You can’t delegate responsibility to an algorithm

In an interview, computer science professor Ulrike von Luxburg talks about the possibilities and difficulties of making machine learning systems fair – and why she is convinced that people, not machines, should decide on certain issues.

Machine learning for science: reports of algorithmic discrimination have been piling up lately. Seemingly objective, algorithm-based systems decide to the detriment of individuals and thus turn out to be unfair. For example, facial recognition programs that simply do not recognize black people, or programs that pre-sort applications for a certain job and rate the resumes of men better than those of women. The social demand on developers, but also on researchers, is often: correct the! Is it as simple as that?

Ulrike von Luxburg: We push a button and make the algorithm fair – it's not that simple. Machine learning is not just the one algorithm I apply. But actually machine learning is a very long pipeline.

What do you mean??

It might already be up to the data, who collected it, how it is labeled. It could also be because of the definition of the groups, that is, who do I have to be fair to, and then the algorithm comes first. And in this whole pipeline, you have to think about fairness.

Let's go through this pipeline, the various steps, together. So in the beginning there is the data. The data that the algorithm is trained with, which it uses to learn to make decisions.

"Often the data was not compiled for the purpose it is used for in machine learning."

This is where the first bias comes in, the first bias or prejudice. In many of the applications that have been discussed publicly, where black people have fared poorly, things already go wrong at that point. There were far too few images of black people in the data. By now, I think it's clear to everyone that if they're going to do facial recognition, people with different skin colors need to be well represented in their data set. Or in the case of resumes: If mostly men have been hired in the past, then of course that's in the data, and a system trained on that will try to mimic that behavior. Often, the data was not even compiled for the purpose it is then used for in machine learning. So an important question is: Where does the data come from and who selected it?? Are they collected specifically or are they simply crawled from the Internet?? And who evaluates them, labels them?

The labeling, i.e. the categorization, is often done by crowdworkers, i.e. people who take on smaller jobs as freelancers on Internet platforms, more or less on demand.

And this is how, for example, tools are created that are supposed to evaluate the attractiveness of a person in a photo with the help of machine learning. The labeling is then done by 25-year-old men – crowdworkers are mostly male and young – and they rate how attractive the people in the photos are. Such a dataset is then "biased" from the start and primarily reflects the preferences of the crowdworkers involved.

Let's move on to the next step, the question of who you want or should be fair to.

"For any particular application, one must first consider what "fair" even means in this context."

Ulrike von Luxburg has been heading the Cluster of Excellence "Machine Learning" together with Philipp Berens since 2019 © SOPHIA CARRARA/UNIVERSITY OF TuBINGEN

First of all, there is the definition of fairness: Which groups do I want to be fair to?? Women compared to men? Blacks vs. Whites? In order for me to make an algorithm fair, I have to "tell" it that up front, and the more groups I name, the harder it becomes to. And then comes the concept of fairness itself. For each specific application, you first have to consider what fair means in this context. Here are a few standard cases. One, for example, is "demographic parity," or "balanced shares". For example, you could say that when admitting students, a university should reflect the ratio of men and women as it is in the population, i.e. admit about half women and half men.

One only looks at the absolute number, not at the qualification of the persons. Another fairness term would be "equalized odds" or "equalized opportunity", which can be translated as equality of opportunity. If we stick with the college admissions example, that would mean: If you have the same aptitude, you – whether you are a woman or a man, whether you are black or white – should be admitted to that degree program. You can't fulfill all the concepts of fairness at the same time, you have to choose one.

It all sounds like machine learning processes can be made fair, as long as you are clear about what fairness is. Where's the catch?

The moment I want to create fairness, other things go down the drain. If I want more fairness, the accuracy of the predictions, for example, goes down.

What does that mean exactly?

This may sound a bit abstract. But it just means that if it's about lending, for example, and I want equal numbers of whites and blacks or men and women to get a loan, then I might also give a loan to people who might not be able to pay it back. But at the end of the day, the money has to come from somewhere. The bank or the customers or the society would then have to jointly pay for the lost money. This means that there are costs involved. The question then becomes: How much is fairness worth to us??

So, after the question of data collection and the definition of fairness, we have now arrived at the algorithm.

There are two goals for the algorithm: On the one hand it should be fair, on the other hand it should be accurate. To stay with our example: So, despite the defined fairness criteria, the algorithm is supposed to pick out, if possible, the candidates for a loan who will also pay off the loan. Now I have to solve this "trade-off", the balancing process, between fairness and the actual goal of the algorithm. Here I now have a set screw: How much fairness do I want, how much accuracy?? As a bank, I can decide, for example, to give ten percent of my loans to needy people. But I can also decide that it should only be five percent. Depending on how I decide, the fairness goes up or down. At the same time, depending on this, the accuracy – and thus ultimately also the costs incurred – goes up and down.

Let's say I actually decide as a university to use a startup's algorithm to automate student selection and save on staff and costs. Then, of course, I would like to know whether this algorithm makes a reasonably fair selection. But how algorithms are built is usually a trade secret that companies rarely disclose.

That's a question that I find exciting: How could a state try to certify something like that?? If you think about the future: There are a lot of start-ups that bring out algorithms, and they also want to be able to say, "We do that well. And they would like to have something like a TuV seal on their website that says: "Tested by the Federal Data Protection Office: Is fair". Or at least, "fair within the bounds of possibility.". But what would something like this look like?? How would one define some kind of minimum standard that is testable afterwards without having to disclose the algorithm? I often discuss this with my employees, but we don't have a ready-made solution either.

In her research, Ulrike von Luxburg is concerned with understanding machine learning algorithms from a theoretical standpoint. © SOPHIA CARRARA/UNIVERSITY OF TuBINGEN

In your opinion, how should a society or a state position itself as long as there is no "TuV" for algorithms?? Is there only the possibility to declare sensitive areas, where discriminatory decisions would have far-reaching consequences, a taboo zone for algorithms??

I think, there are actually areas, where I would not like to have such a system for ethical reasons. When it comes to life-relevant decisions like going to jail or not, taking a child away from someone or not – you can't just delegate that responsibility to an algorithm.

One could argue that the algorithm-based system does not necessarily have to make the decision. It could also just be an assistance system making suggestions to us humans.

You hear this argument all the time, but in practice it often doesn't work out. The judge, who is under time pressure anyway, will not decide against the assistance system all the time. The tendency will always be to follow the recommendations of the system. But there are other areas where I would say: Systems that work with machine learning can also do some good. Medicine is a typical example: an assistance system that makes suggestions for diagnoses or for medications to take. I would say: If it is done well, the benefit may be greater than the damage. In any case, I see potential in the near future.

" It could be that machine learning systems are better or fairer than humans in some places."

In general, we will have to get used to the idea that these systems are not perfect and that we simply have to deal with this fact. But it could be that in some places they are better than people or fairer than people. Because one thing is clear: Even human decision-makers are not always fair and have biases that influence their decisions. The difference is perhaps: we now have methods at hand to evaluate the fairness or accuracy of an algorithm, but also the fairness or accuracy of human decision makers. The comparison between the two could sometimes favor the humans and sometimes favor the machines, depending on the application.