Last week’s announcement of a future assurance framework for AI acknowledges the importance of auditing, as a component of broader algorithmic impact assessment. This is critical if AI is to both promote innovation, and prevent historic inequalities from being projected into the future. Against this backdrop, it is important to recognise that there is variation in precisely how such audits can be conducted, how differential treatment can be revealed, and therefore how outcomes for different groups can be fairly mitigated. As this blog will discuss, ideas of group fairness are core to this debate.
As someone born and raised in France living in London, I often reflect upon the comparative and very distinct ways of addressing the integration of communities with histories of difference: in France our approach was very much one of assimilation (“laicite”, becoming French - whatever that means!), whereas in the UK the sense is one of embracing multiculturalism (accommodating, and dare I say celebrating, difference). More explicitly, French assimilation attempts to erase differences and treat everyone the same, whilst British multiculturalism allows for sub-communities to form and identities to be expressed.
When assessing bias in AI, pretty much the same debate occurs. Researchers have developed numerous definitions of fairness and metrics to measure bias. Individual fairness concepts do not consider group differences and only seek to treat similar individuals (with respect to the task at hand) similarly (the French spirit). On the other hand, group fairness concepts acknowledge there may be a difference of treatment between different groups of individuals, so we should first split the population into groups, then make sure some fairness metric is similar enough for all groups (the British spirit). Choosing between the two is not the end of the story, and both still contain competing definitions of fairness to choose from.
Group fairness seems to be the most common approach in many practical applications to measure bias, such as in recruitment and facial recognition. In the former, we want to check whether, for instance, there is no racial and gender bias in the hiring process by monitoring the proportion of individuals from each group that “succeed”, whilst in facial recognition, one may want to check that the models perform similarly well on different types of faces. This approach has proven fruitful in many examples, revealing societal issues in the way algorithms are tested and deployed. Some famous examples are COMPAS (an AI system in criminal justice), the Amazon hiring algorithm (scrapped before use), and in gender classification.
Critically, there are significant challenges that come with this approach. The first is to decide what type of discrimination is ‘most’ significant to consider (race, gender, age?). This will almost always depend on context, e.g. the use case of the algorithm, the geography, the societal context and so on. As a data scientist assessing bias, this is an extremely important and contentious decision to make (one, perhaps, we are not best placed to determine, making the requirement for disclosure, and subsequently contestation, of these choices critical). Fortunately in some countries, government guidance exists already. In the UK, the 2010 Equality Act defines nine “protected characteristics” against which it is unlawful to discriminate. They are: age, disability, gender reassignment, marriage and civil partnership, pregnancy and maternity, race, religion or belief, sex and sexual orientation. However, as noted by the Institute for the Future of Work it is nearly impossible to comply with the act through statistical auditing, and a more comprehensive form of Equality Impact Assessment is needed.
The second challenge is the need to define subgroups within these categories. At the moment the need to consider intersectional issues is often overlooked in the academic literature on auditing, with some exceptions. This must be remedied, in theory and practice, as evidence demonstrates that AI can find proxies for and detect granular differences which could compound intersectional inequality.
Let’s say one decides to measure bias against black individuals (compared to white), and female individuals (compared to male). Classically one will look at evening out some fairness metric across black and white, and male and female. But fairness across subgroups such as “black female” and “white male” are ignored. In essence four groups were created, and evening out a metric across a dimension may impact negatively one of this subgroup.
As part of my work at Holistic AI, I perform audits of algorithms, and this may include an assessment of bias. On some datasets, it is easy enough to check for intersectional effects on the race and sex dimensions, as data is usually available. In the case of recruitment for instance, we may want to check for disparate impacts, which in simple terms compare the “success rate” of different subgroups, i.e. the proportion of candidates who succeeded the interview rounds where AI is used. We have done this by performing chi-square tests of independence, a statistical test that examines the relationship between the sex and race success rates. Our assumption is that if the two variables (sex and race) are independent, then there is no significant intersectionality effect.
The point is, given the right data and the right guidance, a data scientist or technical auditor has tools available to measure fairness across groups and subgroups. However, they have to make decisions on issues that have a wide societal impact and that are usually solved outside the technical realm. This is because there are always tradeoffs in determining the fairness of automated decision making systems.
It is for this reason we strongly support the recommendations of the APPG on the Future of Work, and the Institute for the Future of Work for a comprehensive model of algorithmic impact assessment, which accounts for equality impacts among a range of others. As part of this, statistical audit should be used to detect granular risks, but form part of a broader process of stakeholder engagement geared towards technical and non technical forms of mitigation. In taking this approach, we would be driving towards a much fairer AI ecosystem.
Roseline is a research assistant at UCL, and chief auditor at Holistic AI, a start-up focused on software for auditing and risk-management of AI systems.
Roseline Polle