Join today’s leading executives online at the Data Summit on March 9th. Register here.
Many AI systems today learn from examples — whether images, text, or audio — that have been labeled by human annotators. The labels enable the systems to extrapolate the relationships between the examples (e.g., the link between the caption “black bear” and a photo of a black bear) to data that the systems haven’t seen before (e.g., photos of black bears that weren’t included in the data used to “teach” the model). This works remarkably well. For example, it’s trivially easy to train a system to distinguish between different animal species, like cats versus dogs.
But annotations can introduce new problems — or exacerbate existing ones. Companies and researchers recruit teams of human annotators to label examples in AI training datasets, typically from crowdsourcing platforms like Amazon Mechanical Turk. And these annotators bring their own sets of perspectives — and biases — to the table. In a 2019 study, scientists found that labelers are more likely to annotate phrases in the African American English (AAE) dialect more toxic than general American English equivalents. In another example of the pitfalls of annotation, some labelers for MIT’s and NYU’s 80 Million Tiny Images dataset several years ago contributed racist, sexist, and otherwise offensive annotations.
AI systems amplify these and other biases as they train, and the biases often trickle down to real-world systems. In 2019, engineers at Meta (formerly Facebook) reportedly discovered that a moderation algorithm at Meta-owned owned Instagram was 50% more likely to ban Black users than white users. Google’s Cloud Vision API at one time labeled thermometers held by Black people as “guns” while labeling thermometers held by light-skinned subjects as “electronic devices.” And facial recognition systems — which, to be clear, are flawed in many respects — do a poor job of identifying trans and non-binary people.
In search of a solution to the problem of annotator bias, researchers at Stanford recently investigated an approach that they call “jury learning.” The idea is to model “individual voices” in training datasets toward designing a system that makes it possible for developers to explore — and ideally shift — the behavior of AI systems.
“Whose labels should a model learn to emulate?” is often the pressing question in AI system development. For applications ranging from detecting toxic comments to diagnosing diseases, different societal groups might have irreconcilable disagreements about labels. Commonly, data scientists resolve these disagreements by using majority voting, where the labels from multiple annotators are aggregated together on a per-example into a single label. This works well enough when there’s little disagreement on the labels in question. But when the annotators do disagree, majority voting has the effect of overriding minority groups’ labels.
Research shows that up to a third of expert annotators — most of whom originate from the U.S. and India — disagree with each other when labeling an average example. In one study, properly accounting for minority groups reduced the accuracy of an online comment toxicity detector from 95% to 73%, showing the degree to which these groups an be muzzled.
Rather than silencing these disagreements, the Stanford researchers’ “jury learning” technique is designed to resolve them through the metaphor of a jury. Jury learning aims to define which people or groups determine a system’s prediction and in what proportion, allowing developers to analyze — and respond to — dissent.
Mitchell Gordon, a lead researcher on the study and a Ph.D. student at Stanford, told VentureBeat that the idea came to him and associate professor Michael Bernstein a couple of years ago, when the two were trying to train a toxicity classifier using a popular dataset. “We noticed that if we simulated re-collecting that dataset with a different set of randomly chosen annotators, something like 40% of the ground truth labels would flip (from toxic to non-toxic, or vice versa),” he said via email. “The labels flipped because each label was decided among a few annotators, and the majority vote among this small group would flip depending on exactly who annotated the example. Imagine running a small survey on some societally-contested topic with only five people and then taking a majority vote: the answer is going to flip depending upon which five people you survey.”
Jury learning tries to capture differences in opinion by modeling every individual annotator and predicting each annotator’s label before outputting a joint annotator prediction. Rather than a typical toxicity detection system outputting a label of, for example, “toxic” or “not toxic,” a jury learning system might output a prediction like, “For this group of six men and six women annotators, which is split evenly between White, Hispanic, AAPI, and Black jurors, 58% of the annotators are predicted to agree that comment is toxic.” Developers using jury learning can also define annotator compositions for a task that reflect stakeholders across gender and racial identities, political affiliations, and more, for example prominently featuring women and Black people because they’re commonly targets of online harassment.
“Today’s machine learning pipeline [is] modeling a sort of aggregate pseudo-human, predicting the majority vote label while ignoring annotators who disagree with the majority. And ignoring people who disagree can be really problematic because voice matters,” Mitchell said. “For instance, in content moderation, healthy spaces and communities have their own distinct norms and values. Non-parents might not be the right voices to decide which topics are fair game in a parenting forum. And the political problems taking place in the Star Wars universe require a very different standard of discussion than those taking place in a forum on Saudi Arabia. So, when a dataset’s annotators disagree, we wanted to empower the people deploying machine learning models to make explicit choices about which voices their models reflect.”
Mitchell and coauthors found that jury learning can seemingly lead to “more diverse” representation than traditional approaches. In their study, they had 18 moderators of online communities create pools of annotators to label a training dataset for toxic comment detection. They found that the pools contained almost three times the number of nonwhite annotators and roughly 32 times the number of gender nonbinary annotators compared with a large, public toxicity dataset. The increase in diversity had a positive downstream effect — when a toxicity detection system originally trained on the public toxicity dataset was trained on the more diverse pools, it altered 14% of the system’s classifications.
Mitchell admits that jury learning isn’t a panacea. A malicious — or simply careless — data scientist could make annotator pools exclude underrepresented voices. But he and coauthors view it as a tool to ensure that decisions about voice are being made “explicitly and carefully,” rather than “implicitly or incidentally.”
“Beyond the toxicity detection task, we use as the primary application domain in our paper (and other social computing tasks one could easily imagine, like misinformation detection), we also envision jury learning being important in other high-disagreement user-facing tasks where we’re increasingly seeing AI play a role,” Mitchell said. “Consider that when using AI to help with a visual design task (e.g., designing a poster), we could select design decisions from designers trained in a particular school of thought. Or when using AI to suggest the right treatment options for a patient, the practitioner could weigh treatment opinions from doctors with a particular expertise. Or think of many of the domains where AI has been criticized for its impact on minoritized groups: often, it’s because the groups’ voices are not integrated into the design and decision-making phases of these efforts.”
Toward greater representation
While jury learning isn’t perfect, it could increase representation in a field that’s severely lacking in it. One recent study showed that only a dozen universities and corporations are responsible for creating the datasets used over 50% of the time in AI (some of which might treat queer people unfairly). In health care, AI training data containing medical records and imagery mostly come from patients in North America, Europe, and China. And economists have spotlighted that credit scores tend to be less precise for underrepresented minorities and low-income groups because they reward traditional credit rather than everyday payments like on-time rent.
Jury learning — and techniques like it — could also foster trust in AI systems, a quality that many enterprise executives profess to value. In a 2021 report by CognitiveScale, 34% of C-level decision-makers said that the most important AI capability is “explainable and trusted.” Expandability will become key as the public grows increasingly skeptical of AI. According to a 2022 Ipsos poll, half of adults trust companies that use AI as much as they trust other companies, and adults from emerging countries are significantly more likely than those from more economically developed countries to have a positive outlook on the impact of AI-powered products and services in their lives.
“Any time a company wants to use AI for a task in which people genuinely disagree about the right answer, we think they should care about the problem that jury learning set out to solve. No matter what decisions the product makes, it’s going to make some people happy, and some people unhappy. We think it’s almost always in a company’s best interest to make an explicit, carefully considered choice about who those people are, or to empower their users to make that decision for themselves,” Mitchell continued. “Looking to the future, we’re eyeing ways to make jurors more expressive (e.g., what if each juror could provide some sort of reasoning as to why they made a particular decision?). We’re also thinking more broadly about how we might codify an ethical framework that helps practitioners think about who to side with when their machine learning models have to make decisions based on competing points of view.”