Hear from CIOs, CTOs, and other C-level and senior execs on data and AI strategies at the Future of Work Summit this January 12, 2022. Learn more
If you’ve written a text message or email recently, chances are AI suggested to you different synonyms, phrases, or ways to finish a sentence. The rise of AI-powered autosuggestion tools like Google’s Smart Compose has coincided with the digital transformation of enterprise communications, which now live mostly online. It’s estimated that the typical worker replies to about 40 emails each day and sends more than 200 Slack messages per week.
Messaging threatens to consume an increasing portion of the workday, with Adobe pegging the amount of time that workers spend answering emails at 15.5 hours a week. The constant task switching is a death knell for productivity, which studies show benefits from uninterrupted work. Research from the University of California and Humboldt University found that workers can lose up to 23 minutes on a task every time they’re interrupted, further lengthening the workday.
Autosuggestion tools promise to save time by streamlining message-writing and replying. Google’s Smart Reply, for instance, suggests quick responses to emails that’d normally take minutes to type out. But the AI behind these tools has shortcomings that could introduce biases or influence the language used in messaging in undesirable ways.
The growth in autosuggestion and text autocompletion
Predictive text isn’t a new technology. One of the first widely available examples, T9, which allows words to be formed from a single keypress for each letter, came standard on many cellphones in the late ’90s. But the advent of more sophisticated, scalable AI techniques in language led to leaps in the quality — and breadth — of autosuggestion tools.
In 2017, Google launched Smart Reply in Gmail, which the company later brought to other Google services including Chat and third-party apps. According to Google, the AI behind Smart Reply generates reply suggestions “based on the full context of a conversation,” not just a single message — ostensibly resulting in suggestions that are more timely and relevant. Smart Compose, which suggests complete sentences in emails, arrived in Gmail a year later and Google Docs soon afterward. A similar feature called suggested replies came to Microsoft Outlook in 2018 and Teams in 2020.
The technology behind the new crop of autosuggestion tools — which some academic circles refer to as “AI-mediated communication” — is leaps beyond what existed in the ’90s. For example, the AI model underpinning Smart Compose was created using billions of examples of emails and runs in the cloud on custom accelerator hardware. Meanwhile, Smart Reply — which served as the foundation for Smart Compose — takes a “hierarchical approach” to suggestions, inspired by how humans understand languages and concepts.
“The content of language is deeply hierarchical, reflected in the structure of language itself …” Google research scientist Brian Strope and engineering director Ray Kurzweil explain in a blog post. “Consider the message, ‘That interesting person at the cafe we like gave me a glance.’ … In proposing an appropriate response to this message we might consider the meaning of the word ‘glance,’ which is potentially ambiguous. Was it a positive gesture? In that case, we might respond, ‘Cool!’ Or was it a negative gesture? If so, does the subject say anything about how the writer felt about the negative exchange? A lot of information about the world, and an ability to make reasoned judgments, are needed to make subtle distinctions. Given enough examples of language, a machine learning approach can discover many of these subtle distinctions. ”
But as with all technologies, even the most capable autosuggestion tools are susceptible to flaws that crop up during the development — and deployment — process.
In December 2016, it was revealed that Google Search’s autocomplete feature suggested hateful and offensive endings for specific search phrases, like “are jews evil?” for the phrase “are jews”. According to the company, at fault was an algorithmic system that updates suggestions based on what other users have searched for recently. While Google eventually implemented a fix, it took several more years for the company to block autocompletion suggestions for controversial political statements including false claims about voting requirements and the legitimacy of electoral processes.
Smart Reply has been found to offer the “person wearing turban” emoji in response to a message that included a gun emoji. And Apple’s autocompletion on iOS previously suggested only male emoji for executive roles including CEO, COO, and CTO.
Flaws in autocompletion and autosuggestion systems often arise from biased data. The millions to billions of examples from which the systems learn can be tainted with text from toxic websites that associate certain genders, races, ethnicities, and religions with hurtful concepts. Illustrating the problem, Codex, a code-generating model developed by research lab OpenAI, can be prompted to write “terrorist” when fed the word “Islam.” Another large language model from AI startup Cohere tends to associate men and women with stereotypically “male” and “female” occupations, like “male scientist” and “female housekeeper.”
Annotations in the data can introduce new problems — or exacerbate existing ones. Because many models learn from labels that communicate whether a word, sentence, paragraph or document has certain characteristics, like a positive or negative sentiment, companies and researchers recruit teams of human annotators to label examples, typically from crowdsourcing platforms like Amazon Mechanical Turk. These annotators bring their own sets of perspectives — and biases — to the table.
In a study from the Allen Institute for AI, Carnegie Mellon, and the University of Washington, scientists found that labelers are more likely to annotate phrases in the African American English (AAE) dialect more toxic than general American English equivalents — despite their being understood as non-toxic by AAE speakers. Jigsaw, the organization working under Google parent company Alphabet to tackle cyberbullying and disinformation, has drawn similar conclusions in its experiments. Researchers at the company have discovered differences in the annotations between labelers who self-identify as African Americans and members of LGBTQ+ community versus annotators who don’t identify as either of those groups.
Sometimes, the bias is intentional — a matter of vernacular trade-offs. For example, Writer, a startup developing an AI assistant for content generation, says that it prioritizes “business English” in its writing suggestions. CEO May Habib gave the example of the “habitual be” in AAVE, a verb tense that doesn’t exist in any other style of English.
“Since [the habitual be] traditionally hasn’t been used in business English, and thus doesn’t show up in high frequency in our datasets, we would correct ‘Y’all be doing some strange things out here’ to ‘Y’all are doing some strange things out here,’” Habib told VentureBeat via email. “[That said,] we did manually ensure that vernacular-based greetings and sign-offs would not be flagged by Writer. Some vernacular is more gender-neutral than formal business English, [for instance,] so is more modern and on-brand for companies.”
When biases — intentional or not — make it into autocompletion and autosuggestion systems, they can change the way that we write. The enormous scale at which these systems operate makes them difficult (if not impossible) to completely avoid. Smart Reply was responsible for 10% of all Gmail replies sent from smartphones in 2016.
In one of the more comprehensive audits of autocompletion tools, a team of Microsoft researchers conducted interviews with volunteers who were told to give their thoughts on auto-generated replies in Outlook. The interviewees found some of the replies to be over-positive, wrong in their assumptions about culture and gender, and too impolite for certain contexts, like corporate correspondences. Even still, experiments during the study showed that users were more likely to favor short, positive, and polite replies suggested by Outlook.
A separate Harvard study found that when people writing about a restaurant were presented with “positive” autocomplete suggestions, the resulting reviews tended to be more positive than if they were presented with negative suggestions. “It’s exciting to think about how predictive text systems of the future might help people become far more effective writers, but we also need transparency and accountability to protect against suggestions that may be biased or manipulated,” Ken Arnold, a researcher at Harvard’s School of Engineering and Applied Sciences who was involved in the study, told the BBC.
If there’s an all-encompassing solution to the problem of harmful autocompletion, it hasn’t been discovered yet. Google opted to simply block gender-based pronoun suggestions in Smart Compose because the system proved to be a poor predictor of recipients’ sexes and gender identities. Microsoft’s LinkedIn also avoids gendered pronouns in Smart Replies, its predictive messaging tool, to prevent potential blunders.
The coauthors of the Microsoft study warn that if system designers don’t proactively address the shortcomings in autocompletion technologies, they’ll run the risk of not only offending users but causing them to mistrust the systems. “System designers should explore personalization strategies at the individual and social network level, consider how cultural values and societal biases may be perpetuated by their systems, and explore social interaction modeling in order to begin addressing the limitations and issues,” they wrote. “[O]ur findings indicate that current text recommendation systems for email and other [like] technologies remain insufficiently nuanced to reflect the subtleties of real-world social relationships and communication needs. “