LANGUAGE BIASES IN TECH: A FULL STACK PROBLEM

AN XIAO MINA
February 10, 2016
8:28 am

Take a minute to imagine you’re a newcomer to the internet. First of all, you are not alone. The web has been around for decades, yes, but on the scale of the world’s population, regular connectivity is still technically a minority experience. With an estimated 3.3 billion internet users out of a world population of 7.2 billion, and a stunning 833 percent growth rate over the past five years, we can expect diversity on the internet to increase significantly, especially as the world internet population inches toward a tipping point.

Now imagine you don’t speak English, Chinese, Arabic, Spanish or another majority language on the internet. Imagine you speak Bihari or Ilokano, minority languages in India and the Philippines, respectively. Again, your experience isn’t unique. With the so-called “next billion” coming online, we can expect a significant increase in language diversity on the internet.

For English speakers, the internet might seem like a teeming wonderland of information and games and social connections, but for those who are just coming online, the internet has a dearth of content—if any—in their native languages. The pipelines for voice and civic action that we’ve seen for much of the world are facing a significant challenge: crossing language and cultural barriers.

For one, some languages are completely invisible and unusable on browsers, operating systems, and keyboards. In the words of Tibetan blogger Dechen Pemba, who can’t access the Tibetan language on a phone:

Given that the Tibetan literary tradition goes back to the 7th century and its linguistic influence reaches far across the Himalayas encompassing areas of India, Bhutan, Mongolia, Russia and Pakistan, my pet hate is when Tibetan language is described as “obscure”. I wonder how it is possible that the language of Tibetan Buddhism and Tibetan Buddhists, comprising of as many as 60 million people, can be wilfully left behind in terms of modern technology? For instance, Google has failed to incorporate a Tibetan font into its Android software, failed to develop a Tibetan language interface and failed to include Tibetan in Google Translate, the most useful of tools. At least Apple has seen the light there.

In a recent series of lectures at UCLA hosted by the Digital Media Arts program and the Processing Foundation, I talked through some of these issues, drawing on an essay I’d written for the Digital Asia Hub, a new think tank in Hong Kong that’s grown out of the Berkman Center for Internet and Society.

Here’s a summary of the key points I think we should be paying attention to with regards to the language biases inherent to our technologies. These are pulled directly from the Digital Asia Hub essay and transcripts from the UCLA talk provided by the terrific Open Transcripts, with minor editing to contextualize the words for this piece:

Language biases create sharp divides in the global web—laying the foundation for digital ghettos of information and community.

Without improved language and writing script support, new netizens run the risk of living in digital ghettos created by their native tongues. Any online actions they engage in or media they create will be largely invisible and unappreciated by those outside their cultural-linguistic spheres. This can have significant effects, for instance, on human rights advocacy, which can depend so heavily on using social media and email to raise awareness among international news sources.

New internet users who don’t speak majority languages will likely be unable to participate in global internet culture and conversations as both readers and contributors. A number of internet researchers looking at language divides online have noted that minority languages speakers, especially those from the global south, will experience substantial information inequality online. Indeed, people’s inability to speak English can significantly affect their very adoption and use of the internet, even if they are aware of its existence.

The internet has proven to be a crucial pipeline for attention for those who have traditionally been marginalized. But language barriers can prevent the broader public from understanding their voices.

I think a lot of us are familiar with the internet’s role in building social movements and the ability to amplify one’s perspective and words. Certainly the Umbrella Movement in Hong Kong and the Black Lives Matter movement here in the U.S. rely on the ability to broadcast a message, to use hashtags, and to create a pipeline from social media to mainstream media, and then hopefully to other audiences.

And certainly we can think about major hashtags and major movements that’ve been in English or a majority language: #TweetLikeAForeignJournalist in Kenya was a critique of media coverage of East Africa. And then #JeSuisCharlie, a simple enough French phrase for people to remember, understand and repeat online and offline.

But there are a number of other movements in other languages that are more difficult to understand, and get significantly less attention: There’s #sassoufit in Congo; there’s the gau wu (#鳩嗚 ) movement, part of the Hong Kong Umbrella Movement, but also a tangential group with different aims and strategies. As I argued at a recent panel on the topic of biased data, language is one important barrier that prevents these movements from reaching a wider audience.

Ultimately, language biases in our technologies are a full stack problem. These compound on each other, and as technologists, we have to think holistically about solutions.

In technology design we talk about the full stack, a series of the layers, such as the code and the user interface, on which software is built. As we note during the biased data panel discussion, human-facing part of that code is in English. Admittedly, much of code is constructed from simple phrases, like “if” and “then”. Yes, you can learn those phrases, but imagine trying to relearn code in a language that you don’t speak, and suddenly having to learn two languages: the programming language and then the language in which the programming language is expressed.

And then it moves up to the typography pressures. The ability to input Arabic on a mobile phone up until recently was severely limited, and Arabic speakers developed “Arabizi”, a chat language made of Roman letters and numbers to express their language online. This was incredibly creative, but it was also a response to a lack of support for the Arabic script. This affects many other languages whose primary script is not Latin.

Then it goes up from there into content. If you want to engage with the broader internet, you have to have access, and we can include language as a form of access. As one example, Stack Overflow is a critical go-to source for the open source community and coders in general, but the majority of the knowledge on the site is only available in English and Portuguese right now. If someone who speaks neither language wants to ask a question from this rich community of more experienced practitioners, whom could they ask?

And then the stack moves all the way to the typography. We’re talking about the political decisions around typography. In languages that use Latin letters, you have a wide variety of typography and fonts that you can use, and if you have that kind of critical knowledge about the implications of all these fonts you can really make important design decisions. But if you have access to only one or two fonts, suddenly the ability for you to cre

ate a space around the very content and the sites that you’re trying to create again becomes limited and you’re inheriting someone else’s designs around your typography.

To be clear, language biases in tech are an extension of the language biases we live with in broader society. As we discuss what it means to “speak American” in this diverse, multilingual country, and as we look to a world multilingual internet, it’s important to remember how often language barriers manifest. Just recently, I wrote about U.S. candidates’ attempts at Spanish language engagement on Twitter, which sometimes falls flat for native speakers. Both Clinton and Sanders have been called to task online for their not-always-perfect Spanish:

https://speakbridge.io/medias/embed/democratic-debates-2016/democratic-debates-2016-general/725

https://speakbridge.io/medias/embed/democratic-debates-2016/democratic-debates-2016-general/706

This is a bias of content, one that is higher up on the technology stack, but that creates a barrier between a candidate and their electorate. Whether a language is misunderstood, or, like Tibetan, completely invisible, the barrier of understanding creates a barrier to access. Solving this at all levels will take a lot of work, but it will be essential for a truly interconnected, accessible, and civically-engaged internet.

Civicist

CIVIC TECH NEWS & ANALYSIS

LANGUAGE BIASES IN TECH: A FULL STACK PROBLEM