LANGUAGE BIASES IN TECH: A FULL STACK PROBLEM

AN XIAO MINA
February 10, 2016
8:28 am

Take a minute to imagine you’re a newcomer to the internet. First of all, you are not alone. The web has been around for decades, yes, but on the scale of the world’s population, regular connectivity is still technically a minority experience. With an estimated 3.3 billion internet users out of a world population of 7.2 billion, and a stunning 833 percent growth rate over the past five years, we can expect diversity on the internet to increase significantly, especially as the world internet population inches toward a tipping point.

Now imagine you don’t speak English, Chinese, Arabic, Spanish or another majority language on the internet. Imagine you speak Bihari or Ilokano, minority languages in India and the Philippines, respectively. Again, your experience isn’t unique. With the so-called “next billion” coming online, we can expect a significant increase in language diversity on the internet.

For English speakers, the internet might seem like a teeming wonderland of information and games and social connections, but for those who are just coming online, the internet has a dearth of content—if any—in their native languages. The pipelines for voice and civic action that we’ve seen for much of the world are facing a significant challenge: crossing language and cultural barriers.

For one, some languages are completely invisible and unusable on browsers, operating systems, and keyboards. In the words of Tibetan blogger Dechen Pemba, who can’t access the Tibetan language on a phone:

Given that the Tibetan literary tradition goes back to the 7th century and its linguistic influence reaches far across the Himalayas encompassing areas of India, Bhutan, Mongolia, Russia and Pakistan, my pet hate is when Tibetan language is described as “obscure”. I wonder how it is possible that the language of Tibetan Buddhism and Tibetan Buddhists, comprising of as many as 60 million people, can be wilfully left behind in terms of modern technology? For instance, Google has failed to incorporate a Tibetan font into its Android software, failed to develop a Tibetan language interface and failed to include Tibetan in Google Translate, the most useful of tools. At least Apple has seen the light there.

In a recent series of lectures at UCLA hosted by the Digital Media Arts program and the Processing Foundation, I talked through some of these issues, drawing on an essay I’d written for the Digital Asia Hub, a new think tank in Hong Kong that’s grown out of the Berkman Center for Internet and Society.

Here’s a summary of the key points I think we should be paying attention to with regards to the language biases inherent to our technologies. These are pulled directly from the Digital Asia Hub essay and transcripts from the UCLA talk provided by the terrific Open Transcripts, with minor editing to contextualize the words for this piece:

Language biases create sharp divides in the global web—laying the foundation for digital ghettos of information and community.

Without improved language and writing script support, new netizens run the risk of living in digital ghettos created by their native tongues. Any online actions they engage in or media they create will be largely invisible and unappreciated by those outside their cultural-linguistic spheres. This can have significant effects, for instance, on human rights advocacy, which can depend so heavily on using social media and email to raise awareness among international news sources.

New internet users who don’t speak majority languages will likely be unable to participate in global internet culture and conversations as both readers and contributors. A number of internet researchers looking at language divides online have noted that minority languages speakers, especially those from the global south, will experience substantial information inequality online. Indeed, people’s inability to speak English can significantly affect their very adoption and use of the internet, even if they are aware of its existence.

The internet has proven to be a crucial pipeline for attention for those who have traditionally been marginalized. But language barriers can prevent the broader public from understanding their voices.

I think a lot of us are familiar with the internet’s role in building social movements and the ability to amplify one’s perspective and words. Certainly the Umbrella Movement in Hong Kong and the Black Lives Matter movement here in the U.S. rely on the ability to broadcast a message, to use hashtags, and to create a pipeline from social media to mainstream media, and then hopefully to other audiences.

And certainly we can think about major hashtags and major movements that’ve been in English or a majority language: #TweetLikeAForeignJournalist in Kenya was a critique of media coverage of East Africa. And then #JeSuisCharlie, a simple enough French phrase for people to remember, understand and repeat online and offline.

But there are a number of other movements in other languages that are more difficult to understand, and get significantly less attention: There’s #sassoufit in Congo; there’s the gau wu (#鳩嗚 ) movement, part of the Hong Kong Umbrella Movement, but also a tangential group with different aims and strategies. As I argued at a recent panel on the topic of biased data, language is one important barrier that prevents these movements from reaching a wider audience.

Ultimately, language biases in our technologies are a full stack problem. These compound on each other, and as technologists, we have to think holistically about solutions.

In technology design we talk about the full stack, a series of the layers, such as the code and the user interface, on which software is built. As we note during the biased data panel discussion, human-facing part of that code is in English. Admittedly, much of code is constructed from simple phrases, like “if” and “then”. Yes, you can learn those phrases, but imagine trying to relearn code in a language that you don’t speak, and suddenly having to learn two languages: the programming language and then the language in which the programming language is expressed.

And then it moves up to the typography pressures. The ability to input Arabic on a mobile phone up until recently was severely limited, and Arabic speakers developed “Arabizi”, a chat language made of Roman letters and numbers to express their language online. This was incredibly creative, but it was also a response to a lack of support for the Arabic script. This affects many other languages whose primary script is not Latin.

Then it goes up from there into content. If you want to engage with the broader internet, you have to have access, and we can include language as a form of access. As one example, Stack Overflow is a critical go-to source for the open source community and coders in general, but the majority of the knowledge on the site is only available in English and Portuguese right now. If someone who speaks neither language wants to ask a question from this rich community of more experienced practitioners, whom could they ask?

And then the stack moves all the way to the typography. We’re talking about the political decisions around typography. In languages that use Latin letters, you have a wide variety of typography and fonts that you can use, and if you have that kind of critical knowledge about the implications of all these fonts you can really make important design decisions. But if you have access to only one or two fonts, suddenly the ability for you to cre

ate a space around the very content and the sites that you’re trying to create again becomes limited and you’re inheriting someone else’s designs around your typography.

To be clear, language biases in tech are an extension of the language biases we live with in broader society. As we discuss what it means to “speak American” in this diverse, multilingual country, and as we look to a world multilingual internet, it’s important to remember how often language barriers manifest. Just recently, I wrote about U.S. candidates’ attempts at Spanish language engagement on Twitter, which sometimes falls flat for native speakers. Both Clinton and Sanders have been called to task online for their not-always-perfect Spanish:

https://speakbridge.io/medias/embed/democratic-debates-2016/democratic-debates-2016-general/725

https://speakbridge.io/medias/embed/democratic-debates-2016/democratic-debates-2016-general/706

This is a bias of content, one that is higher up on the technology stack, but that creates a barrier between a candidate and their electorate. Whether a language is misunderstood, or, like Tibetan, completely invisible, the barrier of understanding creates a barrier to access. Solving this at all levels will take a lot of work, but it will be essential for a truly interconnected, accessible, and civically-engaged internet.

What’s Going On in German Civic Tech?

TOM STEINBERG
February 9, 2016
2:16 pm

WHY GERMANY?

A couple of years ago I was idly scanning through Google Zeitgeist, the search giant’s annual data release of each year’s top search trends. Somehow I found my way onto the international results, and picking almost at random I chose to look at the search terms for Germany.

There, sitting at the top of the pile, was something I could barely believe. The term in poll position was ‘Wahl-o-mat.’ Despite not being a German speaker, I recognized it: it was the brand name of a German website that helps people work out who to vote for.

Not a recently deceased TV star, or a major movie, or a massively viral YouTube video, but an old-fashioned, 36 question online quiz that ultimately spat out a suggested political party. Further searching revealed that it had been used, through to completion, over 13 million times in the 2013 national elections. Even more astonishing is the quiz is run by an arms-length public body—effectively a ‘who to vote for’ service delivered by part of the state.

Since then, I’ve been acutely aware that Germany has a social-impact technology scene that is somewhat unlike that of many other rich countries. So in January this year I set out on a trip to Berlin to find out about tech initiatives that might be a bit different from what you find elsewhere.

CONTEXT: THE SECURITY AND PRIVACY SCENE

It is no great secret that Germany has been closely associated with the groundswell of discontent since the Snowden revelations. But I wasn’t prepared for just how big and central it is to how all technology was viewed, or how widely the suspicion of digital technologies has spread.

The best yardstick of how big the security and privacy tech community is in Germany is to consider the attendance of the year’s biggest community shindig, the Chaos Computer Conference (CCC), held in Hamburg. There were an astonishing 12,000 people present this year, and demand for tickets still substantially outstripped supply. Nearly as many people go to CCC as go to Defcon in America, but in a country that’s about four times smaller. And the number rises rapidly every year.

The concerns are much more widespread than the NSA reading German email, too. After a few days I realized that several people I talked to were using the word ‘algorithm’ (referring to automated technologies like Facebook’s wall) with a kind of distasteful wince. It was similar to the way that a lawyer might reluctantly use swear words when quoting a defendant in front of a judge. This is because the very idea of algorithmic sorting of content in social media has become a kind of dirty word in the tech community—yet another way that big institutions could exploit the rest of us. Poor Al-Khwārizmī, who gave his name to the mathematical concept, must be rolling in his grave.

Several people I talked to remarked that Berlin has become a kind of sanctuary to people who work for both well-known and obscure privacy enhancing technology projects. Living there meant not only more like-minded people to hang out with, it meant less hassle at airports, less likelihood of being followed around or interviewed, less of a feeling of being a bad or wanted person generally. You can buy more stuff with cash. Everyone speaks English, and many people the language of cryptography too. People were not naive about the fact that Germany has it’s own well-staffed security apparatus, but clearly it to this community it feels like a much more acceptable home than most other alternatives.

There wasn’t any consensus about what led to Berlin becoming the hub of this community. More than one person strongly contested the almost-standard idea that the history of the Stasi and of the the Nazis has made the average German more worried about surveillance than the average Brit. I was told that Google and Facebook usage was sky-high in Germany, and that these behaviors at an aggregate just didn’t fit the theory of national suspiciousness. Ultimately, I had no objective way of assessing why there is such a large security and privacy community in Berlin, but if it isn’t due to the sad, violent history of this place then there’s clearly some other very interesting explanation lurking. Theories on an encrypted post-card, please.

My final observation on the privacy and security scene is that the energy surrounding privacy tech and privacy laws has created opportunity costs for the wider civic and social impact tech scene. There were actually, overall, fewer big mainstream civic tech or social impact tech projects than I would have expected to find in a country with wealth, tech chops and political consciousness that Germany has. I suspect it’s because more than a few ideas die in the cradle, smothered by concerns about how user data might be abused. At least one person told me they’d seen this happen.

IMPRESSIVE CIVIC & SOCIAL IMPACT ORGANIZATIONS I DISCOVERED ON MY ADVENTURE

I talked to a lot of people during my stay. The following list, which is in no particular order, simply attempts to give a taste of the interesting projects and people I met, rather than a verbatim record. If I spoke to you and you’re not here, please don’t feel slighted!

Civicist

CIVIC TECH NEWS & ANALYSIS

LANGUAGE BIASES IN TECH: A FULL STACK PROBLEM

What’s Going On in German Civic Tech?

WHY GERMANY?

CONTEXT: THE SECURITY AND PRIVACY SCENE

IMPRESSIVE CIVIC & SOCIAL IMPACT ORGANIZATIONS I DISCOVERED ON MY ADVENTURE