Citizen Science Crowdsourcing Data Science



About six years ago, Shane Davis quit his job as a biologist to spend all of his time aggregating and analyzing data about oil and gas.

Davis says he gives citizens the tools “to fight off” the oil and gas industry. With information he provides, Davis says communities can go to companies and tell them:

‘Wait a minute. We have information here showing that over the last X amount of years your operations have already caused groundwater contamination at this rate or your patterns of spills are at this rate.’ Or this many people got hurt. Or there are complaints, laundry lists of complaints about your operation. Or maybe these operators have had huge problems with their well casings. Maybe they’ve contaminated private water wells or an aquifer.

In other words, Davis is stepping in where he says state regulatory agencies have failed to protect the people, earth, and water of Colorado from pollution. Indeed, all over the U.S. and world, environmentalists have stopped relying on government agencies to monitor everything from oil spills to the spread of invasive species. Instead, calling themselves “citizen scientists,” they’re taking matters into their own computers and smartphones: gathering, analyzing, and publishing data. Some, like Davis, have adopted a combative stance toward the government. Many others would rather work with government agencies and say they’re contributing information the state simply doesn’t have the capacity to gather.

Davis gets his information from a state agency, the Colorado Oil and Gas Conservation Commission, or COGCC. When companies send mandatory reports about their operations to the state, the COGCC uploads them to their website as PDFs. Since PDFs are nearly impossible to aggregate and analyze, Davis developed tools to scrape information from them and dump it into an Excel spreadsheet—“a really, really topnotch spreadsheet, a spreadsheet that has incredible functionality,” he says. Then, he looks for patterns: spills, groundwater contamination, well casing failures.

At first, Davis spent 70 to 80 hours a week doing this work. “I was basically in my own solitary confinement for two years,” he recalls. “Now, it’s gotten a lot easier. I’m pretty quick at it.”

When Davis finds useful information, he brings it to communities and shares it.

“My presentations are not data-heavy. You’re going to lose everyone if you just jam it packed full of data,” he explains. “I’ll put in images, satellite images of the shale formation, and I’ll pick a bunch of people—politicians or environmental whatever. And I’ll show where they live, right on top of that shale formation. But then I’ll show them all the other well bores that are around their house and some failures that have happened. Maybe there’s a benzene spill.”

Davis is not the only citizen scientist to adopt this tactic: to take advantage of the troves of data available on government websites to tell stories. Adrian Cotter, who’s been with the Sierra Club for 13 years, says environmentalists have always considered themselves “data-based” and the only difference is the accessibility of data: “There’s just a lot of resources online for finding everything from all the oil rigs in Alaska to all the [oil] leases in the Gulf.” Overlaying that data with maps, as Cotter did with oil rigs and the migratory paths of the caribou herds of the Alaska National Wildlife Refuge, can tell powerful stories.

In 2012 and 2013, six Colorado communities voted on bans or moratoriums on fracking within their borders. Davis visited all of them, bearing information about the oil and gas industry. He recalls telling them, “Hey, look what’s happening in your backyard! Near your schools, your playgrounds, your universities, public parks. This is the information that I can give you.” In its turn, the Colorado Oil and Gas Association, the industry trade association, spent hundreds of thousands of dollars fighting the measures. It’s unclear to what extent Davis’ data managed to sway voters, but he’s been credited with inciting anti-fracking sentiment in Colorado and with coining the term “fractivist.” Five of the measures passed, one of them by only 13 votes.

Since then, the Colorado Oil and Gas Association has sued all five communities, saying only the state has the power to regulate the industry. The state has joined the industry in the suits, which Davis views as yet more evidence that “regulatory agencies are designed by those they benefit the most” and are not capable of protecting the citizenry. Courts struck down three of the fracking bans, and the Colorado Supreme Court is considering the other two.

Some citizen scientists find it more efficacious to work with government agencies rather than fighting them. David Newell, a professor at Southern Cross University, in New South Wales, Australia, and a researcher of frogs and toads, says, “We need to work collaboratively to be able to bring about change.”

For example, he wanted to find out how far the invasive cane toad had spread across his state of New South Wales. It would have been expensive to launch a study—and, besides, people already had the information; there just wasn’t an easy way for them to share it.

“Well, they can ring up their local park service office and somebody would write it down on a piece of paper and if we’re lucky it might end up in a database,” he says he thought at the time. “But let’s actually come up with a mechanism by which people can use technology, log their record, and also at the same time gain additional information around what it is that they can actually do—so give them a portal to be able to contribute to the database and then get additional information.”

That’s why Newell decided to build Toad Tracker, which has since become Toad Scan, a means for average Australians to report cane toad sightings. Toad Scan is the opposite of a PDF: Instead of uploading information in such a way that anyone interested in the data has to wade through reams of documents in order to find anything useful, all the data is right there, instantaneously available to the public.

That democratization of information at first made government regulators uncomfortable, Newell recalls, because “they see themselves as the knowledge-holders of databases.” That said, he adds, once “government agencies see that this is a really valuable tool to be able to capture spatial information,” they generally come around. His goal is for citizens, academics, and the public to work together towards conservation goals.

As for Shane Davis in Colorado, he’ll continue to fight. “I’m not stopping until we change law so it favors communities and the environment and does not favor corporations, corporate capitalism, oil and gas,” he says.

Civic Hacking Data Science open data



The founder of a Hacker News-style site for data for social good projects says that there is not enough replication in the civic hacking community, and he means to change that.

A year after launching DataLook, a Hacker News-style site highlighting data projects for social good, Tobias Pfaff and his colleagues are spearheading a 10-week replication marathon of some of the site’s top reusable projects in advance of a TEDx competition they qualified for this spring. Participants are finding each other and collaborating on Slack, although if it makes more sense to take problem solving to outside sites—Github’s issue tracker, for example—they are encouraged to do that as well.

“I think there is not enough focus on replicating projects [in the civic tech community],” founder Tobias Pfaff tells Civicist in a Skype interview. “I think it might be less sexy to do things that other people have done before.”

However, Pfaff also points out that replicating projects can be faster and easier than starting an open data project from scratch. Replication, he says, “can be super sexy” because you can get things done—and start having an impact—quickly. He points Civicist to Jason Hibbets’ framework for civic hackers, which outlines three kinds of projects: green fields (new and untested); cloned (tested, approved, and repeated); and augmented (tested and improved upon).

One successful and much-discussed replication is the late U.S. Politwoops, a transparency project documenting politicians’ deleted tweets, which was based on a project first launched in the Netherlands in 2010. The service recently made headlines after Twitter pulled its API access for violating terms of service. However, other iterations of Politwoops continue to run smoothly in 30 other countries.

The first project replicated as part of DataLook’s marathon was a Twitter bot that automatically posts information about animals up for adoption at local shelters. The person behind it, Slack user justnisdead, says that future replications would only take 15-30 minutes per bot.

DataLook’s goal for the marathon is to demonstrate the impact that replication can have in just 10 weeks, and then to challenge the TEDx judges to imagine what they could accomplish if the marathon was extended to a year or more.

DataLook (originally Data for Good, until they found that name was already a registered trademark in the U.S.) was built during a startup weekend in Germany last year. It was always meant to be a home for replicable data for good projects, however in the year since Pfaff has found that the user base is really too small for a robust upvote/downvote-style site. There just isn’t enough traffic.

(He speculated this might be because many of the major players in the civic tech scene—Code for America, for example—are hosting many of these conversations in private or semi-private/branded spaces, and that others are spread out on various platforms like Reddit and DataTau.)

And yet Pfaff and his DataLook colleagues know many of the projects on the site are worth replicating. “A month ago,” Pfaff says, “we went through our complete database and discussed which [projects] are really cool and which are reusable…[which solve] generic problems that appear in every city around the world and at the same time the code is open source.” These are the projects they pulled out for a shortlist, and are actively encouraging data scientists to replicate during the marathon. The shortlist of projects includes Councilmatic; FixMyStreet; a food inspection forecasting app; Link-SF, a resource for homeless and low-income city residents; and more.

DataLook has asked encouraging interested parties to join an open Slack channel and find the projects that most interest them and connect with likeminded people. There are currently twenty or so members of the general DataLook channel.

Pfaff makes clear that the end of the marathon is not meant to be the end of replicating projects, but that the purpose of the marathon is “to see what is possible within a given timeframe.”

“And then we can see what happens next,” he adds.