AI adjusts for gaps in citizen science data

January 28, 2019 by Melanie Lefkowitz, Cornell University
Heat maps showing bird observations in New York state. The map on the left shows the original samples submitted to eBird and the map on the right shows the distribution after it was adjusted using a model developed by Cornell researchers to reduce location bias in citizen science projects. Credit: Cornell Lab of Ornithology

Citizen science is a boon for researchers, providing reams of data about everything from animal species to distant galaxies.

But crowdsourced information can be inconsistent. More reports come from densely populated areas and fewer from spots that are hard to access, creating challenges for researchers who need evenly distributed data.

"There is a huge bias in the data set because the data is collected by volunteers," said Di Chen, a doctoral student in and first author of "Bias Reduction via End to End Shift Learning: Application to Citizen Science," which will be presented at the AAAI Conference on Artificial Intelligence, Jan. 27-Feb. 1 in Honolulu.

"Since this is highly motivated by their personal interest, the distribution of this kind of data is not what scientists want," Chen said. "All the data is actually distributed along main roads and in urban areas because most people don't want to drive 200 miles to help us explore birds in a desert."

To compensate, Chen and Carla Gomes, professor of computer science and director of the Institute for Computational Sustainability, developed a deep learning that effectively corrects for location biases in citizen science by comparing the population densities of various locations. Gomes and Chen tested their model on data from the Cornell Lab of Ornithology's eBird, which collects more than 100 million bird sightings submitted annually by birdwatchers worldwide.

"When I communicate with conservation biologists and ecologists, a big part of communicating about these estimates is convincing them that we are aware of these biases and, to the degree possible, controlling for them," said Daniel Fink, a senior research associate at the Lab of Ornithology who is collaborating with Gomes and Chen on this work. "This gives [biologists and ecologists] a better reason to trust these results and actually use them, and base decisions on them."

Researchers have long been aware of the problems with citizen science data and have tried various methods to address them, including other types of statistical models. Projects that offer incentives to entice volunteers to travel to remote spots or search for less-popular species have shown promise, but these can be expensive and hard to conduct on a large scale.

A massive data set like eBird's is useful in machine learning, where large amounts of data are used to train computers to make predictions and solve problems. But because of the location biases, a model created with the eBird data would make inaccurate predictions.

Adjusting for in the eBird data is further complicated by the data's many characteristics. Each bird sighting in the system comprises 16 distinct pieces of information, making it computationally challenging.

Chen and Gomes solved the problem using a – a kind of artificial intelligence that is good at classifying – that adjusts for population differences in different areas by comparing their ratios of density.

"Right now the data we get is essentially biased because the birds don't just stay around cities, so we need to factor that in and correct that," Gomes said. "We need to make sure the training data is going to match what you would have in the real world."

Chen and Gomes tested several models and found their deep learning algorithm to be more effective than other statistical or machine learning models at predicting where bird species might be found.

Though they worked with eBird, their findings could be used in any kind of citizen project, Gomes said.

"There are many, many applications that rely on , and this problem is prevalent, so you really need to correct for it, whether people are classifying , galaxies or other situations where data biases can skew the learned model," she said.

Explore further: Incentivizing citizen science discovery for a sustainable world

More information: Di Chen and Carla P Gomes. Bias Reduction via End-to-End Shift Learning: Application to Citizen Science. arxiv.org/pdf/1811.00458.pdf

Related Stories

Incentivizing citizen science discovery for a sustainable world

February 13, 2016

Strides are being made with wildlife conservation that invites recreational wildlife enthusiasts to report online observations that help with modeling and migration. Work is also being done to use crowdsourcing to help discover ...

Citizen science birding data passes scientific muster

March 12, 2018

As long as there have been birdwatchers, there have been lists. Birders keep detailed records of the species they've seen and compare these lists with each other as evidence of their accomplishments. Now those lists, submitted ...

What's that bird? Check your smart phone

January 15, 2014

The Cornell Lab of Ornithology has released a free iPhone app to help people identify 285 birds in North America. Created with support from the National Science Foundation, the app asks just five questions, then displays ...

Recommended for you

Semimetals are high conductors

March 18, 2019

Researchers in China and at UC Davis have measured high conductivity in very thin layers of niobium arsenide, a type of material called a Weyl semimetal. The material has about three times the conductivity of copper at room ...

0 comments

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.