Researchers say readers' identities can reveal much about content of articles

Aug 12, 2013

Articles that people share on social networks can reveal a lot about those readers, research has shown. But a new Carnegie Mellon University study reverses the proposition, asking the question: What can be learned about an article from the attributes of its readers?

To find out, the CMU researchers, along with colleagues at the University of Washington, analyzed almost 3 million and the public profiles of the people who shared those articles on Twitter. This enabled them to generate a few thousand "badges" that characterized the content of the shared news articles and also could be used to analyze any subsequent article, including those that had never been shared or even read.

"Because these badges are based on the readers' self-described interests, rather than only on the words in each article, we found that the badges provide a consistent, reliable means of representing content," said Khalid El-Arini, a recent Ph.D. graduate of the CMU Computer Science Department. "For instance, while what it means to be 'liberal' changes from month to month, there will always be people who describe themselves as 'liberal' on Twitter, allowing us to produce a direct correspondence between 'liberal' topics from different periods of time."

This approach could thus be used in recommendation systems, but also can provide interesting insights about writers and the state of .

El-Arini will present the findings today at the Conference on Knowledge Discovery and Data Mining (KDD 2013) in Chicago. Other were Min Xu, a Ph.D. student in CMU's Machine Learning Department; Emily Fox, assistant professor of statistics at the University of Washington, and Carlos Guestrin, professor of computer science and engineering at Washington.

In order to train their model, the team began by looking at three months of —from September of 2010, 2011 and 2012—and selecting those that included links to mainstream news articles and came from a user who had filled out a Twitter profile.

Each news article was then downloaded and the most meaningful, unique words were extracted, creating a "bag of words" for each article; similar to a visual word cloud, these bags give greater weight to more important words. Likewise, from each user's Twitter profile, a set of descriptive words, or badges, was extracted.

By comparing the bags of words with badges from the people who shared the articles, the researchers were able to create a dictionary that associated each badge with its characteristic words. For example, people who self-identify with the music badge in their profiles are likely to share articles with words such as "band," "album" and "song." Different dictionaries were created for each year to compensate for interests or topics that change over time. These dictionaries were then used to encode new articles, leading to a document representation based on attributes of potential readers.

"It is important to point out that the badge dictionaries can be used to encode articles never before seen or shared," El-Arini said. "We take the content of the articles, and use our model to predict which types of readers would be likely to share them."

The researchers included a case study that used badges to explore the readership of prominent political columnists. This method showed, not surprisingly, that New York Times columnist Maureen Dowd had readers who tended to be progressive. This association was notable because Dowd never explicitly uses the word "progressive" in the articles analyzed by the researchers. Rather, the algorithm detected that the words Dowd uses in these articles correspond to the type of content self-described progressives tend to share on Twitter.

El-Arini, now a research scientist at Facebook, said the algorithm the team designed also takes into account relationships between badges. The system understands, for instance, that "liberal" is similar to "progressive" and "school" is associated with "student."

"Our methodology leads to thematically coherent topics that are more consistent over time than popular alternative approaches," El-Arini said. "We believe this method will lead to better performance on personalization tasks."

Explore further: Vatican's manuscripts digital archive now available online

Related Stories

Study examines how news spreads on Twitter

Jan 09, 2013

Nearly every major news organization has a Twitter account these days, but just how effective is the microblogging website at spreading news? That's the question University of Arizona professor Sudha Ram set out to answer ...

Free articles get read but don't generate more citations

Jul 31, 2008

When academic articles are "open access" or free online, they get read more often, but they don't -- going against conventional wisdom -- get cited more often in academic literature, finds a new Cornell study.

Some readers will get around paywall: NY Times

Mar 23, 2011

The publisher of The New York Times said Wednesday that some people will manage to find ways around paying for the newspaper online but they will mostly be teenagers and the unemployed.

Scientists analyze millions of news articles

Nov 26, 2012

Researchers in the UK have used artificial intelligence algorithms to analyze 2.5 million articles from 498 different English-language online news outlets over ten months.

Recommended for you

When emotions control objects

1 hour ago

Dimming a light, immersive playing on a computer, and tracking yoga exercises in real time – sensors developed by SmartCardia use various vital signs to transmit data to a host of everyday objects.

First-of-a-kind supercritical CO2 turbine

2 hours ago

Toshiba Corporation today announced that it will supply a first-of-a-kind supercritical CO2 turbine to a demonstration plant being built in Texas, USA. The plant will be developed by NET Power, LLC, a U.S. venture, together w ...

User comments : 0