Detecting Twitter users' gender, en francais

Nov 28, 2013 by Chris Chipello

With 230 million users, Twitter has become a global force in social media. And not just in English.

Data miners have been hard at work trying to figure out the attributes of Twitter users – such as gender and age—that aren't explicitly revealed on Twitter feeds. That information could be hugely valuable to marketers, enabling them to target messages to their desired audience. Nearly all the research done so far, however, has focused on English users and content.

Now, a McGill University research team has conducted one of the first studies designed to figure out the gender of Twitter users who primarily use languages other than English.

Among the key findings: by using a special detector based on French-language syntax, the researchers showed that it is very easy to classify gender for Twitter users in French – and probably for other Romance languages. In particular, the researchers developed an algorithm to look for masculine or feminine adjectives or past participles following the phrase "Je suis" (or variants such as "je ne suis pas").

Based on this construction, the detector was able to determine the gender of users with 90% accuracy – significantly higher than the accuracy rates of 80% to 85% achieved by various algorithms that have been developed to analyze English-language content.

Because French adjectives and past participles have masculine and feminine forms that are often spelled differently, "You don't have to get too fancy" to develop an effective gender detector for Tweets in the language, says Derek Ruths, a McGill computer-science professor who co-authored the study.

Since most individuals include photos of themselves on their Tweets, identifying male and female users might seem as simple as looking at the photos. But sorting through hundreds of millions of tweets is a task for computers, and "computers aren't good at looking at pictures," Ruths notes.

The McGill study was presented at a recent international conference in Seattle organized by the Association for Computational Linguistics. The paper also examines Twitter data sets for Japanese, Indonesian and Turkish. Japanese proved to be the toughest for inferring .

The results obtained for French show that some languages have features better suited for certain classification tasks. "Identifying and leveraging such features promises to be an interesting and effective direction for future work," adds McGill linguistics professor Morgan Sonderegger, who co-authored the paper with Ruths and computer-science undergraduate student Morgane Ciot.

Explore further: Age and gender? Dutch develop analyser for Twitter

More information: Link to the paper: www.derekruths.com/static/publication_files/CiotSondereggerRuths_EMNLP2013.pdf
Link to the conference website: hum.csse.unimelb.edu.au/emnlp2013/

add to favorites email to friend print save as pdf

Related Stories

Twitter clocks half-billion users: monitor

Jul 30, 2012

Over 500 million people are on micro-blogging site Twitter and Americans and Brazilians are the most connected, according to a study by social media monitor Semiocast released Monday.

Recommended for you

LinkedIn membership hits 300 million

Apr 18, 2014

The career-focused social network LinkedIn announced Friday it has 300 million members, with more than half the total outside the United States.

Researchers uncover likely creator of Bitcoin

Apr 18, 2014

The primary author of the celebrated Bitcoin paper, and therefore probable creator of Bitcoin, is most likely Nick Szabo, a blogger and former George Washington University law professor, according to students ...

White House updating online privacy policy

Apr 18, 2014

A new Obama administration privacy policy out Friday explains how the government will gather the user data of online visitors to WhiteHouse.gov, mobile apps and social media sites. It also clarifies that ...

User comments : 0

More news stories

Airbnb rental site raises $450 mn

Online lodging listings website Airbnb inked a $450 million funding deal with investors led by TPG, a source close to the matter said Friday.

Health care site flagged in Heartbleed review

People with accounts on the enrollment website for President Barack Obama's signature health care law are being told to change their passwords following an administration-wide review of the government's vulnerability to the ...

A homemade solar lamp for developing countries

(Phys.org) —The solar lamp developed by the start-up LEDsafari is a more effective, safer, and less expensive form of illumination than the traditional oil lamp currently used by more than one billion people ...

Treating depression in Parkinson's patients

A group of scientists from the University of Kentucky College of Medicine and the Sanders-Brown Center on Aging has found interesting new information in a study on depression and neuropsychological function in Parkinson's ...