Social network analysis privacy tackled

February 14, 2015, Pennsylvania State University

Protecting people's privacy in an age of online big data is difficult, but doing so when using visual representations of such things as social network data may present unique challenges, according to a Penn State computer scientist.

"Our goal is to be able to release information without making personal or sensitive data available and still be accurate," said Sofya Raskhodnikova, associate professor of computer science and engineering. "But we have to figure out what 'privacy' means and develop a rigorous mathematical foundation for protecting privacy."

In a way, the privacy issue is one of balancing protection of personal data against the benefits of global statistics as a public good. Large databases like those of the U.S. Census Bureau contain data that, when aggregated and analyzed, can illuminate social, economic and health problems and solutions. However, protecting privacy requires more than simply removing identifying data from files before publishing databases and analytic results. With multiple open public databases available, data can easily be correlated between databases to pull together bits and pieces of deleted data and recover the identifying information.

Raskhodnikova gives the example of a database of movies from a paid service that included, in some cases, movies that people might prefer others did not know they viewed. The database owners wanted to be able to predict what people wanted to view, so they set up a competition to develop the best learning algorithms for real data. They thought that removing identifying information made it safe to publish the database. However, when viewing patterns from the paid service were compared to viewing patterns on a public, self-reporting movie-viewing database, there was enough information to identify an overwhelming majority of the individuals involved.

"Similar problems occur with search engine histories and other databases," said Raskhodnikova. "It isn't enough to just strip the obviously identifying information."

She suggests that "differential privacy" is needed to maximize accuracy of analysis while preventing identification of individual records. Differential privacy restricts the types of analyses that can be performed to those for which the presence or absence of one person does not matter much. It guarantees that an analysis performed on two databases that differ in only one record will return nearly the same result.

"One approach for achieving differential privacy is adding a small amount of noise to the actual statistics before publishing them," said Raskhodnikova. "One problem is figuring out how much noise—it depends on how sensitive the statistic is—and how to do it so that we can still retain accuracy of results. There are other more sophisticated approaches for achieving differential privacy."

The notion of differential privacy may be especially important to protection of graph .

Publishing a network of romantic relationships between students in one high school using only gender as an identifier would seem innocuous, for example, but in reality is not. A person with knowledge of the school under study could easily correlate the observed behavior of an outlier - a student with a minority sexual preference, for example - to identify that student's node within the network. Once one node is linked to a particular person, only a small amount of additional information is required to identify the rest of the students included in the network.

As long as the class and students are not identified, researchers once thought that protecting this type of information was not important. However, they now know that using other databases that might be open and available online and various clues from the graph such as the total number of students and the ratio of boys to girls, it might be possible to identify the class and then individual students.

"What the researchers tried to do is strip identifiers—name, etc., and publish, but this is not enough," said Raskhodnikova. "Other information is available that can be correlated with the network information."

Researchers have found differentially private methods for releasing many graph statistics. Examples include the number of occurrences of specified small patterns in a graph and the degree distribution. The degree distribution of a social network specifies what fraction of people have no friends, one friend, two friends, etc. However, some information, for example, the exact number of connections a specific person has, is inherently too sensitive to be released with differential privacy.

Data that classifies as an outlier also poses a problem. If only one person in a salary database makes over $1 million a year, it becomes fairly easy to identify that person. Some types of inherently pose a threat to privacy, said Raskhodnikova, and should not be published.

Explore further: Cybersecurity students discover security gaps in 39,890 online databases

Related Stories

The right to privacy in a big data world

October 27, 2014

In the digital age in which we live, monitoring, security breaches and hacks of sensitive data are all too common. It has been argued that privacy has no place in this big data environment and anything we put online can, ...

Facebook rolls out 'privacy checkup'

September 5, 2014

Facebook on Thursday began rolling out its "privacy checkup" aimed at helping users of the huge social network better manage sharing their information and postings.

Four myths about privacy

May 1, 2014

( —Many privacy discussions follow a similar pattern, and involve the same kinds of arguments. It's commonplace to hear that privacy is dead, that people—especially kids—don't care about privacy, that people ...

Privacy groups take 2nd hit on license plate data

September 19, 2014

A California judge's ruling against a tech entrepreneur seeking access to records kept secret in government databases detailing the comings and goings of millions of cars in the San Diego area via license plate scans was ...

Recommended for you

Coffee-based colloids for direct solar absorption

March 22, 2019

Solar energy is one of the most promising resources to help reduce fossil fuel consumption and mitigate greenhouse gas emissions to power a sustainable future. Devices presently in use to convert solar energy into thermal ...

NASA instruments image fireball over Bering Sea

March 22, 2019

On Dec. 18, 2018, a large "fireball—the term used for exceptionally bright meteors that are visible over a wide area—exploded about 16 miles (26 kilometers) above the Bering Sea. The explosion unleashed an estimated 173 ...


Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.