Data may not compute: Program stores older Web research files, left at risk by technological leaps
September 19, 2011 By Alvin Powell
“Data is not like a book. If you get a 300-year-old book and you know the language, you can usually read it,” said Gary King, the Albert J. Weatherhead III University Professor and head of Harvard’s Institute for Quantitative Social Science (IQSS). “Data changes formats. If it’s from even five years ago, you might not be able to read it at all.” Credit: Kris Snibbe/Harvard Staff Photographer
Modern scholars are wrestling with a problem that ancient monks and early authors managed to master: how to keep their work accessible to future generations.
While the books, papers, and journals of early scientists remain readable to anyone who can lay hands on them and knows the language, and that is not the case for those whose work is stored on early computer media, just a few decades old.
The breakneck pace of technologys advance has left data in its dust, stored on tapes, floppy disks, and other media now unreadable by newer computers. And its not just the nature of storage media that is rapidly changing. File formats change as new programs are developed, rendering older programs obsolete even while giving researchers powerful new tools.
Data is not like a book. If you get a 300-year-old book and you know the language, you can usually read it, said Gary King, the Albert J. Weatherhead III University Professor and head of Harvards Institute for Quantitative Social Science (IQSS). Data changes formats. If its from even five years ago, you might not be able to read it at all.
King has watched those changes since he arrived at Harvard in 1987. As head of the Harvard Data Center, then the Harvard-MIT Data Center, and now the institute, King realized long ago that efforts had to be made to ensure access to digital data for future scholars.
While publication in academic and scientific journals provides summaries of research, King said those articles are like advertisements for the underlying work, the reams of data gathered during exhaustive social science surveys, years of field observations, and long nights in the lab. Further, he said, today more grant-making agencies and journals require researchers to make their data available to others as a condition of a grant or of publication.
Its very important in science and social science to share research data, King said.
One solution to the problem already exists, on computers at Harvard and in a growing network of corporations, universities, and other institutions. Called the Dataverse Network Project and spearheaded by the IQSS, the effort provides archival storage for research projects, initially in the social sciences but recently expanding to the physical sciences and humanities.
The Dataverse project solves problems that plagued the two most common previous data storage strategies, King said. The first is that researchers sometimes use major archives to hold their data. The problem with that, King said, involves loss of control over the data and, potentially, a loss of credit for gathering it, because the archive is sometimes cited as the source. The second commonly used strategy is to store the data on personal computers or servers, making it available on the Web through a researchers Web page. The problem there, King said, is that Web pages dont endure for long. Researchers change institutions, links are lost, and access to data is gone as well.
The average age of a link on the Web is very short, King said. Servers under the desk break or are replaced; the data can disappear.
The Dataverse project is designed to solve both problems, King said. First, the IQSS employs professional archiving standards that ensure access to data long into the future. Once a researchers data is put into the system, it is converted from its original file format into a basic one that ensures the information will remain readable for decades to come. When that format becomes obsolete, King said, the system will automatically convert it to a new format, also designed to endure for decades. To guard against loss, the data is backed up on servers at different locations.
Instead of being locked away somewhere, the data remains accessible to the researcher through a Web interface designed to look like just another page holding a list of datasets on the researchers website. Instead of bringing visitors who click on a page to a researchers server, though, it links directly to a Dataverse server. The data sets, like the journal articles that result from them, have their own citations so that, if they are used by other scientists, a researcher gets credit for the work.
As a researcher, I dont need to do anything. It looks like its mine, but its preserved in the background, King said.
There are Dataverses at several different levels, including the Dataverse Network Project, which has developed and distributed the software; the IQSSs Dataverse Network, which is the Harvard-centered network, holding the data of Harvard researchers; the Dataverse networks of other institutions; and the Dataverses of individual researchers, which are individual archives from their specific projects and which reside on the networks at specific institutions.
Mercè Crosas , director of product development at IQSS, led the development efforts of the Dataverse Network software. She said IQSS currently hosts more than 350 individual researchers Dataverses. Those Dataverses hold about 40,000 studies, made up of 665,000 files. Although Dataverse has so far mainly been used by social scientists, Crosas said some groups in the sciences, including the Harvard-Smithsonian Center for Astrophysics, are beginning to explore Dataverse options.
She expects the size of the files stored there to double in the next five years, as more researchers seek solutions to the problem of storing data into perpetuity. To help that expansion, she said, the Dataverse software is open source, meaning that the code is open to others to download and edit. Among the institutions that have adopted the Dataverse approach are the University of North Carolina, the University of Michigan, and several campuses of the University of California.
The softwares open-source nature means that other institutions can have their own programmers add features that can then be shared with the community of users.
Of course, preserving anything into perpetuity is a tall order, and King acknowledged that will be a central challenge as people and institutions change. The advantage of a place like Harvard, though, is that it is stable and likely to endure.
You need the community to persist, King said. Thats the kind of thing Harvard does best.
This story is published courtesy of the Harvard Gazette, Harvard Universitys official newspaper. For additional university news, visit Harvard.edu.
More information: http://dvn.iq.harvard.edu/dvn/
Provided by
Harvard University
-
From lemons to lemonade: Reaction uses carbon dioxide to make carbon-based semiconductor,
32 comments
-
Thioridazine kills cancer stem cells in human while avoiding toxic side-effects of conventional cancer treatments,
3 comments
-
SpaceX private rocket blasts off for space station (Update),
42 comments
-
Climate scientists say they have solved riddle of rising sea,
31 comments
-
SpaceX capsule has 'new car' smell, astronauts say (Update),
4 comments
-
Ideas to mitigate risk of 911 calls being misdirected
May 24, 2012
-
Live scribe pen?
May 10, 2012
-
Shallow water flow simulation
May 07, 2012
-
Tablet for taking notes?
May 05, 2012
-
Best fit tablet for me?
May 05, 2012
-
Measure of Informaton
May 04, 2012
- More from Physics Forums - Computing & Technology
More news stories
Browser wars flare in mobile space
The browser wars are heating up again, but this time the fight is for dominance of the mobile Internet.
9 hours ago |
5 / 5 (1) |
3
Probability of contamination from severe nuclear reactor accidents is higher than expected: study
Catastrophic nuclear accidents such as the core meltdowns in Chernobyl and Fukushima are more likely to happen than previously assumed. Based on the operating hours of all civil nuclear reactors and the number ...
Technology / Energy & Green Tech
May 22, 2012 |
3.6 / 5 (22) |
56
|
SpotterRF debuts Radar Backpack Kit (w/ Video)
(Phys.org) -- SpotterRF has announced a special radar backpack kit designed to enhance situational awareness for soldiers on the ground. The company says its special radar is designed for warfighters as part ...
HyperSolar shows dirty water no barrier to power world
(Phys.org) -- The Santa Barbara, California, company, HyperSolar, is set to transparently share the ups and downs of its research experiences toward the companys ultimate vision, successfully producing ...
Tesla to launch electric sedan in US on June 22
Tesla Motors said Tuesday it would begin deliveries of "the world's first premium electric sedan" on June 22, slightly ahead of schedule.
Technology / Energy & Green Tech
May 22, 2012 |
4.5 / 5 (12) |
18
Land and sea species differ in climate change response: study
(Phys.org) -- Marine and terrestrial species will likely differ in their responses to climate warming, new research by Simon Fraser University and Australia’s University of Tasmania has found.
Almost half of new vets seek disability
(AP) -- America's newest veterans are filing for disability benefits at a historic rate, claiming to be the most medically and mentally troubled generation of former troops the nation has ever seen.
'Unzipped' carbon nanotubes could help energize fuel cells, batteries
Multi-walled carbon nanotubes riddled with defects and impurities on the outside could replace some of the expensive platinum catalysts used in fuel cells and metal-air batteries, according to scientists at ...
T cells 'hunt' parasites like animal predators seek prey, study shows
By pairing an intimate knowledge of immune-system function with a deep understanding of statistical physics, a cross-disciplinary team at the University of Pennsylvania has arrived at a surprising finding: T cells use a movement ...
Computer model used to pinpoint prime materials for efficient carbon capture
When power plants begin capturing their carbon emissions to reduce greenhouse gases and to most in the electric power industry, it's a question of when, not if it will be an expensive undertaking.
Change in developmental timing was crucial in the evolutionary shift from dinosaurs to birds: study
At first glance, it's hard to see how a common house sparrow and a Tyrannosaurus Rex might have anything in common. After all, one is a bird that weighs less than an ounce, and the other is a dinosaur that ...
Sep 19, 2011
Rank: not rated yet
Sep 19, 2011
Rank: not rated yet
Sep 19, 2011
Rank: not rated yet
What does this mean? Decrepit is not a verb. Many old files are decrepit. Perhaps meant to decipher? or to decode?
de·crep·it Adjective
1. (of a person) Elderly and infirm.
2. Worn out or ruined because of age or neglect.
Either way it's a very real problem - files on 3inch Amstrad WP disks in some strange compressed format are as good as lost!
Sep 19, 2011
Rank: not rated yet
XML was the hope to this an issue of the past -- but good documentation once again could solve much of this issue, mainly how to read and interpret the data in a given flie.
Sep 19, 2011
Rank: not rated yet
just like a .txt file from 1985 is still readable today. Granted, it's probably a little more sophisticated than that - but it is probably all stored in some kind of markup language and scalable image type.
randolm...you are absolutely incorrect. There are plenty of archaic filetypes that are not universal or standard that would be a pain in the ass to find software that is compatible with modern hardware and can decode it.
As for the physical media, they have their own server cluster - presumably that would replicate across the network as they upgrade the nodes.
and @ diverse...why don't we just use our comprehension skills and assume that randolmj meant 'decript'.
Sep 19, 2011
Rank: not rated yet