Computer scientists say it's time to start looking at treatment of data waste

Jul 19, 2011 by Bob Yirka report

(PhysOrg.com) -- As anyone who has ever used a Windows based computer for any length of time knows, the longer you have it, the slower it goes; this is because of the accumulation of data files and entries in system logs; information that in many cases isn’t really necessary. Thus, our computers slow down due to the accumulation of "waste." Now, two computer scientists from Johns Hopkins University have published a paper on arXiv, where they argue that data waste management on computer systems could, and should be handled similarly to the way physical-world waste is managed.

In their paper, Ragib Hasan and Randal Burns pick up where computer scientists at Cornel University left off after discovering in 1999 that up to 80% of files written to the hard drive by the Windows NT operating system were deleted within five seconds of being created.

Hasan and Burns analyzed three computers: a MacBook laptop, a desktop running Ubuntu Linux and a Fedora Linux fileserver in the University Library (Linux is a variant of the Unix operating system used primarily at educational and research institutions). Their intent was to find out what percentage of the files on each of the computers had not been accessed since their creation. They found that the percentages for each were: MacBook: 20.6, Desktop: 47.4 and Server: 57.1 and that the percentage of disk space used for each was 98.5, 38.1 and 99.5 respectively; clearly indicating that a large number of files using a lot of disk space had never been used again once being created. This is clearly an inefficient use of resources.

It is for this reason that the duo suggest a new approach be used for data waste, one that takes advantage of the research already done with physical waste; specifically, they suggest a pyramid approach be used, similar to the one put in place by physical waste management companies. At the bottom of the new pyramid would be the worst case scenarios, then moving up, the next best and so on till reaching the top, and that they be labeled as such: Dispose, Recover, Recycle, Reuse and Reduce, with zero data waste being the optimal goal.

In this case, Dispose is just that, erasing the data, Recover refers to extracting usable components, Recycle would be refurbishing component for reuse, and Reuse would be using those recoverable components in another way, and Reduce, the ultimate goal would be creating software that doesn’t create waste data in the first place.

Besides slowing computers down due to I/O bottlenecks, data waste can also contribute to faster burnout times for flash technology, which have a limited number of lifetime write/rewrites before dying, something the authors point out, will likely become more important as such technology is increasingly being used in hand-held computing devices.

Explore further: CHIKV challenge asks teams to forecast the spread of infectious disease

More information: The Life and Death of Unwanted Bits: Towards Proactive Waste Data Management in Digital Ecosystems, Ragib Hasan, Randal Burns, arXiv:1106.6062v2 [cs.ET] arxiv.org/abs/1106.6062

Abstract
Our everyday data processing activities create massive amounts of data. Like physical waste and trash, unwanted and unused data also pollutes the digital environment by degrading the performance and capacity of storage systems and requiring costly disposal. In this paper, we propose using the lessons from real life waste management in handling waste data. We show the impact of waste data on the performance and operational costs of our computing systems. To allow better waste data management, we define a waste hierarchy for digital objects and provide insights into how to identify and categorize waste data. Finally, we introduce novel ways of reusing, reducing, and recycling data and software to minimize the impact of data wastage.

Related Stories

E-waste trade ban won't end environmental threat

Mar 22, 2010

A proposal under debate in the U.S. Congress to ban the export of electronics waste would likely make a growing global environmental problem even worse, say authors of an article from the journal Environmental Science an ...

E-waste in trash prohibited in California

Feb 10, 2006

It is illegal in California to place most consumer electronics, such as computers and televisions, as well as fluorescent bulbs and batteries in the trash.

Clinical waste management needs specialized regulation

Jan 07, 2011

A study carried out by the University of Granada (UGR) warns of the need to unify existing plans for clinical waste management in the different autonomous communities to improve recycling and waste disposal. ...

'Catastrophic' e-waste fuels global toxic dump

Nov 13, 2009

A "catastrophic accumulation" of dozens of millions of tonnes of "e-waste" from computers, cellphones and television sets is fuelling a global pile of hazardous waste, an international body warned Friday.

Recommended for you

User comments : 14

Adjust slider to filter visible comments by rank

Display comments: newest first

Eikka
2 / 5 (5) Jul 19, 2011
I think there's a grave misunderstanding of what is actually "data waste".

For example, there's a roughly 600 MB folder in the Windows XP operating system that contains various generic drivers for devices like digital pens, ZIP drives, all sorts of obsolete hardware and all sorts of new hardware like bluetooth dongles which may never ever get used on a particular computer.

But they might - and that's the point.

I have an older netbook with just 4 GB for the system partition. I removed all the extra bits like update uninstallers and the drivers folder, and now the system is much smaller with room to breathe. But if I wanted to plug in a gamepad, which I've never used on the machine, it probably wouldn't recognize it because I removed the drivers from the driver cache.
Eikka
1.8 / 5 (5) Jul 19, 2011
And on the Linux operating system, you could ask whether it's a sensible use of the disk space to have tens of thousands of small plain-text configuration files, and other only-do-one-thing files like the Unix mantra is?

Because, on large filesystems, large sector sizes lead to reduced formatting losses because less data is needed to index where the files are. This however means that small files become increasingly wasteful since they won't use all of the disk space appointed to them.

A small text file may need 1.5 sectors, but it will reserve 2. This multiplied by tens of thousands of files is incredibly wasteful.

Of course you could always format a system partition where you have small sectors, and a data partition that has large sectors, except it becomes a problem when one eventually grows out of its bounds and needs to borrow space from the other. Usually it's the system partition, which grows and grows as the user adds software and updates to the software.
Eikka
1 / 5 (6) Jul 19, 2011
And you know what happens when you run out of space on the system partition in Ubuntu while installing software?

The package manager conks out and s**ts itself because it refuses to delete anything before it has completed the previous operation, which can't commmence because there's no space to put the files in.

And in Linux, you Do Not Manually Delete anything unless you're willing to become the package manager yourself and figure out where everything is and where it belongs, which isn't nice because the files are shot all over the directory tree with a cannon.
J-n
4.3 / 5 (3) Jul 19, 2011
Wow, a bit of misunderstanding of how linux and ext3 works. That's to be expected.

In the windows system, it would seem to me to be a bit more efficent to have the drivers stored somewhere on the net, microsoft hosted, where when you need a driver for a device it would download and use it.

The package manager conks out and s**ts itself because it refuses to delete anything before it has completed the previous operation, which can't commmence because there's no space to put the files in.


Why didn't you read when it said "your install will take X space and you have Y space available"? You could also cancel out of the install, and uninstall the package.

which isn't nice because the files are shot all over the directory tree with a cannon.


And bits of information aren't scattered across the computer when you install a windows program? 1/2 of which isn't even cleaned up when you use the Windows uninstall feature. At least with Linux i know it's all been removed.
Eikka
1 / 5 (1) Jul 19, 2011
Why didn't you read when it said "your install will take X space and you have Y space available"? You could also cancel out of the install, and uninstall the package.

Because it didn't. Because I couldn't, because it wouldn't. (It: the Ubuntu software center thingy)

I know it says something like that somewhere there, but when you look for new software it says one thing, and then downloads ten libraries that weren't included in the package. Ubuntu is supposed to be user friendly, but like always they haven't thought it all the way through and eventually you need manual intervention.


And bits of information aren't scattered across the computer when you install a windows program?


True, but then again a great deal of the software is just a folder in the program files, and a couple lines in the registry.

I especially like the fact that I can move a folder to a different drive, and the program usually works without having to make symbolic links to patch it up.
Eikka
1 / 5 (2) Jul 19, 2011
As I understood, there is no standard way for a program in Linux/Unix to ask "Where am I?" to find its own files from the file system.

Which is why certain types of files must be put into pre-determined places in the file system, and the whole thing starts to look like a bunch of angry octopodes wrestling with their tenticles.

You don't know where one thing ends and another one starts, so you can't open the "programs" folder and see "hey this program X takes Y amount of space on my drive, let's delete or re-locate it". You need a special program to do that, and re-location would break all the hard-coded paths.

It's kinda like 1995 again whenever I use it.
Shelgeyr
1 / 5 (1) Jul 19, 2011
"Don't delete your waste data, recycle it!"

I swear that has to be the company motto behind some of the spam I receive, which looks like it was poorly translated to Korean and back at least twice. Recycled, I guess...
Eikka
1 / 5 (3) Jul 19, 2011
In the windows system, it would seem to me to be a bit more efficent to have the drivers stored somewhere on the net, microsoft hosted, where when you need a driver for a device it would download and use it.


On the disk space usage point, yes, but it would induce a horrible lag: "Generic Device found, Hitachi External DVD drive, downloading drivers, 4% complete (2.35 kb/s)"

Which is actually one of the major points I personally have against Linux. It's downright useless if it isn't tethered to the internet constantly. You can't do -anything- unless you're one click away from google to ask for help, or download some library or a patch to something. You can't even carry a software package on a USB stick because it's still missing N other things that you didn't know you didn't have or need.

Oh the joy when installing Linux, and the network refuses to work for some reason.
DrEvilBetty
not rated yet Jul 19, 2011
Your example is where the "Reduce" part of the plan would come in. If the OS asked at install if you wanted the full driver package or a minimal set of the most common drivers, you could be spared that wasted space at the beginning.

Alternatively, the drivers could be stored in a repository online and downloaded as needed. Another option is for device manufacturers to embed the drivers for the device on-board the device itself.

Any of these, or a combination, would reduce wasted space and time needed in scanning useless files.
gmurphy
3.5 / 5 (4) Jul 19, 2011
@Eikka, you really don't sound like you know what you're doing, I live and breath this stuff, let me tell you, the functionality, flexibility and reliability of a Linux system eclipses Windows to obscurity and beyond.
gjbloom
5 / 5 (3) Jul 19, 2011
@Eikka - you said "As I understood, there is no standard way for a program in Linux/Unix to ask "Where am I?" to find its own files from the file system."

My apps do this all the time. In perl, I say:
use FindBin;
my $appDir = $FindBin::Bin;

If my app then wanted to find all the files owned by the user my app is running as, my app could say:

my @files = qx/find $appDir -user $ENV{"USER"}/;

What's more, UNIX has a standard /tmp directory where applications can create their temporary files and not have to worry (much) about cleaning them up. Some variants of UNIX clean the /tmp directory every time the system boots.
gwrede
5 / 5 (1) Jul 19, 2011
What I find pathetic is when tecnical articles keep assuming nobody has ever heard of Linux. And then they go on assuming everybody knows unix, and explain linux in terms of unix -- usually getting even that somehow wrong.

That's like deep articles on cosmology and physics, where they always remember to explain what this Light Year thingy is, while never explaining the stuff you'd need to know. (Like what the interwiever could have asked, and now each reader has to spend an hour googling around.)
ronfinch
1.8 / 5 (4) Jul 19, 2011
you people should use macs
frajo
not rated yet Jul 20, 2011
Usually it's the system partition, which grows and grows as the user adds software and updates to the software.
Depends who's in control. On my OS (neither Windows nor linux nor Mac), I don't allow non-system software to install to the system partition and even system enhancements are placed elsewhere when reasonable.
And you know what happens when you run out of space on the system partition in Ubuntu while installing software?
I'm not working with Ubuntu (it's too windowish) but who told you that software can be installed in /root only? Or what do you mean by "system partition"?
And how do you manage to run out of disk space in the age of TB disks?
And in Linux, you Do Not Manually Delete anything unless you're willing to become the package manager yourself and figure out where everything is and where it belongs
You don't have to delete manually. Use your package manager, synaptic, to remove an application.