March 1, 2013

Big data: Searching in large amounts of data quickly and efficiently

Not only scientific institutes but also companies harvest an amazing amount of data. Traditional database management systems are often unable to cope with this. Suitable tools are lacking in information retrieval on big data. Computer scientists from Saarbrücken have developed an approach which enables searching large amounts of data in a fast and efficient way. The researchers will show their results at the trade fair Cebit in Hannover starting on March 5.

The term "big data" is defined as a huge amount of digital information, so big and so complex that normal database technology cannot process it. It is not only scientific institutes like the nuclear research center CERN that often store huge amounts of data ("Big Data"). Companies like Google and Facebook do this as well, and analyze it to make better strategic decisions for their business. How successful such an attempt can be was shown in a New York Times article published last year. It reported on the US-based company "Target" which, by analyzing the buying patterns of a young woman, knew about her pregnancy before her father did.

The analyzed amount of data is distributed on several servers on the internet. The search queries go to several servers in parallel. Traditional database management systems do not match all use cases. Either they cannot cope with big data, or they overstrain the user. Therefore data analysts love tools which are based on the open-source software framework Apache Hadoop and which use its efficient file system HDFS. Those do not require expert knowledge. "If you are used to the programming language Java, you can already do a lot with it", explains Jens Dittrich, professor of information systems at Saarland University. But he also adds that Hadoop is not able to query big datasets as efficiently as database systems that are designed for parallel processing.

Dittrich's and his colleague's solution is the development of the "Hadoop Aggressive Indexing Library", abbreviated with HAIL. It enables saving enormous amounts of data in HDFS in such a way that queries are answered up to 100 times faster. The researchers use a method which you can already find in a telephone book. So that you do not have to read the complete list of names, the entries are sorted according to surnames. The sorting of the names generates the so-called index.

The researchers generate such an index for the datasets they distribute on several servers. But in contrast to the telephone book, they sort the data according to several criteria at once and store it multiply. "The more criteria you provide, the higher the probability that you find the specified data very fast", Dittrich explains. "To use the telephone book example again, it means that you have six different books. Every one contains a different sorting of the data – according to name, street, ZIP code, city and telephone number. With the right telephone book you can search according to different criteria and will succeed faster." In addition to that, Dittrich and his research group managed to generate the indexes without any additional costs. He and his group members organized the indexing in such a way that no additional computing time and delay is required. Even the additional storage space requirement is low.

More information: Conference Paper: vldb.org/pvldb/vol5/p1591_jens … ittrich_vldb2012.pdf

Provided by Saarland University

Citation: Big data: Searching in large amounts of data quickly and efficiently (2013, March 1) retrieved 18 June 2024 from https://phys.org/news/2013-03-big-large-amounts-quickly-efficiently.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

DBToaster breaks up data jams in server farms

0 shares

Feedback to editors

Lung-targeting lipid nanoparticles with CRISPR components successfully treat cystic fibrosis mouse models

14 minutes ago

Investigating nematode-microbe interactions in lab-simulated decomposed beetle environments

19 minutes ago

Starlings found to expend 25% less energy in follower position compared to flying solo

24 minutes ago

Physicists find a new way to represent π

26 minutes ago

Research investigates chemical composition of globular cluster Terzan 6

26 minutes ago

Study suggests at-camera gaze can increase scores in simulated interviews

33 minutes ago

Study proposes novel hypothesis to explain occupation of Brazil's southern coast 2,000 years ago

42 minutes ago

Scientists use tyrosine nanomedicine to halt melanoma growth

45 minutes ago

Ultra-high spectral purity revealed in exciton-polariton laser

51 minutes ago

Quantum computing trade-off problem addressed by new system

53 minutes ago

Load comments (0)

Big data: Searching in large amounts of data quickly and efficiently

Lung-targeting lipid nanoparticles with CRISPR components successfully treat cystic fibrosis mouse models

Investigating nematode-microbe interactions in lab-simulated decomposed beetle environments

Starlings found to expend 25% less energy in follower position compared to flying solo

Physicists find a new way to represent π

Research investigates chemical composition of globular cluster Terzan 6

Study suggests at-camera gaze can increase scores in simulated interviews

Study proposes novel hypothesis to explain occupation of Brazil's southern coast 2,000 years ago

Scientists use tyrosine nanomedicine to halt melanoma growth

Ultra-high spectral purity revealed in exciton-polariton laser

Quantum computing trade-off problem addressed by new system

Relevant PhysicsForums posts

Math Major Trying to Learn CS

Parallelizing N-Queens

How to test locally hosted websites on mobile?

Question about learning programming

Why do emails from my contact form bounce?

Anyone with experience linking FFTW for C

DBToaster breaks up data jams in server farms

New IBM software accelerates decision making in the era of big data

Fujitsu develops distributed and parallel complex event processing technology that rapidly adjusts big data load fluctua

From Terabytes to Petabytes: Computer Scientists Develop New Hybrid Database System

Data in the fast lane

San Diego Supercomputer Center begins cloud computing research using the Google-IBM CluE cluster

Hyphens in paper titles harm citation counts and journal impact factors

A big step toward the practical application of 3-D holography with high-performance computers

Combining multiple CCTV images could help catch suspects

Applying deep learning to motion capture with DeepLabCut

Training artificial intelligence with artificial X-rays

New model for large-scale 3-D facial recognition

Medical Xpress

Tech Xplore

Science X

Big data: Searching in large amounts of data quickly and efficiently

Lung-targeting lipid nanoparticles with CRISPR components successfully treat cystic fibrosis mouse models

Investigating nematode-microbe interactions in lab-simulated decomposed beetle environments

Starlings found to expend 25% less energy in follower position compared to flying solo

Physicists find a new way to represent π

Research investigates chemical composition of globular cluster Terzan 6

Study suggests at-camera gaze can increase scores in simulated interviews

Study proposes novel hypothesis to explain occupation of Brazil's southern coast 2,000 years ago

Scientists use tyrosine nanomedicine to halt melanoma growth

Ultra-high spectral purity revealed in exciton-polariton laser

Quantum computing trade-off problem addressed by new system

Relevant PhysicsForums posts

Related Stories

DBToaster breaks up data jams in server farms

New IBM software accelerates decision making in the era of big data

Fujitsu develops distributed and parallel complex event processing technology that rapidly adjusts big data load fluctua

From Terabytes to Petabytes: Computer Scientists Develop New Hybrid Database System

Data in the fast lane

San Diego Supercomputer Center begins cloud computing research using the Google-IBM CluE cluster

Recommended for you

Hyphens in paper titles harm citation counts and journal impact factors

A big step toward the practical application of 3-D holography with high-performance computers

Combining multiple CCTV images could help catch suspects

Applying deep learning to motion capture with DeepLabCut

Training artificial intelligence with artificial X-rays

New model for large-scale 3-D facial recognition

Newsletter sign up

Donate and enjoy an ad-free experience