share this!
1
4
Share
Email

September 7, 2018

What happens when data scientists crunch through three centuries of Robinson Crusoe?

Since Daniel Defoe's shipwreck tale "Robinson Crusoe" was first published nearly 300 years ago, thousands of editions and spinoff versions have been published, in hundreds of languages.

A research team led by Grant Glass, a Ph.D. student in English and comparative literature at the University of North Carolina at Chapel Hill, wanted to know how the story changed as it went through various editions, imitations and translations, and to see which parts stood the test of time.

Reading through them all at a pace of one a day would take years. Instead, the researchers are training computers to do it for them.

This summer, Glass' team in the Data+ summer research program used computer algorithms and machine learning techniques to sift through 1,482 full-text versions of Robinson Crusoe, compiled from online archives.

"A lot of times we think of a book as set in stone," Glass said. "But a project like this shows you it's messy. There's a lot of variance to it."

"When you pick up a book it's important to know what copy it is, because that can affect the way you think about the story," Glass said.

Just getting the texts into a form that a computer could process proved half the battle, said undergraduate team member Orgil Batzaya, a Duke double major in math and computer science.

Credit: Duke Research Blog

The books were already scanned and posted online, so the students used software to download the scans from the internet, via a process called "scraping." But processing the scanned pages of old printed books, some of which had smudges, specks or worn type, and converting them to a machine-readable format proved trickier than they thought.

The software struggled to decode the strange spellings ("deliver'd," "wish'd," "perswasions," "shore" versus "shoar"), different typefaces between editions, and other quirks.

Special characters unique to 18th century fonts, such as the curious f-shaped version of the letter "s," make even humans read "diftance" and "poffible" with a mental lisp.

Their first attempts came up with gobbledygook. "The resulting optical character recognition was completely unusable," said team member and Duke senior Gabriel Guedes.

At a Data+ poster session in August, Guedes, Batzaya and history and computer science double major Lucian Li presented their initial results: a collection of colorful scatter plots, maps, flowcharts and line graphs.

Credit: Duke Research Blog

Guedes pointed to clusters of dots on a network graph. "Here, the red editions are American, the blue editions are from the U.K.," Guedes said. "The network graph recognizes the similarity between all these editions and clumps them together."

Once they turned the scanned pages into machine-readable texts, the team fed them into a machine learning algorithm that measures the similarity between documents.

The algorithm takes in chunks of texts—sentences, paragraphs, even entire novels—and converts them to high-dimensional vectors.

Creating this numeric representation of each book, Guedes said, made it possible to perform mathematical operations on them. They added up the vectors for each book to find their sum, calculated the mean, and looked to see which edition was closest to the "average" edition. It turned out to be a version of Robinson Crusoe published in Glasgow in 1875.

They also analyzed the importance of specific plot points in determining a given edition's closeness to the "average" edition: what about the moment when Crusoe spots a footprint in the sand and realizes that he's not alone? Or the time when Crusoe and Friday, after leaving the island, battle hungry wolves in the Pyrenees?

Credit: Duke Research Blog

The team's results might be jarring to those unaccustomed to seeing 300 years of publishing reduced to a bar chart. But by using computers to compare thousands of books at a time, "digital humanities" scholars say it's possible to trace large-scale patterns and trends that humans poring over individual books can't.

"This is really something only a computer can do," Guedes said, pointing to a time-lapse map showing how the Crusoe story spread across the globe, built from data on the place and date of publication for 15,000 editions.

"It's a form of 'distant reading'," Guedes said. "You use this massive amount of information to help draw conclusions about publication history, the movement of ideas, and knowledge in general across time."

More information: The results are available online: orgilbatzaya.github.io/pirating-texts-site/

Provided by Duke University

Citation: What happens when data scientists crunch through three centuries of Robinson Crusoe? (2018, September 7) retrieved 17 July 2024 from https://phys.org/news/2018-09-scientists-crunch-centuries-robinson-crusoe.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Review: Open e-book format comes with headaches

6 shares

Feedback to editors

New 3D anatomical atlas of the African clawed frog increases understanding of development and metamorphosis processes

4 hours ago

Intensive farming could raise risk of new pandemics, researchers warn

5 hours ago

Scientists develop new AI method to create material 'fingerprints'

8 hours ago

Study shows frogs can quickly increase their tolerance to pesticides

8 hours ago

Nature-based solutions to disaster risk from climate change are cost-effective, study confirms

8 hours ago

Astronomers discover what may be 21 neutron stars orbiting sun-like stars

9 hours ago

Scientists use machine learning to predict diversity of tree species in forests

10 hours ago

Physicists pool skills to better describe the unstable sigma meson particle

11 hours ago

Telescope tag-team discovers 10 strange and exotic pulsars

11 hours ago

NASA transmits hip-hop song to deep space for first time

11 hours ago

Load comments (1)

What happens when data scientists crunch through three centuries of Robinson Crusoe?

New 3D anatomical atlas of the African clawed frog increases understanding of development and metamorphosis processes

Intensive farming could raise risk of new pandemics, researchers warn

Scientists develop new AI method to create material 'fingerprints'

Study shows frogs can quickly increase their tolerance to pesticides

Nature-based solutions to disaster risk from climate change are cost-effective, study confirms

Astronomers discover what may be 21 neutron stars orbiting sun-like stars

Scientists use machine learning to predict diversity of tree species in forests

Physicists pool skills to better describe the unstable sigma meson particle

Telescope tag-team discovers 10 strange and exotic pulsars

NASA transmits hip-hop song to deep space for first time

Relevant PhysicsForums posts

Particle.js: Exploring Particle Physics with Web Technologies

Help solving a geometrical matching issue with Graph Neural Networks

5 GHz PC WiFi connection Cybersecurity question

Help with some optimization code for Block Matrices

Is an API Always Necessary for Server-Client Communication?

I did this POST message configuration damage to my wifi internet, help

Review: Open e-book format comes with headaches

Google to launch e-book service in Japan in 2011

'Game of Thrones' books getting a digital enhancement

'Harry Potter' e-books come to life in new Apple edition

Digital vs. print publications: New study shows playing favorites can hurt overall sales

US book lovers embracing digital editions: Pew study

Hyphens in paper titles harm citation counts and journal impact factors

A big step toward the practical application of 3-D holography with high-performance computers

Combining multiple CCTV images could help catch suspects

Applying deep learning to motion capture with DeepLabCut

Training artificial intelligence with artificial X-rays

New model for large-scale 3-D facial recognition

Medical Xpress

Tech Xplore

Science X

What happens when data scientists crunch through three centuries of Robinson Crusoe?

New 3D anatomical atlas of the African clawed frog increases understanding of development and metamorphosis processes

Intensive farming could raise risk of new pandemics, researchers warn

Scientists develop new AI method to create material 'fingerprints'

Study shows frogs can quickly increase their tolerance to pesticides

Nature-based solutions to disaster risk from climate change are cost-effective, study confirms

Astronomers discover what may be 21 neutron stars orbiting sun-like stars

Scientists use machine learning to predict diversity of tree species in forests

Physicists pool skills to better describe the unstable sigma meson particle

Telescope tag-team discovers 10 strange and exotic pulsars

NASA transmits hip-hop song to deep space for first time

Relevant PhysicsForums posts

Related Stories

Review: Open e-book format comes with headaches

Google to launch e-book service in Japan in 2011

'Game of Thrones' books getting a digital enhancement

'Harry Potter' e-books come to life in new Apple edition

Digital vs. print publications: New study shows playing favorites can hurt overall sales

US book lovers embracing digital editions: Pew study

Recommended for you

Hyphens in paper titles harm citation counts and journal impact factors

A big step toward the practical application of 3-D holography with high-performance computers

Combining multiple CCTV images could help catch suspects

Applying deep learning to motion capture with DeepLabCut

Training artificial intelligence with artificial X-rays

New model for large-scale 3-D facial recognition

Newsletter sign up

Donate and enjoy an ad-free experience