The digitization of older documents such as the Archimedes Palimpsest and Darwin’s logbooks allows people to explore ancient text and illustrations in a whole new way.  But access to historical files is about to get even better—14 million ways better.

The Internet Archive consists of about two million freely available, scanned and digitized books—up to 600 million pages, dating back 500 years. Previously, however, only the text of those books could be searched. Now, thanks to researcher Kalev Leetaru, people can search for the images in those books on Flickr. The archive will start with 2.6 million illustrations and is expected to eventually include as many as 14 million.

Leetaru, a fellow at Georgetown's Institute for the Study of Diplomacy, reportedly set out to create the database of historical images because he was frustrated that he couldn’t find pictures of the Old West online. What’s particularly amazing is that he did this work in his spare time, working nights and weekends. And this was not his first big data endeavor: Leetaru’s previous projects include analyzing news archives to predict future events and mapping Twitter sentiment in real-time, according to the Washington Post.

As Leetaru took a closer look at the Internet Archive, he realized that when the books were scanned and digitized, the optical character recognition (OCR) software looked for everything that resembled  a picture, so it could be discarded. Similarly to the New York Public Library’s Building Inspector project, Leetaru simply wrote a different program that went through the same files, looking for everything that did look like a picture, and saved it to a separate JPEG file.

Another major value-add is that the software created metadata for each picture, including  the name of the book, the year it was published, its author, its publisher, the subject, the page number of the picture, and a number of tags describing it. In addition, the software even saved the text before and after each image, to give it some context.

Admittedly, some of the tagging is nonoptimal—some very strange pictures come up when you search for “simplicity,” for example. But the pictures are all in the public domain and can be freely used, according to the Internet Archive.

What’s particularly noteworthy about the images is that they have been in the books all along, notes Flickr’s Kay Kremerskothen in a blog post describing the project. “Perhaps what is most remarkable about this collection is that these images come not from some newly-unearthed archive being seen for the first time, but rather from the books that have been resting in our digital libraries,” he writes. “Through the power of big data we are suddenly able to view the world’s books not as merely piles of text, but as individualized galleries of one of the richest and most diverse museums of imagery in the world.”

Leetaru now hopes that other libraries will pick up the software and use it on their book collections.

Needless to say, this being the Internet, everyone immediately started searching for cat pictures. And yes, there are dozens. Have fun.

Simplicity 2.0 is where we examine the intricate and transitory world of technology—through a Laserfiche lens. By keeping an eye on larger trends, we aim to make software that’s relevant to modern day workers, rather than build technology for technology’s sake.

Subscribe to Simplicity 2.0 and follow us on Twitter. If what we’re saying piques your interest, head over to where you’ll see how we apply the lessons learned on Simplicity 2.0 to our own processes, products and industry.

Machine Learning

Learn how machine learning can be the driving force for digital transformation in your organization.

Listen Now

Related Articles

By Sharon Fisher, November 16, 2017

Digital transformation and artificial intelligence? As it turns out, they're two great tastes that taste great together. Here's how they help your company.

Read More

By Sharon Fisher, November 02, 2017

Customer experience includes the “journey,” the path customers take to interact with a firm. Digital transformation lets you streamline this journey.

Read More

By Sharon Fisher, October 19, 2017

Nearly everyone will tell you that to be successful in business, you need to be innovative. But what does being innovative really mean?

Read More