The digitization of older documents such as the Archimedes Palimpsest and Darwin’s logbooks allows people to explore ancient text and illustrations in a whole new way.  But access to historical files is about to get even better—14 million ways better.

The Internet Archive consists of about two million freely available, scanned and digitized books—up to 600 million pages, dating back 500 years. Previously, however, only the text of those books could be searched. Now, thanks to researcher Kalev Leetaru, people can search for the images in those books on Flickr. The archive will start with 2.6 million illustrations and is expected to eventually include as many as 14 million.

Leetaru, a fellow at Georgetown's Institute for the Study of Diplomacy, reportedly set out to create the database of historical images because he was frustrated that he couldn’t find pictures of the Old West online. What’s particularly amazing is that he did this work in his spare time, working nights and weekends. And this was not his first big data endeavor: Leetaru’s previous projects include analyzing news archives to predict future events and mapping Twitter sentiment in real-time, according to the Washington Post.

As Leetaru took a closer look at the Internet Archive, he realized that when the books were scanned and digitized, the optical character recognition (OCR) software looked for everything that resembled  a picture, so it could be discarded. Similarly to the New York Public Library’s Building Inspector project, Leetaru simply wrote a different program that went through the same files, looking for everything that did look like a picture, and saved it to a separate JPEG file.

Another major value-add is that the software created metadata for each picture, including  the name of the book, the year it was published, its author, its publisher, the subject, the page number of the picture, and a number of tags describing it. In addition, the software even saved the text before and after each image, to give it some context.

Admittedly, some of the tagging is nonoptimal—some very strange pictures come up when you search for “simplicity,” for example. But the pictures are all in the public domain and can be freely used, according to the Internet Archive.

What’s particularly noteworthy about the images is that they have been in the books all along, notes Flickr’s Kay Kremerskothen in a blog post describing the project. “Perhaps what is most remarkable about this collection is that these images come not from some newly-unearthed archive being seen for the first time, but rather from the books that have been resting in our digital libraries,” he writes. “Through the power of big data we are suddenly able to view the world’s books not as merely piles of text, but as individualized galleries of one of the richest and most diverse museums of imagery in the world.”

Leetaru now hopes that other libraries will pick up the software and use it on their book collections.

Needless to say, this being the Internet, everyone immediately started searching for cat pictures. And yes, there are dozens. Have fun.

Related Posts