If you’ve decided to digitize company files – whether a single project or the entire backlog – something you’ll need to figure out is how much disk storage they’ll take up and how to handle it.
Fortunately, there are rules of thumb for such things. Two file cabinets worth of standard 8.5 x 11 inch documents (eight boxes) will require about one gigabyte. 2000 boxes will require about a terabyte.
Naturally, it depends on factors such as the size of the paper, the resolution of the scan (check Appendix 2 of this United Nations report, Record-keeping Requirements for Digitization, for some examples), and whether the data is in color or black-and-white. And beyond the actual amount of space the digitized data takes up, disk space is also required for backups, indexing, and so on.
In the early days of document management, the cost of storage was a big limitation. But today, a few hundred dollars spent on a few terabytes of storage should take care of most document storage projects. In fact, your hardware costs will probably be the least expensive part of your document storage solution.
Of course, things get more complicated when you need petabytes of storage. (A petabyte is a thousand terabytes — or 1,024, to be exact.) Nobody knows more about the storage requirements of digitized data than the Internet Archive, a non-profit digital library offering free access to books, movies, and music, as well as 267 billion archived web pages.
“The Internet Archive contains about 10 petabytes of unique content,” says Alexis Rossi, director of collections. “With backups, compute cluster, and all the machines that run our sites, I believe we have around 29 petabytes of spinning storage space. We have a staff of five who keep these machines running, the software updated, and our network blazing fast.”
Of the 4 million texts in the archive, Internet Archive staffers scanned about 1.9 million of those texts, accounting for about 1 petabyte of the data, Rossi says. “We have primary copies of these texts in our headquarters in San Francisco, and backup copies stored in one of our remote locations. The backups are updated every time a task runs on one of the primary items. We use our cluster to update texts en masse when we need to move to new access file formats, run improved OCR, and so on.” The staff uses a rule of thumb of about 500 MB per scanned book to estimate storage needs, she says.
One of the biggest challenges is keeping all that storage running and cool – for an affordable price. “We focus on high-density storage with the lowest power consumption possible, as well as keeping the machines quiet enough to coexist with us in our building,” Rossi says. “We currently get about 1 petabyte of storage in each rack — though we’re in the process of moving from 3TB disks to 4TB disks so that will increase soon — and the power consumption is 4-5KW per rack. We do not use air conditioning in any of our data storage spaces, just fresh air and fans.” In fact, at the Internet Archive’s San Francisco headquarters, the excess heat from the servers is used to heat other parts of the building.
Simplicity 2.0 is where we examine the intricate and transitory world of technology—through a Laserfiche lens. By keeping an eye on larger trends, we aim to make software that’s relevant to modern day workers, rather than build technology for technology’s sake.
Subscribe to Simplicity 2.0 and follow us on Twitter. If what we’re saying piques your interest, head over to Laserfiche.com where you’ll see how we apply the lessons learned on Simplicity 2.0 to our own processes, products and industry.