Why You Need to Care About the ‘Data Lake’
Have people been telling you to go jump in the lake recently? This may not be a bad thing—if it’s a data lake.
A “data lake,” by definition, is the opposite of silos or the structure of a data warehouse. Instead of taking the time to massage data so that it fits into a warehouse, a data lake just takes all the data in an unstructured form.
“The data lake dream is of a place with data-centered architecture, where silos are minimized, and processing happens with little friction in a scalable, distributed environment,” writes data scientist Edd Dumbill in Forbes. “Applications are no longer islands, and exist within the data cloud, taking advantage of high bandwidth access to data and scalable computing resource. Data itself is no longer restrained by initial schema decisions, and can be exploited more freely by the enterprise.”
It sounds idyllic. And the data lake model does offer several advantages.
- Organizations don’t need to define structures for the data. This saves time and makes the data more flexible. It also avoids political issues, such as, “Which department’s definition of ‘customer’ should we use?”
- The data continues to be available in its pure form for any part of the organization that needs it. Who knows, after all, what sort of data future applications might need?
- Organizations don’t need to take the time to massage the data to put it into a warehouse. That also saves a lot of time—from 30 days to 20 minutes, in the case of one GE project.
As a metaphor, “data lake” is appealing. (After all, it could have been “data landfill.”) It’s instantly understandable, and graspable as a concept. It lends itself to many other metaphors as well, ranging from “drowning” to “swamp.” The one problem with it is that, like a real lake, people sometimes see different things when they look into it, depending on their vantage point.
The “data lake” metaphor was actually first coined in 2010 by Pentaho’s James Dixon. But it didn’t really start gaining traction until this year, particularly when Gartner started being a naysayer. “While the marketing hype suggests audiences throughout an enterprise will leverage data lakes, this positioning assumes that all those audiences are highly skilled at data manipulation and analysis, as data lakes lack semantic consistency and governed metadata,” writes Gartner research director Nick Heudecker. He also warns that while vendors are marketing data lakes as an essential component to capitalize on big data opportunities, there is little alignment among them about what comprises a data lake, or how to get value from it.
If your organization is getting involved in big data—and it has to be your entire organization, not just your department; that’s kind of the point—it’s worth considering a “data lake.” But as with any lake, there are potential hazards:
- Increasingly, organizations don’t own their data in the first place—which makes it challenging to have a single data lake. Organizations using Google Analytics, for example, can analyze the data, but they can’t bring it in-house, notes Alex Woodie in Datanami. “Eventually that Google Analytics data needs to meet other data to get the highest value from of it,” he writes. “So either the company uploads other data into Google’s cloud, or Google lets the customer download some data to their premises. Or—more than likely—it’s all of the above.” Which negates the whole notion of a single “data lake.”
- This isn’t “Lake of Dreams,” where you build it and they will come. “Blindly hoping that the data lake is filled with data from the various applications and systems that already exist or will be built/upgraded in the future and hoping that the data that exists in the data lake will be consumed by data driven application is a common mistake,” writes Kumar Srivastava, senior director of product management at ClearStory Data, in Forbes. “Enterprises need to ensure that the data lake is part of an overall big data strategy where application developers are trained and mandated to use the data lake as part of their application design and development process.”
- Earlier this year, Portland had to drain its 38-million-gallon reservoir when a teenager—well, let’s just say, spoiled the water. Similarly, “data lake” critics say there’s no mechanism to judge the quality of the data that goes in.
- One of Gartner’s criticisms is that “data lakes” don’t necessarily have metadata that explains the provenance of data. Dixon counters that by noting that there’s no reason a data lake couldn’t have that.
- There is still much to be determined regarding nuances such as information governance, e-discovery, and other regulatory issues. “Data dumped into a lake might have associated privacy or regulatory requirements and shouldn't be exposed without oversight,” warns David Ramel in GCN. On the other hand, organizations such as Booz-Allen say the data lake makes it easier for banks, for example, to follow regulatory compliance because all the data is together.
- Does the data lake have a lifeguard? In other words, what security and access control is there for the data? “Data can be placed into the data lake with no oversight of the contents,” Heudecker writes. “Many data lakes are being used for data whose privacy and regulatory requirements are likely to represent risk exposure. The security capabilities of central data lake technologies are still embryonic. These issues will not be addressed if left to non-IT personnel.”
- There are a number of other technical issues to watch for when you’re just starting out, especially if you think your lake might get much larger (and isn’t that the point?).
In short, the water in the data lake might be fine. But it’s best to not dive right in.