I heard a new phrase this week: the data lake. It comes from a Radar Podcast episode where Edd Dumbill talks about data as an asset that should be exposed everywhere in an organization. There’s a blog post on the subject as well.
The idea is somewhat centered on Hadoop, but it could apply elsewhere. Data comes into an organization and then tends to seek and get stored with other data in a large lake. Applications are just a way of accessing the data in the lake, but all the data really lives in a large Hadoop lake of information. In some sense, this isn’t far away from the “single view of the truth” that I’ve seen plenty of organizations attempt. In a relational world this means all data moves from OLTP systems to a large data warehouse, and is then moved to smaller data marts (really warehouse subsets) and is accessed from there.
It’s a good idea in theory, but in the practice of trying to move data around with any velocity between users, with all the copying, cleaning, transforming, and more going on doesn’t work. OLTP systems are needed because there are transactional actions that must be completed quickly and accurately. Moving this data to other systems becomes harder as data volumes and the sheer number of clients (whether users or other systems) increase. The idea that we can keep all of our processes working quickly enough that users won’t get frustrated is likely a dream. The more a value exists in a set of data, the more users will access it. The more accesses, the slower it often becomes, which starts a cycle of smaller subsets of data and applications that subsist on those small data puddles.
Excel is probably the most common example of a data puddle that exists in your organization. A set of data, perhaps out of date, but useful enough to make decisions based on. Infinitely flexible and convenient enough that updates, changes, and more often spawn more and more puddles where the information never gets transferred back to the large lake of a database, whether that’s an RDBMS, Hadoop clusters, or something else.
I think the idea of a large data lake is great, but in a practical sense, much of an organization’s data will never live in the lake. If it does, it will most likely be data that’s been superceeded by information in a puddle somewhere on an employee’s laptop, tablet, or personal cloud.