Should the Data Lake be Immutable?

There’s a concept in computer science of immutability. At a high level, this means once something is set, it isn’t changed. Various computer science languages do this with variables, where values don’t change, though variables can be destroyed and recreated.

In the PASS keynote, Dr. Ramakrishnan pointed out that we have silos of data, often in disparate systems where we keep our information. We want to query this together, so we transfer this to a data warehouse or data lake (the future view) and that items in the data lake are immutable. They aren’t allowed to chang in the way that we update values in our relational databases. We should just read the most recent version of any data, and if there is an update, just add a new set of data.

That’s an interesting concept, but not sure I agree. I think that while we might often want to use a simpler process, there are cases where we do need capabilities to edit. Imagine I had a large set of data, say GBs in a file, would I want to download this and change a few values before uploading it again? Do we want a large ETL load process to repeat? Could we repeat the process and reload a file again? I don’t think so, but it’s hard to decide. After all, the lake isn’t the source of data; that is some other system.

Maybe that’s the simplest solution, and one that reduces complexity, downtime, or anything else that might be involved with locking and changing a file. After all, we wouldn’t want queries that could potentially read the data in between us deleting a value and adding back a new one.

If you’re a data warehouse or analysis person, what do you think? Does it make sense to keep the data lake as immutable and reload data that might not be clean? Let us know today.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 2.8MB) podcast or subscribe to the feed at iTunes and Libsyn.

This entry was posted in Editorial and tagged data lake. Bookmark the permalink.

2 Responses to Should the Data Lake be Immutable?

Jason Horner says:

February 28, 2019 at 12:00 pm

I think this is something that can often get taken out of context. Perhaps the point that was trying to be made was that as you ingest data, managing updates and deletes are expensive. So to optimize, just allow inserts however what is often left unsaid is this assumes we are either getting only incremental data or landing a full snapshot of the entire data set in its current form. This data would typically live in the Raw zone of a data lake. The other zones would allow for updates and refreshed data based on these new changes. If you think through it a bit, it’s really similar to a TLOG in sql server. Instead of processing transactional updates you are just getting a stream of changes that you are recording. This approach solves the velocity and the veracity of the 4 V’s in “Big Data”. That’s my take anyway 🙂

LikeLike
way0utwest says:

February 28, 2019 at 1:27 pm

Good points, and I do think that’s the idea. updates are expensive, so avoid doing them. I’m not sure we should always be immutable, but I do think there is value in avoiding updates t times.

LikeLike

Comments are closed.

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28