There’s a concept in computer science of immutability. At a high level, this means once something is set, it isn’t changed. Various computer science languages do this with variables, where values don’t change, though variables can be destroyed and recreated.
In the PASS keynote, Dr. Ramakrishnan pointed out that we have silos of data, often in disparate systems where we keep our information. We want to query this together, so we transfer this to a data warehouse or data lake (the future view) and that items in the data lake are immutable. They aren’t allowed to chang in the way that we update values in our relational databases. We should just read the most recent version of any data, and if there is an update, just add a new set of data.
That’s an interesting concept, but not sure I agree. I think that while we might often want to use a simpler process, there are cases where we do need capabilities to edit. Imagine I had a large set of data, say GBs in a file, would I want to download this and change a few values before uploading it again? Do we want a large ETL load process to repeat? Could we repeat the process and reload a file again? I don’t think so, but it’s hard to decide. After all, the lake isn’t the source of data; that is some other system.
Maybe that’s the simplest solution, and one that reduces complexity, downtime, or anything else that might be involved with locking and changing a file. After all, we wouldn’t want queries that could potentially read the data in between us deleting a value and adding back a new one.
If you’re a data warehouse or analysis person, what do you think? Does it make sense to keep the data lake as immutable and reload data that might not be clean? Let us know today.