This is part of a series on my preparation for the DP-900 exam. This is the Microsoft Azure Data Fundamentals, part of a number of certification paths. You can read various posts I’ve created as part of this learning experience.
I don’t have an ELT tag, and I’m not likely to make one. I tend to think of ETL as loading data somewhere, even though I know it means more.
The important concepts for DP-900 here are that ELT is becoming more important and you need to understand what this means. I’ll cover these concepts, but also give a little overlap with where the different Azure services fit in with this.
For most of my career, the pattern for loading data was Extract-Transform-Load. In this pattern we:
- grab data from a source
- make changes to clean/change/etc.
- write to a target (or sink)
It’s how tools like SSIS work. They connect a source to a target and have a bunch of tasks or transforms in the middle that change the data in some way.
This is a good pattern for getting the work done when the target system is just built for querying data, such as a data warehouse. It is also good when you need to scrub some data, perhaps for privacy reasons.
This isn’t a good pattern when you are trying to load data quickly as the transform process takes time.
This is the new way of doing things. I this patter we Extract-Load-Transform, though really, it’s not a pattern that quite makes sense in that the process of moving the data just moves it.
- grab data from a source
- write it to a target
Where’s the transform? Well, that happens on the target, often when someone queries the data. Modern analytic systems, like Snowflake and Synapse, can work with vast quantities of data, often stored in a data lake or blob system, and consume that with powerful computational capabilities. There could be some minor re-shaping of the data on write, but that’s not the idea.
This is good when you might not read all the data. Why process (transform) what isn’t being read. Before you complain that you should know what is used, none of us know if all our data is being used. Unless we write crappy SELECT * code with no WHERE clauses.
This is also good when we need to work at speed and privacy isn’t a concern. It’s great for the known formats of files sent to us, as the target system can project a table on top of a ser of files.
ELT seems to be the current future direction of many analytical and warehouse systems.
The ELT theory is not that you don’t use WHERE clauses for E, it’s that you don’t do T in-flight. The EandL should always be “incremental” or CDC style extracts. The simple reason is auditing. We’ve all dealt with cases where the data producer says, “I dunno what you are talking about, the correct data is in the file I sent you last night”…when you go look…yep, it’s there, but someone fixed the file overnight when they spotted the error. Now you have no way to do auditing. Frankly, most of us have been doing ELT and didn’t know it. Number 2 issue is: we tend to do inflight T when we are looking to fix up DQ issues, perform key mgmt, aggregate the data, etc. Similar problem as auditing…I now have no way of seeing what the sender was really sending. We do the T in-process b/c we assume no one will ever want the lower-level, dirty data. But data scientists love to find signal in that noise. An example: if you stream data from SFDC to your data lake you’ll see the creation of an oppy object is actually probably a dozen updates to the table. Seeing that order of updates, in real-time, well…you can do some interesting predictive stuff with that. Number 3 and 4…SRP and code obfuscation. The “movement” code should be separate from the “logic”. How many folks can understand dtsx or ADF json files and can go in and spot some logic that is transforming data wrongly? But, if all the T logic is in a stored proc, spark job, whatever, well…I got a better chance that it is machine-searchable and lineage-capable. And a git repo. That a business person can be taught to understand. Number 5…this actually SOLVES the privacy problem. It’s likely impossible to extract/transform 3rd party data that doesn’t have a modicum of PII or sensitive data. We “land” that in “bronze”, unaltered, with the privacy data…then the next step is the key mgmt, DQ, aggregation, psueudonymization of data, etc. This becomes “silver”. NO ONE (probably) should have r access to bronze but the pseodonymized data in silver is now sanitized. Likely you’ll evenually find a bug in your “bronze to silver” code, especially around key mgmt…now you have a spot you can go back to (in bronze) for debugging.
And last, whoever told you that there is “minor reshaping of the lake data” but the heavy lift is in the compute tier is…well…wrong. That’s a terrible approach. You still need to “model” data, transform, and persist it.
If you don’t mind me being presumptuous…where most traditional RDBMS pros struggle with data lakes is the notion of “I’m copying a lot of data around to various spots in the lake, and it’s prolly not in 3NF. That feels icky and how am I gonna keep that data/code in sync.” You need to suspend disbelief. Yep, the data size will explode (wait until you learn there ain’t no indexes in a data lake and you’ll need to copy entire data sets for certain query patterns)…storage is cheap, get over it. And we are getting better in 2022 with this, it’s less black/white. Likewise, the “keeping everything in sync” is actually a breeze and will eliminate 50% of all of your ETL code when you do it right…but that’s too much for this comment. Just…suspend disbelief.
tldr: ELT helps with auditing.