I was reading a piece on ETL patterns and found it interesting that the author noted that one way to reduce the challenge of ETL processes keeping up with changing schemas is to move to an ELT process. That’s an interesting perspective, and I wonder what those who work with the relational model think. IS this a pattern you like?
In a traditional ETL pattern, we get data from a source (Extract), manipulate it to suit the target schema (Transform) and then Load it into the target. The hassles here, and the reason this is hard, is that the transformations are often complex. We deal with ragged rows, various delimiters and terminators, and more. Lots of consultants have made quite a bit of money by building these transformational applications and processes and maintaining them as requirements change.
The flip of the last two parts, moving from TL to LT, looks to defer some of the transform work until the data is in the destination. This means that the transformation could be a schema-on-read, as the article proposes. The idea here is that all data is ingested into the target system as is. This means that structured CSVs reside alongside spatial data and Word documents. Then, once you have the data, you can transform it as needed, to suit whatever requirements exist at that time.
Certainly this sounds appealing, but for many of us in organiztions, the challenge is still determining how to transform the data. This is often dependent on what the data is and how it can be used. We still need to do the work to understand how this data is structured and change something to make data available to the end user. Whether we do this in the middle or after the data is in a target system doesn’t matter. I know some products, like the MarkLogic one in the post and Azure Data Lake want to reduce must of the ETL effort, but I’m not sure that will work for really important data. For some, sure, we can collect it and sort out the transformations later. However, for useful data, data that the business wants regularly and quickly, most of us need to just get a flow working that will move data on a regular basis.
I’ll admit that I often prefer to transform data inside of a SQL Server target, OLTP or DW database. The reason is that mass changes are often easier with T-SQL, working sets of data rather than row by row. I recognize that doesn’t always work, but I know that for most data that business people do find useful, I’d rather have an ETL process build sooner rather than later.