Self-Healing ETL

The process of building ETL flows is cumbersome and time consuming. Many consultants have made a lot of money by building ETL processes that manage the movement of data in and out of databases, including to data warehouses. There is a lot of importance placed on consistent, quick, and accurate data loading. In fact, one of the most popular job needs these days is still for ETL developers. Even as data science grows and becomes popular, quite a bit of the work in data science is data preparation, which often involves ETL, or ELT.

One of the struggles with an ETL process is the tedious nature of building lots of flows that are very similar. There may be a wide variety of ways to get data to move, especially as we have a crazy set of inconsistent formats in which data appears, but once we have the flow, we often repeat that for many different files or tables. I think Biml makes this better, but it’s still not simple, and there can be plenty of issues that still arise as the source inputs change.

There was an interesting article that describes the use of AI to assist in ETL flows. While there is some basic matching up of source and target data right now, what if AI would add a few things. The idea of looking at the data you’ve matched and suggesting alterations that can potentially help with data quality is interesting. The system would have to learn what issues occur over time and what good data looks like, but perhaps this could even extend when the source data gets slightly mangled in formatting. Perhaps detecting fixed width fields that have grown and shifted all columns would be possible, and even adjust the data flow.

I am actually more interested in detecting PII data and alerting developers, or even operations people that the ETL process needs to be changed or secured in some way. AI should be able as well to help Operations people with scheduling, even pausing or moving ETL work to other machines to better scale the flow. My view would be that any AI system ought to be looking for anomalies and detecting those, as pattern recognition is one thing AI is good at.

I’d think this would be more of an alert capability as I’m not sure I trust an AI system to actually adjust data. ETL flows are often too important to have data mangled in an unexpected way. Perhaps that’s too cautious, as that’s what developers do all the time. They mandle data, and we find ways to recover. Ultimately, I think that having lots of logging on what a developer or AI system does is more important. This might be especially true if voice commands are used to build the flows. I could see lots of ambiguities and mistakes from business analysts that are trying to describe how to build an ETL flow.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 4.2MB) podcast or subscribe to the feed at iTunes and Libsyn.

About way0utwest

Editor, SQLServerCentral
This entry was posted in Editorial and tagged , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.