This is part of a series on my preparation for the DP-900 exam. This is the Microsoft Azure Data Fundamentals, part of a number of certification paths. You can read various posts I’ve created as part of this learning experience.
The first part of the DP-900 skills document has these items:
- describe batch data
- describe streaming data
- describe the difference between batch and streaming data
- describe the characteristics of relational data
These are concepts that are important to this exam. I lightly blew these off when I started studying, but every other person with guides and the practices tests has lots of focus here. I’m glad I spent time here.
This post covers these concepts a bit. Note, these are more ETL/analytic concepts, not really
Most of my career deals with batch data, meaning a bunch of data that arrives at once and is imported into a system. This is different than a connection and query submitted to an OLTP system. The general idea is:
- Lots of data
- Processed periodically
- Latency doesn’t matter.
Think these key words:
- Not real-time
There is an MS Docs article on this. The general idea is that you want to think about a scheduled (or some periodic) processing of lots of data for a purpose.
Examples of where batch is used.
- Total up all hours worked last week for employees
- Load and transform log files from all web servers each day
- Import files from regional offices into a main database server
In the analytics space, you’d be using Azure Data Factory (ADF), HD Insight (U-SQL, Hiuve, Pig, Spark), Azure Data Lake (ADLS).
There is a course on this topic. When you think of streaming, think of these key words:
- data processed as soon as created
- few transactions
- monitoring or instant decision making
Streaming is really about time series, about tumbling windows, about data like a stock ticker that you need to constantly and/or quickly process.
These items helped me:
- Lots of data – Batch
- Low latency – Stream
- Long latency, latency doesn’t matter, periodic work – Batch
- Small, constant sets of data – Stream
The workload here is that you are handling regular changes to data, lots of insert/update/deletes, for a business process. Really this means you are thinking some sort of CRUD application the gets and sends data to users in real time, but not with low latency issues. We are thinking a web server, a data entry business app, something that operates on time scales for humans, seconds. Not real time, IoT millisecond work.