I read a piece recently that talks about the hassles of copying data multiple times for different applications. In my experience, I haven’t seen this to be the main problem with data. It’s not often that we might replicate, in a general sense, data across different data stores to support different applications. Certainly lots of ETL jobs exist to copy data to new stores for different purposes, which perhaps is what the author is implying.
The idea of protecting data is one that is becoming a greater concern for many organizations. In fact, I’d argue that a number of the recent high profile data breaches in the last couple years involve copying data from some RDBMS store to an ElasticSearch server that isn’t secure. Any movement of sensitive data, whether to warehouse or Power BI report, should be in a secure way.
For years we’ve had minor issues with data security in Excel worksheets; a similar problem continues to exist with both data stores and reporting tools that might contain copies of data. In some sense, this is actually no different than the problems of losing paper reports in the distant past.
The solution given in the article is to share data from a single store among more applications. That’s been the practice in many places I’ve worked, with the challenges of additional load and performance concerns on the data store. Modern distributed SQL Servers can use AGs or (after SQL Server 2017), Kubernetes, to scale out and potentially handle the loads, but those choices aren’t without their own resource costs and challenges.
Ultimately, we aren’t going to get away from moving data around. Certainly we have needs to deal with dev/test environments even if we don’t have any other data movement. While I do think the future of large data workloads will involve less movement, we aren’t going to eliminate movement. We may build more applications that connect to a single data store, which is likely as our platforms become more powerful and enable scale-out capabilities to meet workload growth.
We also need to ensure that copies of data made for different purposes as well protected. Most businesses need to develop better skills and habits to limit sensitive data in dev and test environments, as well as proper access controls for data copies that are used in production environments.