Most of us feel that our data volumes are constantly growing. A lack of data isn’t something we worry about, at least not in production. In development, far too many people use sensitive data for developers, which certainly ensures there is enough data, but at a potential increase in liability if the data is mishandled.
If we require developers to create their own data, often there is a problem with them not having enough, or certainly not a representative set that helps ensure the software meets enough of the specifications. This is a challenge, and while random data helps, it doesn’t always work well. This problem might be hard to solve with OLTP systems, but what about an AI or ML scenario, where we might need lots of data to train a system?
Microsoft Research is working on ways to solve this problem. They presented a paper on Icebreaker, a technique that uses minimal data to train a model. I’m not sure I completely understand how this works, but the idea is to be able to work with very little training data and somehow still train the model. I’m guessing there is some ML inside of the process itself.
There are all sorts of downsides with using existing data to train models. Sometimes we have inherent bias, or otherwise skewed data. Allowing a model to work with less data, and perhaps then working to change, or even skew, the data to meet our goals might help. Certainly this likely requires input and feedback from a data scientist of some sort, but that might be where we take advantage of the skills and knowledge of that staff.
There will be more use of the AI/ML technologies in the future, if for no other reason than people are very interested in how this can improve the way that systems can help analyze data. Of course, techniques like this might help us deal with the challenges of doing so when we don’t have all the data we would like to have while building the model.