One of the trends of the last ten years has been for many developers to try and avoid using a relational database where possible. Some look to NoSQL data stores, and others even consider flat file type stores of JSON or other formats that allow developers to work with speed and agility. Quite often it seems that applications grow to require some sort of relational store, often as an additional data store.
I ran across an article from a data science and analysis developer that is often performing work in R or Python on datasets. At some point, the post notes that when your dataset(s) become larger than memory, you might want to consider using a local database of some sort.
Actually, the first question the author asked was “when is your data too big?” Their answer: when operations take a long time, which was 20 seconds for the author. I tend to agree as I am looking for Notepad-like startup performance for apps, and query results in low 10s of seconds.
Most people that perform some sort of data analysis understand tables. Whether this is in R, Python, or even Excel, the table structure for data is familiar and easy to work with. While some analysts might not be overly concerned about normalization, that isn’t always a problem for situations where data is loaded into systems and rarely (if ever) updated. In these cases, just having a database of some sort, could speed up your work.
I think you ought to use a database early, if for no other reason than this is good practice with loading and storing data in a form that is persistent, scalable, and often can perform better across time with disparate queries and data manipulation. While quick experimentation is rapid with in-memory tools, I think a database is better suited to queries across time.
I know I’m biased, but if you find data scientists and other analysts struggling with data sets, offer them a database. They can easily share data, you can protect it with backups, and you might find that you both learn a few things from working together.