Is unstructured data a bad term? I saw some data professionals complaining about this, saying all data is structured. That’s usually true. A CSV, even a ragged one has structure. XML and JSON have structure, even if it might vary node to node. Certainly our relational tables are structured and some formats can be rigidly mandated between organizations (like EDI). Even data in PDF, Word, MP3, MP4 or other audio/video mediums is structured in that we know the format.
Given that, is it a misnomer to use the term, unstructured, when describing flexible formats, such as XML? Is it OK for a PDF? I have had a presentation called Unstructured Data in SQL Server. This is primarily about FileStream, FileTable, and searching those objects. In the talk, I classify data in known formats as structured. These would be SQL Server tables and similar objects. At any point in time, we know what all data in the table looks like, even though we can have NULLs or missing data in rows.
I call XML and JSON semi-structured formats. We can certainly determine the format for any node or section, but we wouldn’t know without querying or examining the data. It’s semi-structured in that there is a hierarchy, but the structured from section to section (essentially row to row) can vary. There can even be depths to hierarchies that vary. In many ways, that makes these great formats for flexibility in data exchange.
I tend to view data in Word, PDF, MP4s, as unstructured. We don’t necessarily know where the data is, or how to separate it. We can get pages in Word or PDF, but those can vary and don’t necessarily help us extract information. They are XML, but the XML tags don’t relate to the content, unlike many other XML documents. Scenes or tracks in audio/video files might be separators, but those aren’t necessarily helpful in gathering information. Instead, we need other tools that can help deal with that data, finding words, concepts, or more inside of the binary stream.
I like the term unstructured data because it helps me understand where the information is. While the tables in a database might be full of nonsensical information in some rows, or be poorly designed with data combined into text fields, at least I know where the fields are. Actually, in that case, I’d argue the data in varchar(max) text fields is really unstructured. You might disagree, but give me a better term to describe there the information is stored in a data format.