Unstructured Data

Is unstructured data a bad term? I saw some data professionals complaining about this, saying all data is structured. That’s usually true. A CSV, even a ragged one has structure. XML and JSON have structure, even if it might vary node to node. Certainly our relational tables are structured and some formats can be rigidly mandated between organizations (like EDI). Even data in PDF, Word, MP3, MP4 or other audio/video mediums is structured in that we know the format.

Given that, is it a misnomer to use the term, unstructured, when describing flexible formats, such as XML? Is it OK for a PDF? I have had a presentation called Unstructured Data in SQL Server. This is primarily about FileStream, FileTable, and searching those objects. In the talk, I classify data in known formats as structured. These would be SQL Server tables and similar objects. At any point in time, we know what all data in the table looks like, even though we can have NULLs or missing data in rows.

I call XML and JSON semi-structured formats. We can certainly determine the format for any node or section, but we wouldn’t know without querying or examining the data. It’s semi-structured in that there is a hierarchy, but the structured from section to section (essentially row to row) can vary. There can even be depths to hierarchies that vary. In many ways, that makes these great formats for flexibility in data exchange.

I tend to view data in Word, PDF, MP4s, as unstructured. We don’t necessarily know where the data is, or how to separate it. We can get pages in Word or PDF, but those can vary and don’t necessarily help us extract information. They are XML, but the XML tags don’t relate to the content, unlike many other XML documents. Scenes or tracks in audio/video files might be separators, but those aren’t necessarily helpful in gathering information. Instead, we need other tools that can help deal with that data, finding words, concepts, or more inside of the binary stream.

I like the term unstructured data because it helps me understand where the information is. While the tables in a database might be full of nonsensical information in some rows, or be poorly designed with data combined into text fields, at least I know where the fields are. Actually, in that case, I’d argue the data in varchar(max) text fields is really unstructured. You might disagree, but give me a better term to describe there the information is stored in a data format.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 3.7MB) podcast or subscribe to the feed at iTunes and Libsyn.

About way0utwest

Editor, SQLServerCentral
This entry was posted in Editorial and tagged . Bookmark the permalink.

1 Response to Unstructured Data

  1. pianorayk says:

    In one of my SQL Saturday presentations about writing and documentation, I use the following analogy: as data professionals, we know there’s a difference between data and information. LIkewise, when it comes to documentation, there’s a difference between notes and documentation. Scratch notes are NOT documentation. Too many people don’t understand that distinction. So even within what is considered “unstructured” data, there is still something that can be called “structure.”


Comments are closed.