Data, whether structured or unstructured, is the lifeblood of business and is at the heart, or should be at the heart, of every decision your company makes. The term “big data” has become commonplace not only in the tech industry but in the common vernacular. However, like many technology terms, definitions of big data vary, but the common denominator is that it is data that is available in large volumes delivered at high speed, making analysis difficult with traditional tools. To put that in a real-world context, think about large volumes of real-time data produced by everything from your car to an offshore oil rig.
Before you use big data to drive business outcomes, you need to understand where it’s coming from and how to recognize and capture it so you can build an efficient data model. The more organization or structure you can give your data, the easier it will be to record and analyze it. For that reason, structured data, data born to be analyzed, is the backbone of big data.
What is structured data?
Data falls into two categories: structured and unstructured. Looking at this Ying-Yang of data, the names are self-explanatory. Unstructured data includes content such as video, email, images, podcasts, social media posts, and PDF files. In short, unstructured data does not have an internal identifier that allows search functions to recognize it. The consensus is that it also makes up a whopping 80 percent of the data generated.
Structured data exists in a format created to be captured, stored, organized, and analyzed. It is perfectly organized for easy access. If structured data were an office, it would contain many filing cabinets that are efficiently configured, clearly labeled, and easily accessible. For that reason, structured data brings inherent benefits when it comes to large volumes of information.
However, structured data vs. unstructured data is not a zero-sum game. Structured data also complements unstructured data and you’ll find insights in your unstructured data sets. For example, structured data records can contain unstructured data within them. Consider a form that offers questions with a list of answers available in a drop-down menu, but also allows users to add comments freely. The responses generated from the pick list are structured data, but the comment field produces unstructured data.
Most data is hybrid to some extent. For that reason, you may also see the term semi-structured data, which is a loosely defined subset of structured data. This format includes the ability to add tags, keywords, and metadata to data types that were previously considered unstructured data. Adding descriptive elements to images, email, and word processing files are examples of semi-structured data. Markup languages such as XML are often used to manage semi-structured data.
<img src
=”https://www.datamation.com/wp-content/uploads/2020/12/structured-data_5fce9d40aa215.jpeg” alt=”Structured data” /> Structured data, unlike unstructured data
, tends to be a more natural fit for the data mining processes
of traditional Big Data applications. Where does structured data come from?
The two main examples of where structured data is generated are databases and search algorithms.
The term structured data is often associated with relational database management systems, dating back to the 1970s and a mathematical theory developed by Edgar Codd at IBM’s San Jose Research Laboratory. The Codd model organizes data into one or more tables (also known as relationships) of columns and rows. A few years later, IBMers colleagues Donald D. Chamberlin and Raymond Boyce designed the Structured Query Language (SQL), which is used with the vast majority of relational databases.
In addition to relational databases, spreadsheets are also common sources of structured data. Whether it’s a complex SQL database or an Excel spreadsheet, because structured data depends on creating a data model, you need to plan how you’ll capture, store, and access the data. For example, will it store numerical, monetary, and alphabetical data?
While relational databases and SQL have a long history, more recently, structured data also plays an important role in internet searches and offers benefits for organic search. According to Google’s Introduction to Structured Data, “When information is highly structured and predictable, search engines can more easily organize and display it creatively.” Google says that by using structured data markup, it makes it possible for your content to appear in rich results and Knowledge Graph cards.
To create a structured data standard for web-based applications, email messages, and forms of Internet content, Google, Microsoft, Yahoo, and Yandex created Schema.org, an open community. Schema.org its vocabulary is said to include encodings such as RDFa (an HTML5 extension used in the header and body sections of the HTML page), Microdata (an open HTML specification used to include structured data in HTML content) and JSON-LD (JavaScript Object Notation for Linked Data).
Unlike
unstructured data that will grow organically (and unchecked) and come from a wide range of sources, structured data is created in two ways: the first is data generated by machines using devices or sensors without human intervention. According to IDC, by 2025 80 billion devices will be connected to the internet versus approximately 11 billion devices connected to the internet now. That means a lot more devices producing a lot more data.
Examples of machine-generated data
include the following:
Sensor data such as GPS, RFID tags, medical devices, network and web log data, retail and e-commerce data, to name just a few
.
Conversely, structured data is also generated by people to feed databases and spreadsheets. This typical structured data is created by humans interacting with computers and other devices. Examples include (non-free) data generated through interaction with online forms, kiosks, games, etc.