Introduction to Big Data
February 28, 2018
What is Big Data?
With the advancements in technology, humans and machines now create more data in 2 days than the world has created since the beginning of time up until 2003.
Think about that for a second. More data since the beginning of time until 2003. That’s a lot of data.
Big Data is the notion of very large amounts of data that a person or organization uses to gather some information or insight. When talking about Big Data and its definition, the 5 V’s are very important: Volume, Velocity, Variety, Veracity, and Value.
The Five V’s — Volume, Velocity, Variety, Veracity, and Value
Volume
Volume is the amount of data that you are consuming. It goes without saying that for something to be considered Big Data, the volume needs to be huge. How huge? It’s a judgement call. It could be 100 GB, 100 TB, or 100 PB.
Velocity
Velocity is how fast your data is coming into the system. Data nowadays can be created super quickly. For example, imagine an airplane that has 1000 sensors inside the engines, on the wings, on the landing gear, and even in the lavatories. There has to be a system somewhere that takes that data to understand whether the plane needs maintenance, new landing tires, or just a quick clean of the lavatories. That data is streaming every second to ensure the passengers on the plane are safe and thus has high velocity.
Variety
Variety is the different types of data that we have. Imagine Facebook where you can have text, post metrics, emojis, pictures, videos, and advertising metrics. These represent a lot of different types of data that can be collected. This might not even scratch the surface of everything Facebook collects. Imagine the plane example above, the data collected from the landing gear is different than the data collected from the plane wings.
Think of data in three main categories: unstructured, structured, and semi-structured. Structured is like the old school relational database where there is a clear schema or structure of the data. Unstructured is where there is no clear schema or even pattern to the data. Semi-Structured data kind of has a schema but has parts that don’t follow a certain pattern.
Veracity
Veracity is the ability to trust your data. Think of the world of Twitter where everyone has an opinion on a certain topic. It’s hard to distinguish facts from fiction.
This phenomenon is the same with data. Let’s go back to the airplane example. Imagine if one of the sensors in the landing gear was faulty and was sending data that there were no tires on the landing gear. This could be true or not. Chances are there are tires on the landing gear, but is there a way to check to see if the sensor is telling the truth or just sending a false reading?
Value
Value is the last one. If we are going to be collecting massive amounts of data, that data has be stored somewhere which costs money. While the cost of storage has gone down, storing petabytes of data still costs money. So if you can determine what data doesn’t add value, you can get rid of it saving you time and money. If you decide that the lavatory sensors on the airplane don’t add value because you clean them and empty them after every flight, you don’t have to worry about collecting that data. Be sure to understand your data because you don’t want to get rid of some data that might actually have some hidden value to you or your organization.
Where does Big Data come from?
Data is everywhere around us today. It is very hard to pinpoint where it is generated from because it is so omnipresent in our lives. Humans create a ton of data! Let’s take a look at a few examples:
- Cars have a ton of sensors such as proximity, back-up cameras, side cameras, tire pressure gauges, etc. that monitor the car and it’s surroundings.
- Your fitness tracker tracks your sleep, food intake, and physical activity. That’s a lot of data to collect every second of every day!
- Twitter and Facebook generate so much data — not only the kind that we see (posts) but also on the backend (post clicks), time spent on every page, etc. This data is used to personalize our home feeds as well as our interactions on the site.
- Roomba vacuums have sensors that report to the owner on the app.
- People do online searches everyday, and search engines like Google are constantly storing popular queries, website visits, etc.
- A lot of telemetry data from satellites, airplanes, etc.
This data is collected and used for many different purposes such as targeting ads, predicting anomalies, making us healthier, figuring out how to increase revenue, inform business decisions, etc. It is quite amazing to think about how much data is around us and how deeply it affects our lives.
Why Big Data? Is there a problem?
Look at Big Data more as a challenge rather than a problem. You need to understand the specifics of the 5 V’s of your data and create a system that gives you the information that you need. The ability to understand the volume of your data and provision the correct amount of storage. The velocity of the data to get the correct technology stack that can handle that amount of throughput. The ability to understand the variety of data to pick the correct type of database or web application to retrieve and store the data. But these are the easy parts.
The ability to understand the veracity and value of the data is the most challenging. You need to be able to understand the data that you can trust and the one that you can’t.
Perhaps you take in the airplane sensor data. Is there some type of calibration that needs to happen to understand which way the airplane is moving because there is a delay in the data?
Or maybe your system has to understand the way consumers feel about a certain product via Facebook posts. Some people get paid to sabotage reviews. Perhaps you need to use some judgement to understand if it is a real review or not
Veracity is more a judgement call as compared to velocity. We can measure velocity easily, but it is generally much more work to measure veracity. Value is similar — a judgement call has to be made whether or not data is needed or wanted. Companies pay people a lot of money to understand the data that they deal with. These people become subject matter experts in this data. If you want to know whether or not you need to keep some data, you seek their advice since they are experienced with the data and its usage. Companies use these people and their judgement to tackle both veracity and value.
Each of the 5 Vs can affect your system in a negative way if not understood or tamed. Unfortunately there will never be an unlimited amount of storage for data that has no value. Your data collection system might not be able to handle all the data that you want to collect so you may have to trim down the data where you can to keep the velocity in check. And most importantly, you want to make sure that your analysis and reporting processes are giving the correct information based on valuable data.
Conclusion
Data is all around us. It can be harnessed, though, understanding the data you wish to collect is a must. The ability to apply the 5 Vs to understand how the data works is a great strategy to get you started on your Big Data path. Big Data isn’t going away any time soon. It will probably get bigger and bigger. It is your responsibility to take Big Data and convert it to useful information for yourself or for your organization.