What Is Big Data?

The term Big Data designates all the methods that make it possible to analyze and automatically extract information from data that is too massive or complex to be processed by conventional data processing tools.

Table of Contents

A Data Explosion

The volume of data stored since the advent of digital technology continues to grow: 90% of the data collected since the beginning of humanity would have been produced during the last two years.

Big Data: Definition

The term Big Data means significant amounts of big data or massive data.

Big Data thus refers to a set of voluminous digital data that no traditional or classic database management or information management tool can process effectively.

By extension, the term Big Data also refers to the technologies used to process this data. We, therefore, do Big Data to process Big Data, which partly explains the great confusion that this term generates!

The Data Source

This is information from several sources: the messages we exchange, the videos published, the GPS signals, sounds, texts, images of e-commerce transactions, exchanges on social networks, data transmitted by objects connected and many others.

According to IBM, we currently produce approximately 2.5 trillion bytes of data daily through new technologies for personal or professional purposes.

These data have been called Big Data or massive data because of the volume which continues to grow. Digital giants like Google and Facebook were the first to develop the technologies to process this data.

Big Data is a complex and polymorphic tool, which is why there is no precise or universal definition. Its definition varies according to the communities interested in it as a user or service provider.

The Characteristics Of Big Data

Characterizing the English term Big Data is the best approach to defining it. According to Gartner’s definition, the characteristics of Big Data are broken down into three simple criteria ( the 3Vs ): Variety, Velocity and Volume.

Data Volume

The consideration of data volume is an essential characteristic of defining Big Data. According to Wikipedia, “the digital data created in the world would have increased from 1.2 zettabytes per year in 2010 to 1.8 zettabytes in 2011, then 2.8 zettabytes in 2012 and will rise to 47 zettabytes in 2020, and 2,142 zettabytes in 2035. For example, in January 2013, Twitter generated seven terabytes of data daily and Facebook 10 terabytes. In 2014, Facebook Hive generated 4,000 TB of data per day.”

The Velocity or Speed of Data Generation

Velocity refers to the fact that digital data is produced in near real-time: it takes a few thousandths of a second between your Like on Facebook and storing this information in a server, whereas traditional databases are every month or every week.

This high data generation speed also implies increased processing speed: the new information must be used within a few seconds when an individualized promotion is triggered on an e-commerce site, within a few hours when a risk of breakdown or within a few days when managing stocks. This need for rapid and continuously repeated data analysis leads to using artificial intelligence methods.

The Variety of Data

The variety of data refers to heterogeneous sources and the nature of the data. We detail these different types of data in the next section.

Before, databases and spreadsheets were the single data sources considered by most applications. But digital data takes many forms: letters, photos, videos, surveillance devices, PDFs, audio, etc. However, it isn’t easy to store, extract and analyze data when they are from different sources. The variety of data is one of the challenges of big data.

What Are The Types Of Big Data?

Big Data is divided into three types stored and used in different ways.

Structured Data

Structured data is the data that comes to mind most spontaneously. Quickly processed by machines, this data encompasses information already managed by the organization in databases and spreadsheets stored in SQL databases, data lakes and data warehouses. In short, all data that has been predefined and formatted according to a specific structure is called “structured” data.

These include, for example, data from financial systems, data you enter into forms, but also data from your smartwatch or computer logs. They represent about 20% of Big Data data.

Unstructured Data

They represent unorganized information that does not have a predetermined format because it can be anything from examples, reports, audio files, images, video files, text files, comments and opinions on social networks, emails, etc. They represent nearly 80% of big data.

Semi-Structured Data

They are an intermediary between structured and unstructured data. This is data that has not been organized in a specialized repository like a database but includes associated information, making it easier to process than raw data.

For example, the storage of your emails constitutes semi-structured data: text fields (the content of the email) and the associated standardized data (recipient’s email, sender’s email, sending time, etc.).

How Does Big Data Work?

The term Big Data makes it possible to meet an immense challenge in technology: to store an immense quantity of data from different sources. This is on a “large hard drive”, easily accessible from anywhere on the planet. This data is stored safely and can be retrieved at any time.

The files are cut into several fragments called “chunks” to achieve this. Then we distribute these fragments on several computers, and there are several ways to reconstitute them. If a breakdown occurs, a machine will take over by taking another path. In this way, the data will be constantly available.

Mass duplication of data is one of the critical factors in Big Data architecture. Cloud computing, hybrid supercomputers, and file systems are some of the primary storage models currently available.

Data Challenges

Companies have very different degrees of maturity when understanding the issues and the potential for exploiting their data, particularly unstructured data.

Ensuring the integrity of this data is a first step in ensuring that it remains a reliable source through sound data management techniques and associated governance. Only then can predictive analytics and artificial intelligence methods fully bear fruit and enable improved customer service, operational efficiency, and decision-making.

Also Read: Big Data – A Key Element Of Industry 4.0