Distributed storage cluster and hadoop
- what is data?
- what is big data?
- the challenge?
- the solution(hadoop)?
1. what is data?
The quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media.
In simple words whatever we have, whatever we do ,everything is data
2. what is big data?
We keep hearing statistics about the growth of data. For instance:
⦁ The New York Stock Exchange generates about one terabyte of new trade data per day.
⦁ The statistic shows that 500+terabytes of new data get ingested into the databases of social media site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message exchanges, putting comments etc.
⦁ A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many thousand flights per day, generation of data reaches up to many Petabytes.
⦁ Data volume in the enterprise is going to grow 50x year-over-year between now and 2020.
⦁ The volume of business data worldwide, across all companies, doubles every 1.2 years.
⦁ Back in 2010, Eric Schmidt famously stated that every 2 days, we create as much information as we did from the dawn of civilization up until 2003.
Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is a data with so large size and complexity that none of traditional data management tools can store it or process it efficiently. Big data is also a data but with huge size.
3. the challenge?
Now the question comes in mind is How can you use it to your advantage?
But before that the challenge we have is how to store this data because it is huge
The three Vs of big data
⦁ Volume: The amount of data matters. With big data, you’ll have to process high volumes of low-density, unstructured data. This can be data of unknown value, such as Twitter data feeds, clickstreams on a webpage or a mobile app, or sensor-enabled equipment. For some organizations, this might be tens of terabytes of data. For others, it may be hundreds of petabytes.
⦁ Velocity: Velocity is the fast rate at which data is received and (perhaps) acted on. Normally, the highest velocity of data streams directly into memory versus being written to disk. Some internet-enabled smart products operate in real time or near real time and will require real-time evaluation and action.
⦁ Variety: Variety refers to the many types of data that are available. Traditional data types were structured and fit neatly in a relational database. With the rise of big data, data comes in new unstructured data types. Unstructured and semi structured data types, such as text, audio, and video, require additional preprocessing to derive meaning and support metadata.
so big data is the problem we have and hadoop can help us with this problem
4. what is hadoop?
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
Hadoop is a solution to Big Data problems like storing, accessing and processing data. It provides a distributed way to store your data. A data node in it has blocks where you can store the data, and the size of these blocks can be specified by the user.
The two obvious benefits of using Hadoop is that, it provides storage for any kind of data from various sources and provides a platform for proficient analytics of the data with low latency. Hadoop is well known to be a distributed, scalable and fault-tolerant system. It can store petabytes with relatively low infrastructure investment. Hadoop runs on clusters of commodity servers. All such servers have local storage and CPU which can store few terabytes on its local disk.
Hadoop has two critical components
A). Hadoop Distributed File System (HDFS):
The storage system for Hadoop is known as HDFS. HDFS system breaks the incoming data into multiple packets and distributes it among different servers connected in the clusters. That way every server, stores a fragment of the entire data set and all such fragments are replicated on more than one server to achieve fault tolerance.
B). Hadoop MapReduce:
MapReduce is a distributed data processing framework. HDFS distributes a dataset to different servers but Hadoop MapReduce is the connecting framework responsible to distribute the work and aggregate the results obtained through data processing.
Apache Hadoop provides solution to the problem caused by large volume of complex data. With the result of growth in data, additional servers can be used to store and analyse the data at low cost. This is complemented by processing power of the servers in a cluster by MapReduce.
These two components define Hadoop, as it gained importance in data storage and analysis, over the legacy systems, due to its distributed processing framework.