Big Data – Hadoop
Hadoop is an open source software framework that supports data-intensive distributed applications available through the Apache Open Source community. It consists of a distributed file system HDFS, the Hadoop Distributed File System and an approach to distributed processing of analysis called MapReduce. It is written in Java and based on the Linux/Unix platform.
The Hadoop framework transparently provides both reliability and data motion to applications. Hadoop implements a computational paradigm named MapReduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. It provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both map/reduce and the distributed file system are designed so that node failures are automatically handled by the framework. It enables applications to work with thousands of computation-independent computers and petabytes of data. The entire Apache Hadoop platform is commonly considered to consist of the Hadoop kernel, MapReduce and Hadoop Distributed File System (HDFS), and number of related projects including Apache Hive, Apache HBase, Apache Pig, Zookeeper etc.
The real magic of Hadoop is its ability to move the processing or computing logic to the data where it resides as opposed to traditional systems, which focus on a scaled-up single server, move the data to that central processing unit and process the data there.
This model does not work on the volume, velocity, and variety of data that present day industry is looking to mine for business intelligence. Hence, Hadoop with its powerful fault tolerant and reliable file system and highly optimized distributed computing model, is one of the leaders in the Big Data world.
Hadoop is it’s storage system and it’s distributed computing model
Hadoop Distributed File System is a program level abstraction on top of the host OS file system. It is responsible for storing data on the cluster. Data is split into blocks and distributed across multiple nodes in the cluster.
MapReduce is a programming model for processing large datasets using distributed computing on clusters of computers. MapReduce consists of two phases: dividing the data across a large number of separate processing units (called Map), and then combining the results produced by these individual processes into a unified result set called Reduce.
This is also called the Head Node/Master Node of the cluster.
This is an optional node that you can have in your cluster to back up the NameNode if it goes down. If a secondary NameNode is configured, it keeps a periodic snapshot of the NameNode configuration to serve as a backup when needed.
These are the systems across the cluster which store the actual HDFS data blocks. The data blocks are replicated on multiple nodes to provide fault tolerant and high availability solutions.
This is a service running on the NameNode, which manages MapReduce jobs and distributes individual tasks.
This is a service running on the DataNodes, which instantiates and monitors individual Map and Reduce tasks that are submitted.
Hive is a supporting project for the main Apache Hadoop project and is an abstraction on top of MapReduce, which allows users to query the data without developing MapReduce applications. It provides the user with a SQL-like query language called Hive Query Language (HQL) to fetch data from Hive store.
Pig is an alternative abstraction on MapReduce, which uses dataflow scripting language called PigLatin. This is favored by programmers who already have scripting skills.
Flume is another open source implementation on top of Hadoop, which provides a data-ingestion mechanism for data into HDFS as data is generated.
Sqoop provides a way to import and export data to and from relational database tables (for example, SQL Server) and HDFS.
Oozie allows creation of workflow of MapReduce jobs. This is familiar with developers who have worked on Workflow and communication foundation based solutions.
HBase is Hadoop database, a NoSQL database. It is another abstraction on top of Hadoop, which provides a near real-time query mechanisms to HDFS data.
HBase is massively scalable and delivers fast random writes as well as random and streaming reads. It also provides row-level atomicity guarantees, but no native cross-row transactional support. From a data model perspective, column-orientation gives extreme flexibility in storing data and wide rows allow the creation of billions of indexed values within a single table. HBase is ideal for workloads that are write-intensive, need to maintain a large amount of data, large indices, and maintain the flexibility to scale out quickly. HBase is now being used by many other workloads internally at Facebook and many other company using this amazing big data technolog.
Mahout is a machine-learning library that contains algorithms for clustering and classification.
Good to know Apache Cassandra Terms:
CQL – Cassandra Query Language
RP – Random Partitioner
OPP – Order Preserving Partitioner
BOP – Byte Ordered Partitioner
RF – Replication Factor
CF – Column Family
JSON – Java Script Object Notation
BSON – Binary JSON
TTL – Time To Live
HDFS – Hadoop Distributed File System
CFS – Cassandra File System
UUID – Universal Unique IDentifier
DSE – Datastax Enterprise
AMI – Amazon Machine Image
OOM – Out Of Memory
SSTables – Sorted String Table
SEDA – Staged Event-Driven Architecture
CRUD – Create Read Update Delete
Big Data Architecture :
Big Data architecture is premised on a skill set for developing reliable, scalable, completely automated data pipelines. That skill set requires profound knowledge of every layer in the stack, beginning with cluster design and spanning everything from Hadoop tuning to setting up the top chain responsible for processing the data.
The main detail here is that data pipelines take raw data and convert it into insight (or value). Along the way, the Big Data engineer has to make decisions about what happens to the data, how it is stored in the cluster, how access is granted internally, what tools to use to process the data, and eventually the manner of providing access to the outside world. The latter could be BI or other analytic tools, the former (for the processing) are likely tools such as Impala or Apache Spark. The people who design and/or implement such architecture I refer to as Big Data engineers.
Facebook is starting to use Apache Hadoop technologies to serve realtime workloads. why facebook decided to use Hadoop technologies, the workloads that facebook have on realtime Hadoop, the enhancements that facebook did to Hadoop for supporting our workloads and the processes and methodologies facebook have adopted to deploy these workloads successfully.
Let me know if you have any further question and your comments will be learning point. Back to top
Microsoft Certified Solutions Associate (MCSA)