- Much of this information was gathered from an HP and Intel sponsored PDF http://www.infoworld.com/d/big-data
- At this time, this is meant for personal learning only, so don't assume all of this is 100% correct!
- Hadoop is a distributed, reliable processing and storage framework for very large data sets.
- The phrase “big data” applies to sets of data that are too costly or large to manage with traditional technologies and approaches. Big data is generally considered to be at least terabytes of data, usually unstructured or semistructured.
- Hadoop is comprised of two core components: Hadoop File System (HDFS) and MapReduce
- Hadoop has built-in resiliency to hardware failure and it spreads its tasks across a cluster of servers called nodes
Examples of how this works
- An acronym for a Big Data stack is SMAQ (Storage, MapReduce and Query).
- An example of MapReduce would be counting the number of unique words in a document. In the Map phase, each word is identified and given a value of 1. In the Reduce phase, the value would be raised by 1, per instance found in the document. This is obviously over simplified, however this is the basic idea of MapReduce and analyzing Big Data in general.
- For most MapReduce based systems, data must first be extracted, transformed, then loaded (ETL). This means getting data from a file, or files, then transformed into some type of structure that can then be loaded into the storage layer for MapReduce to process it.
- MapReduce then processes this and returns the results to storage, from which, someone can extract the information in a form that is usable.
- The interesting part about all of this is that a previous data set, or results from that set could be added to another data set, and MapReduced again, and again, basically the refining process for this is limitless.
- HDFS is where data is stored, serving as the file system for data warehousing. Meanwhile, MapReduce can act as both the data warehouse analytic engine and the ETL engine.
- HDFS is a file system designed for efficiently storing and processing large files on one or more clusters of servers. Its design is centered on the philosophy that write-once and read many times is the most efficient computing approach. One of the biggest benefits of HDFS is its fault tolerance without losing data.
- Blocks are the unit of space used by a physical disk and a file system. The HDFS block is 64MB by default but can be changed. HDFS files are broken into and stored as block-size units. However, unlike a file system for a single disk, a file in HDFS that is smaller than a single HDFS block does not take up a whole block.
- HDFS benefits:
- A file on HDFS can be larger than any single disk in the cluster.
- Blocks simplify the storage subsystem in terms of metadata management of individual files.
- Blocks make replication easier, providing fault tolerance and availability. Each block is replicated to other machines (typically three). If a block becomes unavailable due to corruption or machine failure, a copy can be read from another location in a way that is transparent to the client.
Namenodes and Datanodes
- An HDFS cluster has a master-slave architecture with multiple namenodes (the masters) for failover, and a number of datanodes (slaves). The namenode manages the filesystem namespace, the filesystem tree, and the metadata for all the files and directories. This persists on the local disk as two files, the namespace image, and the edit log. The namenode keeps track of the datanodes on which all the blocks for a given file are located.
- Datanodes store and retrieve blocks as directed by clients, and they notify the namenode of the blocks being stored. The namenode represents a single point of failure, and therefore to achieve resiliency, the system manages a backup strategy in which the namenode writes its persistent state to multiple filesystems.
- MapReduce is a programming model for parallel processing. It works by breaking the processing into two phases:
- Map phase: input is divided into smaller subproblems and processed.
- Reduce phase: the answers from the map phase are collected and combined to form the output of the original problem. The phases have corresponding map and reduce functions defined by the developer. As input and output, each function has key-value pairs.
- A MapReduce job is a unit of work consisting of the input data, the MapReduce program, and configuration details. Hadoop runs a job by dividing it into two types of tasks: map tasks and reduce tasks. The map task invokes a user-defined map function that processes the input key-value pair into a different key-value pair as output. When processing demands more complexity, commonly the best strategy is to add more MapReduce jobs, rather than having more complex map and reduce functions.