Hadoop framework creates a lot of buzz nowadays in the data science field. What is It?
Hadoop is a computational platform built to aim solving big data questions, data that can be both structured and non-structured. The main idea is to bring computation to data instead of bringing data to computation. Its file system (Hadoop Distributed File System - HDFS) breaks data into small chunks, saves and replicates them across clustered data nodes on low-cost computers and disks for high data throughput and extensive computation. Such tasks could only be possible by expensive supercomputers in the past.
What are the key components in Hadoop?
HBASE - Hadoop's non-relational database. Data is stored as key value pairs in a large scale.
Sqoop - a data transformation tool that allows transfer data from relational database to Hadoop.
PIG / HIVE - PIG is high-level data flow language that runs on top of MapReducer. HIVE uses SQL-like syntax for data summarizing and ad-hoc querying.
MapReducer - execution engine for mapping, reducing data and returning results. One of the drawbacks of MapReducer is that it reads data from disks. MapReducer can be interfaced with native Java APIs, or REST APIs.
SPARK - Enhanced MapReducer engine that utilizes in-memory technology to cache data. It has a wide range of applications for ETL, machine-learning and data streaming. SPARK can be interfaced with Java, Python, and in the near future with R.
Cloudera - a software company provides a big array of Hadoop based big data tools. Its single-node VM can give you a jump start for testing, demoing or learning Hadoop framework and tools mentioned above.