Kelly Qu's Blog: 2016

Tuesday, November 1, 2016

Hadoop in a nut-shell

Hadoop framework creates a lot of buzz nowadays in the data science field. What is It?

Hadoop is a computational platform built to aim solving big data questions, data that can be both structured and non-structured. The main idea is to bring computation to data instead of bringing data to computation. Its file system (Hadoop Distributed File System - HDFS) breaks data into small chunks, saves and replicates them across clustered data nodes on low-cost computers and disks for high data throughput and extensive computation. Such tasks could only be possible by expensive supercomputers in the past.

What are the key components in Hadoop?

HBASE - Hadoop's non-relational database. Data is stored as key value pairs in a large scale.

Sqoop - a data transformation tool that allows transfer data from relational database to Hadoop.

PIG / HIVE - PIG is high-level data flow language that runs on top of MapReducer. HIVE uses SQL-like syntax for data summarizing and ad-hoc querying.

MapReducer - execution engine for mapping, reducing data and returning results. One of the drawbacks of MapReducer is that it reads data from disks. MapReducer can be interfaced with native Java APIs, or REST APIs.

SPARK - Enhanced MapReducer engine that utilizes in-memory technology to cache data. It has a wide range of applications for ETL, machine-learning and data streaming. SPARK can be interfaced with Java, Python, and in the near future with R.

Cloudera - a software company provides a big array of Hadoop based big data tools. Its single-node VM can give you a jump start for testing, demoing or learning Hadoop framework and tools mentioned above.

Friday, October 28, 2016

Data Lake, Data Warehouse and Data Mart

Getting dizzy with these big terms. Let's take a close look at them to hopefully help clear some confusions.

Data Lake is the newest term among the three. It is the data storage architecture for BIG data. All kinds of raw data store in it as blobs or objects with unique keys. Data modeling, cleansing and transformation steps will be taken when needs arise, and would only be applied to a subset of relevant data objects. Using a technical term, this modeling method is called schema-on-read. Data Lake serves a broad range of users who can sample and dive in the lake for their specific needs at anytime they see appropriate.

Data warehouse has been around for decades. It is almost the opposite of Data Lake in terms of how data is stored in it. A laborious data modeling and ETL process will need to happen first before data is loaded into it. Data modeling is tailored to answer particular questions and target specific audiences. Because of the up-front invested effort, data is usually well-formatted and ready for querying, slicing and dicing. This data modeling technique is also called schema-on-write. I was involved in a well-funded enterprise data warehouse initiative as a data modeler. The magnitude of effort was a big deal and very impressive. Documentation played a huge role in this process.

Data Mart is a small version of data warehouse. It is smaller in size and more agile to implement. The targeted audience is consequently smaller as well. Data in data mart is also pre-transformed, cleansed and well structured. Comparing to data warehouse, data mart is a better fit for small to medium business without a big IT budget. With a few capable hands, business can be benefited to answer some critical and particular questions in a much faster pace. The downside is that, data marts are often disconnected without needed keys to link them together for providing a holistic view of your organization data.