Data Lake is the newest term among the three. It is the data storage architecture for BIG data. All kinds of raw data store in it as blobs or objects with unique keys. Data modeling, cleansing and transformation steps will be taken when needs arise, and would only be applied to a subset of relevant data objects. Using a technical term, this modeling method is called schema-on-read. Data Lake serves a broad range of users who can sample and dive in the lake for their specific needs at anytime they see appropriate.
Data warehouse has been around for decades. It is almost the opposite of Data Lake in terms of how data is stored in it. A laborious data modeling and ETL process will need to happen first before data is loaded into it. Data modeling is tailored to answer particular questions and target specific audiences. Because of the up-front invested effort, data is usually well-formatted and ready for querying, slicing and dicing. This data modeling technique is also called schema-on-write. I was involved in a well-funded enterprise data warehouse initiative as a data modeler. The magnitude of effort was a big deal and very impressive. Documentation played a huge role in this process.
Data Mart is a small version of data warehouse. It is smaller in size and more agile to implement. The targeted audience is consequently smaller as well. Data in data mart is also pre-transformed, cleansed and well structured. Comparing to data warehouse, data mart is a better fit for small to medium business without a big IT budget. With a few capable hands, business can be benefited to answer some critical and particular questions in a much faster pace. The downside is that, data marts are often disconnected without needed keys to link them together for providing a holistic view of your organization data.
Great article to read.
ReplyDelete