Big Love: EDW and Big Data

Enterprise Data Warehouse (EDW) for decades has been used to powering up lots of enterprise analysis. EDW capability to bring data from various source and transform it into models needed by all layers of the company become main justification to made EDW an imperative element of business analytic. When Big Data is introduced as new analytic platform, due to totally different characteristic between these two technologies, people think EDW and Big Data could not stand side by side … or could they ?

Despite have extreme different on almost every step they do to perform analytic, both EDW and Big Data actually can exist side by side and complement each other. There could be big love in the air between these two platform, as long as it does correctly.

EDW Fiasco

EDW is designed to be rigid. It brings data and force data to be transformed into agreed models that required by business owner. Once the models is agreed and rolled out, it is difficult to change it. It is not impossible, it just a lot of effort will be spent, especially on Transform phase of ETL (Extract, Transform & Load) paradigm which is main driver of EDW. Rigid is first issue of EDW.

As business grows, data also grow and this impact on time to complete of EDW’s ETL process. Unfortunately this time increased most of the time does not go linear, instead if will grow exponentially. If during first year of EDW implementation 1 hour is enough to complete ETL process on daily basis. After 10 years of data, 10 hours is no longer the number but could be 30 hours. This will causing a rat race between data and analytic. Ironically, company does not really need these 10 years of data for their common activity. They only need probably only last 2 years, but they have no option unless to save and processing these 10 years because of some preservation data rules or as an archive media should be there need to analyse their historical data. Time to complete is another EDW problem.

Big Data

Rigidity and time to complete is never been an issue to Big Data. Distributed File System and NoSQL that embrace by big data will solve rigidity problem. Big Data allows us to save any format of data without need of schema to be define first. 

Big Data allowing us to run a logic as soon as data stream into Big Data repository. This logic can be anything: sending an alert, perform computation, building model incrementally or any other we can think. Creativity is the only limit.

Working with Big Data platform also allow us to leverage power of data stored in memory. Being in-memory, data will be processed faster rather than when it is saved only on physical storage.

Ability to scale up, by simply add more nodes using common hardware to increase either storage capacity or computation power is big data answer for time to complete.

Do we still need EDW ?

But if Big Data is soooo greeeaaatttt ? why company still need to preserve their EDW, why not simply moving to Big Data ?

Well, rigidity is not always bad. In some cases, rigidity is required to define standard and guidelines. That is why many statutory reporting is consumed data from EDW model. Thus, EDW still need to be exist, especially on large company where new technology adoption is not as easy as 1, 2 and 3. 

Moving all model to Big Data platform at once on big organisation, for sure will require deep learning curves and significant efforts. 

Above reasons justify why EDW still need to exist but its problem need to be resolved as well, because if not, EDW will not work per expected and will have bad impact on business process. This is the time where EDW need to love Hadoop.

EDW loves Hadoop

Before we discuss further, lets see how common EDW is implemented on below diagram

Screen Shot 2016 04 21 at 9 45 29 AM

Below are some scenarios how EDW embrace Hadoop on its ecosystem to leverage productivity

Scenario 1 – PreProcess ETL

Screen Shot 2016 04 21 at 9 48 02 AM

In this scenario we move ETL process from EDW Staging layer to Hadoop and push the output to EDW to be used later on any dashboard or reporting tools. Using this approach will bring the power of flexible data schema and fast parallel processing of Hadoop and also shift high cost data warehousing to lower cost hadoop clusters.

Scenario 2 – Hot and Cold Storage

Screen Shot 2016 04 21 at 10 00 55 AM

This scenario split data into 2 types:

Hot Data, data frequently used by analysis 

Cold Data, data that not necessarily be accessed on daily basis but need to be archived for either rare analysis or for data archiving requirement by government policy.

In this scenario we offloading large volume of historical data into cold storage with Hadoop. Keep data warehouse for hot data only. Whenever data from cold storage is needed, it can be moved back into EDW or directly queried and joined with EDW result. 

This approach will give some space to EDW to process less data and consequently less time require to complete and less cost.

*All images are taken from Big Data Analytic with HD Insight material of Microsoft Azure