My Technical Thoughts: Big Data Analytics in Cloud

Introduction

Nowadays, people have implemented several Big Data Analytics solutions already, and through these, a few macro patterns have emerged. You can find them below, along with a short explanation for each of them.

I also mentioned a few implementation possibilities next to them. Of course, there are other options as well to each of my examples (I decided to mostly refer IBM software).

1. Landing zone warehouse (HDFS -> ETL DW)

This is composed of a landing zone(Big Data Warehouse), handled using InfoSphere BigInsights (based on Hadoop), that reads the data from various sources and stores it on HDFS. This could be done through ETL batch processes. This data is unstructured/semi structured or structured. It generally needs to be processed and organized.

From there, data can be loaded into a Big Data Report Mart, via batch ETL. The main advantage is that the data becomes structured and organized according to customer needs. This could be Cognos, Netezza, DB2 etc.

It could then be queried though SQL queries or by making usage of reporting features of the above tools.

2. Streams dynamic warehouse (RT Filter -> HDFS/DW)

The data in this pattern is organized mostly the same as the one above. It too has a Landing Zone (Big Data Warehouse) handled with InfoSphere BigInsights for instance, and a Big Data Summary Mart (handled with IBM PureData For Analytics for instance). From this Summary mart, reports could be extracted using SQL for instance.

The main difference is that the incoming "data stream" into the cloud is going to be initially processed using InfoSphere Streams on real time. It will be filtered and analysed, then stored in the Big Data Warehouse (landing zone). This data is going to be mostly structured. A part of it is going to be stored directly into the Big Data Summary mart.

The upside in this case is that while the data mart is going to contain structured information as well, the processing happens in real time as the data comes, with the help of InfoSphere Streams. Therefore the Data Warehouse will contain real time structured data.

3. Streams detail with update (combination 2 & 1)

This is a combination between the first two. The data streams are going to be real time processed as they come, using InfoSphere Streams. They will be then stored on both Landing Zone and Big Data Summary mart as in #2. Additionally, there will be also a Detail Data Mart, loaded through ETL processes from the Landing Zone (Big Data Warehouse). This way additional processing could be done on this data which requires analysis over large data sets for instance, or maybe by making use of data that is not available in real time.

The applications will access both the Big Data Summary mart loaded with data processed real time, as well as the Detail Data mart, filled by the ETL batch processes.

4. Direct augmentation (HDFS -> augment -> DW)

The incoming data will be loaded in the a Big Data Warehouse (Landing Zone) via batch ETL. Contains mostly unstructured data. It could be then directly accessed though Hive (via a Virtual database) from HDFS.

Additionally, there is also an existing Data mart containing existing transactional data acquired previously, from other sources (internal perhaps).

Applications will make use of both landing zone and data mart simultaneously.

5. Warehouse augmentation (HDFS -> augment analytics)

This is basically the same as #1, with the addition that the data acquired in the Big Data Warehouse is going to be subsequently enhanced using an Analytics Engine. It will be then loaded to the Summary Data mart, using the same ETL batch processes.

6. Streams augmentation (augment & filter RT -> HDFS/DW)

This patterns also makes use of the Analytics Engine, that will enhance/augment the data coming to the InfoSphere Streams (and processed in real time). From there, the filtered data will be saved to the Big Data warehouse, handled by InfoSphere BigInsights. The ETL batch processes will read and process this data, and will load it to the Data Summary Mart. It can be then accessed though SQL queries for instance.

7. Dynamic cube (HDFS - > search indexes)

This is the most complex pattern of all :). The data virtually arrives in the Big Data Warehouse handled by InfoSphere BigInsights for example.

It is subsequently being processed by an index crawler, indexed by making usage of a big data index, and then accessed through a Virtual Data Mart.

They say is that though this pattern you're building your own google (search engine) ;)

Primary Big Data Use-Cases

1. Big Data Exploration - find, visualize and understand the data to improve decision making.

2. Enhanced 360 degree view of the Customer - all customer data in one place by incorporating all sources.

3. Security/Intelligence Extension - monitor and detect in real time.

4. Operations Analysis - analyze a variety of machine and operation data for improved business results.

5. Data Warehouse Augmentation - integrate big data and data warehouse capabilities for improved business results. Optimize your data warehouse to enable new types of analysis.

My Technical Thoughts

luni, 20 ianuarie 2014

Big Data Analytics in Cloud - Patterns and Use-Cases