hadoop

What does F1, Spark and Blu-Ray player have in common? Before you start browsing your “intellectual” thoughts, let me state the fact, it just happens to be few regular events in last week. I hosted a session on Apache Spark, attended the SIA F1 weekend event and picked up my free “Blu-Ray” Player (gift with my TV upgrade).
Back to work, and while reflecting on various advises and feedback on Big Data Analytic Deployments, it seems these dis-joint events actually represents today Analytics ecosystem and processes.
While F1 represent the ultimate Speed and Agility. Apache Spark promises to bring the same “Speed and Agility” to Big Data analytics.
F1 relies on discipline and rigor, and the key to winning is to adjust, adapt, and realign during the race (execution) itself. Big Data Analytics success factors are same. It’s not about starting with a KPI driven big-bang and rigid data governance approach. The key to “Big Data” Analytic is to start with a minimal investment, business aligned focused goal and adjusting, adapting and re-aligning during the development life-cycle itself.
Apache Spark promises great Agility in terms of Big Data Analytic development life-cycle. It provides ability to create complete data science workflow, ingest, transform, prepare data, execute analytic algorithm, analyze and visualize all on a single Platform. A unified Platform for such development allow to rapidly adjust, adapt and re-align and thus promises to provides Business with Insights and Agility they have been seeking.
What about the speed? While SPARK hold the record for quickly sorting 100 TB of data (1 trillion records) , its improving similar to Mercedes engine for F1 cars by each release.

What about the “Blu-Ray” Player ? While the “Blu-Ray” Player is one of the excellent technology,I have been struggling to understand it relevance in my house. I watch Movies on Apple Tv , its agile ( I can decide at any time what to watch, change my preference , pay and enjoy). I use USB-Drive/external Drive for any of my existing content. I don’t see a reason why I should be paying for costly “Blu-ray” Disc, which forces me to limit my choice and loose flexibility.

The last statement just reflects the comment I have been hearing from Business Leaders about the value they see from there “traditional data warehouse” approach.
Add to this, the Disc and Blu-ray Region code map, (data governance going wrong) which again limits what I can play, its excellent technology but today irrelevance to me.

So what’s represents yours Analytic Ecosystem? “Speed and Agility” or “High Cost and Rigid” ?

We experienced this on every Analytics/Data Warehousing project that a disproportionate portion of the time spent is about data preparation i.e acquiring /preparing /formatting/normalizing the data. As as the use of Analytics mature in an organisation, dimensional modelling and KPI based reporting are becoming old-fashioned and of limited use. Subject matter experts want access to their organisation’s data to explore the content, select, control, annotate and access information using their terminology with a frameowrk of data protection and governance. We saw the conflict between business and IT operations for democratisation of data and operational control. Thus the concept of Data Lake started evolving.

Data lake as understood by most of the enterprise is a big data repository that provides data to an organization for a variety of analytics processing including:

Discovery and Exploration of Data.
Simple Ad Hoc Analytics.
Complex analysis for business decisions.
Real time Analytics & Reporting.

It is possible to deploy analytics into the data lake to generate additional insight from the data loaded into the data reservoir. There are two aspect of building an effective Data Lake, Platform Infrastructure and data flow ( includes governance and control aspect). While am not touching on the data flow here, would just like to caution that just considering a very structured multi-dimensional modelling on a data lake itself takes away the objective of data lake.

So on the platform aspect for a data lake, Hadoop is becoming synonymous with data lake project and in fact evolved due to it. Most of the vendor today either have a cloud solution for Hadoop or an appliance based approach. While Cloud provides elasticity it’s still a challenge for most of the enterprise to move all sorts of data freely on cloud. In-premise Appliance based approach provide simplicity however looses the elasticity required for a data lake.

Here I suggest a hybrid model which provides both elasticity as well as safety net required for most of the enterprise. A small hadoop cluster in-premises provides safety to collect and store all “data asset” specifically from your in-house applications. It can work as a staging area for volumes of data and analytics area for your confidential data. Can avail cloud hadoop deployment for massive transformation and deeper analytical algorithm and scale (shrink or grow as required). In order to maintain the integrity and data security you need a secure data movement, encryption and access control.

Here is a suggested Architecture based on IBM Big data portfolio.

Hybrid Enterprise Data Lake Architecture

IBM Reference Architecture for BigInsight( hadoop ) provides the flexibility to deploy secure in-premise cluster of any size and cloud offering for BigInsight ( either as PaaS, SaaS, IaaS) helps to provide elasticity.

The above architecture really opens up un-explored opportunities without having to deal with a big time initial investment, changing business dynamics & analytics requirement , foremost without compromising on the control. Business can choose to experiment with setting up a data lake with a in-premise cluster , and let the usage of analytics evolve. As and when they are ready can provision a cloud for any specific analysis they prefer to do. ( Example Click data analysis for traffic generated on the web for retailers during the holiday season or multi-facet clustering for a new promotion by a financial hub).

While the SQL on Hadoop has evolved for structured data , use of existing MPP/In-memory technologies as accelerator for operational reporting is optional. Will publish my thoughts on data flow and governance model for data lake in my next blog.

	davidhbroyles1 on Data Lake for the Enterprise:…
	anupkumar on Lambda Architecture and my exp…
	mritspunto on Lambda Architecture and my exp…
	carloagonzales on Data Lake for the Enterprise:…

Anup Kumar Blog

AI & Information Architecture

F1, Spark and Blu-ray Player

Data Lake for the Enterprise: Using Elasticity of Cloud and Safety of In-premises deployment