Understanding the “Big” in Big Data

Yesterday, January 16, 2013, I had the honor of having lunch with Dr. Konstantin V. Shvachko. Among other things, Konstantin was the lead developer and committer for the Apache Hadoop Framework project. He is responsible for the name node portion of the project (i.e. he is the guy responsible for the name node!!!)

We talked about many things. But, one of the most interesting things that we discussed was the work he is doing at WANdisco (where he now works) on making the the name node scalable.

Apart from that, one of the other things we discussed was around whether there was a utility to discover how many Hadoop instances are running in an enterprise. The simple answer is NO! But for larger enterprises, I’m pretty sure that there are multiple small instances distributed throughout an organization.

A tool such as this would help IT to better understand who is looking at Big Data, and enable the consolidation of Hadoop clusters into the traditional IT environment to enable corporate and regulatory compliance as well as provide the necessary controls to enable IT to centralize management and coordination.

This is important. My research background suggests that most larger organizations are likely to have multiple “skunkworks” Hadoop clusters deployed. IT organizations need to look at not only provide its Infrastructure as a service, but consider the evolution to being a provider of Information-as-a-Service for the enterprise.

Neuralytix, the company I serve, believes that IT should really be a provider of data and information, not a support and maintenance organization for for infrastructure.

What do you think?

Understanding the “Big” in Big Data

Success!