Big data software has emerged from a number of sources but especially the Open Source community, and more specifically the Apache foundation. While a number of for-profit companies have taken up the mantle of Apache Hadoop and Apache Spark, there is still a certain DIY attitude about big data.
For the IT professional, this means a deficit of tools for data management for Hadoop clusters that perform the same functions as those used for more traditional data sources such as SQL databases. Professional data management equivalents of data warehousing and data lakes as well as SQL RDBMS master data management are only emerging. IBM, for example, recently released MDM 11.4 with support for Hadoop and Infosphere Information Integration 11.3, bringing familiar data management to the big data.
Informatica Big Data Management
Data management software company Informatica has just announced its version of data management for Hadoop clusters. Informatica Big Data Management brings the well-known, best practices in data management found in Power Center and other Informatica products into the big data and unstructured data space. It allows companies to manage and integrate data in Hadoop and NoSQL stores in the same way they manage SQL data now.
Some of the high points of Informatica Big Data Management are:
- Big Data Management validates data in unstructured and semi-structured data stores. Unstructured data is often very messy data with inconsistent and incomplete data across different data sources. Informatica finds the errors and creats a clean data glossary.
- Masking and Anonymizing. Data in Hadoop has the same masking problems as SQL data. Up until recently, you couldn’t do anything about it. There was the potential for end-users to receive a dashboard or report generated out of a Hadoop cluster that displays information that they don’t have the rights, by policy, to see, thus repsenting a problem for customers in highly regulated industries. They either had to build difficult workarounds, usually by replicating the ‘allowed’ portion of the data, or not use critical information. Neither was a good option. This fixes that problem and enables privacy sensitive organizations to better use Hadoop clusters. Would it be a good idea to add that the security and masking stuff comes from the secure @ source product line?
- Live Data Map. Unstructured data is especially hard to understand because there is no set schema, hence no natural map. Further, given the explosion of types of unstructured data, any map that an enterprise created for unstructured data would quickly be obsolete as new data sources emerged. The Live Data Map builds the catalog of metadata and relationships vital to performing analysis efficiently. As such, Live data map enables customers to update the mapping as new data sources are included.
- Integration to and across Big Data stores using automation and visual tools. Informatica Big Data Management enhances two common forms of data integration for Hadoop clusters. First, it automates queries across diverse types of SQL and NoSQL databases and loads them into Hadoop clusters. This automates ETL in non-homogenous database environments, making the process more efficient. Second, it does the same for extracting information from a variety of structured and unstructured data stores. This makes queries more efficient and analysis easier
The Big News
The reason Informatica’s Big Data Management matters to IT professionals and data scientists alike is that it applies existing, well known, and well understood data management practices, common in managing SQL data, to non-SQL data types. The same type of automation and visual tools that Informatica customers are used to using in Power Center, Data Integration Hub, and other Informatica products have been extended to the Hadoop and NoSQL world. Productivity is maintained through familiarity and efficiency gained through automation and visual tools.
These recent announcements show the maturation and acceptance of big data in large enterprises. Large companies are now asking for the type of tools that they are used for managing other types of data stores. They want these tools to be familiar and to incorporate the best practices they have come to rely on for managing data within their complex environments. Both Informatica and its rivals are giving these customers what they need to take big data to the next level.