Analytics is not performed in a vacuum. Analytics require the participation of business users, business analysts, IT analysts, data scientists, and even business unit leaders.
Analytics is also performed over time:
- There is a development of the actual problem or opportunity;
- The creation of the algorithm(s) to solve the problem;
- Testing and validation of the algorithm(s);
- Refinement of the algorithm(s); and
- Execution of the solution.
Data analysts, like other employees, resign or get terminated. As data analysts come and go, they may leave or take with them their skills and intellectual property (which may legally be proprietary or confidential to the company).
These factors, along with others not listed, contribute to the need for better communications, collaboration and security of work product associated with analysis.
Greenplum, a division of EMC, announced Chorus (a collaborative data science platform) back in March. The recent announcement at Strata-HadoopWorld delivers on the promise Greenplum made to release the Chorus code to open-source. The platform provides stakeholders in the analytics process to create, share and publish within a designated community of users, insights, datasets, work product and arguably, most importantly, the thought streams associated with the insight.
Chorus also acts as a centralized repository of work products and insights. Models that were created earlier, can be then retrieved and new analysts could improve the models. Thoughts around the improvement, comments to its improvements, knowledge that might exist from previous modelers could be captured within the platform. All this is done in a secure platform.
Data that persist in both relational database management systems (RDBMS), such as Microsoft SQL along with those stored in Greenplum databases and other Hadoop databases can be brought together to improve the breadth of the datasets available and improve the accuracy-of- and time-to-insight.
As part of this announcement, Greenplum released its code for Chorus to the open-source and has already established new partnerships that extend its ecosystem beyond the Chorus product (Greenplum initially announced Alpine as a partner). Greenplum announced that it has established partnerships with Kaggle, GNIP and Tableau.
One of the most challenging parts of the Big Data ecosystems is human resources. There are not enough “data scientists”. To address this, Kaggle.com was established to provide a Linkedin-like experience for data scientists. Kaggle and EMC have partnered to deliver an “on-demand data scientist workforce”. It enables Chorus to be a broker between a pool of 57,000 data scientists and organizations who are unable to quickly staff their predictive analytic projects. By leveraging the relationship with Kaggle, enterprises can leverage the data scientist resources as needed, creating a very agile, on-demand data science ecosystem.
The second partnership Greenplum established is with GNIP. GNIP provides out of the box integration of filtered social feeds (such as Twitter) that Chorus users can leverage as yet another data source that seamlessly integrate with the Greenplum database.
Lastly, Greenplum also announced a partnership with Tableau. Tableau is a very popular data visualization tool. The partnership allows the provisioning of Tableau Workbooks from Chorus data sources; link and co-author Tableau hosted work files; and tag and annotate on Tableau assets from within Chorus. Data scientists, or anyone working with data, no longer need to go to different tools to complete their analytics project. The work done for advanced visualization can now be documented in Chorus, insuring the value of data science assets stay within the organization (per your point at the beginning). Moreover, organizations can start to leverage existing investments and create value out of the synergies between the different analytics solutions.
neuraspective™ & Business Value Assessment
The business value of Chorus is vast. Too often organizations are “reinventing the wheel” when it comes to analytics because simple preliminary models vested with an analyst are no longer accessible as a result of separation between that analyst and the enterprise. In other cases, preliminary models, code, data, work product, and insights are not shareable simply because business users in different departments or divisions just simply are not aware of the pre-existence of this output.
The duplication of work, the redundancy of the analytics, and the lack of collaboration that exist (albeit, the lack of collaboration is neither malicious nor deliberate) cost enterprises valuable opportunity to create strategic value.
The ideas and hypotheses of one department or division could be equally or even more valuable when applied across an enterprise.
What Greenplum Chorus highlights is a problem that is all too common. The goal for open-sourcing Chorus, as a project named OpenChorus (openchorus.org), is to make it a platform driven by the data science community, with the intent being to ensure customers can also modify Chorus in ways meaningful to their own business and preserve those changes in-house or contribute them back.
Neuralytix believes that Big Data collaboration platforms like Chorus will significantly improve the productivity of data analytics personnel; open the channel for cross-functional and cross-divisional sharing of insights that can be leveraged enterprise wide.
Greenplum has demonstrated with Chorus that it is more than a database company and more than a Big Data company. It is developing and contributing an ecosystem that will allow a broader use and acceptance of Big Data activities across an enterprise.
The announced partnerships extend Greenplum even further. It demonstrates that EMC and Greenplum is interested in supporting the Big Data community and extending the Big Data ecosystem.