7 Observations from the Big Data Innovation Summit

The Innovation Enterprise events group put together recently the second annual Big Data Innovation Summit in Boston. Here are a few of the highlights of the first day:

Banana production in Central America is twice the rate of trash production in New York City

This and similar “fun facts” from Wolfram Alpha would have pleased Oscar Wilde who said “it is a very sad thing that nowadays there is so little useless information.” When you search in Wolfram Alpha for the quantity of bananas produced in Central America, you not only get a useful answer and other related facts and figures but this “computational knowledge engine” also whimsically suggests “comparisons” such as trash production. In the age of big data, all data is useful depending on the context. Maybe Oscar Wilde would not have enjoyed the fun facts after all.

But the audience did enjoy the dazzling display of facts and computations, delivered by none other than Stephen Wolfram himself, proudly demonstrating the growing power of his knowledge engine whose long-term goal is to make “all systematic knowledge immediately computable and accessible to everyone.”

Wolfram Alpha now boasts 93% success rate in understanding natural language queries and, for a fee, will tell you what’s in your data. Coming soon, Wolfram promised, will be the public release of the Wolfram Alpha ontologies which go beyond traditional ontologies by representing relations between terms and concepts. Also soon, any website will be able to embed the Wolfram Language which in turns embeds all the algorithms and the data in Wolfram Alpha.

Help wanted: digital-mechanical engineers

Industrial infrastructure accounts for 46% of the global economy and there is increasing demand for technicians that combine digital and mechanical skills to service it. Peter Evans, Director of global strategy and analytics at GE presented the data: 52 million man-hours per year at a cost of $7 billion to service steam and gas turbines; 205 million man-hours at a cost of $10 billion to service aircraft engines; 52 million man-hours at a cost of $3 billion to service freight trains; and 4 million man-hours per year at a cost of $250 million to service CT and MRI scanners.

All of this maintenance work is supported and augmented today by digital tools and data. For example, the 1,150 pages of field procedures for gas turbine maintenance don’t come anymore in a book form, as you can imagine. The work is webified and supported by big data analysis. Which is why we will need more digital-mechanical engineers, who will “race with the machines,” as Evans said.

Oracle’s Mark Hurd is the 2^nd most important executive to influence perception of HP

This interesting factoid was provided by Jeff Veis, VP at HP Autonomy, as a proof that old-fashioned Business Intelligence (BI) with its insistence on pre-defined Key Performance Indicators (KPI) is no match to big data analytics, which does not require pre-defined “schema.” Turns out HP was tracking the “share of voice” of its current 10 most influential senior executives (the pre-defined schema), neglecting to track a highly-visible executive like Hurd, frequently mentioned in the media as the former HP CEO.

This observation may reflect big data linear thinking—aren’t “Hurd” and “HP” just two highly correlated words that have no bearing on anyone’s perception of HP today? But Veis is convinced that big data delivers “a command center that shows you what’s happening, not a dashboard with 40 KPI.” The big data command center could help make sense of the flood of data that is going to continue to rise rapidly, mostly because of the Internet of Things. HP estimates that by 2030 there will be 1 trillion sensors in the world, one sensor every 10 square feet, requiring one million time more storage and processing.

Tape storage is alive and well and improves data housekeeping

We’ve heard for years that “tape is dead” but it hasn’t gone away. At CERN, where they already moved from big data to humongous data (15 petabytes of new data each year; one planned project will collect a terabyte a second), most of the infrastructure is open-sourced except for tape storage.

“Tapes are good,” says Sverre Jarp, CTO at CERN OpenLab, “because unused data ends up on cartridges that can be discarded after a given time of inactivity.” Tape is still around because it facilitates (enforces?) the one activity that we don’t want to think about in the big data era: deleting data. Even if the data is WORN—written once read never, to quote Igor Elbert, another presenter at the event and a principal data scientist at Gilt Groupe.

Big Data is a misleading term

Volume is often the least important issue for big data analysis. Bigger challenges are data variety—format, source, resolution—and velocity, the speed by which it is generated and consumed. Possibly most important is veracity, the varying levels of noise, uncertainty, bias, and processing errors.

This observation is based on the extensive experience of Nikunj Oza, data science leader at NASA. Oza talked about flight anomaly detection and how data analysis helps improve the safety of civil aviation. For more, see the Multiple Kernel Anomaly Detection (MKAD) algorithm and DASHlink, where NASA shares its data, research, and related research by others.

Reach out and touch someone has been empirically and analytically proven

Paychex provides payroll, human resources, and employee benefits services, primarily to small businesses. Paychex loses about 20% of its customer base each year as not all small businesses find it easy to survive. Erika McBride, manager of predictive analytics at Paychex explained how they developed a model that predicts high risk customers and can track what the Paychex branches are doing (or not doing) in terms of increasing customer retention.

Based on the model, some branches developed a year-end retention program, targeting clients most likely to leave by providing free payrolls and loyalty discounts. Turns out that when the retention strategy was applied, the customer loss rate was 6.7%, as opposed to 25.2% loss rate when nothing was done. The analysis also helped significantly the bottom line by helping the branches overcome their eagerness to touch all customers by offering discounts to customers likely to stay with Paychex, rather than targeting only those predicted by the model to be the most likely to leave.

Netflix is an IT infrastructure company that happens to stream movies

Netflix processes 1.5 million log events every second and manages 150,000 IT “events” during normal hours, according to Jae Hyeon Bae, senior platform engineer with Netflix. According to me, Netflix is developing a core competency in one of the most important technological challenges of the big data era: how to move it. If you think Netflix’s competitive advantage comes from hiring Kevin Spacey, ask yourself—who knows more about moving data globally over the Internet and into our homes?

From managing lots of data in all its forms and messiness to analyzing small batches of data to inform business decision making and predict events, the Big Data Innovation Summit brought together the people who make data useful.

[Originally published on Forbes.com]