To be human and living in the 21st century is to rely on data, whether it is generated organically through some miracle of neurotransmission or produced by some electronic device. The most ardent technophiles among us believe – and they may be right – that for every problem in life there exists an electronically enhanced solution. Both people and computers need data to make decisions, and the collection, managing and parsing of that data along with the symbiotic relationship that exists between man and machine is what makes the big data space so compelling as well as complex.
The GigaOM Structure Data conference series, which kicked off in New York City March 20th and 21st, provided ample opportunity for attendees to catch more than 30 presentations and panel discussions, mix with other attendees and dialogue with a variety of Big Data solution providers. While there were too many presentations, vendor booths and opportunities for serendipitous interactions for one individual to absorb, I’ve tried to encapsulate what were, for me, some of the more meaningful moments at the event.
To quote GigaOM’s event recap, “Big Data needs people, leaders and real-time analytics.” I would add that the Big Data space also needs vendors and solution providers to improve their ability to engage aspirational users – not just work with leading edge organizations – and improve on their data integration capabilities. I’d also like to see more Big Data vendors tackling harder problems than trying to figure out, for instance, our purchasing habits or which college basketball team will go to the Final Four.
Big Data Thought Leadership on Display
Indeed there were several presentations that showcased Big Ideas for Big Data. Eric Berlow, TED Fellow and founder of Vibrant Data Labs, in his GigaOM presentation, asked the question, “Are algorithms actually making society dumber?” Vibrant is an example of collective creativity, which fosters collaboration among scientists, artists, designers and social investors developing “data-driven approaches to navigate the complex landscape of our most pressing problems and convert limited resources into problem solving engines for positive change.”
Berlow also talked about the challenges inherent in democratizing data and posed the question, “Can we do more?” Vibrant recently launched We The Data, a non-profit whose mission is to improve access to data for the “underserved,” support openness and trust by giving people both access to and the ability to control their own data and dramatically increase data literacy. Although he did not mention it, access to our personal health information (PHI) comes to mind – an issue Obamacare is tackling with some initial success, for example, vis-à-vis Medicare’s Blue Button.
Human and Machine Collaboration
A panel titled, What Does Collaboration Among Humans and Machines Really Look Like?, and comprised of Scott Brave, founder and CTO of Baynote, Timothy Estes, founder and CEO for Digital Reasoning and Jan Puzicha, co-founder and CTO at Recommind, discussed the importance of the interaction between people and thinking machines and also what machines and people do best. Brave pointed out that, for example, during an online buying experience, organizations often lose sight of the fact that there is a person at the other end of the computer and those people are queryable and can be much more involved in helping to improve the experience and help the supplier understand “What’s the right data?”
Estes added that machines have 3 key advantages over humans: (1) Scale, such as the ability to read billions of documents quickly, “unless humans have some Ray Kurzweil–style implant, which I hope doesn’t happen”; (2) Speed: although humans have the ability to receive thousands of sensory inputs to create a visual object, we can’t handle thousands of inputs simultaneously; and (3) Unifying the synthesis of data at scale and speed will never be achieved by humans; the data and judgments coming out of that synthesis can only be achieved by machines.
Estes also speculated about the future of the machine learning model and what data access might evolve to over the next decade. “We’re going to have a debate in culture and society about what to do with that. Will we have a Google-like model where it tells us what to do next and the data is all in one place? Or will we have software and technology that we own that does it for us as an extension of us?”
Estes believes we not only need to figure out what’s important to us in terms of what information and insights we want to extract from data, he also believes we need to push ourselves to try and solve harder problems. “We have a moral obligation to exploit and leverage data to the extent possible to fight global terrorism, mitigate financial risk or lower healthcare costs and improve health outcomes.”
Puzicha talked about the importance of integrating solutions with a workflow that feels natural for users. “It’s not so much about human vs. machine or the importance of the algorithm. It’s about how you synthesize the two in a system or module that allows the user to forget they are working with a machine learning system. It becomes an assistant, something they can iterate with. This requires a much deeper understanding of the use cases and a deep understanding of how that interaction happens. Creating a system that can be applied to multiple use cases within an organization is critical. From a human interaction perspective, it’s fundamentally the same usability problem applied across the enterprise.”
In a later breakout session, Recommind’s CEO Bob Tennant elaborated on the man and machine collaboration discussion. “Machine learning solutions are really good at precision and recall. You want to be able to easily ask and answer a question. But how do you surface that question? When a real-time search requires much more precision than Google delivers, it should not be necessary to have a team of data scientists perform that function. The technology has to be easy enough for the mass of users to derive value from the solution, whether it be for enabling a business process or an evidence-based healthcare application.”
Big Data Heuristics
CTO and Quid founder Sean Gourley, during his Where Is the Big Data Industry Going? presentation, challenged attendees to re-imagine the move from science to Data Intelligence. He used the example of how some medical informatics scientists are “obsessed with predicting illness rather than trying to figure out how to avoid illness.” Gourley proposes the creation of data sharing ecosystems that curate much richer data than can be found on social media sites.
Gourley laid out his Five Heuristics for Big Data:
- Data has to be designed for humans with human-centered UIs.
- We have to understand the limitations of the human brain.
- Understand that data is messy, incomplete and biased.
- Data needs a theory, so building models that understand theory is important.
- Data needs stories, stories need data, so combine stories with data to make better decisions.
Gourley used the analogy of the centaur, half-man and half-horse, to argue that systems need to be designed to interface seamlessly between human and machine.
Databases of the Future: Where’s the Data Integration?
Thanks in large part to open source software guru Doug Cutting, the Hadoop Distributed File System (HDFS) demonstrates that the constraints of traditional relational database management systems (RDBMS) can be overcome relatively easily and cost effectively. For example, users can address petabytes of data with a single query. (Cutting also developed the Lucene open-source framework or library widely used in text-based search engines.)
Based on software innovations developed at Yahoo! and Google, Hadoop has helped spawn a whole new category of application development tools. The last panel of the conference, Four for the Future: Upcoming Database Technologies that Are Not Hadoop, was moderated by GigaOM analyst David Linthicum. Much to his credit, he tactfully pressed the panelists on the role their companies play in providing data integration solutions or capabilities.
All four panelists agreed that the data integration challenge was largely up to the client to solve. The problem for most users is data integration is messy and costly, and lack of data integration is generally the biggest barrier to deriving new insights or business value from existing information sources. It’s understood that no two organizations have the same data profile or integration requirements. But as long as Linthicum kept bringing up the topic, I was looking for at least one panelist to address the integration issue. None did.
Damian Black, CEO of SQLstream, talked about the value of and the increasingly more cost-effective use of in-memory processing – whether using Hadoop or not. Black is an enthusiastic advocate for leveraging SQL as a de facto query language standard. “SQL is a lingua franca for queries. Enterprises like SQL as do developers who want structure and a framework. Streaming technology has moved to real-time and will become more prevalent and affordable.”
Emil Eifrem, CEO and founder of Neotechnology, stated he did not enjoy panels where everyone agreed. “Queries don’t have to run in-memory and premature standardization kills innovation. Easy to use APIs and things you can’t express in SQL are critical, and modern developers are happy to learn new things.” Eifrem believes that the NoSQL market is headed for more consolidation. “A few years ago we had twenty NoSQL database companies. Now we have five or so, including open-source DB players like Basho, Cassandra, Couchbase, MongoDB and Neo4j. The key value store and document store databases will merge and compete.”
Ryan Garrett, VP of Product for MemSQL, agreed with Black that SQL will be the standard for non-relational database queries. He added that “there will be much wider adoption of in-memory DB systems as both memory and storage continue to become more affordable; and the use cases are going to be more widely recognized, while data volumes will continue to grow.”
TempoDB CEO Andrew Cronk weighed in on the future of time-series data and the impact of mobile devices. “If you believe GE’s marketing of the industrial Internet or Cisco’s view that the Internet is everything, and you believe there will be more connected devices and more stuff about our sales environment, then you’ve got to believe in the explosion of connected devices and time-series data. And,” he added, smiling, “it will all be in TempoDB.”
Solution Providers of Interest
No doubt I have missed several vendor sponsors that, if there had been more time to meet them, would prove to be most interesting. At any rate, the following is a short, alphabetical list of those solution providers I did speak with at the event or those vendors for whom I have some working knowledge.
Cleversafe has a solution that allows limitless data storage, whether the data is stored locally in a data center, at remote locations or with cloud service providers. The obvious appeal of geographically dispersed information is data protection, but also leveraging the cost of storage as well as scaling more easily for big data projects.
CloudSigma is a cloud service provider based in Switzerland that has moved entirely to Flash storage. CEO Robert Jenkins, who was a panelist at the conference, says the move to all SSD or Flash storage is justified because the service he provides to customers is now so much faster than it had been. CloudSigma is planning to open up two facilities in the U.S. later this year.
Digital Reasoning enables enterprises to detect fraud, uncover market trends, gain better insight into customer behavior, gain competitive advantage, and mitigate risk. Its flagship product Synthesys leverages entity analytics to uncover critical facts and relationships from very large and diverse data stores. The company works with government, financial and healthcare organizations offering both on-premise and cloud-based solutions.
Hortonworks spun off from Yahoo! and is one of the leading open source Hadoop distributions. The Hortonworks Data Platform (HDF) is targeted at organizations that want to combine the power and cost-effectiveness of Apache Hadoop with the advanced services and reliability required for enterprise deployments.
IBM has perhaps the most comprehensive portfolio of Big Data–related solutions and services, including major upgrades to their legacy databases and newer Hadoop-based offerings. Unlike many Big Data vendors, IBM also offers information integration, governance and migration services and solutions as well as the hardware and storage assets to support Big Data projects – not to mention cloud-based solutions.
LucidWorks offers an open source enterprise search platform based on Lucene. It recently announced a near real-time indexing capability, and its Big Data solution offers an application development platform that enables comprehensive search, discovery and analysis of an organization’s content and user interactions.
MapR is an enterprise-grade Hadoop platform that supports a broad set of mission-critical and real-time use cases. Well known for its big, named partner ecosystem that includes Amazon, Cisco and Google, the company just this month closed a $30-million new round of funding. MapR touts dependability, ease-of-use and world-record speed to Hadoop, NoSQL, database and streaming applications in one, unified Big Data platform.
MarkLogic provides an enterprise-grade NoSQL document-centric database that enables organizations to manage a variety of data types. Key features include ACID transactions, horizontal scaling, real-time indexing, high availability, disaster recovery and government-grade security, while offering search across disparate data types, including text, images, date/time, geospatial and currency culled from multiple repositories or systems.
Recommind has made a name for itself in the legal ediscovery space but is now leveraging its CORE analytics platform to help customers solve Big Data challenges. It offers customers the ability to access and understand unstructured data, without the need for taxonomies or natural language processing, by using machine learning techniques.
Tableau is one of the more popular data visualization and business intelligence tools in the market. Its solution allows non-programmers to query data relatively quickly and easily. Users can view data from integrated dashboards and share mobile and browser-based interactive analytics and ad hoc reports.
Impressions and Conclusion
The Big Data space is packed with large incumbent vendors like EMC, HP, IBM, Microsoft and Oracle who are busy acquiring companies, trying to innovate and also protecting their turf.
The plethora of smaller companies, many of whom are well funded and/or well regarded, are pushing the envelope in many ways, especially in the database space where even the largest enterprises have felt compelled to adopt Hadoop or other NoSQL solutions that offer much more scale and flexibility at a lower cost than traditional RDBMSs.
The search and analytics space is also wide open for new players, especially those that can handle a variety of Big Data sources of both structured and unstructured data including text and images. Vendor consolidation down the road is a given – similar to other markets when they begin to mature.
I look forward to the day when any of these products can be used easily – as easily as a Google search – by non-technical people and any organization of any size can afford these capabilities. For now, early adopters with deep pockets and tech-savvy individuals are the primary beneficiaries of all this incredible intellectual property.