In the ever-evolving world of enterprise IT, choice is generally considered a good thing –albeit having too many choices can create confusion and uncertainty. For those application owners, database administrators and IT directors who pine for the good old days when one could count the number of enterprise-class databases (DBs) on one or two hands, the relational-database-solves-all-our-data-management-requirements days are long gone.
Thanks to the explosion of Big Data throughout every industry sector and requirements for real-time, predictive and other forms of now indispensable transactions and analytics to drive revenue and business outcomes, today there are more than 50 DBs in a variety of categories that address different aspects of the Big Data conundrum. Welcome to the new normal world of NoSQL – or, Not only Structured Query Language – a term used to designate databases which differ from classic relational databases in some way.
In August, more than 20 NoSQL solution providers and 100-plus experts gathered at the San Jose Convention Center for 2013’s version of NoSQL Now!. Exhibitors and speakers included familiar names such as Oracle along with a score of venture-backed NoSQL solution providers eager to disseminate their message and demonstrate that the time has come for enterprises of every ilk to adopt innovative database solutions to tackle Big Data challenges. More than a dozen sponsors were interviewed at the event and profiled in this research note.
Evolution of NoSQL
In the beginning, there was SQL (structured query language). Developed by IBM computer scientists in the 1970s as a special-purpose programming language, SQL was designed to manage data held within a relational database management system (RDBMS). Originally based on relational algebra and tuple relational calculus, SQL consists of a data definition language and a data manipulation language. Subsequently, SQL has become the most widely used database language largely due to the popularity of IBM, Microsoft and Oracle RDBMSs.
NoSQL DBs started to emerge and become enterprise-relevant in the wake of the open-source movement of the late 1990s. Aided by the movement toward Internet-enabled online transaction processing (OLTP), distributed processing leveraging the cloud and the inherent limitations of relational DBs, including lack of horizontal scale, flexibility, availability, findability and high cost, use of NoSQL databases has mushroomed.
Amazon’s instantiation of DynamoDB is considered by many as the first large-scale, or web-scale, production NoSQL database. To quote author Joe Brockmeier, who now works for Red Hat, “Amazon’s Dynamo paper is the paper that launched a thousand NoSQL databases.” Brockmeier suggests that the “paper inspired, at least in part, Apache Cassandra, Voldemort, Riak and other projects.”
According to Amazon CTO Werner Vogels, who co-authored the paper entitled Dynamo: Amazon’s Highly Available Key-value Store, “DynamoDB is based on the principles of Dynamo, a progenitor of NoSQL, and brings the power of the cloud to the NoSQL database world. It offers customers high availability, reliability, and incremental scalability, with no limits on dataset size or request throughput for a given table.” DynamoDB is the primary DB behind the wildly successful Amazon Web Services business and its shopping cart service that handles over 3 million “checkouts” a day during the peak shopping season.
As a result of the Amazon DynamoDB and other enterprise-class NoSQL database proof points, it is not uncommon for an enterprise IT organization to support multiple NoSQL DBs alongside legacy RDBMSs. Indeed, there are single applications that often deploy two or more NoSQL solutions, e.g., pairing a document-oriented DB with a graph DB for an analytics solution. Perhaps the primary reason for the proliferation of NoSQL DBs is the realization that one database design cannot possibly meet all the requirements of most modern-day enterprises – regardless of the company size or the industry.
The CAP Theorem
In 2000, Berkeley, CA, researcher Eric Brewer published his now foundational CAP Theorem (consistency, availability and partition tolerance) which states that it is impossible for a distributed computer system to simultaneously provide all three CAP guarantees. In May 2012, Brewer clarified some of his positions on the oft-used “two out of three” concept.
- Consistency (all nodes see the same data at the same time)
- Availability (a guarantee that every request receives a response about whether it was successful or failed)
- Partition Tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system).
According to Peter Mell, a senior computer scientist for the National Institute of Standards and Technology, “In the database world, they can give you perfect consistency, but that limits your availability or scalability. It’s interesting, you are actually allowed to relax the consistency just a little bit, not a lot, to achieve greater scalability. Well, the Big Data vendors took this to a whole new extreme. They just went to the other side of the Venn diagram, and they said we are going to offer amazing availability or scalability, knowing that the data is going to be consistent eventually, usually. That was great for many things.”
ACID vs. BASE
In most organizations, upwards of 80% of Big Data is in the form of “unstructured” text or content, including documents, emails, images, instant messages, video and voice clips. RDBMSs were designed to manage “structured” data in manageable fields, rows and columns such as dates, social security numbers, addresses and transaction amounts. ACID Atomicity, Consistency, Isolation, Durability) is a set of properties that guarantees database transactions are processed reliably and is a necessity for financial transactions and other applications where precision is a requirement.
Conversely, most NoSQL DBs tout their schema-less capability, which ostensibly allows for the ingestion of unstructured data without conforming to a traditional RDBMS data format or structure. This works especially well for documents and metadata associated with a variety of unstructured data types as managing text-based objects is not considered a transaction in the traditional sense. BASE (basically available, soft state, eventually consistent) implies the DB will, at some point, classify and index the content to improve the findability of data or information contained in the text or the object.
Increasingly, a number of database cognoscenti believe NoSQL solutions will or have overcome the “ACID test” as availability is said to trump consistency – especially in the vast majority of online transaction use cases. Even Eric Brewer argued recently that bank transactions are BASE not ACID because availability = $.
NoSQL Database Categories
As will be seen in the following section, NoSQL DBs simultaneously defy description and define new categories for NoSQL databases. Indeed, many NoSQL vendors possess capabilities and characteristics associated with more than one category, making it even more difficult for users to differentiate between solutions. A good example is the following taxonomy provided by Cloud Service Provider (CSP) Rackspace, which classifies NoSQL DBs by their data model.
Note: In the original slide, Riak is depicted as a “Document” data model. According to Riak developer Basho, Riak is actually a key-value data model and its query API (application programming interface) is the popular web REST API as well as protocol buffers.
The chart above represents the five major NoSQL data models: Collection, Columnar, Document-oriented, Graph and Key-value. Redis is often referred to as a Column or Key-value DB, and Cassandra is often considered a Collection. According to Technopedia, a Key-Value Pair (KVP) is “an abstract data type that includes a group of key identifiers and a set of associated values. Key-value pairs are frequently used in lookup tables, hash tables and configuration files.” Collection implies a way documents can be organized and/or grouped.
Yet another view, courtesy of Beany Blog, describes the database space as follows:
“In addition to CAP configurations, another significant way data management systems vary is by the data model they use: relational, key-value, column-oriented, or document-oriented (there are others, but these are the main ones).
- Relational systems are the databases we’ve been using for a while now. RDBMSs and systems that support ACIDity and joins are considered relational.
- Key-value systems basically support get, put, and delete operations based on a primary key.
- Column-oriented systems still use tables but have no joins (joins must be handled within your application). Obviously, they store data by column as opposed to traditional row-oriented databases. This makes aggregations much easier.
- Document-oriented systems store structured ‘documents’ such as JSON or XML but have no joins (joins must be handled within your application). It’s very easy to map data from object-oriented software to these systems.”
Beany Blog omits the Graph database category, which has a growing number of entrants in the space, including; Franz Inc., Neo4j, Objectivity and YarcData. Graph databases are designed for data whose relations are well represented as a graph, e.g., visual representations of social relationships, road maps or network topologies and representation of “ownership” for documents within an enterprise for legal or ediscovery purposes.
Hadoop and NoSQL
The Hadoop Distributed File System (HDFS) is an Apache open-source platform that enables applications, such as petabyte-scale Big Data analytics projects, to potentially scale across thousands of commodity servers such as Intel standard x86 servers, dividing up the workload.
HDFS includes components derived from Google’s MapReduce and Google File System (GFS) papers as well as related open-source projects, including Apache Hive, a data warehouse infrastructure initially developed by Facebook and built on top of Hadoop to provide data summarization, query and analysis support; and Apache HBase and Apache Accumulo, both open-source NoSQL DBs, which, in the parlance of the CAP Theorem, are CP DBs and are modeled after the BigTable DB developed by Google. Facebook purportedly uses HBase to support its data-driven messaging platform while the National Security Agency (NSA) supposedly uses Accumulo for its data cloud and analytics infrastructure.
In addition to the HBase, MarkLogic 7 and Accumulo native integrations of HDFS, several NoSQL DBs can be used in conjunction with HDFS, whether they are open source and community supported or proprietary in nature, including Couchbase, MarkLogic, MongoDB or Oracle’s version of NoSQL based on the Berkeley open-source DB. As Hadoop is inherently a batch-oriented paradigm, additional DBs to handle in-memory processing or real-time analysis are needed. Therefore, NoSQL – as well as RDBMS – solution providers have developed connectors for allowing data to be passed between HDFS and their DBs.
The slide above, courtesy of DataStax, illustrates how NoSQL and Hadoop solutions are transforming the way both transactional and analytic data are handled within enterprises with large volumes of data to manage both in real-time, or near real-time, and post-processing or after data is updated or archived.
NoSQL DB Funding and Growth
A recent note written by Wikibon’s Jeff Kelly, Hadoop-NoSQL Software and Services Market Forecast 2012-2017, gives a good indication of how well funded and fast growing the market for RDBMS alternatives has become.
“The Hadoop/NoSQL software and services market reached $542 million in 2012 as measured by vendor revenue. This includes revenue from Hadoop and NoSQL pure-play vendors – companies such as Cloudera and MongoDB – as well as Hadoop and NoSQL revenue from larger vendors such as IBM, EMC (now Pivotal) and Amazon Web Services. Wikibon forecasts this market to grow to $3.48 billion in 2017, a 45% CAGR [compound annual growth rate] during this five-year period.” Kelly forecasts the NoSQL portion of the market to reach nearly $2 billion by 2017.
Kelly’s research also indicates that the top ten companies in the space, measured in amount of funding dollars, received more the $600 million over the last 5 years, with funding increasing dramatically over the last 3 years, including $177 million for 2013 thus far. The top-funded NoSQL DB companies – in order of total funding amount – include DataStax (Cassandra), MongoDB, MarkLogic, MapR, Couchbase, Basho (creator of Riak), Neo Technology (creator of Neo4j) and Aerospike.
Note: On October 4th 2013, MongoDB announced it had secured $150 million in additional funding which would now make it the top-funded company in the space.
21 for 2020: NoSQL Innovators
As previously mentioned, there are now more than 50 vendors that have entered the NoSQL DB software and services space. As is the case with most nascent technology markets, more companies will emerge and others will buy their way into the market, fueling the inevitable surge of consolidation.
Oracle has publicly committed to its Berkeley DB open-source version of NoSQL, while IBM offers support for Hadoop and MongoDB solutions as part of its InfoSphere information management platform as well as Hadoop enhancements for its PureData System, and Microsoft supports a variety of NoSQL solutions on its Windows Azure cloud-based storage solution. Suffice to say, the big three RDBMS vendors are pragmatic about the future of databases. Sooner or later, expect them all to make NoSQL acquisitions.
Meanwhile, here is a short list of companies anticipated to disrupt the database space over the next 5 to 7 years arranged in somewhat different categories from the above NoSQL taxonomies and based more on use case within the enterprise than on data model.
This group is also distinguished by added capabilities or functionality beyond just providing a simple data store with the inclusion of analytics, connectors (interoperability with other DBs and applications), data replication and scaling across commodity servers or cloud instances.
Follow this link for brief profiles of these 21 NoSQL Innovators.
Note: Not all of these solutions are strictly NoSQL-based, including NuoDB and Starcounter, two providers that refer to their databases as “NewSQL”; and Virtue-Desk, which refers to its DB as “Associative.” All three get lumped into the NoSQL category because they offer alternatives to traditional RDBMS solutions.
Note: One could argue that other categories such as [http://en.wikipedia.org/wiki/Embedded_database Embedded Databases] could also be included. In over 20 hours of interviews, only 2 NoSQL solution providers, Oracle Berkeley DB and Virtue-Desk, mention embedding their databases within applications. In the case of Virtue-Desk, its solution is written entirely in Assembler and can be embedded in “any” device that has more the 1MB of memory – the DB is only 600k installed.
Note: The clear trend for non-relational database deployment is for enterprises to acquire multiple DBs based on application-specific needs, what could be referred to as software-defined database adoption.