Cloud Innovators Create a Body and Soul for Big Data

brain cap“It’s alive! It’s alive!” Thus spoke Dr. Henry Frankenstein of his creation in the iconic horror film classic from 1931, as does the main character in Mel Brooks’ 1974 comedy Young Frankenstein. It’s Alive is also the title of a little known 1974 B horror movie about a murderous infant. Can the same be said of computers or the data and information that inhabits them?

In June 2014, a computer program, through text-based conversations, is purported to have become the first to pass a Turing Test by fooling 33 percent of humans into thinking it was a 13-year-old Ukrainian boy. The test was devised in 1950 by computer-science pioneer and World War II code breaker Alan Turing, who said that if a machine was indistinguishable from a human, then it was “thinking.” Granted, some doubt the efficacy of these test results.

Of course, the idea that humankind, by dint of desire, intellect and ingenuity, can create life from inanimate objects is not a new concept. Almost 200 years ago, Mary Shelley, a talented young English lass barely out of her teens, bestowed life on the original Frankenstein monster and its overzealous creator, spawning a series of spinoffs and imitations as a result.

An illuminating quote from Shelley’s book reads, “So much has been done, exclaimed the soul of Frankenstein – more, far more, will I achieve; treading in the steps already marked, I will pioneer a new way, explore unknown powers, and unfold to the world the deepest mysteries of creation.”

In his Pulitzer Prize–winning book, The Soul of a New Machine,Tracy Kidder refers admiringly to one of his subjects, Carl Alsing: “That was what made it fun; he could actually touch the machine and make it obey him.” Said Alsing, “I’d run a little program and when it worked, I’d get a little high, and then I’d do another. It was neat. I loved writing programs. I could control the machine. I could make it express my own thoughts. It was an expansion of the mind to have a computer.”

One of Think and Grow Rich author Napoleon Hill’s most enduring quotes is: “Whatever the mind can conceive and believe, it can achieve.” Couple that thought with man’s tendency to anthropomorphize any other living or lifeless entity, and it is not much of a stretch to imagine computers, software or data imbued with a “living” body – and a soul.

Anatomy of Big Data: Body and Soul, Quantitative and Qualitative

Perhaps it is man’s destiny, or greatest conceit, to remake virtually everything in his/her own image and, along the way, define the essence or animating principal of life. Aristotle in Book II of De Anima (On the Soul) describes soul as “the essential whatness of a body.” He adds, “The soul is inseparable from its [whole, living] body”; and “What has soul in it differs from what has not in that the former displays life.” He defines “life” as including thinking, perception, movement, self-nutrition, decay, growth and the power of sensation.

Most all of the world’s religions recognize soul or anima – Latin for the animating principle, or psyche in Greek – regardless of whether they view only humans as possessing immortal souls or they believe, as do pantheists, that even inanimate objects such as rivers and mountains have souls and are “living.”

Tu Weiming is Director of the Institute for Advanced Humanistic Studies at Peking University and Research Professor and Senior Fellow of Asia Center at Harvard University. On the subject of “human rootedness” of Confucian thought, he identifies three characteristics, including Cheng (juhng), the state of absolute quiet and inactivity; Shen (shen), which concerns the “heavenly aspect of the soul” and its development; and Chi (jee), an “originating power, an inward spring of activity … a critical point at which one’s direction toward good or evil is set” can be identified and used to further “flourish the soul.”

Qualtism or Kwalitisme is defined as “the search for the soul in everything that surrounds us in this over and over quantified society.” The ever-expanding universe of Big Data is akin to a Mulligan stew of ingredients culled from a vast variety of sources – most are accessible through the web – from which we seek to derive structure, meaning and insight, limited only by our lack of imagination and our unwillingness to change.

Descriptive terms such as agile, aware, elastic, innovative, intelligent, intuitive, sensitive are attributed to both hardware and software solutions, further blurring the lines between animate and inanimate objects. In the recently released film by Spike Jonze, Her, Joaquin Phoenix falls in love with an intelligent OS (operating system) that simulates a woman’s voice – no doubt, Apple’s Siri was an inspiration.

Biomedical devices replace lost limbs and organs or, in the case of cardiac pacemakers, provide electrical impulses to keep the heart pumping while simultaneously collecting and sending data to care providers. The MoMeTM System from Infobionic is “the first Cloud-based, universal patient-monitoring solution with unprecedented analytics that allows physicians to quickly and accurately diagnose and treat patients.”

Wearable “intelligent” devices are flooding the market, including smart watches that track fitness; medical devices that keep track of glucose levels and blood pressure; smart clothing and eyewear that gather and aggregate data; wearables for babies, kids and pets, all available online through Amazon and other sources.

Meanwhile, an article, Soul Searching or Data Mining; Distinctive Pathways to Health, written for a Christian Science publication questions the overuse of technology and its implications for spirituality in healthcare. “As healthcare continues to be increasingly linked with technology, data, and endless quantification, staying connected to our soulful self and reflecting upon our inner identity can bring healthy benefits. After all, all the numbers in the world don’t begin to tell the whole story of you.”

Just as digital networks and power grids feed trillions of data points to centralized computers, the body feeds data to the brain – and now increasingly to external computing devices via the Internet. Humans are truly becoming part of the Internet of Things as our bodies are in the process of becoming primary contributors to the Big Data pool. Progressively, we are being connected to and relying on intelligent devices for a variety of purposes – despite concerns about the loss of our humanity. But isn’t it human to want to create humanlike things?

Big Data and the Emergence of Cognitive Computing

Early on, computers mastered deterministic or predictive computing, commonly defined as the ability to objectively predict an outcome or result of a process due to knowledge of a cause-and-effect relationship – such as adding or subtracting numbers. In the 1970s, relational databases emerged as a tool to manage text in a much more predictable fashion, using rows and columns containing, for instance, customer names, addresses, age, sex and other structured demographic information.

A friend and former colleague of mine, Dr. Adrian Bowles, an expert on artificial intelligence and founder of STORM Insights, has posted a six-minute long You Tube video describing neuromorphic architectures and defining cognitive computing, which is how computers “learn” to think like humans.

Below is one of Bowles’ presentation slides.

Adrian Bowles Storm Insight Cognitive Computing Framework

According to Bowles as well as an article published earlier this year by MIT Technology Review, chipmaker Qualcomm is planning to deliver a neuromorphic chip next year that will enable even smarter smart phones with sensory inputs to see, feel, read and predict outcomes.

The MIT article claims, “These ‘neuromorphic’ chips – so named because they are modeled on biological brains – will be designed to process sensory data such as images and sound and to respond to changes in that data in ways not specifically programmed. They promise to accelerate decades of fitful progress in artificial intelligence and lead to machines that are able to understand and interact with the world in humanlike ways.”

In a 2011 presentation given by two undergraduate students, James Kempsell and Chris Radnovich from Rochester Institute of Technology (RIT), on Neuromorphic Architectures, the pair concludes: “Neuromorphic Architectures will be the next major step after Von Neumann. These architectures will help realize how to create parallel locality-driven architectures – used for what the brain is good at: compressing data into information.”

Here is one of the more interesting slides from the RIT Students’ presentation.

Neuromorphic Architectures explained slide

The students also quote Dr. Leon Chua of UC Berkeley, who more than 40 years ago first envisioned the existence of the memristor, or “memory resistor,” which digitally and mechanically mimics the biology of the human brain. “Since our brains are made of memristors, the flood gate is now open for commercialization of computers that would compute like human brains, which is totally different from the Von Neumann architecture underpinning all digital computers.”

As stated on the IBM Research website, in August 2011, as part of IBM’s SyNAPSE Project (Systems of Neuromorphic Adaptive Plastic Scalable Electronics) project, “IBM researchers led by Dharmendra S. Modha successfully demonstrated a building block of a novel brain-inspired chip architecture based on a scalable, interconnected, configurable network of ‘neurosynaptic cores’ that brought memory, processors and communication into close proximity. These new silicon, neurosynaptic chips allow for computing systems that emulate the brain’s computing efficiency, size and power usage.”

Not to be outdone, HP unveiled the Machine at its Discover 2014 Summit in June. The genesis of the Machine is a project led by Dr. Stan Williams, an HP Senior Fellow and director of the Memristor Research group at HP Labs, which began in 2008. As reported by HP, “Williams is currently focused on developing technology that supports the concept of CeNSE: The Central Nervous System for the Earth. The idea is that nanotechnology has the potential to revolutionize human interaction with the earth as profoundly as the Internet has revolutionized personal and business interaction.”

While Quantum expects its neuromorphic chips to be generally available sometime in 2015, IBM’s chip and associated “new programming language to enable the development of new sensory-based cognitive computing applications” is still a few years away from mainstream use, as is HP’s Machine. In the interim, the next 5 years or so should bring some potentially transformational changes in the standard computing architecture model, aka Von Neumann’s approach and chip fabrication.

Consider that neuroscientist Christof Koch lists the total synapses in the cerebral cortex at 240 trillion (Biophysics of Computation. Information Processing in Single Neurons, New York: Oxford Univ. Press, 1999, page 87), while IBM’s SyNAPSE project hopes to reach 100 trillion – at some point in the near future with its “human-scale” simulation.

“From a pure energy perspective, the brain is hard to match,” says Stanford University bioengineering professor Kwabena Boahen who has developed Neurogrid chips to simulate brain function. “The human brain, with 80,000 times more neurons than Neurogrid, consumes only three times as much power. Achieving this level of energy efficiency while offering greater configurability and scale is the ultimate challenge neuromorphic engineers face.”

Big Data Leaving the Predictive Past Behind

Accompanied by the advent of Big Data, much of it in less structured formats such as text in documents, emails and webpages or data derived from images, video, voice or machine data, is a strong desire to plumb the depths of that data for new insights. Hence, computer scientists and engineers are redoubling their efforts to reformulate the predominant computer model to replicate the way people think or, at the very least, predict how people will behave.

Most predictive models rely on past behavior or a retrospective approach. It’s ingrained in us that “People don’t change” or that “Change is hard.” And so it is with predictive models. However, a predictive model that relies on the fact that I have made 30 roundtrips on the same airline to Chicago in 18 months but for the last 6 months, I have not flown on that airline once is cause for predictive modeling confusion. Was I unhappy so I changed my carrier? Was I visiting a friend or relative who has since moved? Did I switch jobs and now have a new territory? Have I retired or expired?

Companies such as Acxiom sell data about millions of people to thousands of companies looking to cross-reference data points and answer questions like those raised above. Healthcare Insurance and Financial Services companies buy data in order to fill out customer profiles – what they refer to as a “Customer 360” view. However, as many data wonks have come to realize, correlation does not imply causation.

Meanwhile, there are scores of new solutions that support cognitive and probabilistic computing models. Like humans, cognitive computing applications, also referred to as artificial intelligence, learn by experience and instruction, then they adapt. For instance, take a solution such as IBM’s Watson, which functions “by understanding natural language, generating hypotheses based on evidence and learning as it goes.”

Like a human, Watson and other cognitive solutions have intent, memory and foreknowledge, while they mimic human reasoning within specific specialized domains with variable outcomes – if trained by experts in the field. Healthcare providers use Watson to support physicians to arrive at difficult diagnoses. Unlike humans, Watson can quickly search through millions of records to find patient records that refer to common symptoms – called similarity analytics – and search through medical ontologies and thousands of research papers to provide physicians with weighted recommendations and several probable diagnoses.

Wall Street uses Monte Carlo methods to value and analyze complex investments by simulating thousands of random or “probable” sources of uncertainty and then determining the average value based on the range of results. Monte Carlo simulations are often used for a variety of other types of institutional and personal investments, such as modeling the future performance of a 401k investment or fixed-income instruments. Yet, Monte Carlo simulations are limited for predicting human behavior – unless you count automobile traffic-safety assessments and perhaps some other niche applications.

Big Data Needs a Soul

“Data without a soul is meaningless,” writes Om Malik of GigaOm in a post from March 2013. “Empathy, emotion and storytelling – these are as much a part of the business as they are of life. The problem with data is that the way it is used today, it lacks empathy and emotion. Data is used like a blunt instrument, a scythe trying to cut and tailor a cashmere sweater.

Malik adds, “The idea of combining data, emotion and empathy as part of a narrative is something every company – old, new, young and mature – has to internalize. If they don’t, they will find themselves on the wrong side of history. What will it take to build emotive-and-empathic data experiences?”

Malik as well as others suggest the ability to ask human questions, to be data-aware and data intelligent and to have the facility to make correlations between data, in context, is where the future of computing ultimately needs to head.

In his recent post, The Soul-Crushing Problem with Data, Andrew Eklund makes the case that “the more and more we rely upon data to inform decisions, the less we often trust our most basic of human instincts – our gut.”

Eklund adds, “We’ve become so heavily dependent upon data that the creative process becomes hijacked from the outset. For example, there’s a big difference between ‘What does the data say about a consumer?’ and ‘What do we think the consumer needs or wants?’ The problem is that the data will rarely answer the second question, even though that is our greatest challenge.”

Despite his concerns, Eklund is a strong advocate for data in general: “Don’t get me wrong. I love data. Data is the gift that keeps on giving with regards to ongoing insights, measurement of performance, targeting and re-targeting, campaign optimization, and a thousand other amazing things wrapped up in jargon we love and overuse. But data isn’t an idea. An idea comes from within your soul.

Matt Ballantine’s equally emotionally charged blog, Big Data: The Tyranny of the Past, suggests that “Big Data might be quite good at predicting the future (most of the time) in comparison to humans because Big Data will predict based on trends. If we could put the issue of selection bias aside, that then might make for better decision making. Except when the unexpected occurs, at which point Big Data extrapolating trends will fall on its backside. And here’s the other big challenge for Big Data – that it’s great at spotting correlations but does nothing to understand causality.

Re-imagining Big Data

Sean Gourley, Co-Founder and CTO of Quid, an intelligence amplification platform, states: “Data science, I believe, we need to re-imagine it because data is incredibly powerful. We need to step back from the scientific notations and start thinking of it as data intelligence. Data intelligence has a slightly different philosophy that embraces some of the messy and unstructured nature of the world that we do live in.”

As Malik states in his aforementioned post, “Data needs stories. The symbiotic relationship between data and storytelling is going to be one of the more prevalent themes for the next few years, starting perhaps inside some apps and in the news media. In a future where we have tablets and phones packed with sensors, the data-driven narratives could take on an entirely different and emotional hue.”

Data in and of itself has no meaning or purpose. Data needs a human, qualitative, soulful element to enrich and enhance the quantitative component that is inherent in all static, lifeless forms of data.

Cloud: The Fundamental Building Block for Humanizing Big Data

Big Data also needs an ample, expandable body or elastic infrastructure to contain zettabytes (for now) of data. The closest thing we have today is the Cloud, aka the Internet that, in theory, encompasses all publicly available electronic data as well as data behind firewalls and contained in private networks, or private Clouds, often available to users via the Cloud.

NIST defines Cloud computing as “a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.”

Research firm IHS predicts worldwide spending on Cloud computing will reach nearly $250 billion by 2017, while in 2014, global business spending for infrastructure and services related to the Cloud will reach an estimated $174.2 billion. IDC, in its Cloud Forecast for 2014, predicts, “the Cloud software market will surpass $75B by 2017, attaining a five-year compound annual growth rate of 22% in the forecast period.”


The Cloud phenomenon touches every business, regardless of size, and virtually every smart phone, personal tablet and device. Until such time that personal devices can store trillions of data points, the Cloud is our best route to humanizing big data. As such, the Cloud, like man, is evolving, requires care and tweaking, and needs to learn, grow and be protected from those that would do it harm.

Bottom Line

The primary Cloud infrastructure components (the Body) are Compute, Storage and Networks – and, one could argue, Databases, various forms of Security and the ability to self-replicate. Extending the metaphor, software applications orchestrate the movement of data and, more and more, re-purpose data to learn, think, reason, perceive and motivate – which some might define as the Soul, albeit one with mostly digital roots.

Regardless of one’s beliefs, the fact is that machines are getting smarter in ways that people have traditionally measured human intelligence – from a quantitative perspective. Adding qualitative, cognitive computing capabilities changes the entire thinking around the usefulness of computers – for better or worse.

Big Data, loosely defined as the sum of all electronic data and information, accompanied by advances in Cloud computing, chip fabrication, natural language processing and machine learning are all in the process of transforming the utility of computing and challenging man’s traditional view of what constitutes “life” as we know it.


Appendix: 38 Big Data and Cloud Innovators to Look for in 2015

To be sure, there are literally hundreds of companies and thousands of individuals contributing to the evolution of the Cloud in some way, shape or form, accompanied by those delivering innovative solutions for the management of Big Data.

What follows are several examples of Big Data and Cloud innovators as well as a few taxonomy slides to help define different categories of solution providers. Representative solution and service providers include large multinationals, established midsize vendors and veritable start-ups, with a primary focus on Big Data and Cloud Computing.

* For readers interested in reviewing Cloud implementation approaches being tested and implemented by large global banks and financial services firms that will hopefully benefit the entire FS industry, and perhaps other industries as well, click on the following link. (FYI, Parity Research is a co-founder of the Cloud Open API Forum for Financial Services.) IaaS and PaaS: Mature Enough for Financial Services Firms? Vendor sponsors for the FS Cloud Forum include ActiveState, Citrix, Cloudscaling, Canonical, GigaSpaces, IBM/Softlayer, Mirantis, MuleSoft, SolidFire, SunGard Availability Services and SwiftStack.

A high-level taxonomy for Cloud Infrastructure as envisioned by IBM follows:

cloud taxonomy from IBM

NIST’s visual model or taxonomy for Cloud infrastructure follows:

cloud taxonomy from NIST

* For additional Cloud taxonomies, readers can review the Cloud Blueprint Blog authored by Ravi Kalakota.

IaaS – Infrastructure as a Service Providers

The IaaS category is packed with solution providers, from regional players that focus on specific industries such as midsize banks to large multinational providers with services that literally span the globe. Other providers specialize in supporting specific application types such as Microsoft applications and databases. Below are several representative examples.

IaaS appeals to companies of all sizes who are looking to lower the cost of their IT infrastructure. However, most large companies are reluctant to wholly embrace IaaS due to concerns that include security, availability, compliance and vendor lock-in. Vendors such as Amazon and IBM contend their IaaS offerings are more secure than their customers’ data centers. From an infrastructure standpoint, this very well may be true.

Meanwhile, most large organizations are experimenting with several IaaS providers at once, if even on a limited basis, because the potential cost, backup and disaster recovery, elasticity and, perhaps at some point, the performance and agility aspects of Cloud computing are just too compelling to ignore. Competition also induces providers to lower prices and offer incentives.

Note: Many larger IaaS solution providers offer services and capabilities across the spectrum of Cloud products, while other IaaS providers (e.g. Canonical and Cloudscaling) do not maintain their own data centers but leverage the expertise and assets of other providers.

AWS is Amazon’s industry-leading web services solution set which provides a collection of computing services that together make up their Cloud computing platform, including Amazon EC2 and S3 services. AWS has a head start on every other provider in the industry, having had the vision to begin development earlier than most of the competition. Beginning with small companies or development groups within large corporations, AWS now is focused on embracing the needs of large organizations within specific industries such as banking and financial services. While virtually every financial services company has, at minimum, a limited relationship with AWS, Amazon will need to convince the industry that it can meet their needs without sacrificing security, performance and platform agnosticism. At this point, AWS also has the largest partner ecosystem.

AWS Global Infrastructure


Canonical believes in the power of open source to change the world. “Canonical was created alongside Ubuntu to help it reach a wider market. Our services help governments and businesses the world over with migrations, management and support for their Ubuntu deployments. Together with our partners, we ensure that Ubuntu runs reliably on every platform, from the PC and the smartphone to the server and, crucially, the Cloud.” Ubuntu could not exist without its worldwide community of voluntary developers who also believe open source is critical to the Cloud development. “We are committed to creating it, refining it, certifying it for reliability and promoting its use. Ubuntu Server 14.04is the ultimate enterprise Cloud platform, both for building OpenStack Clouds and for running on public Clouds.”



Cloudscaling believes “companies across industries will require an open source IaaS solution to cost-effectively support a new generation of Cloud-native workloads and enable the DevOps model – in a multi-Cloud world. That means forward-looking organizations who are building out their Cloud computing footprint require an Infrastructure-as-a-Service solution (IaaS) that is fully interoperable – not just compatible – with the leading public Cloud services.” Its Elastic Cloud Infrastructure built on OpenStack, “enables any IT group to deploy Cloud services comparable to the capabilities of the world’s largest and most successful public Clouds.” Cloudscaling offers its clients increased agility, less complexity and improved time to market, while promoting business and IT alignment.”

Cloudscaling ocs-overview


HP, through its Helion portfolio of Cloud products and services, provides an open ecosystem with a common management structure and “integrates easily into your business through a wide range of delivery models offering an open, secure, scalable and agile Cloud environment.” Based on OpenStack, Helion’s portfolio covers the entire spectrum of hardware, software, and professional services. “Architected to work together, the stack is comprehensive no matter how you choose to build or consume your Cloud.” HP has the advantage of a large installed base of solutions that should translate well to the Cloud environment, including its Vertica analytics database, which is “fully provisioned on the Amazon EC2 or any VMware-supported platform and ready for loading within minutes.”

HP Helion Stack


IBM, since 2007 when it began working with clients on Cloud computing, has been focused squarely on making the model viable for enterprise and government clients that cannot compromise on security, compliance and availability. “IBM’s strategy for Cloud is clear: We will build Clouds for enterprise clients, and we will provide Cloud services where there are gaps we can fill.” IBM is rounding out its Cloud products and services portfolio through internal R&D and key acquisitions such as IaaS provider Softlayer, their PaaS BlueMix and DBaaS Cloudant committing over $2 billion in the process. IBM’s recently announced Watson Developer Cloud will offer “the technology, tools, and APIs that companies need to develop and test their own cognitive applications, powered by IBM Watson’s cognitive computing capabilities.”

IBM smartcloud


Microsoft Azure is winning Cloud-based business, especially with its install base, due to effective marketing and “extremely generous” discounts and incentives. According to a recent Gartner post, “Microsoft’s comprehensive hybrid story, which spans applications and platforms as well as infrastructure, is highly attractive to many companies, drawing them toward the Cloud in general.” Meanwhile, Microsoft runs many of its own popular applications on Azure, including Skype, Office 365, Bing and Xbox. “Azure enables you to build and deploy a wide variety of applications – including web, mobile, media and line-of-business solutions. Built-in Auto Scale features enable you to dynamically scale up and down to meet any needs.” Azure also provides managed SQL and NoSQL data services and support for analytics.

Microsoft Cloud Azure


SunGard Availability Services is trusted by companies around the world to keep their IT environments continuously available – including 50% of the Fortune 500, 70% of the Fortune 50 and many others in all industries and sizes. “Though many know SunGard Availability Services as the pioneer of disaster recovery services, we actually provide much more than that today, from highly resilient managed Cloud and hosting for production applications and infrastructure to managed backup, recovery and business continuity services.” Split off from parent SunGard in March 2014, SunGard Availability Services has revenues of $1.5 Billion and 3,000 employees worldwide, while introducing data management and Cloud-recovery services that “significantly reduce recovery times and costs for organizations managing complex, disparate, hybrid environments.”

SunGard AS Enterprise-Cloud


Xand provides Hybrid Cloud and co-location services to mostly midsize companies in the Northeastern U.S. “With Xand’s Hybrid Cloud, you can provision Cloud (virtual) and Dedicated (physical) servers on the same network. Our scalable and flexible solution gives you a pay-per-use service that can be quickly added to your environment, giving you scalable infrastructure (servers, storage, security, load balancing and network).” Xand also offers to customers in various industries, including finance and healthcare, Private xCloud with “highly-available facilities, strong levels of compliance, vast engineering resources and 24×7 monitoring. Private xCloud will take your Cloud to new levels of security and stability.” Technology partners include Cisco, VMware, Dell and Red Hat.

Xand Hybrid Cloud


Advanced Storage Solutions

This category includes dozens of suppliers of Hybrid Storage as well as solutions that leverage the speed and productivity gains inherent in all-Flash Arrays or Solid State Drives delivered by tech stalwarts Cisco, EMC, IBM and NetApp as well as relative newcomers such as PureStorage, Tegile and Violin. The two examples below are newcomers that have made a commitment to open source Cloud solutions, such as OpenStack and Apache CloudStack, and are focused on the Cloud Service Provider (CSP) market and the enterprise. (For additional information on advanced storage solutions, review Parity’s Flash Memory Summit Blog.)

SolidFire offers the “deepest Cloud integration of any storage vendor. When building large-scale public or private Cloud infrastructures, there is only one choice for block storage. SolidFire delivers the most comprehensive block storage integration with all of the industry leading orchestration software. Each integration surfaces SolidFire’s patent-pending Quality of Service (QoS) controls, allowing for complete performance automation and the development of end-user self-service tools.” SolidFire all-Flash block-based storage arrays are optimized for many open source Cloud solutions, including Citrix, CloudStack and OpenStack. It supplies IaaS providers, including SunGard AS, eBay, Endicia and CenturyLink who are no doubt fans of SolidFire’s QoS, scale-out architecture and low total cost of ownership (TCO) compared with other all-flash solutions.

SolidFire CloudStack


SwiftStack focuses on object-based storage applications and is “built on the world’s most popular object store, OpenStack Swift, which powers the largest storage Clouds in the world. SwiftStack already powers the web’s most popular applications that you use every day – and can supercharge your enterprise private Cloud, content storage/distribution, and active archiving applications. SwiftStack places responsibility for the storage system in the software, not in specific hardware components. The SwiftStack Controller manages multiple object storage clusters and removes the heavy lifting from configuration, authentication, cluster management and capacity management. Regular alerts, reports and system stats keep you constantly updated on your storage needs.” SwiftStack collaborates with SolidFire, MongoDB and other open source-centric solution providers on enterprise projects.



PaaS – Platform as a Service Providers

PaaS is a very muddled category as it includes every Cloud middleware, orchestration, automation or enablement solution that cannot logically or easily be called a SaaS (Software as a Service) solution, such as, or an IaaS provider. Indeed, virtually every IaaS provider has developed its own PaaS layer or is partnering, reselling or is an OEM for an existing PaaS solution. AWS offers its Elastic Beanstalk, and IBM has invested heavily in BlueMix, while HP partners with a variety of suppliers, including open source–based Cloudify and ActiveState, and Cisco with SunGard AS.

A January 2014 Gartner iPaaS MQ specifically focused on enterprise integration PaaS providers includes 17 solution providers with Dell, IBM, Informatica and MuleSoft among the top rated. In a previous research note entitled, Platform as a Service: Definition, Taxonomy and Vendor Landscape, 2013, Gartner identified no less than 15 classes of PaaS, “each roughly mirroring a corresponding class of on-premises middleware products.” By definition, this list would not include PaaS developed exclusively for use by public CSPs such as AWS, Microsoft and others.

Gartner predicts, “Dramatic growth in the iPaaS market over the next five years due to several factors, including:

  • The explosion of CSI [Cloud services integration], MAI [mobile app integration], API and Internet of Things requirements;
  • The emergence of the agile integration approach and citizen integrators, for which traditional integration platforms are unsuitable and;
  • Adoption by SMBs so far often unwilling to embrace integration middleware because of its high cost and complexity, but now interested in iPaaS offerings due to their low entry cost and ease of use.”

The breadth of PaaS offerings is staggering, creating more confusion and uncertainty for enterprises trying to develop their own private Cloud solutions or those enterprises evaluating CSPs to align with for bursting, backup, disaster recovery or Tier 1 application hosting services – all in the name of business agility, application enablement and lower cost.

As with the other categories, the following vendors are a representative example of PaaS solution providers, not an exhaustive list.

ActiveState is the parent company for Stackato, based on Cloud Foundry, which “makes it easy to develop, deploy, migrate, scale, manage, and monitor applications on any Cloud,” available in Enterprise, Micro Cloud, and Sandbox editions. HP Helion began an OEM relationship with ActiveState in late 2012 including Stackato as part of its PaaS portfolio for “building applications for creating private PaaS using any language on any stack on any Cloud. Additionally, enterprise IT can achieve new levels of data security, reduce time to market, save money, ensure compliance, and gain greater control over the Cloud.” Ease of use and agility is achieved through narrowing the gap between development, test and production.



Citrix CloudPlatform offers “simple, turn-key Cloud orchestration. CloudPlatform is the only Cloud orchestration platform that enables you to quickly and efficiently build a future-proofed Cloud. It is a turn-key solution based on an open and flexible architecture that is designed to run every application workload at scale and with simplicity.” In July 2011, Citrix acquired CloudStack developer, now available through an Apache Software open source license. Considered one of the early Cloud orchestration innovators, Citrix has morphed into CloudPlatform and claims more than 2,500 Cloud providers as users, including AWS. CloudPlatform also leverages standard AWS APIs and a rich partner ecosystem as well as offering its own Turnkey IaaS.

Citrix Cloud Architecture


Docker is an “open platform for developers and sysadmins to build, ship and run distributed applications. Consisting of Docker Engine, a portable, lightweight run-time and packaging tool, and Docker Hub, a Cloud service for sharing applications and automating workflows, Docker enables apps to be quickly assembled from components and eliminates the friction between development, QA, and production environments. As a result, IT can ship faster and run the same app, unchanged, on laptops, data center VMs, and any Cloud.” Whether on-premise bare metal or data center VMs or public Clouds, workload deployment is less constrained by infrastructure technology and is instead driven by business priorities and policies. The Docker Engine container comprises just the application and its dependencies.

Docker VM vs Containers


GigaSpaces Technologies provides software middleware for deployment, management and scaling of mission-critical applications on Cloud environments through two main product lines, XAP In-Memory Computing and Cloudify. “Hundreds of Tier-1 organizations worldwide are leveraging GigaSpaces’ technology to enhance IT efficiency and performance, from top financial firms, e-commerce companies, online gaming providers, healthcare organizations and telecom carriers.” GigaSpaces recently released its 3.0 version of Cloudify with an open source community version along with a much-improved premium edition that includes an enhanced UI, plug-ins for standard tools, advanced blueprints, elastic caching and Replication as a Service, making it “even easier for enterprises to automate and manage apps and take Cloud orchestration to the next level.”

GigaSpaces Product Portfolio slide

Informatica Cloud’s integration platform-as-a-service (iPaaS) allows enterprises to “extend the capabilities of Informatica Cloud through a variety of development tools, including a REST API and a Java-powered Cloud Connector SDK. Its multi-tenant iPaaS offers benefits for enterprises, including breaking down the walls of data, process and service integration by having a common set of artifacts; increasing collaboration between integration and application developers, as well as ETL and EAI architects; and rapidly connecting to homegrown, proprietary, or legacy systems through REST or SOAP web services, as well as emerging data standards such as JSON or OData.” Informatica iPaaS also offers independent software vendors and system integrators to quickly onboard customer apps and cut the time and cost of integration projects.

Informatica iPaaS


Jelastic offers “Platform-as-Infrastructure, the integration of Infrastructure-as-a-Service and Platform-as-a-Service, delivering a scalable, manageable and highly available Cloud.” DevOps can deliver a “private or hybrid enterprise Cloud that drives down costs and increases agility. Provide your developers with an enterprise-class private PaaS that enables rapid application development and deployment without coding to proprietary APIs.” Jelastic automates horizontal and vertical scaling and provides denser packing of applications hosted by CSPs, while also facilitating dynamic sizing and live migration of containers such as Docker. Jelastic “simplifies” Private Cloud by providing an infrastructure that is “easy to deploy and simple to manage the entire enterprise stack and can install on ‘bare metal’ servers.”

Jelastic Simplified-Private-Cloud


Mirantis is the “number one pure-play OpenStack solutions provider, the most progressive, flexible, open distribution of OpenStack, combining the latest innovations from the open source community with the testing and reliability customers expect. More customers rely on Mirantis than any other company for the software, services, training and support needed for running OpenStack Cloud.” Mirantis is the only pure-play OpenStack contributor in the top five companies contributing open source software to OpenStack and supports many Global 1,000 companies such as AT&T, Cisco WebEx, Comcast, Dell, The Gap, NASA, NTT Docomo and PayPal to build and deploy production-grade OpenStack. Mirantis OpenStack is a “zero lock-in distro that makes deploying Cloud easier, more flexible and more reliable,” while OpenStack Express is an “on-demand Private-Cloud-as-a-Service” that allows users to deploy workloads “immediately.”



MuleSoft provides the most widely used integration platform for connecting any application, data source or API, whether in the Cloud or on-premises. With Anypoint™ Platform, MuleSoft delivers a complete integration experience built on proven open source technology, eliminating the pain and cost of point-to-point integration. Anypoint Platform includes CloudHub™ iPaaS, Mule ESB™ and a unified solution for API management, design and publishing. “CloudHub is the fastest way to integrate SaaS applications reliably and securely. The platform of choice for enterprise integration, CloudHub offers global availability and 99.99% up-time. Compliance with the highest security standards ensures integration are protected wherever they run.” MuleSoft has more than 170 enterprise customers.



Managed Cloud Services Providers

Many of the above-mentioned IaaS solution providers also offer Managed Cloud Services, including IBM and SunGard Availability Services. Telcos such as AT&T, NTT and Verizon also have viable offerings in this space. This space is also crowded with hundreds of service providers of every ilk. The primary reason for this is MCSPs make the transition to the Cloud much easier for most organizations.

The two examples below are MCSPs that leverage other provider networks/data centers and specialize – although not exclusively – within specific application areas (e.g. Avanade for Microsoft applications) or within specific industries (e.g. DataPipe in financial services).

Avanade “Private Cloud solution is built on the latest Microsoft technologies, including Windows Server 2012, System Center 2012 and Avanade’s own Cloud Services Manager software. These technologies enable IT to improve productivity, optimize operations, lower costs, tighten security and reduce your environmental footprint. Avanade also offers scalable, infrastructure-managed services that can help you rapidly cut costs. We offer a comprehensive selection of managed services including Service Desk coverage, Workplace Services, Data Center Services, Network Services and Security Services.” A joint venture between Accenture and Microsoft, Avanade seeks to accelerate the business value of Private Cloud with tailored managed services while providing clients with a flexible, agile business approach.

Avanade Cloud


Datapipe offers a single provider solution for managing and securing mission-critical IT services, including Cloud computing, infrastructure as a service, platform as a service, colocation and data centers. “Datapipe delivers those services from the world’s most influential technical and financial markets, including New York metro; Ashburn, VA; Silicon Valley; Iceland; London; Tel Aviv; Hong Kong; Shanghai and Singapore. Datapipe Managed Cloud for Amazon Web Services (MAWS) is a unique offering that combines the flexibility, scalability and power of Amazon’s world-leading Cloud platform with Datapipe’s award-winning support and managed services to power hybrid Cloud solutions with Datapipe High performance Oracle, SQL and MySQL database clusters delivered as a service over a direct connection between Amazon and Datapipe networks.” According to Netcraft, Datapipe’s impressive connect time, 16ms, is evidence of the benefits of their globally disperse hosting platform.” Datapipe offers a comprehensive suite of enterprise-grade managed security and compliance solutions.

Datapipe Managed Hosting


NoSQL Databases in the Cloud and Big Data

The popularity and viability of the Cloud coupled with the explosion of Big Data has engendered a variety of new databases that have broken the mold and, perhaps soon, the grip of traditional relational database management systems (RDBMS) – not to mention enabled new business models and initiatives that rely on fast access to ever-larger amounts of data – that have dominated enterprise IT shops for decades.

In the following graphic (Source:, Big Data stores, better known as databases, are arranged in three categories: SQL, NewSQL and NoSQL, which stands for Not Only SQL. The SQL space is dominated by IBM, Microsoft and Oracle traditional relational databases, whereas NewSQL and NoSQL have scores of newer entrants – including offerings from IBM (Cloudant) and Oracle (based on Berkeley DB).

Taxonomy of Big Data Stores_Tony Shan

Many of today’s more popular NoSQL DBs are offshoots or hybrids of DB initiatives developed by web-scale companies such as Amazon (Dynamo) and Google (Big Table) where traditional RDBMS were not suitable for “quickly” accessing large amounts of unstructured or semi-structured data such as documents, images, webpages, wikis, emails and other forms of “free” text.

As described in 21 NoSQL Innovators to Look for in 2020, there are at least four primary categories of NoSQL DBs, including Distributed, Document-Oriented, Graph DBs and In-Memory, with variances within each category. Many NoSQL DB solutions are based on open source solutions such as HBase, Cassandra, Redis or CouchDB along with Dynamo and Big Table.

Many of the almost 100 commercially available New or NoSQL DBs are optimized for the Cloud and offered by CSPs as DBaaS platforms for small companies or can be scaled out for very large private Cloud instances with multiple nodes and locations.

The following is a small sample of NoSQL DBs, including two of the most popular solutions and one relative start-up.

AtomicDB offers an “associative” DB approach that holds the promise of being able to manage exabytes of data far exceeding the capacity of conventional DB approaches or even Hadoop and HDFS. Developed for the Navy in the 90s for data warehousing applications, AtomicDB is now being commercialized for government, financial and healthcare use cases where the need to access exponential Big Data growth is paramount. “The underlying AtomicDB storage engine has patent pending technology for compression, encryption and obfuscation. The technology enables very high security and massive scalability to exabyte levels. This is possible with commodity hardware and Cloud platforms not requiring expensive proprietary appliances. Parallel processing and associative processing enables complex queries over large secure data sets up to two orders or magnitudes faster than alternatives.”



MarkLogic is the market share leader for enterprise NoSQL implementations and now leverages AWS to offer clients additional resources to dynamically scale, “enabling the right resources to be brought to bear even during peak times. A combination of advanced performance monitoring, programmatic control of cluster size, sophisticated data re-balancing and granular control of resources bring agility and elasticity to the data layer, whether it is on-premise or in the Cloud.” With increased interest in accessing large data sets stored in the Cloud, MarkLogic is poised to deliver Semantic Technologies, “unleashing the power of documents, data and triples so you can understand, discover and make decisions based on all kinds of information available from many sources in a single, horizontally scalable database to handle XML, text, RDF, JSON and binaries, and to query across all your information – eliminating redundancy, complexity and latency.”



MongoDB has the largest user community and partner ecosystem in the NoSQL space due to its popular open source distribution. Its DBaaS is hosted by AWS and IBM/Softlayer – a big partner and supporter – Rackspace, Cumulogic and many others. “MongoDB stores data as documents in a binary representation called BSON (Binary JSON). MongoDB Management Service (MMS) is the application for managing MongoDB, created by the engineers who develop MongoDB. Using a simple yet sophisticated user interface, MMS makes it easy and reliable to run MongoDB at scale, providing the key capabilities you need to ensure a great experience for your customers, offering automated deployment and zero-downtime upgrades disaster recovery and continuous monitoring in the Cloud.”

MongoDB Cluster


Cloud (In)Security

Security concerns, followed by compliance and regulatory considerations, is typically the top reason enterprises give for not moving data or Tier 1 apps to the Cloud. As CSPs, PaaS and SaaS vendors point out, security solutions and policy enforcement are required at each layer and there is no one product or service provider that Parity can find today that addresses the entire stack and all of the potential gaps and vulnerabilities.

CSPs such as AWS insist their IaaS and PaaS layers are highly secure and are quick to point out that any breaches their clients have experienced were due to failures on the customer’s end, whether it be poor data encryption policies, network breaches or identity management issues. Moreover, there are few if any security policy standards for the Cloud industry for encryption or Common Criteria (NIST and ISO are exceptions but not complete frameworks for Cloud) across the entire stack as each CSP usually offers different levels of services – even within their own portfolio of products and services.

According to a senior IBM security executive Party interviewed, “Everyone needs to start over and rethink security architecture and controls, especially secondary and tertiary levels. The risks have changed and are changing more rapidly as time goes on. Therefore, auditable controls and security must be built in upfront, not as an afterthought.” As is the case with two other giants in the enterprise security space, RSA and Symantec, IBM is busy developing and acquiring security solutions to address gaps in Cloud security.

Very recently, IBM announced the acquisition of Lighthouse Cloud Security Services to address major Cloud security gaps in Identity and Access Management (IAM) and the growing need to address global IAM deployment and management challenges as well as combining Managed Security Services and IAM. At the same time, the AWS ecosystem is attracting a number of security point solutions to address the gaps and vulnerabilities that enterprises are confronted with on their end.

Despite the fact that every Cloud-enabled solution provider addresses security at some level in the Cloud stack, enterprises need to understand where their responsibilities lie – especially at the policy level.

The following is a short list of mostly smaller Cloud security vendors, all of whom provide security analytics, assessment and/or remediation capabilities. is a “fast, safe and simple solution for AWS security. The Evident Security Platform (ESP) performs continuous AWS security monitoring as a service and can identify and assist you in correcting problems in as little as 5 minutes. ESP for AWS identifies over 100 critical AWS security vulnerabilities across all of your AWS accounts. Security risks are color-coded and displayed on the ESP Risk Assessment Dashboard intuitively, facilitating rapid problem identification in minutes rather than days or weeks. Click on any security issue to get a comprehensive overview and expert instruction on how to remediate the problem as quickly as possible. The Evident Security Platform offers a cost-effective solution that dramatically reduces time spent finding and fixing AWS security risks, resulting in a significantly safer Amazon Cloud experience.” screenshot-2


Porticor offers a rich variety of Cloud encryption capabilities. “Your project’s needs and characteristics will determine the right choices for your application. The Porticor Virtual Private Data solution includes two or three major components: Porticor’s Virtual Key Management Service (PVKM) – a unique and patented key management technology provided as a service. PVKM is stronger than hardware, thanks to patented technologies such as Split-Key Encryption and Homomorphic Key Management; a Porticor Virtual Appliance (one or more for high availability), implemented inside your Cloud account; and an (optional) Porticor Encryption Agent, which may be installed and used on one or more of your Virtual Machines (your servers).” Porticor has Cloud encryption and key management solution for AWS and VMware and supports compliance requirements such as HIPAA. The diagram below represents an overview of the deployment options.

Porticor cloud-encryption-deployment-scenarios


Splunk provides operational intelligence apps for AWS Cloud that “ensure adherence to security and compliance standards and [allows users to] gain visibility into AWS billing and usage. Splunk Enterprise transforms machine data into real-time operational intelligence. It enables organizations to monitor, search, analyze, visualize and act on the massive streams of machine data generated by the websites, applications, servers, networks, mobile and other devices that power the business. Splunk Enterprise is available as a free download on your laptop; then deploy it to your data center or Cloud environment. This machine-generated data holds critical information on user behavior, security risks, capacity consumption, service levels, fraudulent activity, customer experience and much more. This is why it’s the fastest-growing, most complex and most valuable segment of Big Data.”



Analytics, Data Integration and Visualization Tools

Analytics and data integration capabilities are core offerings of virtually every Big Data vendor and CSP, whether those offerings are inherently homegrown and proprietary, based on open source solutions or a hybrid approach that provides a combination of tools and relies on partners with specialized capabilities. The hybrid approach is certainly in play, more or less, with all the major enterprise grade CSPs, including AWS, HP, IBM and Microsoft.

With the advent of the Internet of Things, Social Media, various forms of electronic messaging (e.g. documents, email, IM), streaming media and video, images, voice, web pages, wikis and other forms of unstructured or semi-structured data combined with structured data from traditional RDBMSs and core systems such as ERP (e.g. SAP and Oracle) or CRM (e.g. solutions, data sources have become much more diverse.

All of this data diversity as well as large volumes of Big Data pose challenges and opportunities for organizations of all sizes. The promise of the Cloud and the thousands of solution providers who create the “magic” is access to a vast ocean of information and potential insight, whether publicly available or contained in an organization’s own private “Data Lake.” This democratization of data access is achieved through the combined efforts of a diverse Cloud ecosystem that enables small companies or divisions of large organizations to quickly adopt IT services to meet new and evolving business requirements.

What follows is a small sampling of vendors that offer analytics, data integration and visualization tools to help users manage and make sense of large volumes of data in the Cloud:

Cloudera Enterprise allows “leading organizations to put their data at the center of their operations, to increase business visibility and reduce costs, while successfully managing risk and compliance requirements along with robust security, governance, data protection, and management that enterprises require.” According to senior execs at Cloudera, “Customers have been running Cloudera’s Hadoop distribution (HDFS) in mission-critical environments, 24/7 for the past 3 years, thus proving that Hadoop is enterprise ready.” Cloudera offers “one massively scalable platform to store any amount or type of data, in its original form, for as long as desired or required; integrated with your existing infrastructure and tools; and flexible to run a variety of enterprise workloads – including batch processing, interactive SQL, enterprise search and advanced analytics.” Cloudera is the primary contributor to Apache-licensed, open source Impala, “the industry’s leading massively parallel processing (MPP) SQL query engine that runs natively in Apache Hadoop.”

Cloudera diagram-edh-2014-620x361


Crawford Technologies transforms print stream data that contains vital customer information, providing additional source data for enterprise data warehouses and analytics solutions. Crawford’s suite of software solutions and services “enable clients to meet even the most rigorous demands for instantaneous access to information, irrespective of current, legacy or future standards in infrastructure or document output. High-value document solutions streamline, improve and manage enterprise content, document archive and production and advanced output management. Our Document Accessibility Services (DAS) produces alternate customer communications formats for blind and partially sighted customers, as well as those unable to read traditional print. Conversions include PDF/UA, Braille, large print, audio and e-text. Organizations can quickly and automatically transform monthly banking, credit card or investment statements to meet the ever-growing needs of an under-served portion of the population.” Crawford solutions extract customer-facing data to support financial services compliance requirements. Key partners include EMC and IBM.

Crawford Insurance example


Hortonworks is “architected, developed and built completely in the open. Hortonworks Data Platform (HDP) provides Hadoop designed to meet the needs of enterprise data processing. HDP is a platform for multi-workload data processing across an array of processing methods – from batch through interactive to real-time – all supported with solutions for governance, integration, security and operations. As the data operating system of Hadoop 2, Apache Hadoop YARN breaks down silos and enables you to process data simultaneously across batch, interactive and real-time methods. YARN has opened the Data Lake to rapid innovation by the ISV community and delivered differentiated insight to the enterprise. Increasingly, enterprises are looking for YARN-ready applications to maximize the value of its data. As key architects of YARN, we cherish open architecture and are committed to collaborate with ISVs and partners to onboard their applications to YARN and Hadoop.”


LogiAnalytics’ mission is to bring a self-service, Google-like experience to non-programmers by providing a code-free development environment that includes “our unique ‘Elemental Design’ approach, which provides hundreds of pre-built modules to speed design and development of your BI applications. This provides increased agility and flexibility, freeing your organization from long implementation cycles and extensive re-coding. Our architecture makes it easy to build tailored applications to deliver views of any kind of information, customized to each user. Drive interactivity with employees, partners, customers or the public in a way that serves your business needs. The design of our modular architecture and flexible API allows enterprise business intelligence functions to be embedded directly inside of existing enterprise applications, delivering critical information where users can act on it, dramatically speeding both adoption and deployment. Our write-back functionality integrates BI completely within your operational infrastructure.” Major partners include HP Vertica and MongoDB.

LogiAnalytics Infused Analytics


Pentaho “data integration prepares and blends data to create a complete picture of your business that drives actionable insights. The complete data integration platform delivers accurate, ‘analytics-ready’ data to end-users from any source. With visual tools to eliminate coding and complexity, Pentaho puts Big Data and all data sources at the fingertips of business and IT users alike. Within a single platform, our solution provides visual Big Data analytics tools to extract, prepare and blend your data, plus the visualizations and analytics that will change the way you run your business. Pentaho’s flexible Cloud-ready platform is purpose-built for embedding into and integrating with your applications. Our powerful embedded analytics combined with flexible licensing and a strong partner program [e.g. Cloudera, MongoDB, HP Vertica, Hortonworks) ensures that you can get to market quickly, drive new revenue streams and delight your customers.”

Pentaho evolving-big-data-architectures


SmartLogic “content intelligence platform, Semaphore, uses semantics, text analytics and visualization software technologies to perform five primary functions: Ontology and taxonomy management; Auto-classification of unstructured data; Text analysis (including entity, fact and sentiment extraction); Metadata management; and Content visualization. The resulting metadata drives an array of business-critical tasks, including semantic user experience for search, text analytics, workflow processes driven by meaning, regulatory compliance involving unstructured content, automatic classification for content management, decision support using information locked-up in content. Semaphore classifies content using semantic models such as taxonomies and ontologies. It automatically generates metadata that represents each classification decision. Semaphore is tightly integrated with content management systems, business intelligence systems and enterprise search engines and offers interoperability with graph databases and triple stores.” Semaphore integrates with many existing systems and can be customized to work with enterprise solutions such as SharePoint and MarkLogic.



Syncsort has effectively leveraged its legacy mainframe solutions business to provide mainframe users with the ability to extract Big Data for ingestion into Hadoop EDW environments. While mainframes are a good source of Big Data, mainframes do a lot of sorting at a comparatively high cost. “Syncsort maximizes Hadoop’s full potential by tapping into any data source and target, including mainframes. No coding, no scripting, just faster connectivity. Syncsort DMX is full-featured data integration software that helps organizations extract, transform and load more data in less time, with less money and fewer resources. DMX processes data 10x faster than other data integration solutions, easily scales to support growing data volumes, eliminates manual coding and performance tuning and lowers data integration TCO by up to 65%. Smarter Data Integration means faster data processing with less hardware and IT resources to harness the opportunities of Big Data.”

Syncsort DMX Architecture


Agile Application Transformation and Development Tools

Arguably very similar to several of the vendors listed in the prior Analytics, Data Integration and Visualization category, along with some elements of PaaS, what both examples listed below have in common is they allow organizations to migrate legacy applications, along with the associated business logic, to the Cloud. This class of solution offers a higher level of self-service capabilities for both IT and business users by limiting the amount of hand coding necessary, therefore shrinking application development and deployment cycles by several weeks.

Appian modern work platform combines the “best of business process management, social business, mobile access and Cloud deployment. Business Process Management Software in the Cloud enables strategic process improvement, reduced technology cost, and better alignment of IT with business goals. The new IT paradigm and business model can drive new growth opportunities, increase profit margins for the private sector and achieve more efficient and effective missions for federal agencies. The entire Appian BPM Suite is available as either an on-premise or Cloud offering. The Appian BPM Suite delivered in the Cloud is a secure, scalable and reliable way to create a more nimble and adaptive organization. The benefits of deploying Appian in the Cloud include low startup costs, fast deployment with no manual maintenance, predictable costs during the life of the application and fast return-on-investment.”

Appian Mphasis-Hybrid-Agile-Method


OutSystems is designed to appeal to the “citizen programmer” and provides the “only open, high-productivity application platform (PaaS) that makes it easy to create, deploy and manage enterprise-mobile and web applications – helping IT deliver innovative business solutions fast. OutSystems Platform enables rapid delivery of beautiful applications for all devices and empowers IT to attack changing business requirements.” The “High Productivity Platform” can run on a desktop, tablet or smart phone and is available as a public Cloud, private Cloud and on-premises solution. The OutSystems Platform allows programmers to “focus on building a product, not scaffolding, helping organizations get to market faster and better” by generating C# (Sharp), HTML, referential database code or any other code or artifacts necessary to make applications work while also tying data to existing data sources.



Cognitive Computing in the Cloud

This is yet another emerging class of solutions for building applications in the Cloud that take advantage of neural networks and other advances in artificial intelligence (AI) and cognitive computing technologies. Suffice to say that virtually all the major application development vendors, including HP (IDOL) and IBM (Watson), are using some form of cognitive computing. The three vendor examples here are all very small – which is where much of the innovation is expected to come from in the future.

Cortica is an intelligent visual search tool for the web founded in 2007 by a team of Technion researchers who “amazed by the computational capabilities of biological systems, reverse-engineered the Cortex to enable a new generation of biologically inspired computational technologies. Despite the major technological breakthroughs of recent decades, basic tasks such as image recognition, movement in the physical world and natural language understanding – easily performed by toddlers – are practically impossible for even the most advanced computers. The essence of Cortica’s Image2Text® technology lies in its ability to automatically extract the core concepts in photographs, illustrations and video and map these concepts to keywords and textual taxonomies. Images drive interest, sentiment and commercial intent, and Cortica’s technology reads and automatically associates images with relevant content in real time. This groundbreaking model gives our partners a completely new way to engage a highly targeted mass audience.” Cortica works with phones and wearables.

Cortica Image recognition


Ersatz Labs “aims to make deep neural network technology available to the masses. Ersatz is the first commercial platform that packages state-of-the-art, deep-learning algorithms with high-end GPU number crunchers for a highly performant machine-learning environment. We sell it in the Cloud or in a box. Ersatz is the easiest way for anyone to start building applications that take advantage of deep neural networks. With Ersatz, you can focus less on feature engineering and more on collecting as much data as you can, even if it is unlabeled. One of the most powerful benefits of deep neural networks is their apparent ability to extract powerful features from your data automatically. The formatting of your data may be task specific, but in general, the following tasks are supported: classification, clustering, feature extraction, dimensionality reduction, time series prediction, regression, and sample generation.”



Narrative Science introduced Narrative Analytics, “a new approach to automated communication that starts with the story. The story drives a set of communication goals that reflect what needs to be said. These, in turn, define the analysis that needs to be performed in order to get the facts that support the story. The result is a complete set of requirements to drive the creation of the narrative. This linkage between story, analytics and data – the core tenet of Narrative Analytics – is fundamental in making sure you say the right thing at the right time. Quill is a patented AI platform that goes beyond merely reporting the numbers – it gives you true insight on your data. Quill uncovers the key facts and interesting insights in the data and transforms them into natural language produced at a scale, speed and quality only possible with automation.”

Narrative Science What_Is_Quill_R7b


End Note:  For a peek into how large Banks and Financials are looking to collaborate on Cloud de facto standards, reference architectures and taxonomies click on the following link. IaaS and PaaS: Mature Enough for Financial Services Firms?


Posted in Big Data, Cloud Computing, Information Management Thought Leadership | Tagged , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , | Leave a comment

Meeting Cloud Automation and Orchestration Challenges with Cloudify from GigaSpaces

logoWhite-OpeA recent Gartner study projected that through 2015, “80 percent of outages impacting mission-critical services will be caused by people and process issues, and more than 50 percent of those outages will be caused by change, configuration, release integration and handoff issues.”[i]

In addition, a recent survey conducted by the Ponemon Institute determined that “the average cost of data center downtime across industries was approximately $7,900 per minute (a 41-percent increase from the $5,600 in 2010).”[ii] VentureBeat editor-in-chief Dylan Tweney calculated Google lost $545,000 when an outage affecting all of their online services went down for 5 minutes in August 2013.[iii]

With an increasingly larger portion of business being transacted online, the impact of downtime on a company’s bottom line can be significant. With the move to Cloud and SaaS delivery models, both customer-facing applications and an organization’s entire IT infrastructure are at risk. Moreover, a worst-case scenario would be different teams running different tools for each layer – a sort of “anti-DevOps” approach. No doubt, there are still siloed IT organizations where multiple, redundant and incompatible tools are being deployed.

Chris Wolf, a research vice president at Gartner, in a recent posting states, “One of the great myths today is that there is all of this centralized hybrid Cloud management happening – for the most part, it doesn’t exist in terms of what folks are actually doing. In nearly every case where we see a hybrid Cloud environment, the customer is using separate sets of tools to manage its public and private Cloud environments. There just truly isn’t a ‘single pane of glass’ today; that’s the problem.”[iv]

While Wolf’s analysis might well reflect the current state of enterprise Cloud, there is no doubt that organizations building their own Cloud environments will surely look to standardize on automation and orchestration tools and processes that offer the most flexibility and speed of deployment capabilities while maintaining alignment with business goals. GigaSpaces developed Cloudify specifically to address these critical end-user requirements.

Cloud Journey

Cloudify enables customers to onboard and scale any app, on any Cloud, with no code changes, while maintaining full visibility and control.

GigaSpaces CTO Nati Shalom, in a recent blog post entitled “Eight Cloud and Big Data Predictions for 2014”, suggests, “Orchestration and automation will be the next big thing in 2014. Having said all that, the remaining challenge of enterprises is to break the IT bottleneck. This bottleneck is created by IT-centric decision-making processes, a.k.a. ‘IaaS First Approach,’ in which IT is focused on building a private Cloud infrastructure – a process that takes much longer than anticipated when compared with a more business/application-centric approach.”

In addition, Shalom states, “One of the ways to overcome that challenge is to abstract the infrastructure and allow other departments within the organization to take a parallel path towards the Cloud, while ensuring future compatibility with new development in the IT-led infrastructure. Configuration management, orchestration and workflow automation become key enablers in enterprise transition to Cloud, and will gain much attention in 2014.”

The Emergence of DevOps

Shalom also points to the emergence of DevOps, a combination of development and IT operations, as a “key example for a business-led initiative that determines the speed of innovation and, thus, competitiveness of many organizations. The move to DevOps forces many organizations to go through both cultural and technology changes in making the business and application more closely aligned, not just in goals, but in processes and tools as well.”

Cloudify Automation Slide

Cloudify Unifies the Cloud Stack

Cloudify is a Cloud orchestration platform developed by GigaSpaces that allows any application to run on any Cloud, public or private, with no code changes. More than three years ago, GigaSpaces anticipated the need for higher level tools to accelerate the onboarding and migration of mission-critical applications to the Cloud.

Since its release, Cloudify has been adopted by many large organizations – including multi-national financial institutions – as a de facto standard for bridging the gap between the IaaS layer and the application layer. With Cloudify, it is now possible to adopt a Cloud automation and orchestration framework that provides IT organizations, systems integrators, software developers and application owners with the ability to quickly deploy applications securely with standard interfaces (APIs).

Cloudify is designed to bring any app to any Cloud, enabling enterprises, ISVs, and managed service providers alike to quickly benefit from the Cloud automation and elasticity that organizations need today. Cloudify helps users maximize application onboarding and automation by externally orchestrating the application deployment and runtime.

Cloudify’s DevOps approach treats infrastructure as code, enabling users to describe deployment and post-deployment steps for any application through an external blueprint, which users can then take from Cloud to Cloud, unchanged. 

Cloudify is now available as an open source solution under the Apache license agreement or through GigaSpaces directly for those organizations looking for a premium services package. Cloudify and GigaSpaces work with many other open source Cloud automation and orchestration tools such as OpenStack’s Heat as well as Chef and Puppet. Cloudify also enables applications migrating to and from OpenStack, HP’s Cloud Services, Rackspace, AWS, CloudStack, Microsoft Azure and VMWare.

TOSCA (Topology and Orchestration Specification for Cloud Applications) from OASIS is an open source specification that works to “enhance the portability of Cloud applications and services.” The goal of TOSCA is to enable cross-Cloud, cross-tools orchestration of applications on the Cloud. In the 3.0 version of Cloudify, GigaSpaces is working on putting TOSCA into the mix and using its concepts as a canonical application model.

Cloudify uses an orchestration plan, or blueprint, that is inspired by TOSCA. The blueprint contains an application topology model: IaaS components, middleware components and application components. For each of these elements, Cloudify describes the component lifecycle and dependencies with other components (dubbed relationships). In addition, each node defines a set of policies that allow Cloudify to enforce application availability and health.

Cloudify translates these topologies into real, managed installations by running automation processes described in the blueprint workflows. These workflows trigger the lifecycle operations implemented by the Cloudify plugin, which uses different Cloud APIs as well as tools such as Chef, Puppet and others.

OpenStack: The Open Source Alternative to AWS

“Deploy, manage and scale” is a mantra for rapid application delivery in the Cloud. As previously mentioned, AWS and other CSPs accomplish this through standardization on a single IaaS platform. Three years ago, AWS competitor Rackspace, along with NASA, introduced the OpenStack initiative – essentially as an open source alternative to AWS – which now has more than 50 IT vendors and CSPs actively supporting and participating in the community, including AT&T, Cisco, Dell, HP, IBM, Intel, NetApp, Red Hat, Suse, VMware and Yahoo!.

OpenStack is a Cloud operating system acting as an IaaS platform to control large pools of compute, storage and networking resources throughout a data center. Released in September 2012 as open source software under the Apache license, OpenStack has become the most deployed IaaS platform in the world, embraced by thousands of service providers, government agencies, non-profit organizations and multinational corporations. The OpenStack community just announced the 8th release of its Havana software for building and supporting public, private and hybrid Cloud application infrastructure.

A recent survey conducted for Red Hat by IDG Connect shows 84 percent of enterprise IT decision makers surveyed say that OpenStack is part of future private Cloud plans. “Sixty percent of survey respondents indicated they are in the early stages of their OpenStack deployments, and have not yet either completed the implementation stage or are early in the process. Survey respondents cited management visibility (73 %); deployment speed (72 %); platform flexibility (69 %); better agility (69 %); and competitive advantage (67 %) as the unique benefits offered by OpenStack over private Cloud alternatives.”[v]

In early 2013, to keep pace with the innovation coming out of the open source Cloud community, Amazon introduced OpsWorks to provide AWS customers with a “powerful end-to-end platform that gives you an easy way to manage applications of nearly any scale and complexity without sacrificing control.” While OpsWorks is not an open source product, so far Amazon does not charge extra for it. OpsWorks does however support a few open source solutions such as Chef version 11 and open source development languages such as Java, PHP and Ruby on Rails along with open source databases MySQL and Memcached.

It remains to be seen whether a wide swath of customers will embrace OpsWorks – even within the AWS framework – because a dedicated, proprietary AWS solution that locks companies and their applications into the Amazon Cloud will likely have limited appeal for many organizations. Large enterprises and some SMBs not only need a solution that manages Cloud infrastructure resources and interactions with users, they also need portability across private and hybrid (private/public) Cloud platforms.

Cloudify User Insights

A large multinational financial services company that has implemented Cloudify offers the following insights.

  • Setup usually takes quite a bit of time when there are no best practices and no structure to follow. Cloudify allowed us to model this process in a very well defined way using application blueprints. When dealing with hundreds or even thousands of applications that we might want to migrate to the Cloud, this can be quite a timesaver.
  • Regarding ongoing operations, when an application or component fails, there is the cost of downtime and the cost of bringing the application back up again. Cloudify does a few things for us to avoid such situations.
  • Cloudify supports proactively scaling out an application when load increases, thus avoiding downtimes caused by overload.
  • Cloudify enables auto-healing application components upon failure, thus minimizing the Mean Time to Recovery (MTTR) when an application fails.
  • For enterprises like ours, with private Clouds, the ability for each application to consume only the amount of resources that it needs at any given point in time means significant improvement in our data center utilization and cost savings.
  • Development and testing in many cases requires replicating complete environments, which is very costly without a good automation framework. Cloudify allowed us to set up an entire application with all of its supporting resources at the click of a button. In fact, we have a few users that are using Cloudify only for that.


There is no doubt that enterprises are looking for additional deployment support when it comes to implementing private and public Cloud solutions. The benefits of freeing up internal infrastructure and enabling DevOps to bring applications to market more quickly are unquestionably real. What is not so clear is which path to the Cloud will provide the least friction and the fastest time to value.

Cloudify is winning converts in the banking, finance, retail, hospitality and telecom industries as well as gaining credibility with partners such as IBM, HP and Alcatel-Lucent when customers require native OpenStack and multi-Cloud support. A framework and approach that touches all the Cloud layers unifying the Cloud stack while simultaneously simplifying DevOps is a compelling combination of capabilities.

Enterprises looking to “Cloudify” their mission-critical applications will be hard pressed to find a more comprehensive, innovative, intuitive, open-source, community-supported solution than GigaSpaces has created in Cloudify.

This post is an excerpt from a whitepaper entitled Cloudify from GigaSpaces: Delivering Mission-Critical Applications to the Cloud at the Speed of Business. The full report can be viewed through this link

Posted in Big Data, Cloud Computing, Information Management Thought Leadership | Tagged , , , , , , , , , , , | Leave a comment

Legal Tech Memo 2014: Driving Efficiencies and Mitigating Risk in the Era of Big Data

Dilbert on Big DataNow that the Information Governance message has finally gotten through to most every CIO and General Council, not to mention the vendor community, what’s next?

Over successive Legal Tech’s (this year’s LTNY 2014  was my sixth in a row), information management related technology and services have gotten progressively more sophisticated.  Some might argue also easier to use and integrate into the enterprise.

Meanwhile Technology Assisted Review (TAR), Predictive Coding and Early Case Assessment (ECA) have all become standard tools in the ediscovery arsenal – and accepted by most courts today as viable options to a heretofore much more labor intensive and costly manual review process.

In the right hands, this new generation of ediscovery/information governance tools can be leveraged throughout an enterprise corporate legal department and beyond to other departments including compliance, records management, risk management, marketing, finance and lines of business.

However, one of the great challenges for any large corporation or organization today remains how best to address the deluge of Big Data. With a variety of technology and service provider offerings and methodologies to choose from, what is the best approach?

Big Data Comes of Age

In 2009, Analytics, Big Data and the Cloud were emerging trends. Five years later, Big Data is an integral part of the business fabric of every information-centric enterprise. Analytic tools and their variants are essential to managing Big Data.

The Cloud is fast becoming the primary Big Data transportation vehicle connecting individuals and organizations to vast amounts of compute power and data storage – all optimized to support a myriad of popular applications from Amazon, Facebook, LinkedIn and Twitter to so-called Software as a Service (SaaS) applications such as and Google Analytics.

At the same time, eDiscovery requests have increased dramatically. Better tools along with more data are a recipe for a spike in data access requests whether triggered by outside regulatory agencies, an increase number of lawsuits or internal requests from across the enterprise to leverage electronic data originally captured to meet regulatory compliance requirements.

Big Data Spike in Financial Services: Example

A Fortune 500 financial services firm I am familiar with has seen the number of internal ediscovery requests jump from 400 a year, in 2004, to 400 a month in 2013 – and counting. While changes to FRCP  helped to accelerate the pace of ediscovery activities, the firm offers variable annuities and mutual funds; therefore, the SEC also regulates them.

Thus, the company is required by Rule 17a-4 to retain every electronic interaction between their 12,000 agents and their client/prospects. Every month or so, they add 100 million more “objects” to their content archive which now exceeds 3 billion emails, documents, instant messages, tweets and other forms of social media.

From this collection of Big Data, larger highly regulated firms, in particular, are compelled to carve out portions of data for a variety of purposes. Most of these larger organizations have learned to rely on a mix of in-house technologies and service providers to analyze, categorize, collect, cull, move, secure and store mostly unstructured or semi-structured data, such as documents and emails, that do not fit neatly into the structured world of traditional relational database management systems.

People and Process First, Then Technology

A recent ZDNet article on Analytics quoting Gartner analysts suggests, “Easy-to-use tools does not mean it leads to better decisions; if you don’t know what you’re doing, it can lead to spectacular failure. Supporting this point, Gartner predicts that by 2016, 25 percent of organizations using consumer data will face reputation damage due to inadequate understanding of information trust issues.”

In other words, great technology in inexperienced hands can lead to big trouble including privacy breaches, brand damage, litigation exposure and higher costs. When meeting ediscovery demands is the primary goal, most large organizations have concluded that acquiring the right combination of in-house technologies and outside services that offer specific subject matter expertise and proven process skills is the best strategy for reducing costs and data volumes.

Responsive Data vs. Data Volume: Paying by the Gig

The time-honored practice of paying to store ESI (electronically stored information) by volume in the age of Big Data is a budget buster and non-starter for many organizations. Most on-premise ediscovery tools and appliances as well as Cloud-based solutions have a data-by-volume pricing component. Therefore, it makes perfect sense to lower ESI or data volumes to lower costs.

However, organizations run the risk of deleting potentially valuable information or spoliation of potentially responsive data relevant to a litigation or regulatory inquiry. This retain or delete dilemma favors service and solution providers whose primary focus is offering tactical, put-out-the-fire approaches to data management that worked well five or more years ago but not in today’s Big Data environment.

Big Data Volumes Driving Management Innovation

The pulse benchmarks launched last November by Kroll Ontrack indicate, “Since 2008, the average number of source gigabytes per project has declined by nearly 50 gigabytes due to more robust tools to select what is collected for ediscovery processing and review.”  In Kroll’s words, the decline is the result of “Smarter technology, savvier litigants and smaller collections.”

But, technology innovation and smarter litigants only tells a portion of the story. The larger picture was revealed during LTNY’s Managing Big Data Track panel sessions moderated by UnitedLex President Dave Deppe, and Jason Straight, UnitedLex’s Chief Privacy Officer.

Deppe’s panel included senior litigation support staff from GE and Google. Several key takeaways from the session include:

  • A long-term strategy for managing the litigation lifecycle is critical and goes well beyond deploying ediscovery tools and services. Acquiring outside subject matter expertise to mitigate internal litigation support bandwidth issues and provide defensibility through proven processes is key.
  • Price is not the mitigating factor in selecting service providers. Efficiency, quality and a comprehensive, end-to-end approach to litigation lifecycle management are more important.
  • The relationship between inside and outside council and providers needs to be managed effectively. Good people (SMEs) and process can move the dial farther than technology, which often does not make a difference – especially in the wrong hands.
  • Building a litigation support team that includes a panel of key partners, directed by a senior member of the internal legal staff can dramatically influence ediscovery outcomes and help lower downstream costs.  Aggressively negotiating search terms and, as a result, influencing outside counsel is just one example.
  • When not enough attention is paid to search terms, 60 to 95 percent of documents are responsive. Litigation support teams must stop reacting and arm outside counsel with the right search terms to obtain the right results.
  • Over collecting creates problems downstream. Conduct interviews with data owners and custodians. Ask them what data should be on hold. They will likely know.
  • Define, refine and repeat the process, train others and keep it in place.
  • Develop litigation readiness protocols, come up with a plan and play by the rules.

Merging of eDiscovery and Data Security

The focal point of Straight’s panel was on why and how organizations should develop a “Risk-based approach to cyber security threats.” With the advent of Cloud computing, sensitive data is at risk from both internal and external threats.

Panelist Ted Kobus, National Co-leader Privacy and Data Protection, at law firm Baker Hostetler shared his concerns about the possibility of cyber-terrorists shutting down critical infrastructure such as power plants. In regards to ediscovery, law firms often have digital links to client file systems which, if compromised could leak sensitive data, intellectual property or give hackers access to customer records.

In light of recent very high profile cyber breaches suffered by Target Stores and others, Kobus and other panelists emphasized the need to develop an “Incident Response Plan” that includes stakeholders from across the enterprise beyond just IT including; legal, compliance, HR, operations, brand management, marketing and sales.

Kobus emphasized that management needs to “embrace a culture of securing data” as a key component of an enterprise’s Big Data strategy. As the slide below indicates, a risk-based approach to managing corporate data addresses simple but critical questions.

UnitedLex LTNY 2014 Cyber Risk Panel Slide

Many organizations have created a Chief Data or Information Governance Officer position responsible for determining how various stakeholders throughout the enterprise are using data, and cyber insurance is becoming much more popular. Big Data management, compliance and data security are intrinsically connected. Moreover, the development of a data plan is critical to the survival of many corporations and its importance must not be overlooked or diminished.

Big Data Innovators to Watch in 2014 – and Beyond 

UnitedLex:  It is relatively easy to make the argument that it takes more than technology to innovate. UnitedLex has grown rapidly on the merits of its “Complete Litigation Lifecycle Management” solution, which provides a “unified litigation solution that is consultative in nature, provides a legally defensible, high quality process and can drive low cost through technology and global delivery centers.”

UnitedLex Domain Experts leverage best of breed ediscovery tools such as kCura’s Relativity for document review as well as their own Questio consultant led, technology-enabled service that “combines targeted automation and data analysis expertise to intelligently reduce data and significantly reduce cost and risk while dramatically improving the timeliness and efficiency of data analysis and document review.”

UnitedLex has reason to believe its Questio service is “materially changing the way eDiscovery is practiced” because UnitedLex SMEs help to significantly reduce risk and, on average, reduce customers’ total project cost (TPC) by 40 to 50% or more. This “change” is primarily achieved by reducing data volumes, avoiding legal sanctions, securely hosting data, strict adherence to jurisdictional and ethics requirements and, most or all, through the development of a true litigation and risk partnership with its customers.

How Questio Reduces Risk

Questio chart_collection

61% of eDiscovery-related sanctions caused by a failure to identify and preserve potentially responsive data. (UnitedLex)

Vound:  In the words of CTO Peter Mercer, Vound provides a “Forensic Search tool that allows corporations to focus on the 99 percent of cases not going to court.” Targeting corporate council and internal audit, Vound’s Intella is “a powerful process, search and analysis tool that enables customers to easily find critical data. All products feature our unique ‘cluster map’ to visualize relevant relationships and drill down to the most pertinent evidence.” Intella can be installed on a laptop or in the cloud.

Intella works almost exclusively with unstructured and semi-structured data such as documents, emails and metadata dividing data into “facets” using a predefined, in-line multi-classification scoring scheme. The solution is used by forensic and crime analysts as well as ediscovery experts to “put the big picture together” and bridge the gap that exists between Big Data (too much data) and ediscovery (reduce size of relevant data sets) to manage risk by identifying fraud patterns, improve efficiencies by enabling early assessments and lowering cost.

Intella slide


Catalyst:  Insight is a “revolutionary new ediscovery platform from Catalyst, a pioneer in secure, cloud-based discovery. Engineered from the ground up for the demanding requirements of even the biggest legal matters, Insight is the first to harness the power of an XML engine—combining metadata, tags and text in a unified, searchable data store.”

According to Founder and CEO John Tredennick, Catalyst has deployed at least three NoSQL databases on the backend to offer Insight users “unprecedented speed, visual analytics and ‘no limits’ scalability, all delivered securely from the cloud in an elegant and simple-to-use interface—without the costs, complications or compromises of an e-discovery appliance.”

In addition, Catalyst has engineered ACID transaction, dynamic faceted search, specialized caching techniques and relational-like join capabilities into Insight in order to deliver a reliable, fast and easy to use solution that enables results, from raw data to review, “in minutes not days” – even with large document sets exceeding 10 million.




The above-mentioned trio of services/solution providers represent a sea change in the way corporations big and small will strategically approach ediscovery in the era of Big Data and Cloud computing.  Deploying tactical solutions that meet short-term goals leads to higher costs and increase risks.

Services and technology solutions that can be leveraged across the enterprise by a variety of data stakeholders is a much more logical and cost effective approach for today’s business climate than deploying point solutions that meet only departmental needs.

A risk-based approach to managing Big Data assets throughout the enterprise – including customer and partner facing data – is not only a good idea, an organization’s survival may depend on it.

Posted in Big Data, Cloud Computing, Information Governance, Information Management Thought Leadership, Information Management Trends, Strategic Information Management | Tagged , , , , , , , , , , , , , , , | Leave a comment

Making the Case for Affordable, Integrated Healthcare Data Repositories and PHRs

Healthcare Data MapTackling the rising cost and complexity of healthcare delivery in the U.S. and, increasingly, around the world while improving health outcomes is one of the great challenges of modern civilization. This is the primary mission of the Affordable Care Act (ACA), better known as ObamaCare, which is funded by ARRA, The American Recovery and Reinvestment Act of 2009.

Far from being an exact science, the practice of medicine is highly specialized and compartmentalized – unlike human beings who are an amalgam of interconnected and related physical systems, emotions and thoughts. Today, most of us humans find ourselves at the nexus of the healthcare delivery and management debate, and healthcare data is an integral part of the discussion.

Navigating today’s complicated healthcare ecosystem and the nuances of ACA demands that individuals take more responsibility for managing their own and their family’s healthcare services. This includes selecting a variety of healthcare professional partners who will help guide us through our health and wellness journey so we can receive the best possible health care advice and services. Healthcare data is undeniably one of those partners.

ACA promises to increase access for individuals to higher quality information regarding the efficacy of providers, procedures, medical research, case studies, outcomes data and comparative cost data previously reserved for “experts” only.

Now, due to changes in laws governing the ownership and access to healthcare records and thanks to advances in electronic data collection and analytics, laymen have the right, and the means, to review and manage their own and their family’s personal healthcare records (PHR) and view aggregated or “cleansed” healthcare data that may support better care and help improve outcomes.

Unfortunately, much of our collective potentially useful healthcare data is still locked away in paper records or inaccessible data formats within provider archives and siloed computer systems despite the fact that technology to access or “crack” these formats has been commercially available for more than a decade.

In addition, the vast majority of hospitals, physician groups and other providers have been slow to adopt these solutions or they have invested in older technology that makes the data extraction problematic or prohibitively expensive. Privacy and security concerns are also cited by those who hold or “curate” our personal health data as justification for delays in promoting potentially useful data to individuals and researchers.

At the same time, personal healthcare records advocates such as Patient Privacy Rights.Org decry the loss of individual anonymity as PHRs are legally, and illegally, resold to “thousands” of healthcare analytics companies for purposes far beyond improving health outcomes.

In addition, studies of electronic health records (EHR) solutions, including a withering article in the New England Journal of Medicine entitled Escaping the EHR Trap, suggest EHR solutions are overpriced, inefficient and EHR solutions vendors selfishly are fostering stagnation in healthcare IT innovation.

Meanwhile, ACA is allocating roughly $19 billion for hospitals to modernize their medical records systems encouraging the adoption of technologies that are often 20 or more years out of date compared with technology adoption curves in several other industries including finance, ecommerce, manufacturing and even government agencies.

Whether or not individuals and providers are fully aware of the ramifications born of the healthcare Big Data explosion, the industry is crying out for help to resolve critical issues at the center of the controversy including; tackling security; fraud and transparency concerns; data portability and ownership; using healthcare data exclusively to improve outcomes; and applying the brakes to escalating costs.

The Changing Healthcare Landscape

Dr. Toby Cosgrove, CEO of the Cleveland Clinic, recently remarked at the 2014 World Economic Forum in Davos, Switzerland, “Now, healthcare is more of a team sport than an individual sport. Between doctors, nurses, technicians, IT people and others, these days it takes a whole team of people working together across specialties.” (Huff Post Live at Davos)

At the center of the team is the individual or the patient advocate – most often a family member such as a parent, child or sibling – who is responsible for orchestrating a legion of healthcare providers and technicians that might also include nutritionists, physical therapists, wellness advisors, and alternative medicine practitioners such as herbalists or chiropractors.

Also weighing in from Davos, Mark Bertolini, CEO of Aetna, the third largest health insurer in the US, says, “Healthcare costs are out of control. We really need to look at how health care is delivered and how we pay for it. Today, we pay for each piece of work done and so we get a lot of pieces of work done.” Bertolini points out that Americans are gaining more control and more responsibility for their medical bills with individuals paying about 40% of costs through premiums, deductibles and other charges.

Today, healthcare providers are relying more than ever on data derived from multiple sources to supplement traditional modes of diagnoses and care. Much of this data, useful to providers, payers and individuals, is stored on paper records but also increasingly in electronic form in a variety of “data silos” across the healthcare continuum.

The integration of these data silos to gain a holistic view of each individual’s historical healthcare record while, in the process, also achieving an aggregated view of health populations holds great promise for contributing to improved healthcare outcomes and overall lower costs. This integration and secure portability of health records is one of the primary challenges for the ACA.

The aforementioned $19 billion is being allocated for Medicare and Medicaid electronic health records (EHR) Incentive Programs to encourage eligible providers (EPs) to update their computer systems in order to demonstrate “meaningful use” (MU) of healthcare technology that meets a variety of “core objectives” including; keeping up to date patient medication and allergy histories, consolidating personal health records and demonstrating the ability to securely transmit EHRs to patients, other providers and health information exchanges (HIEs).

The MU program is administered by the Office of the National Coordinator (ONC) for Health Information Technology (HITECH). Most of the money is earmarked for EPs such as hospitals, physician groups, HIEs and other EPs.

One of the 17 Stage 2 core objectives of the MU EHR incentive program for 2014-15 is to “Provide patients with an electronic copy of their health information (including diagnostic test results, a problem list, medication lists, and medication allergies) upon request.”

On the face of it, the objective of Measure 12 is simple. In practice, there are very few hospitals today that can comply with the letter of the law and therefore are in jeopardy of not meeting HIPAA requirements and losing future meaningful use incentive dollars – allocated in stages over several years.

Here is a link to an ONC document that outlines all of the Core Objectives for Stage 2 MU.

What is the Law?

HIPAA laws have been strengthened over the last decade to enforce the rights of individuals and strongly encourage HIPAA compliance from providers, payers, employers and other covered entities where HIPAA compliance is required. The following link HIPAA PHR and Privacy Rules outlines an individual’s right to access their electronic records.

Later this year, amendments to HIPAA privacy rules will go into effect that provide individuals greater ability to access lab reports, further “empowering them to take a more active role in managing their health and health care,” according to the rule.

HIPAA rules have also been amended to provide individuals and government agencies with recourse if HIPAA security is breached or if personal health records are not made available in a reasonable timeframe, usually within several business days. A California based privacy group details the types of PHRs, what laws protect individual rights and examples of fines that have been recently levied on health providers that do not comply with HIPAA regulations. One health insurer incurred a $1.5  million fine while a cardiac surgery group was fined $100,000 for not properly implementing HIPAA safeguards.

On the flip side of the argument, Dr.  Deborah C. Peel, Founder and Chair of Patient Privacy, believes HIPAA rules have actually been weakened. In recent “testimony” addressed to Jacob Reider, MD, National Coordinator for Health Information Technology at the ONC, Dr. Peel articulates her concerns about the widespread practice by analytics companies, payers and providers of Patient Matching – a technique used to exchange U.S. health data without patient involvement or consent.

In part, Dr. Peel asks, “How can institutions exchange sensitive health data without patient participation or knowledge?” Apparently relatively easily.

Healthcare Data Map

A “live” Data Map of the above graphic was developed in cooperation with Harvard University.

At present, $ billions of MU dollars are still available for EPs to support adoption of EHRs and electronic medical records (EMRs) solutions. However, by 2016, The Centers for Medicare & Medicaid Services (CMS) will withhold money from providers who do not comply with MU requirements. CMS has already stopped paying for what it characterizes as avoidable readmissions for congestive heart failure (CHF), Acute Myocardial Infarction (AMI) and Pneumonia (PN).

CMS is also finalizing the expansion of the applicable conditions for 2015 to include; acute exacerbation of chronic obstructive pulmonary disease (COPD), patients admitted for elective total hip arthroplasty (THA) and total knee arthroplasty (TKA).

Value of EHRs and PHRs

As indicated in this health records infographic (also seen below), EHRs and personal health records (PHRs) are becoming more valuable to providers and individuals with more healthcare data available on line and technology advances to leverage the data in multiple ways.

Healthcare Infographic onc_consumer_task-6.3_infographic_final

More than 10% of smart phone users have downloaded an app to help them track or manage their healthcare services and 2 out of 3 people said they would consider switching to providers who offered access to their health records through the Internet.  (Even more reason to place control of data in the hands of those that will use it to benefit the individual.)

Better access to health information has many benefits including; less paperwork and easy access to records, better coordination of care across providers, faster more accurate filling of prescriptions and fewer unnecessary and duplicative tests that inflate costs or involve some risk.

Ultimately, EHRs and PHRs offer the individual better control over their healthcare experience and give caregivers additional information to improve their quality of service.

PHR services such as Microsoft’s Health Vault – along with 50 plus other non-profit and for profit PHR related services including Medicare’s own Blue Button PHR service – offer users a way to collect, update, store and selectively transmit medical records to providers or anywhere they choose. Medicare (CMS) has opened its Blue Button format to encourage “data holders” and software developers to adopt its Blue Button Plus (BB+) framework, which offers a “human-readable format and machine-readable format.”

Despite security concerns and potential loss of anonymity, millions of Americans have reasoned that capturing and sharing their personal health data has some value. Analytics firm IMS Health evaluated over 43,000 health related apps available for Apple smart phones alone. While IMS concluded that only a small number of apps were “useful” or engaging, the explosion in mobile health apps is only one indication of consumer interest in using health data to modify lifestyles in order to improve health.

While large health insurers such as Aetna and United Health Group have, on the surface, bought into the BB+ initiative, early indications are that usage is less than 1% out of a potential pool of roughly 100 million U.S. citizens. That pool includes health insurance companies, health information exchanges (HIEs) and the Veterans Administration.

One could argue the BB+ program is new and not yet well publicized or understood. For those of us who have actually signed up for PHR services and tried to use them, the bigger problem is likely poor user interfaces and an overall lackluster customer experience leading to little motivation for engagement.

Barriers to EHR and PHR Adoption

There are several factors slowing the widespread adoption of electronic healthcare records by providers and individuals including politics, education, cost, transparency, flawed technology and workflow, and lack of engagement and innovation.


Until recently, there was no clear statement from lawmakers on ownership of personal health records or any teeth to enforce HIPAA security standards. Providers, payers, pharmaceutical companies, analytics firms as well as government agencies who possessed EHR data “owned” the data. Even as the government has declared individuals own their data, the value of EHRs to the entire healthcare ecosystem has increased exponentially.

The more providers move to EHR solutions, the more coveted that data has become to support research, drug trials, population health, fraud detection, supply chain management, clinical informatics start-ups, venture capitalists and other lucrative data analytics related businesses.


Those members of the healthcare ecosystem who are benefiting financially from reselling de-identified, aggregated or enhanced EHR data usually prefer not to publicize their windfall. Publicly, the ecosystem prefers to focus on the value to individuals – of which there is potentially much benefit. However, the financial benefit does not easily trickle down to individuals or even to most caregivers.

With the individual in control of their own data, the potential for dramatically improving the accuracy and efficacy of individual and aggregated data is enormous. There should be some financial benefit to the individual in the form of lower healthcare costs, reduced prescription drug costs and healthcare insurance rebates for accurately gathering and managing PHR data.


Most hospitals are spending $ millions on migrating older EHR or electronic medical records (EMR) systems to MU “certified” solutions. However, MU incentives only pay a fraction of the cost of these systems. For example, a single practitioner or EP will get less than $50,000 for complying with MU, which is likely less than the initial cost of a MU certified solution for the first year or so.

In comparison, Kaiser Permanente, one of the country’s largest providers, has 17,000 doctors. With Kaiser receiving $50,000 for each EP, the total MU incentive reimbursement would be a whopping $850 million. Unfortunately, Kaiser has purportedly spent over $3 billion on their EHR upgrade – and counting.

Up to this point, MU incentives have not required providers to link or integrate all of their internal IT and hospital systems. Given that most providers are falling behind with existing MU incentive objectives, total integration is not on most providers to do list as of yet. Kaiser and other leading edge health systems such as InterMountain and the Mayo Clinic have largely completed their integration but at a very high cost.


Despite MU requirements that require more transparency to meet quality objectives for reimbursements, some providers may be reluctant to disclose all patient information to individuals including detailed, consolidated procedure and billing information. This level of detail may expose overuse of certain procedures or a pattern of overcharging for services.

Individuals may also not want certain procedures (liposuction or AIDs testing) or lifestyle choices (drinking in excess or drug abuse) recorded in their charts for posterity. Security in general is an issue for most providers as their systems and IT managers may not have access to the most up to date security solutions. Most providers are far from leading edge when it comes to data security.

Flawed Healthcare Technology and Workflow

As the saying goes, “In the land of the blind, the one eyed man is king.” The scramble to adopt certified EHR solutions to qualify for MU incentives has an unfortunate consequence; the bulk of the EHR solutions have severe limitations starting with the lack of interoperability.

For instance, a patient may be admitted to the emergency room, then be sent to intensive care, followed by x-rays, go for surgery and then follow up a week later as an outpatient in the doctor’s office. In most cases, the EHRs are supplied and supported by different software vendors supporting different record formats. If the patient has a nutritional component or rehab is required, those visits might also be recorded in different systems.

Doctors complain that using EMR solutions has turned them into data entry clerks. Beyond diagnostic and billing codes, there is often no standard nomenclature for some diseases or ailments. Meanwhile, EHR vendors claim their solutions are capable of being the “System of Record” or the central repository for all of a patient’s consolidated records.

This recent article from Medical Economics makes the case for why there is such an outcry from physicians over the poor functionality and high costs of EHRs. A study referenced in the article found that 2/3rds of doctors would not purchase the same EHR again. Doctors also complained of lost efficiency and the need for additional staff just to manage the new EHRs as well as negative impacts on the quality of patient care.

Experience has demonstrated that the more popular EMRs are good at billing, supporting some workflows, basic reporting and collecting data from some other systems to include in or attach to a patient’s chart. However, no EMR solution has adequately demonstrated that it can function as an integrated data repository while sustaining the high speed and volumes required for clinical decision support systems, analytics and medical informatics solutions that are fast approaching the multiple hundreds of terabytes range and require sub-second response times.

Dr. John Halamka, who writes the Geek Doctor Blog and is the CIO of one of the few remaining hospitals in Eastern Massachusetts that is not migrating its EHR/EMR to Epic (the previously mentioned “one eyed man”), compared his plight to the final scene in the movie Invasion of the Body Snatchers. At times, in the era of Epic, I feel that screams to join the Epic bandwagon are directed at me.”

Halamka adds, “The next few years will be interesting to watch. Will a competitor to Epic emerge with agile, cloud hosted, thin client features such as Athenahealth?  Will Epic’s total cost of ownership become an issue for struggling hospitals?  Will the fact that Epic uses Visual Basic and has been slow to adopt mobile and web-based approaches prove to be a liability?”

On its “about” page, Epic touts its “One Database” approach;All Epic software was developed in-house and shares a single patient-centric database.” That one database, referred to by its unfortunate acronym, MUMPS, was developed in the 1960s at Mass General. MUMPS and Epic have many critics who bristle at the idea of using “patient-centric” in the same sentence.

Blogs, such as Power Your Practice contend that EPIC and MUMPS are stifling innovation. “MUMPS and Interoperability: A number of industry professionals believe MUMPS will be weeded out as doctors and hospitals continue to implement electronic health records, namely because MUMPS-based systems don’t play nice with EHRs written in other languages. There is a reason why the Silicon Valley folks aren’t too fond of the language.

If MUMPS truncates communication between systems, then it hinders interoperability, a cornerstone of EHR adoption. One of the goals of health IT is to avoid insularity, so unless your practice or hospital’s goal is to adopt a client-server enterprise system with limited scalability – and you don’t care much for interoperability – MUMPS may be an option for you.”

Epic and MUMPS have their proponents – primarily its buyers and users at almost 300 hospitals that have collectively forked over many $ billions to help make Epic a healthcare EHR/EMR juggernaut. This Google plus thread started by Brian Ahier is replete with heated exchanges about MUMPS and Epic’s lack of interoperability.

Lack of Engagement and Innovation

Standards such as HL7 are only partly working as they still do not handle unstructured data very well, and the mania of EMR vendors, some physician organizations and HIEs to structure all EHRs is unrealistic. Worse yet, the idea that each individual’s health narrative can be reduced to a collection of stock answers and check boxes is just not based in reality.

Managing your health records should not be like pulling teeth. The PHR “experience” is tedious, lengthy and boring akin to filling out an application for health insurance or filling out a medical history at the doctor’s office. It is heavy data entry with little interaction from the app itself.

In addition, most PHR solutions offer only static results with no intelligent mapping across populations or help to gather an individual’s “publically available” or private records. Even CMS’ Blue Button has provider and billing codes that do not translate well for non-healthcare professionals and BB plans only to keep records for 3 years – not nearly long enough to track certain individual or health population trends and services.

As pointed out in the NEJM article referenced above, “Health IT vendors should adapt modern technologies wherever possible. Clinicians choosing products in order to participate in the Medicare and Medicaid EHR Incentive Programs should not be held hostage to EHRs that reduce their efficiency and strangle innovation.”

Time for Affordable Healthcare Big Data Management

Affordable healthcare and improving quality is the primary goal of ACA. Affordable healthcare data management should also be a top priority as providers and individuals need access to timely information.

Already smaller providers and hospitals are pressed for funds to meet new ACA requirements and existing solution providers seem intent on exacerbating the problem with overly expensive, poorly functioning solutions and services.

Individuals also need access to affordable or, one could argue, free data to support their own healthcare journey and the healthcare services needs of their extended families.

Healthcare solutions and services vendors as a whole – compared with solution providers in other industries – seem less engaged with newer technology advancements that could help drive dramatic cost reductions in IT services and solutions adoption or, for that matter vastly improve performance. It appears healthcare solutions buyers are too willing to settle for less.

Too many healthcare data management solutions that claim to tackle big data are doing so with old technology. While the financial industry has widely adopted standards such as XML, SWIFT for inter-bank financial transactions and Check 21 for imaging – not to mention embraced the Linux and open source community – healthcare still struggles with HL7 standards begun over 25 years ago.

Yes, XML is used in healthcare to allow some basic document transfer interoperability between EHR/EMR systems. However, vendors such as Epic use proprietary extensions to make the transfer more difficult outside their system customer base. And yes, some older technologies still work well. COBOL programs are still integral to many mainframe systems. Nevertheless, most of the newer, innovative web-scale systems were developed using open source tools developed or refined in the last decade or so.

Web retailers such as Amazon have bypassed traditional relational database technologies for open source based NoSQL databases that are more scalable, available and affordable. Airline reservation systems have run into relational database bottlenecks and are deploying real-time, in-memory databases. Traditional brick and mortar businesses such as banks and retailers have embraced cloud computing and security standards such as Kerberos,  SAML  and OpenID.

According to Shahid Shah, The Healthcare IT Guy, healthcare IT needs to consider industry neutral protocols and semantic data formats like RDF and RDFa. “Pulling data is easier. Semantic markup and tagging is easier than trying to deal with data trapped in legacy systems not built to share data.”

Shah is not alone in his assessment regarding the cost of maintaining legacy data management systems. Many studies suggest older technology costs users more money in the long run.  Here are a few examples:  Big Data and the problem with Relational Databases, Understanding Technology Costs, and Time to Pull the Plug on Relational Databases?

In addition, there is at least one non-profit organization that views healthcare technology interoperability as a priority. The West Health Institute believes improving healthcare device interoperability alone can save the industry $30 billion per year.

Excerpted from the “pull the plug” link above, “Former Federal CIO Vivek Kundra recently said, ‘This notion of thinking about data in a structured, relational database is dead. Some of the most valuable information is going to live in video, blogs, and audio, and it is going to be unstructured inherently.’ Modern, 21st century tools have evolved to tackle unstructured information, yet a huge majority of federal organizations continue to try and use relational databases to solve modern information challenges.”

The same can be said of the healthcare industry. Despite industry efforts to structure as much healthcare data as possible, the bulk of healthcare data will remain unstructured and the narrative of each individual’s healthcare record will be the richer for it.

The Way Forward for Healthcare Big Data Integration

The development of affordable healthcare solutions needs to be focused on supporting the two primary partners in the healthcare ecosystem: Individuals and providers.

Clearly, tackling the problem of integrating disparate data types gathered from multiple sources and organizing that data into a cogent, human readable format is a Big Data challenge that demands 21st century Big Data handling solutions.

It should be understood that EHRs were not designed to manage a variety of healthcare data formats. In addition, massive RDBMS multi-year data warehouse projects utilizing limited, structured data sets were fine for retrospective or financial reporting but do not work well with large volumes of unstructured data needed for real-time and predictive analytics.

The EHR and PHR are intrinsically interconnected and inseparable. Making it easier for providers and individuals to develop rich healthcare narratives and securely share information should be a top priority for the industry.

As with the early days of the open source movement, most solution providers were slow to envision how to monetize a free software product. EHR integration solutions are expensive and PHR solutions are unwieldy and lack imagination.

The open source model is built around a common community goal. The common goal for EHRs and PHRs is to lower costs and improve healthcare outcomes. Improving the quality of healthcare data by encouraging the primary sources of that data to collaborate greatly benefits both parties – partners in healthcare data management and ownership.

HDR and PHR Sample Scenarios

The following scenarios are examples of how providers and individuals can play a major role in gathering and enhancing healthcare data for the mutual benefit of both parties. Implementing available, tested, secure, scalable and affordable 21st century technology also plays a key role.

Integrated Healthcare Data Repository for Hospitals

The Challenge

Changes in the economics of healthcare delivery brought on by new laws enacted through the ACA/ObamaCare rollout and the advent of Big Data technologies are forcing hospitals to rethink their funding, workflow, supply chain management, services, staffing and IT strategies.

The result is most hospitals are struggling to implement newer technologies that can help lower costs, improve patient outcomes and buoy employee job satisfaction.

Larger non-profit and for-profit hospital groups have an advantage over stand-alone or smaller hospitals and physicians’ groups due to economies of scale and efficiency including; increased purchasing power and negotiation leverage; standardization across hospital IT systems; flexibility of a larger workforce; broader service offerings; and more revenue to justify healthcare IT upgrades, capital expenditures and recruitment or retention of key staff.

Also critical is the ability to access larger pools of data to help meet ACA performance metrics, CMS reimbursement requirements and determine the best treatment options for individuals and larger health population groups. Smaller provider groups and hospitals can access larger pools of data offered by newly formed analytics groups owned by large providers, payers or analytics groups – for a price – including Optum Health and Verisk Health.

Data “curated” by providers and hospitals is virtually owned by the application vendors. To paraphrase Shah, “Never build your data integration strategy with the EHR in the center. Create it with the EHR as a first class citizen. Focus on the real customer the patient.”

The Solution: An Integrated HDR

Using mostly open source components including a NoSQL database, analytics and cloud orchestration software, and also leveraging existing underutilized network, hardware and storage assets, even a smaller hospital group can affordably create an integrated healthcare data repository (HDR) to help meet compliance, regulatory, internal reporting and clinical information requirements – and much more.

Integrated Healthcare Data Repository HDR

Several established and emerging vendors are eager to conduct proof of concepts (POCs) and partner with motivated hospitals that would prefer to stay independent and are struggling to keep expenses under control. Even larger hospitals groups struggle with integration issues.

Here is a link to a report on 21 NoSQL Innovators many of whom have a footprint in healthcare and offer open source or affordable and proven database alternatives to traditional, expensive relational databases. Examples include; NoSQL segment market share leader MarkLogic 7, open source DB community leader MongoDB and Virtue-Desk’s “associative” database AtomicDB.

HDR Benefits

Savings and Cost Avoidance

  • Spend thousands not millions on integrating healthcare data
    • Liberate your data from HIT solutions silos, make data analytics-ready
    • Lower legacy IT infrastructure costs (data, storage and compute)
  • Avoid expensive software licenses, upgrades and maintenance costs
    • Offload data from expensive data warehousing systems
    • Improve efficiency of existing software systems
  • Help meet ACA and MU compliance requirements
    • Integrate EHR data and deliver HIPAA compliant PHRs to patients
    • Support clinical decision support systems and analyze outcomes data
    • Enable predictive analytics to avoid CMS penalties
    • Avoid HIPAA penalties by avoiding unauthorized information releases
    • Maintain patient privacy of sensitive data, e.g. psychiatric progress notes
  • Avoid time consuming manual workflows
    • Automatically access data for operations and supply chain management
    • Support clinical, research efforts with holistic data views, visualization
  • Recruit and retain high quality staff by implementing state of the art technology

Revenue Generation

  • Analyze data to determine best treatment modalities for patient population
    • Run analytics against holistic data sets to improve treatment choices
    • Discover and predict needs of local patient populations
    • Focus on higher value services within the community
  • Partner with other providers, payers, pharma on aggregated, anonymized data
    • Data to supplement clinical trials
    • Data to supplement payer wellness programs
    • Data to supplement academic medical center research
  • Recruit local clinics, providers, employers, HIEs with state of the art solution
    • Share/defer system costs with local health provider participation
    • Embrace clinically useful information such as cancer genomics
  • Develop a true data partnership with patients and their families
    • Encourage individuals to return to your facility when seeking care
    • Support patient-centric wellness and care programs to local employers

Personal Healthcare Records

The Challenge

The U.S. and most of the rest of the world is in the midst of an elemental transformation in the way healthcare services are delivered to individuals and their families. The cost of healthcare is rising in real terms along with the percentage of the total cost of healthcare for which individuals in the U.S. are responsible – due to higher insurance premium deductibles and vital health services and treatments falling outside of insurance plans’ basic coverage.

Now that we have established the inevitability of personal healthcare information being created and used, for better or worse, as well as resold in a variety of legal and illegal forms, there are three fundamental questions that need to be answered:

1-      What role do individuals play in helping to reform healthcare delivery?

2-      How do individuals move from passive to active participants in their care?

3-      How can individuals leverage or monetize their personal health information?

Answers to the questions above include:

  • Request or demand more information about your care including treatment options, cost prior to agreeing to procedures and electronic copies of all your records. Store PHR in a central location e.g., spreadsheet, email with attached images or with a PHR service.
  • Individuals need to assume an active role in managing their own and their family’s care. In an age of hyper-specialization, individuals should not be surprised if they develop knowledge of an illness or treatment more comprehensive than many caregivers possess. Look for ways to share the knowledge with others who can benefit from it e.g., through online forums or information exchanges.
  • Seek out new web-based services that exchange healthcare information for individuals. There are more than 50 PHR services available today that range in price from free to “concierge” level services that charge a monthly fee. Business models that allow individuals to monetize their own PHR at this point are a bit early in the game but not at all farfetched. Business models and services are changing rapidly. Expect changes and new services to appear – services that are engaging and have a quantifiable value proposition.

In addition, a few assumptions need to be made in order to affect change:

1-      The individual (or a family member or duly appointed health advocate) needs to assume responsibility for their healthcare delivery and wellness strategy.

2-      More accurate, richer personal healthcare records along with access to anonymized, aggregated healthcare information and data can increase an individual’s chances for improved health outcomes at a lower cost.

3-      Data has value. Some say data is the oil of the 21st century. Payers, pharma, healthcare analytics firms and other for profit consumers of healthcare records data need to compensate individuals directly for their anonymized data – pay for it through a PHR broker, offer an incentive through lower premiums, lower cost of prescription drugs or provide individual access to outcomes data and wellness information.

Individuals also need to understand that much of the data being collected today is incomplete or inaccurate and is now “owned” by companies who have their corporate customers’, not the individual’s, best interests in mind.

Imagine following a recipe that has key ingredients missing.

Imagine all of the incorrect data being used by researchers to make multi-billion dollar decisions on treatment modalities and future clinical trial investments.

According to another blog post by Shah (The Healthcare IT Guy) entitled Causes of digital patient privacy loss in EHRs and other Health IT systems  “Business models that favor privacy loss tend to be more profitable. Data aggregation and homogenization, resale, secondary use, and related business models tend to be quite profitable. The only way they will remain profitable is to have easy and unfettered (low friction) ways of sharing and aggregating data. Because enhanced privacy through opt-in processes, disclosures, and notifications would end up reducing data sharing and potentially reducing revenues and profit, we see that privacy loss is going to happen with the inevitable rise of EHRs.”

During a podcast conducted last month by KCRW entitled, Big Data for Healthcare: What about patient privacy?, Dr. Deborah Peel noted, “Powerful data mining companies are collecting intimate data on us. We want the choice of sharing data only with those we know and trust to collect our data. Patients have more interest and stake in data integrity and patient safety than any other stakeholders.”

Dr. Peel calls this “Partnership with consent”. Unencumbered by the “expensive legal and contractual processes and burdens of institutions, and without the need for expensive, complex technologies and processes to verify identity, patients can move PHI easily, more cheaply, and faster than institutions. The lack of ability to conveniently and efficiently update demographic data is one of the top complaints the public has about Healthcare IT systems.

Health technology systems violate our federal rights to see who used our data and why. Despite the federal right to an Accounting of Disclosures (AODs) – the lists of who accessed our health data and why – technology systems violate this right to accountability and transparency.”

No doubt, Dr. Peel and Shah are correct. Yet, as healthcare data collection, management and analysis tools mature and become more mainstream, it is clear that buried within the ever-increasing tower of electronic rubble that is Big Data are insights waiting to be liberated.

It is becoming increasingly rare to find a clinician who would not agree that access to machine derived data and information has and will continue to have a positive impact on a variety of health outcomes.

To paraphrase Iya Khalil, co-founder of healthcare Big Data analytics company GNS Healthcare who was also a panelist on the above mentioned KCRW podcast “What-if scenarios for the future are more predictive and will rely more on precision medicine not on an “average” patient. Big Data gives this insight. Painting a more accurate picture of an individual’s health through more accurate data enables patients and doctors.”

The description of illnesses change over time along with the way individuals talk about their lifestyle or detail their personal narrative. Rather than focusing on structuring as much as possible like present day EHR solutions and billing systems would like to have us do, the need to capture each individual’s personal health narrative in their own “voice” is critical.

Moreover, individuals are more likely to recognize errors, mismatches and omissions – some of them potentially harmful – than institutions or machines. Individuals can and would be willing to verify information if engaged and properly incented.

The Solution: a True PHR System of Engagement

As mentioned in the Integrated Healthcare Data Repository (HDR) section above, there are at least 21 NoSQL Solutions Innovators that offer proven, scalable, secure, affordable solutions to augment or replace expensive, less flexible traditional databases that primarily rely on structured query languages (SQL) to extract information.

Example: PHR and Rare Illness Data Exchange Repository (RIDER)

PHR and RIDER Data Flows

The PHR/RIDER is a concept that combines at least two existing business models: A place for individuals to store and compare their PHRs and a secure database for exchanging multiple levels of information with other individuals, providers or other entities. The solution affords the individual with protection, choice and management tools at no cost other than their time.

The specific focus on rare illnesses addresses a need expressed by the NIH and others for an increased effort on collecting data from smaller health populations both in the U.S. and abroad where increasing and combining data sets may support additional insights and healthcare breakthroughs.

In part, the NIH states, “Rare diseases comprise a clinically heterogeneous group of approximately 6,500 disorders each occurring in fewer than 200,000 persons in the USA. They are commonly diagnosed during childhood, frequently genetic in origin, and can have deleterious effects on long-term health and well-being. Although any given condition is rare, their cumulative public health burden is significant with an estimated 6-8% of individuals experiencing a rare disease at some point during their lives.”

Fewer than 20% of rare diseases have registries – an indication that most pharmaceutical and biotech firms see little financial return in pursuing cures for rare or so called “orphaned” diseases.

The intent of RIDER is to manage multiple data sets of rare illnesses, or diseases, letting the machine do the heavy work of data abstraction and analytics to determine if any significant correlations exist between the data sets, and then making aggregate data available to the individual contributors as well as carefully selected healthcare providers or other members of the healthcare community.

Using mostly open source components including a NoSQL database, analytics and cloud computing infrastructure, small or large non-profit and for profit groups can more easily afford to implement a true, patient-focused portal that has lowering the individual’s costs and health outcomes as its primary mission.

PHR/RIDER Benefits

Savings and Cost Avoidance

  • Spend thousands not millions on integrating healthcare data
    • Provides Individuals with free access to PHI and health information
    • Lower clinical trial costs with supplemental data
    • Support and accelerate research efforts
      • Avoid time consuming manual workflows
  • Lower cost of care with supporting data
    • More quickly determine top treatment options
    • Eliminate need to gather PHI from multiple sources
    • More easily compare provider costs and efficacy

Improve Outcomes

  • Support clinical, research efforts with holistic data views, visualization
    • Run analytics against holistic data sets to improve treatment choices
    • Discover and predict needs for orphaned patient populations
  • Partner with providers, payers, pharma, HIEs on aggregated, anonymized data
    • Data to supplement clinical trials
    • Data to supplement payer wellness programs
    • Data to supplement academic medical center research
    • Data to supplement approved health informatics programs
  • Develop a true data partnership with your patients and their families


According to survey data recently released by the analytics division of HIMSS, patient portals and clinical data warehousing/mining are two of the top three new applications poised for growth among hospitals over the next five years. “The findings presented in this report suggest there is an opportunity for vendors and consultants to assist hospital leaders in their efforts to improve care by helping them realize their full EMR capabilities. Of the three applications predicted to dominate first-time sales in hospitals, the patient portal market opportunity is the only one clearly tied to the Meaningful Use Stage 2 requirements.”

Truly patient-centric solutions that support MU core objectives intended to improve patient communication, enable secure, authorized individual access to and transfer of personal health records (PHR), and extend functionality beyond the limits of today’s EHR/EMR solutions to support dramatically enhanced clinical informatics and MU mandated compliance capabilities will dominate the new healthcare IT systems sales beginning in 2014.

Newer approaches to lowering solution costs while improving functionality and ameliorating security concerns can be achieved by leveraging open source and NoSQL database and query technologies. In addition, standards, such as JSON for documents and unstructured text, which have been widely adopted outside of the healthcare industry and generally available, secure and maturing cloud services need to be strongly considered.

Finally, senior hospital management and providers of all sizes need to understand the inherent limitations of this generation’s crop of EHR/EMR solutions. Billing, practice management, quality reporting and meeting a number of other meaningful use requirements are extremely important objectives. However, buying into the hype that EMRs can properly function as the central repository for critical data consolidated from a dozen or more systems within the provider universe is misguided – at best.

Individuals and providers must work together to update, manage and secure healthcare data to increase the likelihood that information contained in medical records is used to improve individual wellness or aggregated to drive insights that benefit health populations. To that end, there are indeed viable, innovative and affordable solutions to pursue.


Posted in Big Data, Healthcare Informatics, Information Governance, Information Management Best Practices, Information Management Thought Leadership, Strategic Information Management | Tagged , , , , , , , , , , , , , , , , , , , , , | Leave a comment

21 NoSQL Innovators to Look for in 2020

Disruptive NoSQL Database SolutionsIn the ever-evolving world of enterprise IT, choice is generally considered a good thing –albeit having too many choices can create confusion and uncertainty. For those application owners,  database administrators and IT directors who pine for the good old days when one could count the number of enterprise-class databases (DBs) on one or two hands, the relational-database-solves-all-our-data-management-requirements days are long gone.

Thanks to the explosion of Big Data throughout every industry sector and requirements for real-time, predictive and other forms of now indispensable transactions and analytics to drive revenue and business outcomes, today there are more than 50 DBs in a variety of categories that address different aspects of the Big Data conundrum. Welcome to the new normal world of NoSQL – or, Not only Structured Query Language – a term used to designate databases which differ from classic relational databases in some way.

In August, more than 20 NoSQL solution providers and 100-plus experts gathered at the San Jose Convention Center for 2013’s version of NoSQL Now!. Exhibitors and speakers included familiar names such as Oracle along with a score of venture-backed NoSQL solution providers eager to disseminate their message and demonstrate that the time has come for enterprises of every ilk to adopt innovative database solutions to tackle Big Data challenges. More than a dozen sponsors were interviewed at the event and profiled in this research note. 

Evolution of NoSQL

In the beginning, there was SQL (structured query language). Developed by IBM computer scientists in the 1970s as a special-purpose programming language, SQL was designed to manage data held within a relational database management system (RDBMS). Originally based on relational algebra and tuple relational calculus, SQL consists of a data definition language and a data manipulation language. Subsequently, SQL has become the most widely used database language largely due to the popularity of IBM, Microsoft and Oracle RDBMSs.

NoSQL DBs started to emerge and become enterprise-relevant in the wake of the open-source movement of the late 1990s. Aided by the movement toward Internet-enabled online transaction processing (OLTP), distributed processing leveraging the cloud and the inherent limitations of relational DBs, including lack of horizontal scale, flexibility, availability, findability and high cost, use of NoSQL databases has mushroomed.

Amazon’s instantiation of DynamoDB is considered by many as the first large-scale, or web-scale, production NoSQL database. To quote author Joe Brockmeier, who now works for Red Hat, “Amazon’s Dynamo paper is the paper that launched a thousand NoSQL databases.” Brockmeier suggests that the “paper inspired, at least in part, Apache Cassandra, Voldemort, Riak and other projects.”

According to Amazon CTO Werner Vogels, who co-authored the paper entitled Dynamo: Amazon’s Highly Available Key-value Store, “DynamoDB is based on the principles of Dynamo, a progenitor of NoSQL, and brings the power of the cloud to the NoSQL database world. It offers customers high availability, reliability, and incremental scalability, with no limits on dataset size or request throughput for a given table.” DynamoDB is the primary DB behind the wildly successful Amazon Web Services business and its shopping cart service that handles over 3 million “checkouts” a day during the peak shopping season.

As a result of the Amazon DynamoDB and other enterprise-class NoSQL database proof points, it is not uncommon for an enterprise IT organization to support multiple NoSQL DBs alongside legacy RDBMSs. Indeed, there are single applications that often deploy two or more NoSQL solutions, e.g., pairing a document-oriented DB with a graph DB for an analytics solution. Perhaps the primary reason for the proliferation of NoSQL DBs is the realization that one database design cannot possibly meet all the requirements of most modern-day enterprises – regardless of the company size or the industry. 

The CAP Theorem

In 2000, Berkeley, CA, researcher Eric Brewer published his now foundational CAP Theorem (consistency, availability and partition tolerance) which states that it is impossible for a distributed computer system to simultaneously provide all three CAP guarantees. In May 2012, Brewer clarified some of his positions on the oft-used “two out of three” concept. 

  • Consistency (all nodes see the same data at the same time)
  • Availability (a guarantee that every request receives a response about whether it was successful or failed)
  • Partition Tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system).

NIST CAP Slide with Big Data

According to Peter Mell, a senior computer scientist for the National Institute of Standards and Technology, “In the database world, they can give you perfect consistency, but that limits your availability or scalability. It’s interesting, you are actually allowed to relax the consistency just a little bit, not a lot, to achieve greater scalability. Well, the Big Data vendors took this to a whole new extreme. They just went to the other side of the Venn diagram, and they said we are going to offer amazing availability or scalability, knowing that the data is going to be consistent eventually, usually. That was great for many things.” 


In most organizations, upwards of 80% of Big Data is in the form of “unstructured” text or content, including documents, emails, images, instant messages, video and voice clips. RDBMSs were designed to manage “structured” data in manageable fields, rows and columns such as dates, social security numbers, addresses and transaction amounts. ACID Atomicity, Consistency, Isolation, Durability) is a set of properties that guarantees database transactions are processed reliably and is a necessity for financial transactions and other applications where precision is a requirement.

Conversely, most NoSQL DBs tout their schema-less capability, which ostensibly allows for the ingestion of unstructured data without conforming to a traditional RDBMS data format or structure. This works especially well for documents and metadata associated with a variety of unstructured data types as managing text-based objects is not considered a transaction in the traditional sense. BASE (basically available, soft state, eventually consistent) implies the DB will, at some point, classify and index the content to improve the findability of data or information contained in the text or the object.

Increasingly, a number of database cognoscenti believe NoSQL solutions will or have overcome the “ACID test” as availability is said to trump consistency – especially in the vast majority of online transaction use cases. Even Eric Brewer argued recently that bank transactions are BASE not ACID  because availability = $. 

NoSQL Database Categories

As will be seen in the following section, NoSQL DBs simultaneously defy description and define new categories for NoSQL databases. Indeed, many NoSQL vendors possess capabilities and characteristics associated with more than one category, making it even more difficult for users to differentiate between solutions. A good example is the following taxonomy provided by Cloud Service Provider (CSP) Rackspace, which classifies NoSQL DBs by their data model.

NoSQL Data Models from Rackspace Updated

Note: In the original slide, Riak is depicted as a “Document” data model. According to Riak developer Basho, Riak is actually a key-value data model and its query API (application programming interface) is the popular web REST API as well as protocol buffers.

The chart above represents the five major NoSQL data models: Collection, Columnar, Document-oriented, Graph and Key-value. Redis is often referred to as a Column or Key-value DB, and Cassandra is often considered a Collection. According to Technopedia, a Key-Value Pair (KVP) is “an abstract data type that includes a group of key identifiers and a set of associated values. Key-value pairs are frequently used in lookup tables, hash tables and configuration files.” Collection implies a way documents can be organized and/or grouped.

Yet another view, courtesy of Beany Blog, describes the database space as follows:

Cap Theorem media_httpfarm5static_mevIk

“In addition to CAP configurations, another significant way data management systems vary is by the data model they use: relational, key-value, column-oriented, or document-oriented (there are others, but these are the main ones).

  • Relational systems are the databases we’ve been using for a while now. RDBMSs and systems that support ACIDity and joins are considered relational.
  • Key-value systems basically support get, put, and delete operations based on a primary key.
  • Column-oriented systems still use tables but have no joins (joins must be handled within your application). Obviously, they store data by column as opposed to traditional row-oriented databases. This makes aggregations much easier.
  • Document-oriented systems store structured ‘documents’ such as JSON or XML but have no joins (joins must be handled within your application). It’s very easy to map data from object-oriented software to these systems.”

Beany Blog omits the Graph database category, which has a growing number of entrants in the space, including; Franz Inc., Neo4j, Objectivity and YarcData. Graph databases are designed for data whose relations are well represented as a graph, e.g., visual representations of social relationships, road maps or network topologies and representation of “ownership” for documents within an enterprise for legal or ediscovery purposes.

Hadoop and NoSQL 

The Hadoop Distributed File System (HDFS) is an Apache open-source platform that enables applications, such as petabyte-scale Big Data analytics projects, to potentially scale across thousands of commodity servers such as Intel standard x86 servers, dividing up the workload.

HDFS includes components derived from Google’s MapReduce and Google File System (GFS) papers as well as related open-source projects, including Apache Hive, a data warehouse infrastructure initially developed by Facebook and built on top of Hadoop to provide data summarization, query and analysis support; and Apache HBase and Apache Accumulo, both open-source NoSQL DBs, which, in the parlance of the CAP Theorem, are CP DBs and are modeled after the BigTable DB developed by Google. Facebook purportedly uses HBase to support its data-driven messaging platform while the National Security Agency (NSA) supposedly uses Accumulo for its data cloud and analytics infrastructure.

In addition to the HBase, MarkLogic 7 and Accumulo native integrations of HDFS, several NoSQL DBs can be used in conjunction with HDFS, whether they are open source and community supported or proprietary in nature, including Couchbase, MarkLogic, MongoDB or Oracle’s version of NoSQL based on the Berkeley open-source DB. As Hadoop is inherently a batch-oriented paradigm, additional DBs to handle in-memory processing or real-time analysis are needed. Therefore, NoSQL – as well as RDBMS – solution providers have developed connectors for allowing data to be passed between HDFS and their DBs.

Datastax NoSQL and Hadoop slide

The slide above, courtesy of DataStax, illustrates how NoSQL and Hadoop solutions are transforming the way both transactional and analytic data are handled within enterprises with large volumes of data to manage both in real-time, or near real-time, and post-processing or after data is updated or archived.

NoSQL DB Funding and Growth 

A recent note written by Wikibon’s Jeff Kelly, Hadoop-NoSQL Software and Services Market Forecast 2012-2017, gives a good indication of how well funded and fast growing the market for RDBMS alternatives has become.

“The Hadoop/NoSQL software and services market reached $542 million in 2012 as measured by vendor revenue. This includes revenue from Hadoop and NoSQL pure-play vendors – companies such as Cloudera and MongoDB – as well as Hadoop and NoSQL revenue from larger vendors such as IBM, EMC (now Pivotal) and Amazon Web Services. Wikibon forecasts this market to grow to $3.48 billion in 2017, a 45% CAGR [compound annual growth rate] during this five-year period.” Kelly forecasts the NoSQL portion of the market to reach nearly $2 billion by 2017.

Kelly’s research also indicates that the top ten companies in the space, measured in amount of funding dollars, received more the $600 million over the last 5 years, with funding increasing dramatically over the last 3 years, including $177 million for 2013 thus far. The top-funded NoSQL DB companies – in order of total funding amount – include DataStax (Cassandra), MongoDB, MarkLogic, MapR, Couchbase, Basho (creator of Riak), Neo Technology (creator of Neo4j) and Aerospike.

Note:  On October 4th 2013, MongoDB announced it had secured $150 million in additional funding which would now make it the top-funded company in the space.

21 for 2020: NoSQL Innovators

As previously mentioned, there are now more than 50 vendors that have entered the NoSQL DB software and services space. As is the case with most nascent technology markets, more companies will emerge and others will buy their way into the market, fueling the inevitable surge of consolidation.

Oracle has publicly committed to its Berkeley DB open-source version of NoSQL, while IBM offers support for Hadoop and MongoDB solutions as part of its InfoSphere information management platform as well as Hadoop enhancements for its PureData System, and Microsoft supports a variety of NoSQL solutions on its Windows Azure cloud-based storage solution. Suffice to say, the big three RDBMS vendors are pragmatic about the future of databases. Sooner or later, expect them all to make NoSQL acquisitions.

Meanwhile, here is a short list of companies anticipated to disrupt the database space over the next 5 to 7 years arranged in somewhat different categories from the above NoSQL taxonomies and based more on use case within the enterprise than on data model.

This group is also distinguished by added capabilities or functionality beyond just providing a simple data store with the inclusion of analytics, connectors (interoperability with other DBs and applications), data replication and scaling across commodity servers or cloud instances.

Disruptive NoSQL Database Solutions



Follow this link for brief profiles of these 21 NoSQL Innovators.



Note: Not all of these solutions are strictly NoSQL-based, including NuoDB and Starcounter, two providers that refer to their databases as “NewSQL”; and Virtue-Desk, which refers to its DB as “Associative.” All three get lumped into the NoSQL category because they offer alternatives to traditional RDBMS solutions.

Note: One could argue that other categories such as [ Embedded Databases] could also be included. In over 20 hours of interviews, only 2 NoSQL solution providers, Oracle Berkeley DB and Virtue-Desk, mention embedding their databases within applications. In the case of Virtue-Desk, its solution is written entirely in Assembler and can be embedded in “any” device that has more the 1MB of memory – the DB is only 600k installed. 

Note: The clear trend for non-relational database deployment is for enterprises to acquire multiple DBs based on application-specific needs, what could be referred to as software-defined database adoption.




Posted in Big Data, Information Management Thought Leadership | Tagged , , , , , , , , , , , , , , , , , , , , , , , , , | Leave a comment