In previous years, we saw several technologies rise with the big-data wave to fulfill the need for analytics on Hadoop. But enterprises with complex, heterogeneous environments no longer want to adopt a siloed BI access point just for one data source (Hadoop). Answers to their questions are buried in a host of sources ranging from systems of record to cloud warehouses, to structured and unstructured data from both Hadoop and non-Hadoop sources. (Incidentally, even relational databases are becoming big data-ready. SQL Server 2016, for instance, recently added JSON support).
Learn Hadoop Now
The terms “Big Data” and “Hadoop” have gained favour in recent times. Hadoop, has made it fairly easy for programmers to take any embarrasingly parallel problem and quickly spread them across large clusters. Big Data on the other hand is to me just the fuel that Hadoop works on to convert it into a form amenable for analysis. A person who is able to write code using Hadoop and the associated frameworks is not necessarily someone who can understand the underlying patterns in that data and come up with actionable insights. That is what a data scientist is supposed to do. Again, data scientists might not be able to write the code to convert “Big Data” into “actionable” data. That's what a Hadoop practitioner does. These are very distinct job descriptions.
Big Data too, has its own interpretation. While people typically identify Big Data using the four Vs (Volume of data, Velocity of data, that is the frequency with which data comes in, Variety of data types as well as the Veracity or goodness of the data), one of the best definitions that I have heard of the term is as follows: “Big data is one byte more data than your system has”. For example, while HR data has a wide variety of data with very low veracity (since data is quite noisy), compared to streaming data coming in from the likes of e-commerce, the volume and velocity of data are low. However, given the lower computing power of typical HR systems, even a few gigabytes of data can seem like big data for its practitioners.
Thus “big data” itself is a relative term that I believe has outlived its usefulness. Perhaps “smart data” ought to replace “big data” for most analytical applications!
LOGICAL DATA WAREHOUSES AND FEDERATION TECHNOLOGY
One of the most mature solutions to the problem of aggregating data from disparate data sources is the logical data warehouse. To the user, a logical data warehouse looks just like an enterprise data warehouse. But in fact, it is a single-query interface that assembles a logical view “on top of” a set of heterogeneous and dispersed data sources.
In order to look like an enterprise data warehouse, a logical warehouse has to transform and standardize the data in real time, as it is queried. The great strength of the logical warehouse is that it is possible to do just enough standardization and transformation to meet an immediate business need just in time, rather than having to standardize and transform every piece of data for every possible query up front.
However, if the data being queried is stored in multiple formats and in multiple physical locations, this is both technically difficult and inefficient; this is why the logical warehouse has not simply replaced physical data warehouses.
Logical data warehouses that span enterprises—an approach often described as “federation”—are particularly inefficient, because the problems of data being in different formats, and in physically distant locations connected by limited bandwidth, are magnified by the physically- and technologically-distinct corporations that are being bridged together. Today’s solutions are the data lake and APIs. But both suffer challenges of their own.
THE DATA LAKE
A data lake is a physical instantiation of a logical data warehouse: data is copied from wherever it normally resides into a centralized big data file system, thereby solving the problem of data being physically dispersed. This is not any kind of return to the traditional data warehouse—the data lake is designed for far greater agility, scalability, and diversity of data sources than a traditional data warehouse.
It is relatively easy for an enterprise to add its own data to a lake, but there are many datasets of critical importance outside of the enterprise. It may be possible to copy low volume, stable, third-party datasets into the lake, but this will not always be a viable solution—whether for reasons of data confidentiality or data volume, or because the data is volatile and requires excessive change management.
In the case of external data in the cloud, an enterprise might be able to extend their private network into the cloud to encompass the dataset and federate it with their on-premise lake. This is undoubtedly technically feasible; the only issues are the willingness of the data owner to agree to the arrangement, and performance (depending on the bandwidth of the connection between the cloud and the on-premise data center).
Rather than lifting and shifting data into a lake, a big data warehouse can federate with external data sources by means of their published APIs.
It is often far more effective for an enterprise to consume APIs that answer its business questions (“What is my top-performing brand?”) than to amass the data required to answer these questions for itself.
Remote APIs are an appropriate solution when the enterprise knows what it needs to know and can ask for it. Remote APIs are less effective for advanced analytics and data discovery—where the user doesn’t know what they don’t know, and as a result is obliged to make API calls that move large volumes of data across a wide area network. This has traditionally been a poorly-performing approach, mainly due to the bandwidth problems of moving so much data. Advanced analytics is one of the main uses of big data; given that big data is inherently distributed, solving the problem of how to run discovery-type processes in this environment has received serious attention. The most promising approach is to implement APIs designed from the ground up to support big data. These APIs use overcome the problem of moving large amounts of unstructured data across a network by transmitting it in a highly compressed form, and having the data describe itself so that the lack of a pre-defined structure is not an issue.
INTEGRATING DATA IN A BIG DATA WORLD
Physically co-locating data in a data lake, or logically through APIs or a form of federation, solves the problem of data dispersal. It does not address the issue that the data is in many different formats and is un-harmonized. The enterprise data warehouse solves the format problem by brute force: extracting all of the data from source and loading it into a single database. It solves the integration problem by using master data management software to apply a consistent set of descriptive metadata and identifying keys across the whole dataset.
Although big data technologies and the data lake approach have a major role to play in the future of data warehousing, the many different of types of data the warehouse needs to contain (including images, video, documents, associations, key value pairs, and plain old relational data) means that there is no one physical format that is optimal for storing and querying all of it. As a result, many people are strong proponents of a polyglot persistence approach: data is stored in the most appropriate form and an overarching access layer provides a single interface and query language to interrogate the data. The access layer takes responsibility for translating the query into a form the underlying data stores can understand, combining results across stores and presenting the results back to the user.
There are already many interfaces that allow developers to query big data in a non-relational format using SQL. Although it may take some time for comprehensive, fully functional and efficient solutions to the complications of polyglot persistence to become mainstream, it is an eminently solvable problem. The problem of data integration and harmonization is much more challenging because it is one of content, not technology. One way of looking at this is to recognize that polyglot persistence gives you the grammar, but no vocabulary. Grammar has just a small number of rules—but the language it orders will have hundreds of thousands if not millions of words.
Unless the disparate datasets in a data lake are aligned and harmonized, they cannot be joined or co-analyzed. The techniques used to do this in an enterprise data warehouse are manual and rely on a limited number of experts—they don’t scale to big-data volumes.
Data discovery tools have provided a partial solution to the problem by democratizing integration. An analysis may require a business user to combine several disparate datasets. To support this, discovery tools have built-in, lightweight, high productivity integration capabilities—generally known as “data wrangling”—to distinguish them from heavyweight data warehouses with extract, transform and load (ETL) processes. This basic and very user-friendly functionality removed the expert bottleneck from integration: users could do integration for themselves. The downside of this approach is that integration tends to be done by users in isolation and the integration process is not repeatable, shareable and cataloged. It results in data puddles rather than a data lake. This may be the best approach to providing a fast answer to a tactical question: it allows the flexibility to apply just enough integration to meet the business need if, for example, all that is required is a quick directional read on a trend. The one-size-fits-all enterprise data warehouse approach is inflexible and slow in comparison. Nevertheless, the data puddle has obvious issues if a consistent or enterprise-wide view is required.
Companies have often spent a great deal of time and money curating their enterprise master data. The data warehousing guru Ralph Kimball argues that the dimensions identified in the enterprise warehouse model and the master data associated with them are valuable enterprise assets that should be reused in a big data world. Matching dimensions between internal and external datasets and identifying common items on those dimensions allow data to be snapped together and years of master data investment to be leveraged.
A problem for both traditional and democratized data integration is that they rely on people, albeit a much larger pool of people in the case of democratized integration. Big data is not only vast, it is also fast: if the sheer amount of data needing integration does not overwhelm individuals, the speed at which it needs to be integrated before it becomes stale will. That is why the common thread linking the tools attempting to scale data integration for the digital world is the use of machine learning and statistical matching techniques. After appropriate learning, these tools fully automate simple integration tasks and only promote complex cases or exceptions to humans for adjudication. For some analyses, fully automated matching may give a good enough result. If users need real-time or near-real-time directional reporting, it is the only way to go.
Given the current state of technology, there is no single solution for creating a big data warehouse to replace the now outdated enterprise data warehouse. In the short term, the enterprise data warehouse will remain the golden nugget at the heart of a company’s data assets. It will be supplemented by, or subsumed into, a data lake, which contains the many and various big data sources the company is able to co-locate. Data that cannot be co-located in the lake will be accessed through APIs and federation technologies, such as logical data warehouses. Data harmonization will take on even greater importance than it does now, but will transform from the clerically intensive, one-size-fits-all approach of the enterprise warehouse to a highly automated and need-driven approach.