The Real-Time Data Warehouse




	Data Warehousing

	Meta Data Management

	Articles & White Papers

The Real-Time Data Warehouse: The Next Stage in Data Warehouse Evolution

By Michael Haisten

Note: This article originally appeared in DM Review (August and September, 1999) as three installments of Michael Haisten's online column "The Real-Time Enterprise." Parts of this article were voted by DM Review readers to the Hall of Fame and Top 10 List as among the most popular writing to have appeared in DM Review.

The concept of data warehousing is older than most people realize. The road from there to here has been rough, to say the least. The initial expectations turned out to be unattainable (enterprise integration). Our fallback position was too tactical (data marts). The options grew to be too diverse, too ambiguous and downright confusing.

Happily, we have entered a period of synthesis in which a new consensus on design is emerging. This is a good thing because we are about to embark on possibly the most radical path yet - real-time data warehousing is within our grasp.

The bulk of this article is a review of data warehouse history. The intent is set the stage for a series of discussions on the future of data warehousing and enterprise architecture.

Architectural Beginnings (1978-1988)
The origin of the data warehouse can be traced to studies at MIT in the 1970s which were targeted at developing an optimal technical architecture. At the time, the craft of data processing was evolving into the profession of information management. For the first time, the MIT researchers differentiated between operational systems and analytic applications. Their intent was to develop architectural guidelines for developing new solutions from the ground up. A core principle that emerged was to segregate operational and analytic processing into layers with independent data stores and radically different design principles.

One contributing factor was the extremely limited processing and storage capacities that existed at the time. This helped justify the Information Center¹ phenomenon in the 1980s that was motivated strongly by the desire to off-load the new and wildly unpredictable analytic load from the transactional systems platform. Fortunately, though, the heart of the argument was that these two forms of information processing are, and by their very nature would always be, very different from one another - so different that they require unique architectural designs.

In the mid- to late-1980s, Digital Equipment Corporation (DEC) was one of the most technically advanced companies on the planet and was the first to build a distributed network architecture to host their business applications. DEC was also the first to migrate to a relational database environment using their own DBMS product, RdB.

DEC planned to be the showcase for a new approach to application architecture. They pulled together a team from many different disciplines including engineering, marketing and finance as well as information technology groups. The mandate was not only to invent this new architecture but also to apply it to Digital's own global financial systems. The architecture team proceeded to combine the MIT design principles with their newfound expertise in networks and relational databases to create the Technical Architecture 2 (TA-2) specification. TA-2 defined four services categories: data capture, data access, directory and user services.

Data capture services are the operational components of the MIT model, and the data access services are the analytic component. Each of these services would be implemented as one or more distributed applications interacting with segregated data stores on a worldwide network. The TA-2 team added directory services to help people find what they needed on the network and user services to support direct interaction with the data.

User services contained the human-machine interfaces for all the other services. This was the most novel twist in the architecture. For the first time, the interface was defined as a standalone program interacting over a network with multiple independent data service providers. It is ironic that Digital Equipment Corporation, the company that said personal computers would never find a home in the enterprise environment, was the first to implement client/server computing on a large scale.

Enterprise Integration (1988)
Meanwhile, IBM was tackling a different aspect of the information management problem. One of the most vexing problems reported by their customers was the growth of information silos. Whether due to decentralized operations or mergers or changes in business model or whatever, companies faced the daunting task of integrating data from many separate systems with dissimilar coding schemes.

In 1988, Barry Devlin and Paul Murphy of IBM Ireland tackled the problem of enterprise integration head-on. They introduced the term "information warehouse" for the first time and defined it as: "A structured environment supporting end users in managing the complete business and supporting the (information technology department) in managing data quality."²

IBM still knew a good thing when they saw it. Information Centers sold a lot of boxes in the '80s. Maybe this information warehouse could be their ticket to renewed growth in the 1990s. But IBM failed to follow up their intellectual lead with anything but marketing rhetoric. Where was the architecture? How do you go about building one of these things?

Between 1988 and 1991, I was a member of a cross-functional team that set out to fill this gap. Our ambition was nothing less than to build a complete design guide for application development that married the new client/server capabilities to the traditional, generally mainframe, computing environment. Rather than start from scratch, our intent was to update and expand the DEC TA-2 architecture while adding the enterprise integration goals of IBM's information warehouse concept. We even toyed with the idea of calling our architecture TA-3 before we settled on VITAL (virtually integrated technical architecture life cycle).³

On the technology side, we had to incorporate personal computers, graphical interfaces, object oriented component environments and local area networks into the mix. Our greatest contribution, however, was to flesh out the details of hundreds of architectural components for application development in the client/server era. We specified more than 85 components in our data warehouse section alone including generic extraction, transformation, validation, loading, cube development and graphical query tools. Since you all know the sorry history of Apple Computer in enterprise computing, I am sure you can guess what happened to these leading edge designs.

By the first years of this decade, an initial rationale existed for data warehousing supported by well-developed architectural frameworks. The principle of analytic segregation was well established. The need for a clean, integrated, cross-functional view of the business was gaining ground. The base technologies including common relational access, ubiquitous networking, client/server computing and graphical interfaces with component (object oriented) software were on the way.

Bleeding-edge companies built data warehouses between 1988 and 1991. The pieces were in place for wider adoption. All that was needed was the catalyst.

DSS Part 1- Enterprise Data Warehouse (1991)
In 1991, Bill Inmon published his first book on data warehousing.⁴ Rather than focus exclusively on why you need one or what it is supposed to do for you, Inmon provided the first widely available how-to guide on building a data warehouse. This book contains the most widely published definition of a data warehouse.

Inmon's definition helped immensely by providing concrete guidelines. The core components of his definition are as relevant today as when they were published eight years ago. (Figure below)

Defining the data warehouse
A Data Warehouse is a subject-oriented, integrated, time-variant, nonvolatile collection of data in support of management decisions.
	Subject-oriented: Focus on natural data groups, not applications boundaries.
	Integrated: Provide consistent formats and encodings.
	Time-variant: Data is organized by time and is stored in diverse time slices.
	Nonvolatile: No updates are allowed. Only load and retrieval operations.

Subject-orientation mandated a cross-functional slice of data drawn from multiple sources to support a diversity of needs. This was a radical departure from serving only the vertical application views of data (supply-side) or the overlapping departmental needs for data (demand side).

The integration goal was taken from the realm of enterprise rhetoric down to something attainable. Integration is not the act of wiring applications together. Nor is it simply commingling data from a variety of sources. Integration is the process of mapping dissimilar codes to a common base, developing consistent data element presentations and delivering this standardized data as broadly as possible.

Time variance is the most confusing Inmon concept but also a most pivotal one. At its essence, it calls for storage of multiple copies of the underlying detail in aggregations of differing periodicity and/or time frames. You might have detail for seven years along with weekly, monthly and quarterly aggregates of differing duration. The time variant strategy is essential, not only for performance but also for maintaining the consistency of reported summaries across departments and over time.⁵

Non-volatile design is essential. It is also the principle most often violated or poorly implemented. Non-volatility literally means that once a row is written, it is never modified. This is necessary to preserve incremental net change history. This, in turn, is required to represent data as of any point in time. When you update a data row, you destroy information. You can never recreate a fact or total that included the unmodified data. Maintaining "institutional memory" is one of the higher goals of data warehousing.

Inmon lays out several other principles that are not a component of his definition. Some of these principles were initially controversial but are commonly accepted now. Others are still in dispute or have fallen into disfavor. Modification of one principle is the basis for the next leap forward.

Enterprise Model
For Inmon, the foundation of his subject-oriented design is an enterprise data model. His roots are in data administration, and he has long favored a top-down approach. As the pace of change accelerated in this decade, information technology organizations become more tactical. The data warehouse became the last bastion for enterprise modeling advocates as companies switched from build to buy. Even here, top-down enterprise modeling has been supplanted by more focused subject area designs.

Enterprise Scope
The biggest casualty in this arena has been the goal of building an enterprise-spanning data warehouse. There are too many sources, too many uses, too much volatility and so little time. Moreover, it has proven impossible to get and hold the support of management long enough to build such a broad-based asset. At least it has with the methods used so far.

Decision Support
A basic assumption of Inmon is that a data warehouse is exclusively a store of data for decision support. In his definition, this precludes the use of a data warehouse for what he calls operational reporting. Early on, his writings were ambiguous regarding whether a data warehouse could be used to feed data back to the operational world or to downstream analytic systems.

This intentionally restricted DSS-only perspective initiated a debate that goes to the very heart of what data warehousing is all about. What defines operational reporting? Is it transaction detail? Near real-time visibility? A current-only view? Is it okay to use a data warehouse as a data delivery hub or does this conflict with the DSS mission? How do we balance the multiple purposes? More on this later.

Atomic Detail
The need to collect atomic detail to build a historic base of reusable data is a component of the early architectures and is supported by the Inmon approach. However, many early decision support advocates questioned the value of retaining the most granular detail. After all, they argued, management analysis is always down at an intermediate summary level. Database administrators and operations managers colluded with these misguided DSS folks, claiming performance impacts and capacity constraints.

Early champions stressed that data warehouses had to be built for future needs not just current requirements. For every request such as, "Show me product category totals by month and region," there lingers, just over the horizon, the follow-on requirement to see products within category or stores within regions or a daily trend. It is far cheaper, even in the short run, to collect and keep the atomic detail than it is to write an expensive program to summarize the data only to have to go back to the source again.

This battle has been won. The multiterabyte databases of today are a result of these voracious needs being satisfied.

Snap Shot Capture
In support of his view of decision support, Inmon defined the need to capture a data "snapshot" representing a consistent state of data as a specific point in time. There is no arguing with the basic principle of consistency management. However, this concept has led us down several blind alleys and set us on a course we are only now about to reverse.

The most simplistic interpretation is that you copy the source ("snapshot") and load it as is into a data warehouse table. During each cycle, you essentially empty the warehouse table and repeat the process. Inmon never supported this form of full refresh processing, but those who failed to understand the fundamental essence of nonvolatility and historical retention used his definition as justification.

The correct interpretation of snapshot is the capture of net incremental changes. This is now commonly understood. However, debate rages over how to detect and store these changes.

The snapshot concept led to the ETL (extraction/ transformation/ loading) approach to filling up a data warehouse that is still in vogue today. It is based on the assumption that you build a data warehouse as a DSS add-on independently of the source systems that already exist. The early architectural treatments assumed that the operational and analytic solutions would be co-designed. At a minimum, the source systems could be, and should be, modified to feed data optimally to the data warehouse. The ETL method has you look back at the source database periodically to select, by some filtering mechanism, only those instances added, changed or deleted. Early design theorists considered this time-consuming and error-prone task of finding what changed after the fact a distance third choice.

When data warehousing got started, many implementations acquired data only once a month. Most high impact data warehouses today have ramped up to daily acquisition for most data sources. This is likely to be the theoretic limit of ETL processing.
What we are now seeing is the return of co-engineered solutions. This will change data warehousing fundamentally. Real-time acquisition is coming and will expand our range of opportunities enormously.

Relational Versus Star Versus Cube
One of the most contentious debates has been around the ONE CORRECT data design philosophy. This debate has many facets but I believe it has the greatest effect on how you store incremental net change details. Inmon is the champion of third normal form storage of historical detail. He offered a start and end date scheme for recording changed data but several others methods have also been proposed. Ralph Kimball later offered dimensional modeling and his star schema structure. He alone defined five different methods for recording changing data.

In the end, Inmon's definition of data warehousing was the spark that ignited widespread interest while carrying the seeds of controversy that would cause increasing fragmentation in approach.

DSS Part 2 - Data Marts (1994)
The first evidence of a split in the data warehousing ranks came with the introduction of the concept of data marts. The difficulty of selling, designing and implementing an enterprise solution caused a high casualty rate among the early adopters. Many abandoned any attempt to plan top down Those that succeeded concentrated on delivering business value and started small and grew incrementally. The best of the breed operated with a plan that would grow to support cross-functional needs. In effect, they built to enterprise scope one piece at a time.

Unfortunately, early failures caused many enterprises to retreat from the concept of data warehousing. Vendors and analysts, concerned about losing their meal ticket, took an architectural concept out of context and sold it as a justifiable stand alone solution. The term "data mart" was introduced into architectural frameworks as a spin-off of the data warehouse optimized for a particular department or function. Its very name implied a retail location that required a wholesale supplier of data behind it for it to exist. The concept of a data mart without a data warehouse (i.e., without coordinated data acquisition and preparation) is an oxymoron.

Convincing clients to build smaller, more contained departmental solutions saved the industry in the short run. There is little doubt this was a necessary evil. However, this violated the most fundamental principle of data warehousing: creating a single point of distribution. Integration and consistency goals cannot be achieved when each department sources, cleans, transforms and loads their own data. Stand alone data marts threatened to increase the very chaos that data warehousing was conceived to eliminate.

Even with this narrowed scope, many data mart projects also began to fail. Some reasons are classic IT failure modes: lack of a clear business driver and poor execution. However, a shocking number were based on bad design. Many used the Inmon model to justify their use of traditional OLTP design methods. The fact that this was a misreading of Inmon's intent did not deter anyone dead set on using what they already knew to build these new-fangled databases. OLTP influenced third normal form designs failed utterly to support the DSS needs for accessibility and performance.

Luckily, Ralph Kimball's first book, The Data Warehouse Toolkit,⁶ hit the market just as the data mart craze really took off. This best-selling book provided detailed design guidance on how to optimize data for analysis. Dimensional modeling spanned the gap from traditional relational design to the multidimensional databases that gave rise to the OLAP moniker. It provided a bridge between business requirement analysis and optimal physical data design that did not exist for decision support.

The year 1994 featured the launch of the OLAP offensive and followed by the ROLAP counter-offensive. The debate was widened from relational vs. dimensional to which form of dimensional do you want: relational OLAP-based on normalized tables, a star-schema form of denormalized design or a true multidimensional database.

Data Warehouse: Divergence (1996-97)
Enterprise warehouse versus department marts. Relational versus dimensional. OLAP versus ROLAP. As if this were not enough, analysts started looking around at what people were actually building and concluded: none of the above. Many warehouse/marts violated one or more of the fundamental principles such as nonvolatility and point-in-time snapshots. What they were really doing was building a new technology form of externalized operational reporting. Some even allowed direct update that put them squarely back on the operational side of the equation.

To preserve the sanctity of the original definitions, many new labels were floated to describe these not-a-data-warehouse constructs. The most universally excepted term is the operational data store (ODS). An ODS may be built with the same underlying technologies, but differs from a data warehouse/data mart in that it may be real time (not snapshot), updateable (not volatile) or transactional in nature.

The introduction of the ODS concept was necessary to continue the dialog about what data warehousing is all about. Without it, more design failures would be blamed on the data warehouse concept. The short-term effect, however, was to introduce more chaos, increase the FUD factor and allow more people to justify the wrong approach. In fact, it almost caused the banishment of the data warehouse term all together.

In 1996, I had clients asking, "Do I need an ODS and a data warehouse and a data mart?" Though the answer should clearly be a resounding NO, I was surprised by the growing dissonance in the Incorporated community. I was shocked when a client in 1997 actually asked me to review a design that called for building all three simultaneously. There was literally no rationale for this design whatsoever other than a magazine article that showed an ODS, a data warehouse and a data mart in the same diagram.

More and more, I heard people say, "I need an ODS and a data mart but not a data warehouse." To them, an ODS meant detail and a data mart meant access and analysis. To them, an ODS uses a normalized design and a data mart uses an access-optimized (generally dimensional) design. Time and again, I resorted to pulling out 7- to 12-year-old monographs to demonstrate that what most of these clients were describing was the essence of original data warehouse models.

Data Warehouse: Synthesis (1998)
Fortunately, 1998 turned out to be a year of synthesis. The one-size-fits-all model of data warehousing is long dead. Happily, frameworks that look amazingly alike - once you remove the dissimilar labels - are replacing the chaos of competing approaches. The Inmon corporate information factory can be aligned with Kimball's new extended architecture in his advanced training. Proprietary approaches from divergent companies are tending toward a common norm.

What these frameworks have in common is the inclusion of multiple layers optimized for different tasks. Each has a layer for staging, a layer for historical detail and a multifaceted layer for optimized access and delivery of data. Kimball (begrudgingly) accepts the use of normalized design for staging or historical layers. Inmon acknowledges the need for dimensional constructs in the access layer. They generally allow for pass-through of transaction level detail and a downstream delivery of aggregated results.

Even the most rabid data mart fanatics have come around to the need for an enterprise solution. The evolution of independent data marts to dependent data marts with common source acquisition and reusable data objects is underway.

What is most heartening is the reconvergence of data warehousing with overall application architecture. Data warehousing will no longer be tacked on as an afterthought but part and parcel of the whole solution. The emergence of products such as SAP's Business Warehouse Solution can be seen on one level as just a marketing move to capture more of the IT dollar. What it indicates to me is the beginning of mainstream integration between the operational and analytic worlds.

Data Warehouse 2000: Real-Time Data Warehousing
Our next step in the data warehouse saga is to eliminate the snapshot concept and the batch ETL mentality that has dominated since the very beginning. The majority of our developmental dollars and a massive amount of processing time go into retrieving data from operational databases. What if we eliminated this whole write then detect then extract process? What if the data warehouse read the same data stream that courses into and between the operational system modules? What if data that was meaningful to the data warehouse environment was written by the operational system to a queue as it was created?

This is the beginning of a real-time data warehouse model. But, there is more.

What if we had a map for every operational instance that defined its initial transformation and home location in the data warehouse detail/history layer? What if we also had publish-and-subscribe rules that defined downstream demands for this instance in either raw form or as a part of some derivation or aggregation? What if this instance was propagated from its operational origin through its initial transformation then into the detail/history layer and to each of the recipient sites in parallel and in real time?

What would you say if I said you could do this today?

1. The term Information Center can be traced to a design monograph developed at IBM Canada in 1979.
2. Devlin, B.A. and Murphy, P.T. An Architecture for a Business and Information System. IBM Systems Journal. Volume 27, No. 1, 1988.
3. The original meaning of VITAL was VAX IBM Tandem Application Life cycle, in deference to the dominant platforms in use at Apple at the time, despite the Macintosh, of course. This gives you some hint at the scope of our integration challenge.
4. W.H. Inmon, Building the Data Warehouse, QED/Wiley, 1991.
5. The need for a time-variant design is often misunderstood, overlooked or outright refuted by those who are most partial to Inmon's approach. Inmon is sited as the champion for the normalized storage model as opposed to star schema or cube methods. Those who advocate exclusively normalized design are also those that reject the need for physical storage of multiple aggregate levels. They abhor redundancy and erroneously consider a time variant design to be a redundant design.
6. Ralph Kimball,. The Data Warehouse Toolkit : Practical Techniques for Building Dimensional Data Warehouses, John Wiley & Sons, 1996.

FACE="Verdana, Helvetica, Arial">Texas.78746 tel(512)329-6646,.fax(512)329-0767

	home \| about us \| solutions \| products \| partners \| clients \| careers \| contact us \| search