|
The Real-Time Data Warehouse: The Next Stage in Data Warehouse Evolution Defining the data warehouse The integration goal was taken from the realm of enterprise rhetoric down to something attainable. Integration is not the act of wiring applications together. Nor is it simply commingling data from a variety of sources. Integration is the process of mapping dissimilar codes to a common base, developing consistent data element presentations and delivering this standardized data as broadly as possible. Time variance is the most confusing Inmon concept but also a most pivotal one. At its essence, it calls for storage of multiple copies of the underlying detail in aggregations of differing periodicity and/or time frames. You might have detail for seven years along with weekly, monthly and quarterly aggregates of differing duration. The time variant strategy is essential, not only for performance but also for maintaining the consistency of reported summaries across departments and over time.5 Non-volatile design is essential. It is also the principle most often violated or poorly implemented. Non-volatility literally means that once a row is written, it is never modified. This is necessary to preserve incremental net change history. This, in turn, is required to represent data as of any point in time. When you update a data row, you destroy information. You can never recreate a fact or total that included the unmodified data. Maintaining "institutional memory" is one of the higher goals of data warehousing. Inmon lays out several other principles that are not a component of his definition. Some of these principles were initially controversial but are commonly accepted now. Others are still in dispute or have fallen into disfavor. Modification of one principle is the basis for the next leap forward. Enterprise Model For Inmon, the foundation of his subject-oriented design is an enterprise data model. His roots are in data administration, and he has long favored a top-down approach. As the pace of change accelerated in this decade, information technology organizations become more tactical. The data warehouse became the last bastion for enterprise modeling advocates as companies switched from build to buy. Even here, top-down enterprise modeling has been supplanted by more focused subject area designs. Enterprise Scope The biggest casualty in this arena has been the goal of building an enterprise-spanning data warehouse. There are too many sources, too many uses, too much volatility and so little time. Moreover, it has proven impossible to get and hold the support of management long enough to build such a broad-based asset. At least it has with the methods used so far. Decision Support A basic assumption of Inmon is that a data warehouse is exclusively a store of data for decision support. In his definition, this precludes the use of a data warehouse for what he calls operational reporting. Early on, his writings were ambiguous regarding whether a data warehouse could be used to feed data back to the operational world or to downstream analytic systems. This intentionally restricted DSS-only perspective initiated a debate that goes to the very heart of what data warehousing is all about. What defines operational reporting? Is it transaction detail? Near real-time visibility? A current-only view? Is it okay to use a data warehouse as a data delivery hub or does this conflict with the DSS mission? How do we balance the multiple purposes? More on this later. Atomic Detail The need to collect atomic detail to build a historic base of reusable data is a component of the early architectures and is supported by the Inmon approach. However, many early decision support advocates questioned the value of retaining the most granular detail. After all, they argued, management analysis is always down at an intermediate summary level. Database administrators and operations managers colluded with these misguided DSS folks, claiming performance impacts and capacity constraints. Early champions stressed that data warehouses had to be built for future needs not just current requirements. For every request such as, "Show me product category totals by month and region," there lingers, just over the horizon, the follow-on requirement to see products within category or stores within regions or a daily trend. It is far cheaper, even in the short run, to collect and keep the atomic detail than it is to write an expensive program to summarize the data only to have to go back to the source again. This battle has been won. The multiterabyte databases of today are a result of these voracious needs being satisfied. Snap Shot Capture In support of his view of decision support, Inmon defined the need to capture a data "snapshot" representing a consistent state of data as a specific point in time. There is no arguing with the basic principle of consistency management. However, this concept has led us down several blind alleys and set us on a course we are only now about to reverse. The most simplistic interpretation is that you copy the source ("snapshot") and load it as is into a data warehouse table. During each cycle, you essentially empty the warehouse table and repeat the process. Inmon never supported this form of full refresh processing, but those who failed to understand the fundamental essence of nonvolatility and historical retention used his definition as justification. The correct interpretation of snapshot is the capture of net incremental changes. This is now commonly understood. However, debate rages over how to detect and store these changes. The snapshot concept led to the ETL (extraction/ transformation/ loading) approach to filling up a data warehouse that is still in vogue today. It is based on the assumption that you build a data warehouse as a DSS add-on independently of the source systems that already exist. The early architectural treatments assumed that the operational and analytic solutions would be co-designed. At a minimum, the source systems could be, and should be, modified to feed data optimally to the data warehouse. The ETL method has you look back at the source database periodically to select, by some filtering mechanism, only those instances added, changed or deleted. Early design theorists considered this time-consuming and error-prone task of finding what changed after the fact a distance third choice. When data warehousing got started, many implementations acquired data only once a month. Most high impact data warehouses today have ramped up to daily acquisition for most data sources. This is likely to be the theoretic limit of ETL processing. What we are now seeing is the return of co-engineered solutions. This will change data warehousing fundamentally. Real-time acquisition is coming and will expand our range of opportunities enormously. Relational Versus Star Versus Cube One of the most contentious debates has been around the ONE CORRECT data design philosophy. This debate has many facets but I believe it has the greatest effect on how you store incremental net change details. Inmon is the champion of third normal form storage of historical detail. He offered a start and end date scheme for recording changed data but several others methods have also been proposed. Ralph Kimball later offered dimensional modeling and his star schema structure. He alone defined five different methods for recording changing data. In the end, Inmon's definition of data warehousing was the spark that ignited widespread interest while carrying the seeds of controversy that would cause increasing fragmentation in approach. DSS Part 2 - Data Marts (1994) The first evidence of a split in the data warehousing ranks came with the introduction of the concept of data marts. The difficulty of selling, designing and implementing an enterprise solution caused a high casualty rate among the early adopters. Many abandoned any attempt to plan top down Those that succeeded concentrated on delivering business value and started small and grew incrementally. The best of the breed operated with a plan that would grow to support cross-functional needs. In effect, they built to enterprise scope one piece at a time. Unfortunately, early failures caused many enterprises to retreat from the concept of data warehousing. Vendors and analysts, concerned about losing their meal ticket, took an architectural concept out of context and sold it as a justifiable stand alone solution. The term "data mart" was introduced into architectural frameworks as a spin-off of the data warehouse optimized for a particular department or function. Its very name implied a retail location that required a wholesale supplier of data behind it for it to exist. The concept of a data mart without a data warehouse (i.e., without coordinated data acquisition and preparation) is an oxymoron. Convincing clients to build smaller, more contained departmental solutions saved the industry in the short run. There is little doubt this was a necessary evil. However, this violated the most fundamental principle of data warehousing: creating a single point of distribution. Integration and consistency goals cannot be achieved when each department sources, cleans, transforms and loads their own data. Stand alone data marts threatened to increase the very chaos that data warehousing was conceived to eliminate. Even with this narrowed scope, many data mart projects also began to fail. Some reasons are classic IT failure modes: lack of a clear business driver and poor execution. However, a shocking number were based on bad design. Many used the Inmon model to justify their use of traditional OLTP design methods. The fact that this was a misreading of Inmon's intent did not deter anyone dead set on using what they already knew to build these new-fangled databases. OLTP influenced third normal form designs failed utterly to support the DSS needs for accessibility and performance. Luckily, Ralph Kimball's first book, The Data Warehouse Toolkit,6 hit the market just as the data mart craze really took off. This best-selling book provided detailed design guidance on how to optimize data for analysis. Dimensional modeling spanned the gap from traditional relational design to the multidimensional databases that gave rise to the OLAP moniker. It provided a bridge between business requirement analysis and optimal physical data design that did not exist for decision support. The year 1994 featured the launch of the OLAP offensive and followed by the ROLAP counter-offensive. The debate was widened from relational vs. dimensional to which form of dimensional do you want: relational OLAP-based on normalized tables, a star-schema form of denormalized design or a true multidimensional database. Data Warehouse: Divergence (1996-97) Enterprise warehouse versus department marts. Relational versus dimensional. OLAP versus ROLAP. As if this were not enough, analysts started looking around at what people were actually building and concluded: none of the above. Many warehouse/marts violated one or more of the fundamental principles such as nonvolatility and point-in-time snapshots. What they were really doing was building a new technology form of externalized operational reporting. Some even allowed direct update that put them squarely back on the operational side of the equation. To preserve the sanctity of the original definitions, many new labels were floated to describe these not-a-data-warehouse constructs. The most universally excepted term is the operational data store (ODS). An ODS may be built with the same underlying technologies, but differs from a data warehouse/data mart in that it may be real time (not snapshot), updateable (not volatile) or transactional in nature. The introduction of the ODS concept was necessary to continue the dialog about what data warehousing is all about. Without it, more design failures would be blamed on the data warehouse concept. The short-term effect, however, was to introduce more chaos, increase the FUD factor and allow more people to justify the wrong approach. In fact, it almost caused the banishment of the data warehouse term all together. In 1996, I had clients asking, "Do I need an ODS and a data warehouse and a data mart?" Though the answer should clearly be a resounding NO, I was surprised by the growing dissonance in the consulting community. I was shocked when a client in 1997 actually asked me to review a design that called for building all three simultaneously. There was literally no rationale for this design whatsoever other than a magazine article that showed an ODS, a data warehouse and a data mart in the same diagram. More and more, I heard people say, "I need an ODS and a data mart but not a data warehouse." To them, an ODS meant detail and a data mart meant access and analysis. To them, an ODS uses a normalized design and a data mart uses an access-optimized (generally dimensional) design. Time and again, I resorted to pulling out 7- to 12-year-old monographs to demonstrate that what most of these clients were describing was the essence of original data warehouse models. Data Warehouse: Synthesis (1998) Fortunately, 1998 turned out to be a year of synthesis. The one-size-fits-all model of data warehousing is long dead. Happily, frameworks that look amazingly alike - once you remove the dissimilar labels - are replacing the chaos of competing approaches. The Inmon corporate information factory can be aligned with Kimball's new extended architecture in his advanced training. Proprietary approaches from divergent companies are tending toward a common norm. What these frameworks have in common is the inclusion of multiple layers optimized for different tasks. Each has a layer for staging, a layer for historical detail and a multifaceted layer for optimized access and delivery of data. Kimball (begrudgingly) accepts the use of normalized design for staging or historical layers. Inmon acknowledges the need for dimensional constructs in the access layer. They generally allow for pass-through of transaction level detail and a downstream delivery of aggregated results. Even the most rabid data mart fanatics have come around to the need for an enterprise solution. The evolution of independent data marts to dependent data marts with common source acquisition and reusable data objects is underway. What is most heartening is the reconvergence of data warehousing with overall application architecture. Data warehousing will no longer be tacked on as an afterthought but part and parcel of the whole solution. The emergence of products such as SAP's Business Warehouse Solution can be seen on one level as just a marketing move to capture more of the IT dollar. What it indicates to me is the beginning of mainstream integration between the operational and analytic worlds. Data Warehouse 2000: Real-Time Data Warehousing Our next step in the data warehouse saga is to eliminate the snapshot concept and the batch ETL mentality that has dominated since the very beginning. The majority of our developmental dollars and a massive amount of processing time go into retrieving data from operational databases. What if we eliminated this whole write then detect then extract process? What if the data warehouse read the same data stream that courses into and between the operational system modules? What if data that was meaningful to the data warehouse environment was written by the operational system to a queue as it was created? This is the beginning of a real-time data warehouse model. But, there is more. What if we had a map for every operational instance that defined its initial transformation and home location in the data warehouse detail/history layer? What if we also had publish-and-subscribe rules that defined downstream demands for this instance in either raw form or as a part of some derivation or aggregation? What if this instance was propagated from its operational origin through its initial transformation then into the detail/history layer and to each of the recipient sites in parallel and in real time? What would you say if I said you could do this today? 1. The term Information Center can be traced to a design monograph developed at IBM Canada in 1979. 2. Devlin, B.A. and Murphy, P.T. An Architecture for a Business and Information System. IBM Systems Journal. Volume 27, No. 1, 1988. 3. The original meaning of VITAL was VAX IBM Tandem Application Life cycle, in deference to the dominant platforms in use at Apple at the time, despite the Macintosh, of course. This gives you some hint at the scope of our integration challenge. 4. W.H. Inmon, Building the Data Warehouse, QED/Wiley, 1991. 5. The need for a time-variant design is often misunderstood, overlooked or outright refuted by those who are most partial to Inmon's approach. Inmon is sited as the champion for the normalized storage model as opposed to star schema or cube methods. Those who advocate exclusively normalized design are also those that reject the need for physical storage of multiple aggregate levels. They abhor redundancy and erroneously consider a time variant design to be a redundant design. 6. Ralph Kimball,. The Data Warehouse Toolkit : Practical Techniques for Building Dimensional Data Warehouses, John Wiley & Sons, 1996. |
|||||||||||||||||||||||||||||||||||||||||||||
![]() |
![]() |
![]() |
![]() |
![]() |
home | about us | solutions | products | partners | clients | careers | contact us | search ![]() |
|||
|
|||
![]() |
![]() |
![]() |
![]() |