|
DPlanning for a Data Warehouse Michael B. Haisten Clarify Your Intentions So you plan to build a data warehouse. Why? The concept has become so popular that it is treated as a self-justifying proposi-tion. As in the hit film Field of Dreams, normally sound-mined IS professionals firmly believe: If we build it, they will come! One factor shared by all successful data warehouse efforts is pragmatic rationale. Articulation of the potential benefits is a minimum starting point. The ability to paint a rosy picture of future possibilities is a valuable skill. But, real success comes from effectively associating your warehouse proposal with the daily reality of the enterprise. Don’t confuse vision with rationale. Vision defines the path. Your rationale pays for the journey. The most powerful justifications for a data warehouse investment are qualitative goals. Typical objectives culled from actual executive presentations are:
Statements like these help illuminate your vision. A common failing, though, is to present these statements as if they are, in fact, the rationale. What does it mean to improve information access? How do we ensure this has the same meaning for everyone? Every goal should not simply point out a direction. Each should tell a consistent story. Each must carry specific deliverables and must, somehow and in some form, be measurable. A data warehouse can rarely be justified based (solely, primarily) on operational cost reduction. Where do you turn to add substance to these fuzzy statements of direction? The normal tendency of IS professionals is to revert to a tactical pitch involving cost reduction. Reduce reporting costs by 30% is typical of these cost-centered arguments. It sounds pretty specific, doesn’t it, and it comes equipped with a number. What more could you ask for? Well, for one, it helps if it is based on even a shred of truth. Let’s follow this chain of logic: Give me some money now and I promise to eliminate a batch of operational reports at some time in the future. I will accomplish this by creating a pool of data and training the users to wade in on their own to siphon off what they need. This is only an elaborate shell game which management will see through, sooner or later. In the best possible outcome, you simply transfer costs from the very visible IS budget into a user cost center. If lucky, you produce a slim IS-only return on investment. In a more typical case, the saving from eliminating an operational expense does not even cover the development costs. And in any scenario, the business as a whole experiences a negative return. Ironically, a data warehouse can return real benefits that go unnoticed when the promised objective is not achieved. Good work can not rise above bad rationale. Help your project build on a solid foundation. The best-cost objective: Achieve a permanent reduction in the full lifecycle cost of delivering information. A data warehouse solution can become a highly efficient process for responding to new information needs. Traditional means of extracting information from our data systems are highly labor intensive and, at best, linear in cost. Dealing with older leg-acy systems can result in exponential costs; it costs more and more every time you go back to the well. The objective of data warehouse implementation is to initiate a data acquisition and delivery process that offers lower marginal cost with each new user over time. The best overall objective: Align your goals explicitly with a strategic business initiative. Link your data warehouse to the strategic plan of your enterprise. Not the IS business plan - I am talking about the business business plan. I use the motto AIM Higher to remind me to search these plans for opportunities in three areas:
These strategic goals often suggest the right subject areas and business context to pursue. For instance, the analysis example implies a market subject area that commingles internal data on sales volume, pricing, and profitability with external data on market share. The integration example explicitly cites a product and customer subject area. Indirectly, it implies a financial data warehouse spanning the newly independent divisions. Re-engineering, renewal, or re-whatever often involves significant application migration activity. Migration spawns many productive justifications for a data warehouse. One can be implemented in advance to shield information consumers from disruption due to the transition. A data warehouse can simultaneously serve a forward conversion role as well as its normal information access function. A data warehouse can also supplement information access and analysis deficiencies in new ap-plications. The worst (and frighteningly frequent) objective: a technical proof of concept for data warehousing. This is the goal of way too many pilot projects. Suffice it to say data warehouse projects justified exclusively on technical grounds have the highest failure rate. Create a Program Moral: Offer a bold, comprehensive, and rigorous program with a pragmatic bias. With a strong, credible rationale, you are off to a solid start. The next step is to frame what you intend to do. Spread the vision outline a multi-year, multi-phase plan, and propose a pilot supporting this direction. When you articulate what the data warehouse will become, always position it as evolving incrementally to match the chang-ing needs of the enterprise. In early releases, you will be perfecting techniques and building technical infrastructure. Make sure management realizes that each early release will tackle a new business subject area and will offer additional information deliverables. Then prepare to fulfill these promises. Vision background; not just another database, not just a project, more than infrastructure. A data warehouse is not just another database. Most database designs are derivative intended to serve the processing re-quirements of a specific application. A data warehouse database is both primitive and primary intended as the definitive source for a broad spectrum of basic business facts. The prime directive of traditional database development is to optimize efficient handling of a known set of transactions. The data warehouse prime directive is to optimize availability of cross-functional data in support of future needs. A data warehouse is not just a project. A project is the managed organization of resources to produce a defined set of deliv-erables in a specified timeframe. A data warehouse effort does involve several time-sequenced deliverables: raw atomic data acquired from operational systems, collections of data prepared for use, and executable procedures to perform this process-ing. However, these deliverables are produced using exploratory techniques that are radically different from traditional methodologies. And, though there is a definite beginning, there should be no end to the data warehouse. Sourcing, packaging, and delivering data will continue and evolve as the business changes. But the differences go deeper than this. When you initiate a data warehouse implementation, you are investing in a whole array of new services supported by the deployment of new technical infrastructure. You select and introduce analysis tools. You plan and implement training programs. Support procedures need to be designed and support mechanisms must be put into place. This effort takes place amidst an evolving set of new information management practices. A data warehouse is a program to manage sharable information acquisition and delivery universally. A data warehouse, like your neighborhood library, is both a resource and a service. The value of library resources is deter-mined by the breadth and depth of the collection. The value of library services is based on how quickly and easily they can assist you in finding and using what you need. Unlike a library, a data warehouse must take on the role of manufacturer and distributor as well. As in a factory, raw materials are collected from operational systems and packaged for use by information consumers. Similar to a public utility, a data warehouse uses a common distribution network to deliver products to the point of use. The value of a factory is determined by how well it produces the finished goods that consumers most need today and into the future. The value of a utility is measured by the efficiency and timeliness of delivery. Avoid the false promises of the “just do it” advocates. Whew, this sounds like you must propose an expensive mega-project. Don’t all the other data warehouse advisors say that you can - in fact, that you should - start on a small-scale, limited-exposure, pilot project? If you follow the advice of the quick start or just do it advocates literally, you are signing up for a quick trip to nowhere. Many people are duped into thinking that a pilot project is the limit of their planning horizon. There is nothing wrong with moving out fast if you know where you are going. Starting with a real plan need not take forever, only two or three months. It will be time well spent. Take the time to look for strategic opportunities. Create the outline of a long-term plan. Define potential future destinations and alternate paths to get there. Pitch your pilot as the logical first step. Outline Information Deliverables Moral: Learn the differences about data warehousing since commonly held beliefs will cause you to fail. Now let’s turn our attention to the outputs of a data warehouse. First, it helps to understand the various types of data that should be stored on your warehouse shelves. Then we can discuss how you decide what to put into these containers. A data warehouse contains a far wider variety of containers than traditional applications. Expect to design four major types of containers: core data, collections, archive, and metadata. (see box, below)
Core data is the raw material from which all other data tables are built. This data should be brought in with minor cleanup and conversions to make it more useful. Avoid selective acquisition and take all rows from the source. Avoid summarization on the way in and keep the detail. Avoid creating new calculated values at this point, but it is okay to import derived data when it is essential to maintain auditability. New derived data is created later when the collections are produced. Core data is typically stored in a normalized form. Collections are manufactured in the warehouse environment to satisfy known and anticipated needs. Aggregations are pro-duced to support repetitive access of related data drawn from multiple base tables. Summaries are created based on usage experience, not by a priori analysis. Prepacks and exports satisfy specific needs of your consumers or the tools they use. By definition, all collections are denormalized. Though the terms used here may be unfamiliar, virtually every warehouse effort builds tables containing the kind of data de-scribed above. Unfortunately this is not the case for archive and metadata. Though extensive lip service is given to the need to maintain history, few projects go beyond adding timestamps to online data warehouse tables. A virtual avalanche of words is lavished on the essential role of metadata. Yet in the daily grind, this is always something that can be addressed in the next release. The central role of the archive and comprehensive use of metadata are two features that clearly set the data warehouse con-cept apart from prior information access initiative. Build them into your plan at an early stage. As with everything else, you can start small, but do include these features. The Archive contains data packaged for easy retrieval and reintegration with the online warehouse data. Raw data at the most granular level (atomic) is stored to ensure no loss of information over time. In some cases, the archive is the only place where this level of detail is available. Context history is stored to ensure that you can recreate the data environment that existed at any period. Metadata is the most discussed but least understood. Most people think of a data dictionary first, but this is the least valuable form of definition metadata. Within this subcategory, knowing the ultimate source for a data element, meanings rather than descriptions, and references to how the data is commonly used are more important than the basic data dictionary-style infor-mation. Navigation data is intended to help the consumer find what exists and how to use it. The card catalog analogy sug-gests some of the possibilities of navigation. The least understood form of metadata, administrative, is possibly the most powerful in the long run. It plays an essential role in process integration and automation. A data warehouse must be designed to support unknown future needs of unidentified consumers. When you turn your attention to selecting the data for your core tables, you immediately run into a wall. How can we possi-bly plan for the unknown? The most difficult hurdle is to realize you must try. There are three classic ways to fail when an information consumer searches for data: 1) not having the data in the form they desire, 2) not having the required detail, or 3) not having the data element they need at all. The first problem is a matter of design that can be fixed on the fly. A collection can be built to satisfy this need from core data already in the warehouse. The latter two, however, are a factor of advance planning. It takes considerable time to acquire missing data - time the con-sumer does not have. Traditional requirements gathering methods must be abandoned or extensively modified. Traditional practices and commonly held beliefs lead us to:
Only getting what is a known requirement leads to repeated, costly trips to the well. Try, instead, to:
Start from the premise that you will drain every potentially usable nibble of data out of the source systems. Fight at all turns the tendency to accept summarized data rather than the ultimate atomic level of detail. Make your associates justify not in-cluding data rather than the other way around. You will have to back off, to some degree, but you are coming from a more defensible position for the long run. Outline the Technology Moral: How you build it is at least as important as what you build. Much attention has been devoted to Information Architecture in data warehouse articles and conferences. The vast majority of data warehouse speakers and writers are (former) database analysts, data modelers, data analysts or other practitioners of data-centric fields. Almost no attention is given to technical architecture or the practice of defining the optimal way to build a warehouse. Process components are the building blocks of a data warehouse. A data warehouse is more than a computer, a DBMS, a network, and a client/server query tool. These things are the infra-structure services that form the backbone upon which you build a data warehouse. But what is the essence of the data ware-house itself? From the perspective of technical architecture, a data warehouse is a set of processing components that interact to make en-terprise information available to information consumers:
To build a complete data warehouse solution you need to build, buy or manually perform these services. Create a technical plan for your data warehouse that is technology-free. A well-defined technical plan allows you to specify how your data warehouse will function and what services it will provide. The plan can define what technical components you intend to offer initially, and with subsequent releases, thus providing a common framework defining growth over time. A technical plan that does not list any vendor-specific products by name (technology-free) provides an excellent foundation for making the decision of what to buy, build or perform manually. It can be used to frame the requirements for product se-lection and, later, to provide purchase justification for essential components. Determine Automation Targets Moral: Failure to automate a data warehouse leads to failure. The basic development tasks are so deceptively simple that most people resort to our normal manual practices. This includes writing all the code by hand and writing a separate custom procedure for each activity. To continue to add value, a data warehouse must adapt to changing needs. New data must be continually acquired. Available raw data must continually be packaged and repackaged to support new demands, tools, and forms of analysis. The key to survival is to remain dynamic. The key to remaining dynamic is to automate repetitive tasks and reuse the prior deliverables whenever and wherever you can. Go from design to implementation in one step. The holy grail of Computer-Aided Systems Engineering is to eliminate all coding. For the most part, CASE has been a monumental failure. Sophisticated upper CASE modeling tools and methods failed to reduce development time, and resource requirements often increased. Now in the specialized field of data warehouse development, CASE techniques are finally be-ginning to pay off. I will give you just two examples First, it is now possible to use graphical modeling tools to design, implement and maintain a data warehouse database. In fact, this is the preferred method. Maintaining synchronization between the graphical image and the physical database is essential. You accomplish this by eliminating all manual manipulation of the warehouse database data definition language statements. This can and does work for a data warehouse. The second high productivity area involves the new breed of extract and load code generators. Using very different ap-proaches, these products allow you to define the data source(s), the warehouse target table, and the mappings and transforma-tions to get from one to the other. They then generate code to operate on the specific platforms in your environment. Produc-tivity is increased by a factor of two to five times for initial development, and is considerably greater for revisions to existing feeds. Given the number of data feeds, the rapid change, and the ability to reuse prior definitions, these products pay for themselves very quickly. Buy generalized utilities whenever and wherever possible. Otherwise, consider building them. Though the current concept of a data warehouse is new, many tasks necessary to develop and operate one are well under-stood. Existing tools and utilities can be harnessed for data warehouse use. New products are regularly introduced to simplify data design or management tasks. Where a repetitive activity exists, there is an opportunity to develop a custom utility. Ex-amples abound. Here are just a few:
Some analysis tools require data to be structured in a very explicit fashion. Rather than write a unique piece of code for each occurrence, write a generic utility to prepare data for these products. Provide the glue to integrate off-the shelf products into a seamless process. Many labor-intensive data warehouse tasks can be streamlined by simply stitching together existing products and proce-dures. For instance, the dynamic nature of a data warehouse means there are many frequent changes in the batch processing cycle. Wouldn’t it be nice if you could update a diagram of the batch dependency plan to change the schedule? How about automating the process of error escalation? Think how much easier it would be if all interested parties were automatically notified by e-mail when a data load fails. One solution incorporates a PERT charting applications, the DBMS of your choice, your existing schedule management software, one or more mail packages and a mail gateway along with a client/sever development tool and a bit of code. All this functionality, and more, can be made available by creative use of building blocks. Plan for automation and reuse in your first project and expand from there. No data warehouse program can operate cost-effectively without introducing a high degree of automation and reusability. Many opportunities exist. Just search for repetitive tasks that can be reduced or eliminated. Don’t wait. Begin in the pilot. Select the Tools Moral: Abandon completely even the desire to find a silver bullet. Many data warehouse project teams waste enormous amounts of time searching in vain for a silver bullet. They believe their mission is to find the one access tool that will handle all the needs of all their users. Don’t even try. It can’t be done. One size does not fit all. There is a wide and ever-growing diversity of tools for data access, exploration, analysis, and presentation. The appropriate mission of a data warehouse or decision support team is to understand these diverse alternatives and properly match the tool to the intended usage. The names typically applied to tools are ambiguous and confusing. Generally every new name only obscures the issue more. What is the definitive definition of ad hoc query tool, decision support system, executive information system or online ana-lytic processing? Some terms, like EIS, carry the burden of years of overzealous selling and underwhelming results. To evaluate tools, you need to slot the alternative into categories that allow for meaningful comparison. Since the traditional terms add little discriminatory power, where do you turn? The first part of the answer is to create purely functional catego-ries. For example:
One category missing from this list are tools such as PC spreadsheets, charting and presentation tools. In most enterprises today, they are the final packaging vehicles for management presentation. However, they are rarely the tool of choice for di-rect access to the data warehouse. Also, this taxonomy only includes interactive tools. Many mature technologies exist for batch reporting. A new class of non-interactive tools will come on the market over the next several years. They will use agent-processing techniques to scan remote databases and pass result data back to the client asynchronously. Match the tool to the consumer, context, and usage. The next step in selecting access tools is to survey your potential consumers. You need to collect information in three broad interrelated areas: 1) who they are and what their role is, and how they operate, 2) what the business context is, and 3) their typical usage of data. This information will help you select the appropriate tool category(s) for each consumer. Then rank order your potential consumers according to who will be the most involved, productive, and powerful participants. The prior-ity tool categories should become apparent. Several examples follow:
Build a best-in-class tool set of supported products. Of course you can provide quality support for only a small, finite number of tools. After identifying the tool categories most appropriate for your consumers, you are prepared to evaluate what the market has to offer. The objective is to select the tools you intend to support in each category. You will never be able to eliminate pirate activity. Some consumers will always find justifiable reason to pick a non-supported product. We have found that it is far more productive to explain the value of selecting a supported product than to wage a battle to enforce a standard. You will lose customers and goodwill. Remember to plan regular surveys of the market. You need to remain aware of new features of existing products, new prod-ucts in each category, and the advent of wholly new categories. Be ready to change your supported product mix when there is genuine need. Resist the tendency to move capriciously to the next best thing. You and your customers need time to become proficient and successful before moving on. Lay Out the Project Moral: A data warehouse project is more like scientific research than anything in traditional IS! The normal IS approach places considerable emphasis on knowing what the expected results are before committing to action. In scientific research, the results are unknown up front, and emphasis is placed on developing a rigorous, step-by-step process to uncover the truth. The activities are hands-on, involving regular interaction between the scientist and the subject and among the participants on the project. Adopt an exploratory, hands-on process involving cross-disciplinary participation. The following situations indicate a project in trouble because these techniques are not being used:
Initiate multiple parallel activity tracks simultaneously. A data warehouse project involves a wide range of deliverables that can be worked on in parallel. The track scheme we ad-vocate is:
This scheme offers a division of labor proven to work in practice. You do not need nine or more people to make this scheme work for you. But it is hard to gain the advantages of parallel activity with only a few participants. Forcing closure on specific results too soon leads inexorably to disaster. A general rule of thumb is to break the initial project roughly into thirds: initiation, exploration, and implementation. We assume that the project begins after the planning activity, when you assemble the bulk of the team for action. Initiation Phase This phase continues through delivery of the first trial database of live core data. On the consumer track, you outline the ex-pected information deliverables. On the source data track, basic mapping of data from the source systems is defined. The model database track delivers a data warehouse data model containing only core tables and then proceeds to create and load the initial data warehouse database. Ideally, the access tools are selected by the end of this phase. Optionally, you may identify finalists that are to be evaluated during the exploratory phase. Design deliverables are created for the other tracks (technical plan, support, operations, and integration). All the technology to support the development of the data model, database, and data acquisition procedures must be ready when needed during the initiation phase. Exploration Phase This phase involves iterative revision of the database and the preparation of procedures that maintain it. On the consumer track, a team of technical and consumer participants uses the access tools suite to develop representative samples of antici-pated results. Based on the experience, the model database team members define aggregates and summaries that offer a good balance of user flexibility, production efficiency, and runtime performance. On the source track, the participants refine the mapping and transformation logic based on hands-on experience with the data. They clean up the source data and define con-version logic to correct consistency or integrity problems. They modify the core tables based on tradeoffs between acquisition and preparation activity. All technical infrastructure and tools work concludes during the exploration phase. The operations track prepares for implementation and production migration. The support track prepares for initial consumer training in the next phase. Navigation services, the metadata tables, and the metadata maintenance procedures are developed at this time. Implementation Phase This phase starts with a freeze on source investigation and consumer experimentation. It concludes with formal release of the current version of the data warehouse. The final data definition and data manipulation and procedural changes are made. All component services (acquisition, preparation, operations, navigation, access, and delivery) are installed and unit tested. The production schedule is implemented and a full system test is conducted. Initial consumer orientation and training are offered during the production migration activity. Continue to look over the horizon at the future possibilities. To ensure that the data warehouse program evolves continuously, management must maintain focus on future opportunities. By the beginning of the implementation phase of the current project, management must already be planning the subsequent project. What new subject area will be tackled? What new consumers will be recruited? What new techniques will be tried? What new technical components will be implemented? And very importantly, how has what you are offering been received and adopted? Solicit a Team Moral: A cross-functional program must be inter-disciplinary. Use maximum leverage. The question I am most often asked is: “ How many people are needed to build a data warehouse?” My typical answer is: “Anywhere from two and a third to a cast of hundreds.” From project to project, there are substantial differences in objective and scale. But the resource question is also biased by how you define participation. When is 2.33 equal to 4 or more? The smallest project in my experience used two systems analysts and a part-time database programmer to build a data ware-house from scratch in eight month. Their subject area, R&D resources, was quite small. Their technical objectives were lim-ited and they supported only one access tool. Yet, they are successful enough to move into other subject areas and add new consumer groups while expanding their direct staff to five. When you look closer, the 2.33 full-time equivalents significantly undercounts real participation. At least four full-time equivalents were expended on development alone. This includes devel-oper on the source systems who implemented new and modified data feeds, reviews by a DBA, vendor contractors, an advi-sory consultant, and others. An unknown number of consumers, developers, and managers were involved before, during, and after development. Consciously manage multiple spheres of involvement. This is the only real way to “get more from less.” Everyone wants to know the minimum number of people it will take to get the job done. When pressed for a single number, my answer is seven. This is roughly the critical mass needed to sustain a successful program for a medium to large business. But how this resource is structured varies too wildly to draw any general conclusions. The creative manager builds a small core team and borrows the rest. This is the key to delivering more with less. Our definition of levels of involvement is:
Actively solicit extended team members and participants from other groups, departments, and cost centers. A data warehouse requires broad expertise from multiple disciplines. Sign on developers from the source systems to acquire the data. Enlist consumers and the IS support personnel. All the IS infrastructure groups need to be on your side. If not, they will be in your face. Reaching out to other groups for expertise increases the management hassles, no doubt. But it can be immeasurably valuable in solidifying cross-functional support. Your data warehouse project should be the future direction for information access. Competition is wasteful and deadly. Don’t even try to do it all yourself. Keep the core team small and recruit a cast of hun-dreds. The talent you need on the core team is the hardest to find. So, who should be on this small core team? Typically, the first talent hired is a data modeler and the second is a database analyst or DBA. While you need these data specialists, you don’t need them as fulltime, permanent participants of the core team. The best individuals at the core of your team are people who intimately understand the broad issues of a complete data warehouse program. Ideally, they studied the field extensively and participated in at least one successful project. These folks are hard to come by. Your next choice should be those with experience of packaging data for consumer use: a data analyst with DSS experience, a 4GL programmer with information center or end-user support background, a datacentric developer with real experience us-ing spiral or interactive methods. You need data engineers to get started and to add new data resources to the warehouse. They create the potential. You will need data analysts or data usage specialists to stay in business. They deliver results. Evangelize Endlessly Moral: Tell them what you are going to say, say it, then tell them what you said. The story of a successful data warehouse program is of a story often told. You need to keep the vision alive while delivering results. When you begin planning for your own data warehouse, plan to spend a considerable amount of your time on the campaign trail selling and reselling your concept. |
||||||||||||||||||||||||||||||||||
![]() |
![]() |
![]() |
![]() |
![]() |
home | about us | solutions | products | partners | clients | careers | contact us | search ![]() |
|||
|
FACE="Verdana, Helvetica, Arial">Texas.78746 tel(512)329-6646,.fax(512)329-0767 | ||
![]() |
![]() |
![]() |
![]() |