Data Warehouses, Decision Support and Data Mining
This paper provides an overview of data warehousing and OLAP technologies by using back end tools for extracting, cleaning and loading data into a data warehouse; multidimensional data models typical of OLAP; front end client tools for querying and data analysis; server extensions for efficient query processing, with an emphasis on Applications for Data Warehouses such as Decision Support Systems (DSS), On-Line Analytical Processing (OLAP) and Data Mining to deliver advanced capabilities. Contents 1. Introduction 2.
Data Warehousing Architecture and End-to-End Process 3. Decision support Back End Tools and Utilities 4. Conceptual Model and Front End Tools 5. OLTP Database Design Methodology 6. Data Mining a. Goals of Data Mining b. Data Mining Applications c. Standard data mining process d. CRISP-Data Mining process 7. Phases in the DM Process: CRISP-DM 8. Conclusion 9. References Chapter 1 Introduction Data warehousing is a collection of decision support technologies, aimed at enabling the knowledge workers such as executive, manager, analysts to make better and faster decisions.
Data warehousing technologies have been successfully deployed in many industries such as manufacturing for order shipment and customer support, retail for user profiling and inventory management, financial services for claims analysis, risk analysis, credit card analysis, and fraud detection, transportation (for fleet management), telecommunications (for call analysis and fraud detection), utilities (for power usage analysis), and healthcare (for outcomes analysis). This paper presents a roadmap of data warehousing technologies, focusing on the special requirements that data warehouses place on database management systems (DBMSs).
A data warehouse is a “subject-oriented, integrated, time- varying, non-volatile collection of data that is used primarily in organizational decision making. ” Typically, the data warehouse is maintained separately from the organization’s operational databases. There are many reasons for doing this. The data warehouse supports on-line analytical processing (OLAP), the functional and performance requirements of which are quite different from those of the on-line transaction processing (OLTP) applications traditionally supported by the operational databases .
OLTP applications typically automate clerical data processing tasks such as order entry and banking transactions that are essential day-to-day operations of an organization. These tasks are structured and repetitive, and consist of short, atomic, isolated transactions. The transactions require detailed, up-to-date data, and read or update a few (tens of) records accessed typically on their primary keys. The size of Operational databases ranges from hundreds of megabytes to gigabytes in size. Consistency and recoverability of the database are critical, and maximizing transaction throughput is the key performance metric.
Consequently, the database is designed to reflect the operational semantics of known applications, and, in particular, to minimize concurrency conflicts. Data warehouses, in contrast, are targeted for decision support. Historical, summarized and consolidated data is more important than detailed, individual records. Since data warehouses contain consolidated data, perhaps from several operational databases, over potentially long periods of time, they tend to be orders of magnitude larger than operational databases; enterprise data warehouses are projected to be hundreds of gigabytes to terabytes in size.
The workloads are query intensive with mostly ad hoc, complex queries that can access millions of records and perform a lot of scans, joins, and aggregates. Query throughput and response times are more important than transaction throughput. To facilitate complex analyses and visualization, the data in a warehouse is typically modeled multidimensionally. For example, in a sales data warehouse, time of sale, sales district, salesperson, and product might be some of the dimensions of interest.
Often, these dimensions are hierarchical; time of sale may be organized as a day-month-quarter-year hierarchy, product as a product-category-industry hierarchy. Many organizations want to implement an integrated enterprise warehouse that collects information about all subjects (e. g. , customers, products, sales, assets, personnel) spanning the whole organization. However, building an enterprise warehouse is a long and complex process, requiring extensive business modeling, and may take many years to succeed. Some organizations re settling for data marts instead, which are departmental subsets focused on selected subjects (e. g. , a marketing data mart may include customer, product, and sales information). These data marts enable faster roll out, since they do not require enterprise-wide consensus, but they may lead to complex integration problems in the long run, if a complete business model is not developed. Data Mining may be viewed as automated search procedures for discovering credible and actionable insights from large volumes of high dimensional data. Often, there is emphasis upon symbolic learning and modeling methods (i. . techniques that produce interpretable results), and data management methods (for providing scalable techniques for large data volumes). Data Mining employs techniques from statistics, pattern recognition, and machine learning. Many of these methods are also frequently used in vision, speech recognition, image processing, handwriting recognition, and natural language understanding. However, the issues of scalability and automated business intelligence solutions drive much of and differentiate data mining from the other applications of machine learning and statistical modeling.
Chapter2 Data Warehousing Architecture and End-to-End Process Figure 1. Data Warehousing Architecture It includes tools for extracting data from multiple operational databases and external sources; for cleaning, transforming and integrating this data; for loading data into the data warehouse; and for periodically refreshing the warehouse to reflect updates at the sources and to purge data from the warehouse, perhaps onto slower archival storage. In addition to the main warehouse, there may be several departmental data marts.
Data in the warehouse and data marts is stored and managed by one or more warehouse servers, which present multidimensional views of data to a variety of front end tools: query tools, report writers, analysis tools, and data mining tools. Finally, there is a repository for storing and managing metadata,and tools formonitoring and administering the warehousing system. The warehouse may be distributed for load balancing, scalability, and higher availability.
In such a distributed architecture, the metadata repository is usually replicated with each fragment of the warehouse, and the entire warehouse is administeredcentrally. Analternative architecture, implemented for expediency when it may be too expensive to construct a single logically integrated enterprise warehouse, is a federation of warehouses or data marts, each with its own repository and decentralized administration. Chapter 3 Decision support Back End Tools and Utilities Data warehousing systems use a variety of data extraction and cleaning tools, and load and refresh utilities for populating warehouses.
Data extraction from “foreign” sources is usually implemented via gateways and standard interfaces (such as Information Builders EDA/SQL, ODBC, Oracle Open Connect, Sybase Enterprise Connect, Informix Enterprise Gateway). Data Cleaning Since a data warehouse is used for decision making, it is important that the data in the warehouse be correct. However, since large volumes of data from multiple sources are involved, there is a high probability of errors and anomalies in the data..
Therefore, tools that help to detect data anomalies and correct them can have a high payoff. Some examples where data cleaning becomes necessary are: inconsistentfield lengths, inconsistentdescriptions, inconsistent value assignments, missing entries and violation of integrity constraints. Not surprisingly, optional fields in data entry forms are significant sources of inconsistent data. Load After extracting, cleaning and transforming, data must be loaded into the warehouse. Additional preprocessing may still erequired:checkingintegrityconstraints;sorting; summarization, aggregation and other computation to build the derived tables stored in the warehouse; building indices and other access paths; and partitioning to multiple target storage areas. Typically, batch load utilities are used for this purpose. In addition to populating the warehouse, a load utility must allow the system administrator to monitor status, to cancel, suspend and resume a load, and to restart after failure with no loss of data integrity. The load utilities for data warehouses have to deal with much larger data volumes than for operational databases.
There is only a small time window (usually at night) when the warehouse can be taken offline to refresh it. Sequential loads can take a very long time, e. g. , loading a terabyte of data can take weeks and months! Hence, pipelined and partitioned parallelism are typically exploited . Doing a full load has the advantage that it can be treated as a long batch transaction that builds up a new database. While it is in progress, the current database can still support queries; when the load transaction commits, the current database is replaced with the new one.
Using periodic checkpoints ensures that if a failure occurs during the load, the process can restart from the last checkpoint. Refresh Refreshing a warehouse consists in propagating updates on source data to correspondingly update the base data and derived data stored in the warehouse. There are two sets of issues to consider: when to refresh, and how to refresh. Usually, the warehouse is refreshed periodically (e. g. , daily or weekly). Only if some OLAP queries need current data (e. g. , up to the minute stock quotes), is it necessary to propagate every update.
The refresh policy is set by the warehouse administrator, depending on user needs and traffic, and may be different for different sources. Refresh techniques may also depend on the characteristics of the source and the capabilities of the database servers. Extracting an entire source file or database is usually too expensive, but may be the only choice for legacy data sources. Most contemporary database systems provide replication servers that support incremental techniques for propagating updates from a primary database to one or more replicas.
Such replicationservers can be usedto incrementally refresh a warehouse when the sources change. There are two basic replication techniques: data shipping and transaction shipping. In data shipping (e. g. , used in the Oracle Replication Server, Praxis OmniReplicator), a table in the warehouse is treated as a remote snapshot of a table in the source database. After_row triggers are used to update a snapshot log table whenever the source table changes; and an automatic refresh schedule (or a manual refresh procedure) is then set up to propagate the updated data to the remote snapshot.
In transaction shipping (e. g. , used in the Sybase Replication Server and Microsoft SQL Server), the regular transaction log is used, instead of triggers and a special snapshot log table. At the source site, the transaction log is sniffed to detect updates on replicated tables, and those log records are transferred to a replication server, which packages up the corresponding transactions to update the replicas. Transaction shipping has the advantage that it does not require triggers, which can increase the workload on the operational source databases.
However, it cannot always be used easily across DBMSs from different vendors, because there are no standard APIs for accessing the transaction log. Such replication servers have been used for refreshing data warehouses. However, the refresh cycles have to be properly chosen so that the volume of data does not overwhelm the incremental load utility. In addition to propagating changes to the base data in the warehouse, the derived data also has to be updated correspondingly. The problem of constructing logically correct updates for incrementally updating derived data (materialized views) has been the subject of much research .
For data warehousing, the most significant classes of derived data are summary tables, single-table indices and join indices. Chapter 4 Conceptual Model and Front End Tools A popular conceptual model that influences the front-end tools, database design, and the query engines for OLAP is the multidimensional view of data in the warehouse. In a multidimensional data model, there is a set of numeric measures that are the objects of analysis. Examples of such measures are sales, budget, revenue, inventory, ROI (return on investment).
Each of the numeric measures depends on a set of dimensions, which provide the context for the measure. For example, the dimensions associated with a sale amount can be the city, product name, and the date when the sale was made. The dimensions together are assumed to uniquely determine the measure. Thus, the multi-dimensional data views a measure as a value in the multidimensional space of dimensions. Each dimension is described by a set of attributes. For example, the Product dimension may consist of four attributes: the category and the industry of the product, year of its introduction, and the average profit margin.
Figure2 Another distinctive feature of the conceptual model for OLAP is its stress on aggregation of measures by one or more dimensions as one of the key operations; e. g. , computing and ranking the total sales by each county (or by each year). Other popular operations include comparing two measures (e. g. , sales and budget) aggregated by the same dimensions. Time is a dimension that is of particular significance to decision support (e. g. , trend analysis). Often, it is desirable to have built-in knowledge of calendars and other aspects of the time dimension.
Front End Tools The multidimensional data model grew out of the view of business data popularized by PC spreadsheet programs that were extensively used by business analysts. The spreadsheet is still the most compelling front-end application for OLAP. The challenge in supporting a query environment for OLAP can be crudely summarized as that of supporting spreadsheet operations efficiently over large multi-gigabyte databases. One of the popular operations that aresupportedbythemultidimensional spreadsheet application is pivoting.
Consider the multidimensional schema of Figure 2 represented in a spreadsheet where each row corresponds to a sale . Let there be one column for each dimension and an extra column that represents the amount of sale. The simplest view of pivoting is that it selects two dimensions that are used to aggregate a measure, e. g. , sales in the above example. The aggregated values are often displayed in a grid where each value in the (x,y) coordinate corresponds to the aggregated value of the measure when the first dimension has the value x and the second dimension has the value y.
Thus, in our example, if the selected dimensions are city and year, then the x-axis may represent all values of city and the y-axis may represent the years. The point (x,y) will represent the aggregated sales for city x in the year y. Thus, what were values in the original spreadsheets have now become row and column headers in the pivoted spreadsheet. Other operators related to pivoting are rollup or drill-down. Rollup corresponds to taking the current data object and doing a further group-by on one of the dimensions. Thus, it is possible to roll-up the sales data, perhaps already aggregated on city, additionally by product.