Warehousing Data & Data Mining

Warehousing Data: The Data Warehouse, Data Mining, and OLAP

Warehousing data is based on the premise that the quality of a manager's decisions is based, at least in part,on the quality of his information. The goal of storing data in a centralized system is thus to have the means to provide them with the right building blocks for sound information and knowledge. Data warehouses contain information ranging from measurements of performance to competitive intelligence (Tanler 1997).

Data mining tools and techniques can be used to search stored data for patterns that might lead to new insights. Furthermore, the data warehouse is usually the driver of data-driven decision support systems (DSS), discussed in the following subsection.

Thierauf (1999) describes the process of warehousing data, extraction, and distribution. First data extraction of operational production data takes place, and this data is passed on to the warehouse database. A server hosts the data warehouse and the DSS. This server then passes on the extracted data to the warehouse database, which is employed by users to extract data through some form of software.

Theirauf's model for data warehousing is as follows:

Warehousing Data: Design and Implementation

Tanler (1997) identifies three stages in the design and implementation of the data warehouse. The first stage is largely concerned with identifying the critical success factors of the enterprise, so as to determine the focus of the systems applied to the warehouse. The next step is to identify the information needs of the decision makers. This involves the specification of current information lacks and the stages of the decision-making process (i.e. the time taken to analyze data and arrive at a decision). Finally, warehousing data should be implemented in a way that ensures that users understand the benefit early on. The size of the database and the complexity of the analytical requirements must be determined. Deployment issues, such as how users will receive the information, how routine decisions must be automated, and how users with varying technical skills can access the data, must be addressed.

According to Frank (2002), the success of the implementation of the data warehouse depends on:

Accurately specifying user information needs
Implementing metadata: Metadata is essentially data about data. This is regarded as a particularly crucial step. Parankusham & Madupu (2006) outline the different roles of metadata as including: data characterization and indexing, the facilitation or restriction of data access, and the determination of the source and currency of data. They further identify the lifecycle of metadata as:
- Collection: Identification and capture
- Maintenance: Updating of metadata to match changes in data architecture
- Deployment: Users access the relevant metadata, based on their needs.

To this, we can add the 5 criteria presented on the www.syntelinc.com website:

Recognize that the job is probably harder than you expect: A large portion of the data in data warehouses is incorrect, missing, or input in such a way that it is not usable (e.g. historical databases that have not been updated to modern schemas).
Understand the data in your existing systems: Analyze existing databases. Identify relationships between existing data systems so as to avoid inconsistencies when these are moved to the warehouse.
Be sure to recognize equivalent entities: Identify equivalent entities in heterogeneous systems, which may appear under a different name.
Emphasize early wins to build support throughout the organization
Consider outsourcing your data warehouse development and maintenance: Implementing a data warehouse can be a huge task that can often be better handled by experts. Many data warehousing applications are suited for outsourcing.

If properly designed and implemented, the goal of warehousing data is to drastically reduce the time required in the decision making process. To do so, it employs three tools, namely Online Analytical Processing System (OLAP), data mining, and data visualization (Parankusham & Madupu 2006).

OLAP

OLAP allows three functions to be carried out.

Query and reporting: Ability to formulate queries without having to use the database programming language.
Multidimensional analysis: The ability to carry out analyses from multiple perspectives. Tanler (1997) provides an example of a product analysis that can be then repeated for each market segment. This allows for quick comparison of data relationships from different areas (e.g. by location, time, etc.). This analysis can include customers, markets, products, and so on,
Statistical analysis: This function attempts to reduce the large quantities of data into formulas that capture the answer to the query.

OLAP is basically responsible for telling the user what happened to the organization (Theirauf 1999). It thus enhances understanding reactively, using summarization of data and information.

What is Data Mining?

This is another process used to try to create useable knowledge or information from data warehousing. Data mining, unlike statistical analysis, does not start with a preconceived hypothesis about the data, and the technique is more suited for heterogeneous databases and date sets (Bali et al 2009). Karahoca and Ponce (2009) describe data mining as "an important tool for the mission critical applications to minimize, filter, extract or transform large databases or datasets into summarized information and exploring hidden patterns in knowledge discovery (KD)." The knowledge discovery aspect is emphasized by Bali et al (2009), since the management of this new knowledge falls within the KM discipline.

It is beyond the scope of this site to offer an in-depth look at the data mining process. Instead, I will present a very brief overview, and point readers that are interested in the technical aspects towards free sources of information.

Very briefly, data mining employs a wide range of tools and systems, including symbolic methods and statistical analysis. According to Botha et al (2008), symbolic methods look for pattern primitives by using pattern description languages so as to find structure. Statistical methods on the other hand measure and plot important characteristics, which are then divided into classes and clusters.

Data mining is a very complex process with different process models. One is the CRoss-Industry Standard Process for Data Mining (or Crisp-DM). The process involves six steps (Maraban et al, in Karahoca & Ponce 2009):

Business understanding -> data understanding -> data preparation -> modeling -> evaluation -> deployment

For more on data mining see the book "Data Mining and Knowledge Discovery in Real Life Applications", edited by Ponce & Karahoca (2009), available for free from intechopen.com where numerous other potentially relevant resources can also be downloaded.

Data Visualization

This process involves representing data and information graphically so as to better communicate its content to the user. It is a way to make data patterns more visible, more accessible, easier to compare, and easier to communicate. Data visualization includes graphical interfaces, tables, graphs, images, 3D presentations, animation, and so on (Turban & Aaronson in Parankusham & Madupu 2006).

DSS are other tools used in conjunction with warehousing data. These are discussed in the following subsection.

Written by Emil Hajric

Published August 4th, 2023