• Nenhum resultado encontrado

Business Intelligence with I.B.M. Data Warehouse Enterprise Edition

N/A
N/A
Protected

Academic year: 2023

Share "Business Intelligence with I.B.M. Data Warehouse Enterprise Edition"

Copied!
98
0
0

Texto

The created database will then be used in Design Studio and loaded into a Data Model Project ( OLAP. After working on the Data Model Project, we will build a data warehousing application that will be deployed on the Websphere Application server.

Presentation of the A.T.E.I

Goals of the project

Technical Environment and used tools

DB2 Data Warehouse Edition V9.1.1 overview

Introduction

Provided products

DB2 DWE architecture

DB2 DWE components

DB2 Enterprise Server Edition is a multi-user relational database management system specifically tailored for managing data warehouses. This version is specially tuned for data warehouse management due to features like "data compression features" that reduce the need for large disk space.

The fictional company

Presentation of the JK Superstore

Understanding of the need of the company

The sample database

The current technical situation of the company

The Data Warehouse Schema

The Organizational Unit (OU) table is loaded with the various stores belonging to JK Super Store. The Product Table (PD) is designed to contain the various products offered by JK Super Store.

The Data Marts Schema

  • What is a dimensional model
  • The JK-Superstore Marts

This schema contains the aggregated data that will be used to make analyzes of the sales and prices of JK Superstore. The three dimension tables are the TIME, STORE and PRODUCT and the fact table is called PRCHS_PRFL_ANALYSIS.

Illustration 6: The Mart schema
Illustration 6: The Mart schema

Prerequisites before starting with tutorials

How to install DB2 DWE Edition 9.1.1

Creation of the database

Code: Part of .bat file that creates DWE schema and MARTS -- Creating DWE schema tables. Add stored functions, to be used by MARTS schema db2 -tvf sqw\dwesamp_marts_func.sql.

The Eclipse based Integrated Development Environment

Introduction to the Design

Projects in Design Studio

Global aim of these tutorials

The tutorials and their goals

Introduction and prerequisites

Creation of the data design project

Reverse Engineering the existing database

It is possible to have a graphical representation of both schemas from the Data Diagrams folder in the project explorer. The data model must be opened and expanded to have diagrams in the data diagram directory).

Illustration 16: MARTS  schema.
Illustration 16: MARTS schema.

Working on the physical data model

Defining the problems

Adding Primary keys

Adding Associations – Foreign keys

So this can only be used if the names of the columns in the fact table and in the dimensional table are similar.

Illustration 19: Completed MARTS diagram
Illustration 19: Completed MARTS diagram

Validation of the model

In this tutorial, the fact table and one dimensional table will be loaded.(When we can do it for one dimensional table, we can do it for the other. The goal is to understand the mechanism). SUBDIV.OU_IP_ID, -- ID of the subdivision of the store -> In the data flow replaced by the name (JOIN WITH IP). The input port of the operator will be linked to the data set of the Organizational Unit.

IN2_09.EFF_DT The date is stored in the table ITM_TXN and. different representations of this date in the MSR_PRD table. In a data warehousing application, the data flow consists of the base layer of the application.

Propagation of the modification on the database

The Comparison process

The comparison is useful because it provides an overview of the changes that have been made. This report shows that there are no foreign and primary keys in the DWESAMP:DWESAMP.MARTS schema, but they are present in the data model. We can discard the changes to the data model (it will change the schema to be the same as in the database), or propagate the changes to the database.

Our goal is to propagate changes to the database, so the run button is pressed from right to left. In the report, the edits are now propagated, but they are not yet in the database.

The DDL script

The data warehouse project

Introduction

What provides this project?

Definition of a data flow in Design Studio

Definition of some data flow operators

  • File Import operator
  • File Export operator
  • Distinct operator
  • Union operator
  • Table Join operator
  • Select List operator and Where Condition operator
  • Key Lookup operator
  • Easy to understand operators

The Table Join operator also has a Select List option that allows column selection in the design part of the query. This operation in SQL is called SELECTION and should not be confused with the projection that uses SELECT in the query. In illustration 26, the file import operator feeds the Item Transaction table with the day's sales.

The key lookup operator will check if the sold product exists in the PRODUCTS table. On the other hand, the destination table operator feeds a table of data into the input port.

Illustration 23: Example of the Distinct Operator.
Illustration 23: Example of the Distinct Operator.

Data flows in the JK Superstore

Populating the Item Transaction table

Origin of the data and start point of the flow

Cleaning the data

For each lookup port of the key lookup operator, the matching condition must be constructed so that the operator knows hex column to be compared. This tool allows you to create SQL state with the mouse by clicking on inputs, keywords, operations and functions. The data station operator is an intermediate stage, where data can be entered into a step table to check the flow content at any moment in the data flow.

By right-clicking on the data station, we can retrieve data about the data flow. The contents of the displacement table are cleared before each run of the data flow.

Illustration 29: Key lookup operators check the existence of foreign key(s) in lookup  tables to prevent from enforcing the referential integrity.
Illustration 29: Key lookup operators check the existence of foreign key(s) in lookup tables to prevent from enforcing the referential integrity.

Populating a dimensional table of the marts schema

  • Origin of the data
  • Explanation of the problem
  • Resolving it within an SQL query
  • Designing the first level
  • Designing the other levels
  • Overview of the data flow

SELECT STORE.OU_IP_ID, -- Id for the store STORE.OU_CODE, -- Code for the store. STORE.ORG_IP_ID, -- Id of the organization the store has for STORE.PRN_OU_IP_ID, -- Id of the subdivision of the store. REGION.OU_IP_ID, -- Id for the store's region --> same as above DISTRICT.OU_IP_ID -- Id for the store's district --> same as above FROM DWH.OU SUBDIV,DWH.OU REGION,DWH. OU DISTRICT,DWH.OU STORE,DWH.CL CUSTOMERS WHERE SUBDIV.PRN_OU_IP_ID IS NULL --> LEVEL 1 Subdivision is rows without parent AND SUBDIV.OU_IP_ID = REGION.PRN_OU_IP_ID --> LEVEL 2 which has LEVEL .

The first level of the hierarchy is the organization level and is characterized by a null value in the parent column. A Where condition operator holds this hierarchy level with a condition "INPUT_03.PRN_OU_IP_ID IS NULL".

Illustration 31: Part of the data flow that get the top level of the hierarchy.
Illustration 31: Part of the data flow that get the top level of the hierarchy.

Populating a fact table

Origin and details of the data

IN1_09.MKT_BSKT_TXN_ID Join the market basket transaction using the Item Transactions table. IN1_09.PD_ID = IN3_09.PD_ID Connect a transaction item to the product table to get information about the products (expected sales price,..). In addition to the previous operators for joining tables, we start by aggregating the data by using a "Group By" operator and defining the calculation of the various analyzable values ​​in the operator's pick list properties view.

The Group By column is defined in the Group By tab of the property view.

Illustration 34: Aggregation of data in the group by  operator.
Illustration 34: Aggregation of data in the group by operator.

Using the Key lookup operator

In the opposite, there will be many endpoints that represent the end of the process. On the other hand, errors can also be traced from the Websphere application server log files. At this part of the project, the data warehouse is completed and automatically updated.

Days after days the size of the warehouse increases and the data mining can be a good way for the analyst to find knowledge in the data. We are going to use it to interpret the result of the data mining on the JK-Superstore.

Overview of the flow

Definition of a control flow in Design Studio

A control flow is a graphical representation of the execution order of data flows that must follow a specific sequence. The second level is the control flow and is designed as the data flow in the data warehousing project. To create a control flow in the data warehouse project, we right click on the control flow directory and click on new Control Flow.

A control flow can be in a different project than that of the data flow, but to have a good project management and to make it easy to create the data warehousing application, it is better to use the same project.

Rules and operators from control flows

Rules

Outputs

Definition of available control flow operators

  • Start operator
  • End operator
  • Fail operator
  • Data Flow operator
  • File Wait operator
  • Stored Procedure operator
  • Email operator

And calling these data flows in a control flow is the second level in developing a data storage application. The output of the operator is a little different from the others, but it is quite clear. The number of times the loop should be performed is defined in the operator properties view and can be set in three different ways.

If a new store appears, the store is added to the file and the data warehousing application will also extract the data from the new store (No need to reinstall the application, only the configuration file changes). This operator is used to send a mail that can alert the administrator about the use of branch in a control flow.

Using the control flow to feed the MARTS schema

Process understanding

Designing the data flows

Introduction

Deployment preparation

Working with DWE Administrating console

Prerequisites before deploying the application

Deploying the application

JK Superstore builds the Data warehouse

What is the data mining?

Why using data mining?

The parameter is the name of the view to be created by the function. Table columns not specified in a group column are considered "Items" columns. The upswing is represented by the size of the arrow and the color represents the support.

The goal of the data mining will be to find clusters, to regroup victims into different segments depending on certain characteristics. The Java code is available in the appendices, but here is some explanation about the Java Bean. Data mining is a fantastic tool, but it requires a very good understanding of a company's needs before it can be integrated into their system.

Creating the model that calls the Easy Mining For BASIC data mining sqlStmt = new StringBuffer();.

Illustration 39: First screen of Visualizer
Illustration 39: First screen of Visualizer

Association rules

Easy mining procedures

  • Introduction to Easy Mining
  • FindRules
  • ClusterTable
  • The others functions

A model is stored in the database if the stored procedure created the view and added the model to the model list table. For example, it could be used to share mining results between heterogeneous RDBMS (Oracle and I.B.M. DB2). Syntax: ClusteTable procudure IDMMX.ClusterTable(, , . , -- Minimum cluster size ) -- Maximum cluster size.

Association rules on the Market Basket of JK Superstore

  • Introduction
  • Cleaning the data and Creating a model
  • The visualizer
  • Interpreting the result of the data mining

Our goal is to find relationships between products and more precisely according to the product category, which will be realized through operator associations. We will also need the products table to do the product name mapping. This model will be loaded in the visualizer to get the result, but it will also be used with the rule extractor to populate the view with the rules.

We come directly in front of the panel that makes it possible to connect to a database, get the list of models and open a model. When people buy items from the "Hair Care" category and items from the Miscellaneous (which is a subcategory of Health and beauty), people also buy items from the Personal Hygiene in 90% of the cases.

Clustering on the titanic data set

  • Introduction
  • The origin of the data
  • How to create a model without mining flow
  • Interpretation of the results
    • The results with the geometric algorithm

SqlStmt currently contains: creating a time view, calling the data mining process, and creating a new view containing the data mining results. At the end of my training, I had the opportunity to present my work. The exercises presented at the ARISTOTEL University were proof of the success of the project.

I have worked for a fictional company, but in the future I hope to continue working in this field of data mining and KDD. On the other hand, an Erasmus project is also the best way to improve my English, learn more about foreign culture and also to get out of the 'family cocoon'.

Illustration 43: Smallest Cluster found with the geometric aglorithm
Illustration 43: Smallest Cluster found with the geometric aglorithm

The Java bean code to create the data mining model of titanic data set

Imagem

Illustration 2: DWE Architecture.
Illustration 3: The DWH Schema.
Illustration 4: Example of a star schema
Illustration 5: Different types of dimensional models
+7

Referências

Documentos relacionados

Second-Order Exponential Smoothing • If the data shows a linear trend, a more appropriate model will be Use for: · Data with constant or non-constant trend, and · Data with no