Ecommerce Retail
Data Warehouse
Development

            • Applied ETL process for building Data Warehouse based on AdventureWorks 2019 Database

            • Developed SSIS packages for initial loading and incremental loading on sales and inventory fact tables

            • Generated an OLAP cube using SSAS and realized online sales analysis based on AdventureWorks 2019 Data Warehouse

Background

AdventureWorks 2019 is an OLTP (Online Transaction Processing) sample database originally published by Microsoft, which stores the data of an ecommerce retail company from multiple perspectives, such as sales, production and related people. However, it is difficult for analysts of the company to effectively get insights and create business values, because it costs much time to integrate information from multiple tables in a database. Thus, to enhance the data quality and consistency, improve the decision-making process, and generate a higher return of interest based on business intelligence, it is necessary for us to maintain a data warehouse for OLAP (Online Analytical Processing).

Keywords: Data Warehouse, ETL (Extract, Transform, Load) Process, OLAP, SSIS, SSAS

Introduction

In this project, our goals include: build a data warehouse (dimensional model) of the AdventureWorks 2019 database – consolidate two fact tables FactInternetSales and FactProductInventory to present the data of internet sales and inventory in a clear manner; design and construct data pipeline via SSIS to automatically populate or backup data into our dimension and fact tables; feed data into an OLAP model to improve efficiency of multi-dimensional analytical queries; integrate with BI tools to support underlying data reports refreshing and enable business intelligence.

[data source link] https://docs.microsoft.com/en-us/sql/samples/adventureworks-install-configure?view=sql-server-ver15&tabs=ssms

Workflow
Data Flow
  • Figure below shows the data flow of the dimensional model, which flows from source (left) to destination (right).
  • There are two levels of data layer – staging area and destination (dim and fact) layer.

Staging Area:
Staging area is the replica of the source tables in AdventureWorks 2019 database. Each source table will have a staging table in the data warehouse accordingly. For daily processing, we first load the current day’s data of each table into the corresponding staging table. The staging table will work as the source of the ETL procedures to populate dim tables and fact tables on a daily basis.

Destination Layer:
1. Fact Tables

FactInternetSales – the only source of truth of sales data, containing data of order details and related customer and salesperson.

● Granularity: order-product level (SalesOrderID + SalesOrderDetailID)

There could be multiple products included in one single order. This table serves data at an order-product level, which means each record in this table identifies a unique order-product pair, providing information about the specific product in the order.

● Daily processing: based on OrderDateKey

FactProductInventory – a daily snapshot table of products inventory data indicating the current inventory, daily in, daily out of products.

● Granularity: product + DailyDate

The primary key of this table is the combination of ProductKey and DailyDate. Each record identifies the inventory of a certain product on a certain day. For each product, its inventory data will be recorded from the day it started to sell to the last day of the dataset.

● Daily processing: based on SellStartDate

2. Dimension Tables
DimDate table is a table of generated time data, containing every single day in the range of time that we care. We generate the time data in different intervals (daily, weekly, monthly, quarterly, yearly), join with FactInternetSales and FactProductInventory to query or aggregate sales and inventory data in a certain time period.

DimSalesPerson table stores data related to salesperson, in the perspective of demographic, employment and sales achievement. These attributes will provide salesperson’s information in the FactInternetSales table. Attributes stored in the [Dim.SalesPerson] table will be acquired by extracting data from joined tables including [HumanResources.Employee], [Person.Person] and [Sales.SalesPerson] with the same BusinessEntityID.

DimCustomer table contains customer information including customer id, name, suffix, email address, customer address and phone number. These attributes will be used to keep track of customer purchase behavior. Customer’s information of title, name, email address and phone number in [Dim.Customer] table will be collected by joining tables including [Sales.Customer], [Person.Person], [Person.AddressType], [Person.EmailAddress], [Person.SateProvince] and [Person.PersonPhone] using the relationship established by the BusinessEntityID, AddressID, AddressTypeID and StateProvinceID. This table could exist duplicate against CustomerKey. The reason behind is one customer could have multiple addresses which are in different address types (Shipping, Home, etc.). Thus, the primary key of this table is the combination of CustomerKey and AddressType. There exists SCD (Slow Changing Dimensions) in this dimension table, because “phone number” will be overwritten when an update happens. Meanwhile, the PhoneUpdateDate will also be updated to the current timestamp.

DimProduct table contains product id, product name, product number, start date, end date, standard cost, list price, product description, safety stock level, order quantity, and large photo of the product. These attributes will be used to maintain the information of product inventory. [Dim.Product] table consists of product information including product name, identify number, measure code, etc will be gathered by joining tables involving [Production.Product], [Production.ProductCostHistory], [Production.ProductListPriceHistory], [Production.ProductInventory], [Production.WordOrder], [Production.ProductDescription] and [Production.ProductLargePhoto] using same ProductID. There will be duplicate product keys since we may have the same product with multiple updated records. Therefore, a combination of ProductKey and HistoryCostID will be used as the primary key. There exists SCD (Slow Changing Dimensions) in this dimension table. The DimProduct table retains the full history of product cost values. When the value of ProductUnitCost changes, the current record is closed.

Design of DimDate
  1. Find the date range in the Database AdventureWorks 2019
  2. Create a DimDate table which contains every consecutive day in the date range and let the date column be the primary key of DimDate table
  3. Generate weekly date (first day of each week), monthly date (first day of each month), quarterly date (first day of each quarter), and year columns according to the date column, making it easy to generate weekly, monthly, quarterly or yearly reports later on.
ETL Process
Initial Loading: load data before 2014
Initial Loading Workflow

With an initial load, we retrieve the sales data with OrderDate before “2014-01-01” and load into FactInternetSales. Customer table does not have an appropriate timestamp for identifying data prior to 2014. To mimic the state of DimCustomer at “2013-12-31”, we obtain all the related CustomerKey from FactInternetSales, and pull data only for those records to populate the DimCustomer table.

In the initial load for DimSalesPerson, we populate for the salespersons hired before “2014-01-01”. For the initial load of DimProduct, we populate the products that started selling before “2014-01-01”. Since DimDate is a list of dates that we want to backfill the data, we would not make changes to it after initialization. It contains dates from “2008-04-30” to “2014-08-12”.

As for the initial load of FactProductInvenotry, we consolidate the UnitIn and UnitOut data for each product between its SellStartDate and “2013-12-31”. In the case that the product ended selling before “2013-12-31”, we will only include inventory data to its SellEndDate.

Initial Load of FactInternetSales
Initial Load of FactInventory

After the initial loading, we will be able to keep track of the records of DimCustomer which happened before “2014-01-01”. To be specific, we have access to the information of customers who placed orders by querying the date when the event occurred. The data of salesperson hired before 2014 and orders placed before 2014 are also available. We can also track the information of products stated selling before “2014-01-01” in DimProduct and get their corresponding inventory.

Incremental Loading: daily processing after 2014
Incremental Loading Workflow

For efficiency consideration, we plan to load data based on month. We performed 8 batches to load all data from “2014-01-01” to “2014-08-12”.

1) Load current batch’s Source Data into Staging Area

To achieve the incremental loads, we need to find one timestamp for each source table to distinguish data from different dates, in other words, to separate data into different batches. For some transactional source tables, such as SalesOrderHeader, we can easily find a column (OrderDate in this case) using for separating batches. For every day’s processing, we will retrieve the current day’s orders from the source table and load them into the staging area. However, there still exists some tables (e.g. Customer) that we cannot find any useful timestamp, since they have maintained the latest data while not persisting historical data. In this case, we will load the whole table on that day into the staging area and use it to update the downstream tables.

2) Populate Destination tables

3) Clear Staging tables

After finishing the processing of one batch, we need to clear all of the staging tables before the next round of processing.

SCD (Slow Changing Dimensions) Design

1. Customer phone number – carry an UpdateDate
In the customer dimension table [DimCustomer], the new phone number overwrites the existing data. This eases dimension updates and limits the change of the dimension table to only new records. The date of updating phone numbers will be also stored in the table to keep track of the changes. The historical value of the phone number is lost as it is not storing anywhere. Therefore, the dimension table will always contain only the current value since it is pointless to contact a customer with an obsolete phone number. An example is shown in the figure below.

2. Product unit cost – add new rows in DimProduct table to store all history changes

In the product dimension table [DimProuduct], the new unit cost will be added to the table within a new row. The table retains the full history of product unit cost values: the current record is closed then the value of ProductUnitCost changes. An example is shown in the figure below: the CostStartDate of the second row is equal to the CostEndDate of the previous row. The null CostEndDate in row three indicated the current record version. The HistoryCostID is a surrogate key to show different version numbers. Since we regard the fluctuations in the unit cost as a significant indicator, it is necessary to keep all the records in a historical table in case of tracing back.