4/6/19

Building a Modern Data Warehouse From Different Sources SQL Server and Spark

Oscar Garcia @ozkary 4/06/2019 data platform , spark , SQL No comments

A data warehouse (DW) is a core component of business intelligence and a central data repository for different sources. This poses many challenges as the schema definition for those sources may be completely different from one another. In this presentation, we discuss how to create a data warehouse model that can support data for disparate data sources. We look at how to build a dimensional model which can enable us to import the data with different shapes into a data warehouse. We then create processes to transform and load the data using Apache Spark. We finally use PowerBI to visualize the data from our data warehouse.

Modern Data Warehouse

Learning Objectives:

Challenges when using disparate data sources (different models)

The number of fields on the payload can be different. Some system can send more data readings/facts than another.
The fields can have a different name but the data represent the same reading.
The model can change independently of the other system.

Define a common model to support the different model shapes

We need to identify the common fields across all the data models that can identify a unique entry. For example, the date, a serial number or device id, a regional location.
Identify the dimension from the actual entry from all these models. The dimensions are used to build other tables that can provide structured information about the readings.
We need to identify the measurement or readings and support their different names.
For performance, we would like to be able to read the entries without the overhead of processing all the measurements.
Star Schema design can be used to associate the dimension tables to the fact table.

Strategy for processing the different models into the common model with Apache Spark

Load the data source into data frames and map their fields into a data set with well defined data types for the fields.
Parse the dimension fields into a different data set. Create a unique hash numeric key that can be used as a foreign key on the the relational data model.
Map all the measurements from their data source name to the corresponding field on the common model.
Create a dimension for the field name to catalog the measurement fields thus avoiding having to load all the data and get unique names.
Transpose all the measurements from columns to rows. This enables us to support different number of columns as well as support field changes on the data source.

Data Pipeline for Loading the Data into the Data Warehouse with Apache Spark

Load the new unique dimension records from the Spark database into the dimension tables in the data warehouse first.
Load the fact entry with all the dimension relationships. Create a unique hash id for the entry.
Load the fact measurements with the entry hash id relationship. This maps the reading to a unique entry.

Sample Project:

The sample project with the Spark and SQL code as well as demo data can be found at this location:

https://github.com/ozkary/modern-data-warehouse-spark/

By following this high level plan, we should be able to create a data warehouse that can support disparate data sources with different models using modern data processing tooling.

Originally published by ozkary.com

Oscar Garcia @ozkary

Oscar Garcia is a Principal Software Architect and VP of product development. He is a Microsoft MVP and certified solutions' developer with many years of experience building software solutions. He specializes in building cloud solutions using technologies like GCP, AWS, Azure, ASP.NET, NodeJS, Angular, React, SharePoint, Microsoft 365, Firebase as well as BI projects for data visualization using tools like Tableau and PowerBI, Spark. You can follow Oscar on Twitter @ozkary or his blog at ozkary.com VP of Product Development 5 consecutive Microsoft Most Valuable Professional (MVP).

Ozkary - Emerging Technologies

I am Oscar Garcia, Ozkary^TM. I author this site, speak at conferences and events, contribute to OSS, mentor people. I use this blog to post ideas and experiences about software development, with the goal to both learn from and help the technology communities around the world.

4/6/19

Building a Modern Data Warehouse From Different Sources SQL Server and Spark

Originally published by ozkary.com

0 comments :

Post a Comment

Books on Amazon

Data Engineering Process Fundamentals: Master the Fundamentals of Data Engineering with a Hands-on Approach

Developer Technologies Award 2021

Ozkary

This Blog is protected by DMCA.com

Popular Posts

Recent Posts

Categories

Featured Post

Launch a Software Product - Product Release Lifecycle

Blog Archive

Followers

I am Oscar Garcia, OzkaryTM. I author this site, speak at conferences and events, contribute to OSS, mentor people. I use this blog to post ideas and experiences about software development, with the goal to both learn from and help the technology communities around the world.

4/6/19

Originally published by ozkary.com

0 comments :

Post a Comment

Books on Amazon

Data Engineering Process Fundamentals: Master the Fundamentals of Data Engineering with a Hands-on Approach

Developer Technologies Award 2021

Ozkary

This Blog is protected by DMCA.com

Popular Posts

Recent Posts

Categories

Featured Post

Launch a Software Product - Product Release Lifecycle

Blog Archive

Followers

I am Oscar Garcia, Ozkary^TM. I author this site, speak at conferences and events, contribute to OSS, mentor people. I use this blog to post ideas and experiences about software development, with the goal to both learn from and help the technology communities around the world.