Showing posts with label data warehouse. Show all posts
Showing posts with label data warehouse. Show all posts

11/26/24

Introduction to Data Lakes and Data Warehouses - Data Engineering Process Fundamentals -

Overview

In this technical presentation, we will delve into the fundamental concepts of Data Engineering, focusing on two pivotal components of modern data architecture - Data Lakes and Data Warehouses. We will explore their roles, differences, and how they collectively empower organizations to harness the true potential of their data.

Introduction to Data Lake and Data Warehouse - Data Engineering Process Fundamentals

  • Follow this GitHub repo during the presentation: (Star the project to follow and get updates)

👉 GitHub Repo

  • Data engineering Series:

👉 Blog Series

YouTube Video

Video Agenda

Agenda:

  1. Introduction to Data Engineering:

  2. Brief overview of the data engineering landscape and its critical role in modern data-driven organizations.

  3. Operational Data

  4. Understanding Data Lakes:

  5. Explanation of what a data lake is and its purpose in storing vast amounts of raw and unstructured data.

  6. Exploring Data Warehouses:

  7. Definition of data warehouses and their role in storing structured, processed, and business-ready data.

  8. Comparing Data Lakes and Data Warehouses:

  9. Comparative analysis of data lakes and data warehouses, highlighting their strengths and weaknesses.

  10. Discussing when to use each based on specific use cases and business needs.

  11. Integration and Data Pipelines:

  12. Insight into the seamless integration of data lakes and data warehouses within a data engineering pipeline.

  13. Code walkthrough showcasing data movement and transformation between these two crucial components.

  14. Real-world Use Cases:

  15. Presentation of real-world use cases where effective use of data lakes and data warehouses led to actionable insights and business success.

  16. Hands-on demonstration using Python, Jupyter Notebook and SQL to solidify the concepts discussed, providing attendees with practical insights and skills.

  17. Q&A and Hands-on Session:

  18. An interactive Q&A session to address any queries.

Conclusion:

This session aims to equip attendees with a strong foundation in data engineering, focusing on the pivotal role of data lakes and data warehouses. By the end of this presentation, participants will grasp how to effectively utilize these tools, enabling them to design efficient data solutions and drive informed business decisions.

Presentation

Data Engineering Overview

A Data Engineering Process involves executing steps to understand the problem, scope, design, and architecture for creating a solution. This enables ongoing big data analysis using analytical and visualization tools.

Topics

  • Data Lake and Data Warehouse
  • Discovery and Data Analysis
  • Design and Infrastructure Planning
  • Data Lake - Pipeline and Orchestration
  • Data Warehouse - Design and Implementation
  • Analysis and Visualization

Follow this project: Give a star

👉 Data Engineering Process Fundamentals

Operational Data

Operational data is often generated by applications, and it is stored in transactional relational databases like SQL Server, Oracle and NoSQL (JSON) databases like MongoDB, Firebase. This is the data that is created after an application saves a user transaction like contact information, a purchase or other activities that are available from the application.

Features:

  • Application support and transactions
  • Relational data structure and SQL or document structure NoSQL
  • Small queries for case analysis

Not Best For:

  • Reporting system
  • Large queries
  • Centralized Big Data system

Data Engineering Process Fundamentals - Operational Data

Data Lake - Analytical Data Staging

A Data Lake is an optimized storage system for Big Data scenarios. The primary function is to store the data in its raw format without any transformation. Analytical data is the transaction data that has been extracted from a source system via a data pipeline as part of the staging data process.

Features:

  • Store the data in its raw format without any transformation
  • This can include structure data like CSV files, unstructured data like JSON and XML documents, or column-base data like parquet files
  • Low Cost for massive storage power
  • Not Designed for querying or data analysis
  • It is used as external tables by most systems

Data Engineering Process Fundamentals - Analytical Data staging

Data Warehouse - Analytical Data

A Data Warehouse is a centralized storage system that stores integrated data from multiple sources. The system is designed to host and serve Big Data scenarios with lower operational cost than transaction databases, but higher costs than a Data Lake. This system host the Analytical Data that has been processed and is ready for analytical purposes.

Data Warehouse Features:

  • Stores historical data in relational tables with an optimized schema, which enables the data analysis process
  • Provides SQL support to query the data
  • It can integrate external resources like CSV and parquet files that are stored on Data Lakes as external tables
  • The system is designed to host and serve Big Data scenarios. It is not meant to be used as a transactional system
  • Storage is more expensive
  • Offloads archived data to Data Lakes

Data Engineering Process Fundamentals - Analytical Data Store

Discovery - Data Analysis

During the discovery phase of a Data Engineering Process, we look to identify and clearly document a problem statement, which helps us have an understanding of what we are trying to solve. We also look at our analytical approach to make observations about at the data, its structure and source. This leads us into defining the requirements for the project, so we can define the scope, design and architecture of the solution.

  • Download sample data files
  • Run experiments to make observations
  • Write Python scripts using VS Code or Jupyter Notebooks
  • Transform the data with Pandas
  • Make charts with Plotly
  • Document the requirements

Data Engineering Process Fundamentals - Data Analysis and discovery

Design and Planning

The design and planning phase of a data engineering project is crucial for laying out the foundation of a successful system. It involves defining the system architecture, designing data pipelines, implementing source control practices, ensuring continuous integration and deployment (CI/CD), and leveraging tools like Docker and Terraform for infrastructure automation.

  • Use GitHub for code repo and for CI/CD actions
  • Use Terraform is an Infrastructure as Code (IaC) tool that enables us to manage cloud resources across multiple cloud providers
  • Use Docker containers to run the code and manage its dependencies

Data Engineering Process Fundamentals - Design and Planning

Data Lake - Pipeline and Orchestration

A data pipeline is basically a workflow of tasks that can be executed in Docker containers. The execution, scheduling, managing and monitoring of the pipeline is referred to as orchestration. In order to support the operations of the pipeline and its orchestration, we need to provision a VM and data lake, and monitor cloud resources.

  • This can be code-centric, leveraging languages like Python
  • Or a low-code approach, utilizing tools such as Azure Data Factory, which provides a turn-key solution
  • Monitor services enable us to track telemetry data
  • Docker Hub, GitHub can be used for the CI/CD process

Data Engineering Process Fundamentals - Data Lake - Data Pipeline and Orchestration

Data Warehouse - Design and Implementation

In the design phase, we lay the groundwork by defining the database system, schema model, and technology stack required to support the data warehouse’s implementation and operations. In the implementation phase, we focus on converting conceptual data models into a functional system. By creating concrete structures like dimension and fact tables and performing data transformation tasks, including data cleansing, integration, and scheduled batch loading, we ensure that raw data is processed and unified for analysis. Create a repeatable and extendable process.

Data Engineering Process Fundamentals - Data Warehouse Design and Implementation

Data Warehouse - Data Analysis

Data analysis is the practice of exploring data and understanding its meaning. It involves activities that can help us achieve a specific goal, such as identifying data dimensions and measures, as well as data analysis to identify outliers, trends, and distributions.

  • We can accomplish these activities by writing code using Python and Pandas, SQL, Visual Studio Code or Jupyter Notebooks.
  • What's more, we can use libraries, such as Plotly, to generate some visuals to further analyze data and create prototypes.

Data Engineering Process Fundamentals - Data Analysis

Data Analysis and Visualization

Data visualization is a powerful tool that takes the insights derived from data analysis and presents them in a visual format. While tables with numbers on a report provide raw information, visualizations allow us to grasp complex relationships and trends at a glance.

  • Dashboards, in particular, bring together various visual components like charts, graphs, and scorecards into a unified interface that can help us tell a story
  • Use tools like PowerBI, Looker, Tableau to model the data and create enterprise level visualizations

Data Engineering Process Fundamentals - Data Visualization

Conclusion

Both data lakes and data warehouses are essential components of a data engineering project. The primary function of a data lake is to store large amounts of operational data in its raw format, serving as a staging area for analytical processes. In contrast, a data warehouse acts as a centralized repository for information, enabling engineers to transform, process, and store extensive data. This allows the analytical team to utilize coding languages like Python and tools such as Jupyter Notebooks, as well as low-code platforms like Looker Studio and Power BI, to create enterprise-quality dashboards for the organization.

We've covered a lot today, but this is just the beginning!

If you're interested in learning more about building cloud data pipelines, I encourage you to check out my book, 'Data Engineering Process Fundamentals,' part of the Data Engineering Process Fundamentals series. It provides in-depth explanations, code samples, and practical exercises to help in your learning.

Data Engineering Process Fundamentals Book Amazon

Thanks for reading.

Send question or comment at Twitter @ozkary 👍 Originally published by ozkary.com

8/21/24

Medallion Architecture: A Blueprint for Data Insights and Governance - Data Engineering Process Fundamentals

Overview

Gain understanding of Medallion Architecture and its application in modern data engineering. Learn how to optimize data pipelines, improve data quality, and unlock valuable insights. Discover practical steps to implement Medallion principles in your organization and drive data-driven decision-making.

Data Engineering Process Fundamentals - Medallion Architecture

  • Follow this GitHub repo during the presentation: (Give it a star)

👉 https://github.com/ozkary/data-engineering-mta-turnstile

  • Read more information on my blog at:

👉 https://www.ozkary.com/2023/03/data-engineering-process-fundamentals.html

YouTube Video

Video Agenda

  • Introduction to Medallion Architecture

    • Defining Medallion Architecture
    • Core Principles
    • Benefits of Medallion Architecture
  • The Raw Zone

    • Understanding the purpose of the Raw Zone
    • Best practices for data ingestion and storage
  • The Bronze Zone

    • Data transformation and cleansing
    • Creating a foundation for analysis
  • The Silver Zone

    • Data optimization and summarization
    • Preparing data for consumption
  • The Gold Zone

    • Curated data for insights and action
    • Enabling self-service analytics
  • Empowering Insights

    • Data-driven decision-making
    • Accelerated Insights
  • Data Governance

    • Importance of data governance in Medallion Architecture
    • Implementing data ownership and stewardship
    • Ensuring data quality and security

Why Attend:

Gain a deep understanding of Medallion Architecture and its application in modern data engineering. Learn how to optimize data pipelines, improve data quality, and unlock valuable insights. Discover practical steps to implement Medallion principles in your organization and drive data-driven decision-making.

Presentation

Introducing Medallion Architecture

Medallion architecture is a data management approach that organizes data into distinct layers based on its quality and processing level.

  • Improved Data Quality: By separating data into different zones, you can focus on data quality at each stage.
  • Enhanced Data Governance: Clear data ownership and lineage improve data trustworthiness.
  • Accelerated Insights: Optimized data in the Silver and Gold zones enables faster query performance.
  • Scalability: The layered approach can accommodate growing data volumes and complexity.
  • Cost Efficiency: Optimized data storage and processing can reduce costs.

Data Engineering Process Fundamentals - Medallion Architecture Design Diagram

The Raw Zone: Foundation of Your Data Lake

The Raw Zone is the initial landing place for raw, unprocessed data. It serves as a historical archive of your data sources.

  • Key Characteristics:
    • Unstructured or semi-structured format (e.g., CSV, JSON, Parquet)
    • Data is ingested as-is, without any cleaning or transformation
    • High volume and velocity
    • Data retention policies are crucial
  • Benefits:
    • Preserves original data for potential future analysis
    • Enables data reprocessing
    • Supports data lineage and auditability

Data Engineering Process Fundamentals - Medallion Architecture Raw Zone Diagram

Use case Background

The Metropolitan Transportation Authority (MTA) subway system in New York has stations around the city. All the stations are equipped with turnstiles or gates which tracks as each person enters (departure) or exits (arrival) the station.

  • The MTA subway system has stations around the city.
  • All the stations are equipped with turnstiles or gates which tracks as each person enters or leaves the station.
  • CSV files provide information about the amount of commuters per stations at different time slots.

Data Engineering Process Fundamentals - Data streaming MTA Gates

Problem Statement

In the city of New York, commuters use the Metropolitan Transportation Authority (MTA) subway system for transportation. There are millions of people that use this system every day; therefore, businesses around the subway stations would like to be able to use Geofencing advertisement to target those commuters or possible consumers and attract them to their business locations at peak hours of the day.

  • Geofencing is a location based technology service in which mobile devices’ electronic signal is tracked as it enters or leaves a virtual boundary (geo-fence) on a geographical location. Businesses around those locations would like to use this technology to increase their sales.
  • Businesses around those locations would like to use this technology to increase their sales by pushing ads to potential customers at specific times.

ozkary-data-engineering-mta-geo-fence

The Bronze Zone: Transforming Raw Data

The Bronze Zone is where raw data undergoes initial cleaning, structuring, and transformation. It serves as a staging area for data before moving to the Silver Zone.

  • Key Characteristics:
    • Data is cleansed and standardized
    • Basic transformations are applied (e.g., data type conversions, null handling)
    • Data is structured into tables or views
    • Data quality checks are implemented
    • Data retention policies may be shorter than the Raw Zone
  • Benefits:
    • Improves data quality and consistency
    • Provides a foundation for further analysis
    • Enables data exploration and discovery

Data Engineering Process Fundamentals - Medallion Architecture Bronze Zone Diagram

The Silver Zone: A Foundation for Insights

The Silver Zone houses data that has been further refined, aggregated, and optimized for specific use cases. It serves as a bridge between the raw data and the final curated datasets.

  • Key Characteristics:
    • Data is cleansed, standardized, and enriched
    • Data is structured for analytical purposes (e.g., normalized, de-normalized)
    • Data is optimized for query performance (e.g., partitioning, indexing)
    • Data is aggregated and summarized for specific use cases
  • Benefits:
    • Improved query performance
    • Supports self-service analytics
    • Enables advanced analytics and machine learning
    • Reduces query costs

Data Engineering Process Fundamentals - Medallion Architecture Silver Zone Diagram

The Gold Zone: Your Data's Final Destination

  • Definition: The Gold Zone contains the final, curated datasets ready for consumption by business users and applications. It is the pinnacle of data transformation and optimization.
  • Key Characteristics:
    • Data is highly refined, aggregated, and optimized for specific use cases
    • Data is often materialized for performance
    • Data is subject to rigorous quality checks and validation
    • Data is secured and governed
  • Benefits:
    • Enables rapid insights and decision-making
    • Supports self-service analytics and reporting
    • Provides a foundation for advanced analytics and machine learning
    • Reduces query latency

Data Engineering Process Fundamentals - Medallion Architecture Gold Zone Diagram

The Gold Zone: Empowering Insights and Actions

The Gold Zone is the final destination for data, providing a foundation for insights, analysis, and action. It houses curated, optimized datasets ready for consumption.

  • Key Characteristics:
    • Data is accessible and easily consumable
    • Supports various analytical tools and platforms (BI, ML, data science)
    • Enables self-service analytics
    • Drives business decisions and actions
  • Examples of Consumption Tools:
    • Business Intelligence (BI) tools (Looker, Tableau, Power BI)
    • Data science platforms (Python, R, SQL)
    • Machine learning platforms (TensorFlow, PyTorch)
    • Advanced analytics tools

Data Engineering Process Fundamentals - Medallion Architecture Analysis Diagram

Data Governance: The Cornerstone of Data Management

Data governance is the framework that defines how data is managed within an organization, while data management is the operational execution of those policies. Data Governance is essential for ensuring data quality, consistency, and security.

Key components of data governance include:

  • Data Lineage: Tracking data's journey from source to consumption.
  • Data Ownership: Defining who is responsible for data accuracy and usage.
  • Data Stewardship: Managing data on a day-to-day basis, ensuring quality and compliance.
  • Data Security: Protecting data from unauthorized access, use, disclosure, disruption, modification, or destruction.
  • Compliance: Adhering to industry regulations (e.g., GDPR, CCPA, HIPAA) and internal policies.

By establishing clear roles, responsibilities, and data lineage, organizations can build trust in their data, improve decision-making, and mitigate risks.

Data Engineering Process Fundamentals - Medallion Architecture Data Governance

Data Transformation and Incremental Strategy

The data transformation phase is a critical stage in a data warehouse project. This phase involves several key steps, including data extraction, cleaning, loading, data type casting, use of naming conventions, and implementing incremental loads to continuously insert the new information since the last update via batch processes.

Data Engineering Process Fundamentals - Data transformation lineage

Data Lineage: Tracks the flow of data from its origin to its destination, including all the intermediate processes and transformations that it undergoes.

Data Governance : Metadata

Assigns the owner, steward and responsibilities of the data.

Data Engineering Process Fundamentals - Medallion Architecture Governance Metadata

Summary: Leverage Medallion Architecture for Success

  • Key Benefits:
    • Improved data quality
    • Enhanced governance
    • Accelerated insights
    • Scalability
    • Cost Efficiency.

Data Engineering Process Fundamentals - Medallion Architecture Diagram

We've covered a lot today, but this is just the beginning!

If you're interested in learning more about building cloud data pipelines, I encourage you to check out my book, 'Data Engineering Process Fundamentals,' part of the Data Engineering Process Fundamentals series. It provides in-depth explanations, code samples, and practical exercises to help in your learning.

Data Engineering Process Fundamentals - Book by Oscar Garcia Data Engineering Process Fundamentals - Book by Oscar Garcia

Thanks for reading.

Send question or comment at Twitter @ozkary 👍 Originally published by ozkary.com

5/4/24

Streamlining Data Flow: Building Cloud-Based Data Pipelines - Data Engineering Process Fundamentals

Overview

Delve into the world of cloud-based data pipelines, the backbone of efficient data movement within your organization. As a continuation of our Data Engineering Process Fundamentals series, this session equips you with the knowledge to build robust and scalable data pipelines leveraging the power of the cloud. Throughout this presentation, we'll explore the benefits of cloud-based solutions, delve into key design considerations, and unpack the process of building and optimizing your very own data pipeline in the cloud.

Data Engineering Process Fundamentals - Data Warehouse Design

  • Follow this GitHub repo during the presentation: (Give it a star)

👉 https://github.com/ozkary/data-engineering-mta-turnstile

  • Read more information on my blog at:

👉 https://www.ozkary.com/2023/03/data-engineering-process-fundamentals.html

YouTube Video

Video Agenda

About this event

This session guides you through the essential stages of building a cloud-based data pipeline:

Agenda:

Discovery: We'll embark on a journey of discovery, identifying data sources, understanding business needs, and defining the scope of your data pipeline.

Design and Planning: Here, we'll transform insights into a well-defined blueprint. We'll discuss architecture considerations, data flow optimization, and technology selection for your cloud pipeline.

Data Pipeline and Orchestration: Get ready to orchestrate the magic! This stage delves into building the pipeline itself, selecting the right tools, and ensuring seamless data movement between stages.

Data Modeling and Data Warehouse: Data needs a proper home! We'll explore data modeling techniques and the construction of a robust data warehouse in the cloud, optimized for efficient analysis.

Data Analysis and Visualization: Finally, we'll unlock the power of your data. Learn how to connect your cloud pipeline to tools for insightful analysis and compelling data visualizations.

Why Watch:

Process Power: Learn a structured, process-oriented approach to building and managing efficient cloud data pipelines.

Data to Insights: Discover how to unlock valuable information from your data using Python for data analysis.

The Art of Visualization: Master the art of presenting your data insights through compelling data visualizations.

Future-Proof Your Skills: Gain in-demand cloud data engineering expertise, including data analysis and visualization techniques.

This session equips you with the knowledge and practical skills to build a data pipelines, a crucial skill for data-driven organizations. You'll not only learn the "how" but also the "why" behind each step, empowering you to confidently design, implement, and analyze data pipelines that drive results.

Video Chapters:

0:00:00 Welcome to Data Engineering Process Fundamentals 0:02:19 Phase 1: Discovery 0:19:30 Phase 2: Design and Planning 0:33:30 Phase 3: Data Pipeline and Orchestration 0:49:00 Phase 4: Data Modeling and Data Warehouse 0:59:00 Phase 5: Data Analysis and Visualization 1:01:00 Final Thoughts

Presentation

Data Engineering Overview

A Data Engineering Process involves executing steps to understand the problem, scope, design, and architecture for creating a solution. This enables ongoing big data analysis using analytical and visualization tools.

Data Engineering Process Fundamentals - Operational Data

Process Phases:

  • Discovery
  • Design and Planning
  • Data Pipeline and Orchestration
  • Data Modeling and Data Warehouse
  • Data Analysis and Visualization

Follow this project: Star/Follow the project

👉 Data Engineering Process Fundamentals

Phase 1: Discovery Process

The discovery process involves identifying the problem, analyzing data sources, defining project requirements, establishing the project scope, and designing an effective architecture to address the identified challenges.

Activities include:

  • Background & problem statement: Clearly document and understand the challenges the project aims to address.
  • Exploratory Data Analysis (EDA): Make observations about the data, its structure, and sources.
  • Define Project Requirements based on the observations, enabling the team to understand the scope and goals.
  • Scope of Work: Clearly outline the scope, ensuring a focused and well-defined set of objectives.
  • Set the Stage by selecting tools and technologies that are needed.
  • Design and Architecture: Develop a robust design and project architecture that aligns with the defined requirements and scope.

Data Engineering Process Fundamentals - Phase 1: Discovery

Phase 2: Design and Planning

The design and planning phase of a data engineering project is crucial for laying out the foundation of a successful and scalable solution. This phase ensures that the architecture is strategically aligned with business objectives, optimizes resource utilization, and mitigates potential risks.

Foundational Areas

  • Designing the data pipeline and technology specifications like flows, coding language, data governance and tools
  • Define the system architecture with cloud services for scalability like data lakes & warehouse, orchestration.
  • Source control and deployment automation with CI/CD
  • Using Docker containers for environment isolation to avoid deployment issues
  • Infrastructure automation with Terraform or cloud CLI tools
  • System monitor, notification and recovery to support operations

Data Engineering Process Fundamentals - Phase 2: Design and Planning

Phase 3: Data Pipeline and Orchestration

A data pipeline is basically a workflow of tasks that can be executed in Docker containers. The execution, scheduling, managing and monitoring of the pipeline is referred to as orchestration. In order to support the operations of the pipeline and its orchestration, we need to provision a VM and data lake.

Data Engineering Process Fundamentals - Phase 3: Data Pipeline and Orchestration

Process:

  • Get Data In: Ingest data from various sources (databases, APIs, files). Decide to get it all at once (batch) or continuously (streaming).
  • Clean & Format Data: Ensure data quality and consistency. Get it ready for analysis in the right format.
  • Code or No-Code: Use code (Python, SQL) or pre-built solutions.
  • Run The Pipeline: Schedule tasks and run the pipeline. Track its performance to find issues.
  • Store Data in the Cloud: Use data lakes (staging) for raw data and data warehouses for structured, easy-to-analyze data.
  • Deploy Easily: Use containers (Docker) to deploy the pipeline anywhere.
  • Monitor & Maintain: Track how the pipeline runs, fix problems, and keep it working smoothly.

Phase 4: Data Modeling and Data Warehouse

Data Engineering Process Fundamentals - Phase 4: Data Modeling and Data Warehouse

Data Lake - Analytical Data Staging

A Data Lake is an optimized storage system for Big Data scenarios. The primary function is to store the data in its raw format without any transformation. Analytical data is the data that has been extracted from a source system via a data pipeline as part of the staging data process.

Features:

  • Store the data in its raw format without any transformation
  • This can include structure data like CSV files, unstructured data like JSON and XML documents, or column-base data like parquet files
  • Low Cost for massive storage power
  • Not Designed for querying or data analysis
  • It is used as external tables by a data warehouse system

Data Engineering Process Fundamentals - Phase 4: Data Lake - Analytical Data Staging

Data Warehouse - Staging to Analytical Data

A Data Warehouse, Online Analytical Processing (OLAP) system, is a centralized storage system that stores integrated data from multiple sources. The system is designed to host and serve Big Data scenarios with lower operational cost than transaction databases, but higher costs than a Data Lake.

Features:

  • Stores historical data in relational tables with an optimized schema, which enables the data analysis & visualization process
  • Provides SQL support to query and transform the data
  • Integrates external resources on Data Lakes as external tables
  • The system is designed to host and serve Big Data scenarios.
  • Storage is more expensive
  • Offloads archived data to Data Lakes

Data Engineering Process Fundamentals - Phase 4: Data Warehouse - Staging to Analytical Data

Phase 5: Data Analysis and Visualization

Data Engineering Process Fundamentals - Phase 5: Data Analysis and Visualization

How Do We Gather Insights From Data?

We leverage the principles of data analysis and visualization. Data analysis reveals patterns and trends, while visualization translates these insights into clear charts and graphs. It's the approach to turning raw data into actionable insights for smarter decision-making.

Let’s Explore More About:

  • Data Analysis
    • Python and Jupyter Notebook
  • Data Visualization
    • Chart Types and Design Principles
    • Code-centric with Python Graphs
    • Low-code with tools like Looker, PowerBI, Tableau

Data Analysis - Exploring Data

Data analysis is the practice of exploring data and understanding its meaning. It involves activities that can help us achieve a specific goal, such as identifying data dimensions and measures, as well as the process to identify outliers, trends, and distributions.

Methods:

  • We can accomplish these activities by writing code using Python and Pandas, SQL, Jupyter Notebooks.
  • We can use libraries, such as Plotly, to generate some visuals to further analyze data and create prototypes.
  • The use of low-code tools also aids in the Exploratory Data Analysis (EDA) process by modeling data and using code snippets

Data Engineering Process Fundamentals - Phase 5: Data Analysis and Visualization Code

Data Visualization - Unlock Insights

Data visualization is a practice that takes the insights derived from data analysis and presents them in a visual format. While tables with numbers on a report provide raw information, visualizations allow us to grasp complex relationships and trends at a glance with the use of charts, controls and colors.

Data Engineering Process Fundamentals - Phase 5: Data Analysis and Visualization Dashboard

Visualization Solutions:

  • A code-centric solution involves writing programs with a language like Python, JavaScript to manage the data analysis and create the visuals

  • A low-code solution uses cloud-hosted tools like Looker, PowerBI and Tableau to accelerate the data analysis and visualization by using a design approach

Summary

Throughout this session, we've explored the key stages of building a powerful cloud-based data pipeline. From identifying data sources and understanding business needs (Discovery) to designing an optimized architecture (Design & Planning), building the pipeline itself (Data Pipeline & Orchestration), and finally constructing a robust data warehouse for analysis (Data Modeling & Data Warehouse), we've equipped you with the knowledge to streamline your data flow.

By connecting your cloud pipeline to data analysis and visualization tools, you'll unlock the true power of your data, enabling you to translate insights into clear, actionable information.

We've covered a lot today, but this is just the beginning!

If you're interested in learning more about building cloud data pipelines, I encourage you to check out my book, 'Data Engineering Process Fundamentals,' part of the Data Engineering Process Fundamentals series. It provides in-depth explanations, code samples, and practical exercises to help in your learning.

Data Engineering Process Fundamentals - Book by Oscar Garcia Data Engineering Process Fundamentals - Book by Oscar Garcia

Thanks for reading.

Send question or comment at Twitter @ozkary 👍 Originally published by ozkary.com

4/3/24

Architecting Insights: Data Modeling and Analytical Foundations - Data Engineering Process Fundamentals

Overview

A Data Warehouse is an OLAP system, which serves as the central data repository for historical and aggregated data. A data warehouse is designed to support complex analytical queries, reporting, and data analysis for Big Data use cases. It typically adopts a denormalized entity structure, such as a star schema or snowflake schema, to facilitate efficient querying and aggregations. Data from various OLTP sources is extracted, loaded and transformed (ELT) into the data warehouse to enable analytics and business intelligence. The data warehouse acts as a single source of truth for business users to obtain insights from historical data.

In this technical presentation, we embark on the next chapter of our data journey, delving into data modeling and building our data warehouse.

Data Engineering Process Fundamentals - Data Warehouse Design

  • Follow this GitHub repo during the presentation: (Give it a star)

👉 https://github.com/ozkary/data-engineering-mta-turnstile

  • Read more information on my blog at:

👉 https://www.ozkary.com/2023/03/data-engineering-process-fundamentals.html

YouTube Video

Video Agenda

Building on our previous exploration of data pipelines and orchestration, we now delve into the pivotal phase of data modeling and analytics. In this continuation of our data engineering process series, we focus on architecting insights by designing and implementing data warehouses, constructing logical and physical models, and optimizing tables for efficient analysis. Let's uncover the foundational principles driving effective data modeling and analytics.

Agenda:

  • Operational Data Concepts:

    • Explanation of operational data and its characteristics.
    • Discussion on data storage options, including relational databases and NoSQL databases.
  • Data Lake for Data Staging:

    • Introduction to the concept of a data lake as a central repository for raw, unstructured, and semi-structured data.
    • Explanation of data staging within a data lake for ingesting, storing, and preparing data for downstream processing.
    • Discussion on the advantages of using a data lake for data staging, such as scalability and flexibility.
  • Data Warehouse for Analytical Data:

    • Overview of the role of a data warehouse in storing and organizing structured data for analytics and reporting purposes.
    • Discussion on the benefits of using a data warehouse for analytical queries and business intelligence.
  • Data Warehouse Design and Implementation:

    • Introduction to data warehouse design principles and methodologies.
    • Explanation of logical models for designing a data warehouse schema, including conceptual and dimensional modeling.
  • Star Schema:

    • Explanation of the star schema design pattern for organizing data in a data warehouse.
    • Discussion on fact tables, dimension tables, and their relationships within a star schema.
    • Explanation of the advantages of using a star schema for analytical querying and reporting.
  • Logical Models:

    • Discussion on logical models in data warehouse design.
    • Explanation of conceptual modeling and entity-relationship diagrams (ERDs).
  • Physical Models - Table Construction:

    • Discussion on constructing tables from the logical model, including entity mapping and data normalization.
    • Explanation of primary and foreign key relationships and their implementation in physical tables.
  • Table Optimization Index and Partitions:

    • Introduction to table optimization techniques for improving query performance.
    • Explanation of index creation and usage for speeding up data retrieval.
    • Discussion on partitioning strategies for managing large datasets and enhancing query efficiency.
  • Incremental Strategy:

    • Introduction to incremental loading techniques for efficiently updating data warehouses.
    • Explanation of delta processing.
    • Discussion on the benefits of incremental loading in reducing processing time and resource usage.
  • Orchestration and Operations:

    • Tools and frameworks for orchestrating data pipelines, such as dbt.
    • Discussion on the importance of orchestration and monitoring the data processing tasks.
    • Policies to archive data in blob storage.

Why join this session?

  • Learn analytical data modeling essentials.
  • Explore schema design patterns like star and snowflake.
  • Optimize large dataset management and query efficiency.
  • Understand logical and physical modeling strategies.
  • Gain practical insights and best practices.
  • Engage in discussions with experts.
  • Advance your data engineering skills.
  • Architect insights for data-driven decisions.

Presentation

Data Engineering Overview

A Data Engineering Process involves executing steps to understand the problem, scope, design, and architecture for creating a solution. This enables ongoing big data analysis using analytical and visualization tools.

Data Engineering Process Fundamentals - Operational Data

Topics

  • Operational Data
  • Data Lake
  • Data Warehouse
  • Schema and Data Modeling
  • Data Strategy and Optimization
  • Orchestration and Operations

Follow this project: Star/Follow the project

👉 Data Engineering Process Fundamentals

Operational Data

Operational data (OLTP) is often generated by applications, and it is stored in transactional relational databases like SQL Server, Oracle and NoSQL (JSON) databases like CosmosDB, Firebase. This is the data that is created after an application saves a user transaction like contact information, a purchase or other activities that are available from the application.

Features

  • Application support and transactions
  • Relational data structure and SQL or document structure NoSQL
  • Small queries for case analysis

Not Best For:

  • Reporting and analytical systems (OLAP)
  • Large queries
  • Centralized Big Data system

Data Engineering Process Fundamentals - Operational Data

Data Lake - From Ops to Analytical Data Staging

A Data Lake is an optimized storage system for Big Data scenarios. The primary function is to store the data in its raw format without any transformation. Analytical data is the transaction data that has been extracted from a source system via a data pipeline as part of the staging data process.

Features:

  • Store the data in its raw format without any transformation
  • This can include structure data like CSV files, unstructured data like JSON and XML documents, or column-base data like parquet files
  • Low Cost for massive storage power
  • Not Designed for querying or data analysis
  • It is used as external tables by most systems

Data Engineering Process Fundamentals - Data Lake for Staging the data

Data Warehouse - Staging to Analytical Data

A Data Warehouse, OLAP system, is a centralized storage system that stores integrated data from multiple sources. The system is designed to host and serve Big Data scenarios with lower operational cost than transaction databases, but higher costs than a Data Lake.

Features:

  • Stores historical data in relational tables with an optimized schema, which enables the data analysis process
  • Provides SQL support to query and transform the data
  • Integrates external resources on Data Lakes as external tables
  • The system is designed to host and serve Big Data scenarios.
  • Storage is more expensive
  • Offloads archived data to Data Lakes

Data Engineering Process Fundamentals - Data Warehouse Analytical Data

Data Warehouse - Design and Implementation

In the design phase, we lay the groundwork by defining the database system, schema model, logical data models, and technology stack (SQL, Python, frameworks and tools) required to support the data warehouse’s implementation and operations.

In the implementation phase, we focus on converting logical data models into a functional system. By creating concrete structures like dimension and fact tables and performing data transformation tasks, including data cleansing, integration, and scheduled batch loading, we ensure that raw data is processed and unified for analysis.

Data Engineering Process Fundamentals - Data Warehouse Design

Design - Schema Modeling

The Star and Snowflake Schemas are two common data warehouse modeling techniques. The Star Schema consist of a central fact table is connected to multiple dimension tables via foreign key relationships. The Snowflake Schema is a variation of the Star Schema, but with dimension tables that are further divided into multiple related tables.

What to use:

  • Use the Star Schema when query performance is a primary concern, and data model simplicity is essential

  • Use the Snowflake Schema when storage optimization is crucial, and the data model involves high-cardinality dimension attributes with potential data redundancy

Data Engineering Process Fundamentals - Data Warehouse Schema Model

Data Modeling

Data modeling lays the foundation for a data warehouse. It starts with modeling raw data into a logical model outlining the data and its relationships, with a focus based on data requirements. This model is then translated, using DDL, into the specific views, tables, columns (data types), and keys that make up the physical model of the data warehouse, with a focus on technical requirements.

Data Engineering Process Fundamentals - Data Warehouse Data Model

Data Optimization to Deliver Performance

To achieve faster queries, improve performance and reduce resource cost, we need to efficiently organize our data. Two key techniques for accomplishing this are data partitioning and data clustering.

  • Data Partitioning: Imagine dividing your data table into smaller, self-contained segments based on a specific column (e.g., date). This allows the DW to quickly locate and retrieve only the relevant data for your queries, significantly reducing scan times.

  • Data Clustering: Allows us to organize the data within each partition based on another column (e.g., Station). This groups frequently accessed data together physically, leading to faster query execution, especially for aggregations or filtering based on the clustered column.

Data Engineering Process Fundamentals - Data Warehouse DDL Script

Data Transformation and Incremental Strategy

The data transformation phase is a critical stage in a data warehouse project. This phase involves several key steps, including data extraction, cleaning, loading, data type casting, use of naming conventions, and implementing incremental loads to continuously insert the new information since the last update via batch processes.

Data Engineering Process Fundamentals - Data Warehouse Data Lineage

  • Data Lineage: Tracks the flow of data from its origin to its destination, including all the intermediate processes and transformations that it undergoes.

Orchestration and Operations

Effective orchestration and operation are the keys of a reliable and efficient data project. They streamline data pipelines, ensure data quality, and minimize human intervention. This translates to faster development cycles, reduced errors, and improved overall data management.

  • Version Control and CI/CD with GitHub: Enables development, automated testing, and seamless deployment of data pipelines.

  • Documentation: Maintain clear and comprehensive documentation covering data pipelines, data quality checks, scheduling, data archiving policies

  • Scheduling and Automation: Automates repetitive tasks, such as data ingestion, transformation, and archiving processes,

  • Monitoring and Notification: Provides real-time insights into pipeline health, data quality, and archiving success

Data Engineering Process Fundamentals - Data Warehouse Data Lineage

Summary

Before we can move data into a data warehouse system, we explore two pivotal phases for our data warehouse solution: design and implementation. In the design phase, we lay the groundwork by defining the database system, schema and data model, and technology stack required to support the data warehouse’s implementation and operations. This stage ensures a solid infrastructure for data storage and management.

In the implementation phase, we focus on converting conceptual data models into a functional system. By creating concrete structures like dimension and fact tables and performing data transformation tasks, including data cleansing, integration, and scheduled batch loading, we ensure that raw data is processed and unified for analysis.

Thanks for reading.

Send question or comment at Twitter @ozkary

👍 Originally published by ozkary.com