Ozkary - Emerging Technologies

Generative AI: Create Code from GitHub User Stories - Large Language Models

2024-04-24T19:03:00.000-04:00

Overview

This presentation explores the potential of Generative AI, specifically Large Language Models (LLMs), for streamlining software development by generating code directly from user stories written in GitHub. We delve into benefits like increased developer productivity and discuss techniques like Prompt Engineering and user story writing for effective code generation. Utilizing Python and AI, we showcase a practical example of reading user stories, generating code, and updating the corresponding story in GitHub, demonstrating the power of AI in streamlining software development.

#BuildwithAI Series

Follow this GitHub repo during the presentation: (Give it a star and follow the project)

👉 https://github.com/ozkary/ai-engineering

Read more information on my blog at:

YouTube Video

Video Agenda

Agenda:

Introduction to LLMs and their Role in Code Generation
Prompt Engineering - Guiding the LLM
Writing User Stories for Code Generation
Introducing Gemini AI and AI Studio
Python Implementation - A Practical Example using VS Code
- Reading user stories from GitHub.
- Utilizing Gemini AI to generate code based on the user story.
- Updating the corresponding GitHub user story with the generated code.
Conclusion: Summarize the key takeaways of the article, emphasizing the potential of Generative AI in code creation.

Why join this session?

Discover how Large Language Models (LLMs) can automate code generation, saving you valuable time and effort.
Learn how to craft effective prompts that guide LLMs to generate the code you need.
See how to write user stories that bridge the gap between human intent and AI-powered code creation.
Explore Gemini AI and AI Studio
Witness Code Generation in Action: Experience a live demonstration using VS Code, where user stories from GitHub are transformed into code with the help of Gemini AI.

Presentation

What are LLM Models - Not Skynet

Large Language Model (LLM) refers to a class of Generative AI models that are designed to understand prompts and questions and generate human-like text based on large amounts of training data. LLMs are built upon Foundation Models which have a focus on language understanding.

Common Tasks

Text and Code Generation: LLMs can generate code snippets or even entire programs based on specific requirements
Natural Language Processing (NLP): Understand and generate human language, sentiment analysis, translation
Text Summarization: LLMs can condense lengthy pieces of text into concise summaries
Question Answering: LLMs can access and process information from various sources to answer questions, making a great fit for chatbots

Training LLM Models - Secret Sauce

Models are trained using a combination of machine learning and deep learning. Massive datasets of text and code are collected, cleaned, and fed into complex neural networks with multiple layers. These networks iteratively learn by analyzing patterns in the data, allowing them to map inputs like user stories to desired outputs such as code generation.

Training Process:

Data Collection: Sources from books, articles, code repositories, and online conversations
Preprocessing: Data cleaning and formatting for the ML algorithms to understand it effectively
Model Training: The neural network architecture is trained on the data. The network adjusts its internal parameters to learn how to map input data (user stories) to desired outputs (code snippets)
Fine-tuning: Fine-tune models for specific tasks like code generation, by training the model on relevant data (e.g., specific programming languages, coding conventions).

Transformer Architecture - Not Autobots

Transformer is a neural network architecture that excels at processing long sequences of text by analyzing relationships between words, no matter how far apart they are. This allows LLMs to understand complex language patterns and generate human-like text.

Components

Encoder: Process the input (use story) by using multiple encoder layers with self-attention Mechanism to analyze the relationship between words
Decoder: Uses the encoded information and its own attention mechanism to generate the output text (like code), ensuring it aligns with the text.
Attention Mechanism: Enables the model to effectively focus on the most important information for the task at hand, leading to improved NLP and generation capabilities.

👉 Read: Attention is all you need by Google, 2017

Prompt Engineering - What is it?

Features

Clarity and Specificity: Effective prompts are clear, concise, and specific about the task or desired response
Task Framing: Provide background information, specifying the desired output format (e.g., code, email, poem), or outlining specific requirements
Examples and Counter-Examples: Including relevant examples and counterexamples within the prompt can further guide the LLM
Instructional Language: Use clear and concise instructions to improve the LLM's understanding of what information to generate

User Story Prompt:

As a web developer, I want to create a React component with TypeScript for a login form that uses JSDoc for documentation, hooks for state management, includes a "Remember This Device" checkbox, and follows best practices for React and TypeScript development so that the code is maintainable, reusable, and understandable for myself and other developers, aligning with industry standards.

Needs:

- Component named "LoginComponent" with state management using hooks (useState)
- Input fields:
    - ID: "email" (type="email") - Required email field (as username)
    - ID: "password" (type="password") - Required password field
- Buttons:
    - ID: "loginButton" - "Login" button
    - ID: "cancelButton" - "Cancel" button
- Checkbox:
    - ID: "rememberDevice" - "Remember This Device" checkbox

Generate Code from User Stories - Practical Use Case

In the Agile methodology, user stories are used to capture requirements, tasks, or a feature from the perspective of a role in the system. For code generation, developers can write user stories to capture the context, requirements and technical specifications necessary to generate code with AI.

Code Generation Flow:

1 User Story: Get the GitHub tasks with user story information
2 LLM Model: Send the user story as a prompt to the LLM Model
3 Generated Code: Send the generated code back to GitHub as a comment for a developer to review

👉 LLM generated code is not perfect, and developers should manually review and validate the generated code.

How does LLMs Impact Development?

LLMs accelerate development by generating code faster, leading to shorter development cycles. They also automate documentation and empower exploration of complex algorithms, fostering innovation.

Features:

Code Completion: Analyze your code and suggest completions based on context
Code Synthesis: Describe what you want the code to do, and the LLM can generate the code
Code Refactoring: Analyze your code and suggest improvements for readability, performance, or best practices.
Documentation: Generate documentation that explains your code's purpose and functionality
Code Translation: Translate code snippets between different programming languages

👉 Security Concerns: Malicious actors could potentially exploit LLMs to generate harmful code.

What is Gemini AI?

Gemini is Google's next-generation large language model (LLM), unlocking the potential of Generative AI. This powerful tool understands and generates various data formats, from text and code to images and audio.

Components:

Gemini: Google's next-generation multimodal LLM, capable of understanding and generating various data formats (text, code, images, audio)
Gemini API: Integrate Gemini's into your applications with a user-friendly API
Google AI Studio: A free, web-based platform for prototyping with Gemini aistudio.google.com
- Experiment with prompts and explore Gemini's capabilities
  - Generate creative text formats, translate languages
- Export your work to code for seamless integration into your projects

👉 Multimodal LLMs can handle text, images, video, code

Generative AI for Development Summary

LLM plays a crucial role in code generation by harnessing its language understanding and generative capabilities. People in roles like developers, data engineers, scientists and others can utilize AI models to swiftly generate scripts in various programming languages, streamlining their programming tasks.

Common Tasks:

Code generation
Natural Language Processing (NLP)
Text summarization
Question answering

Architecture:

Multi-layered neural networks
Training process

Transformer Architecture:
Encoder-Decoder structure
Attention mechanism

Prompt Engineering:

Crafting effective prompts with user stories

Code Generation from User Stories:
- Leveraging user stories for code generation

Thanks for reading.

Send question or comment at Twitter @ozkary

👍 Originally published by ozkary.com

Architecting Insights: Data Modeling and Analytical Foundations - Data Engineering Process Fundamentals

2024-04-03T16:46:00.000-04:00

Overview

A Data Warehouse is an OLAP system, which serves as the central data repository for historical and aggregated data. A data warehouse is designed to support complex analytical queries, reporting, and data analysis for Big Data use cases. It typically adopts a denormalized entity structure, such as a star schema or snowflake schema, to facilitate efficient querying and aggregations. Data from various OLTP sources is extracted, loaded and transformed (ELT) into the data warehouse to enable analytics and business intelligence. The data warehouse acts as a single source of truth for business users to obtain insights from historical data.

In this technical presentation, we embark on the next chapter of our data journey, delving into data modeling and building our data warehouse.

Follow this GitHub repo during the presentation: (Give it a star)

👉 https://github.com/ozkary/data-engineering-mta-turnstile

Read more information on my blog at:

👉 https://www.ozkary.com/2023/03/data-engineering-process-fundamentals.html

YouTube Video

Video Agenda

Building on our previous exploration of data pipelines and orchestration, we now delve into the pivotal phase of data modeling and analytics. In this continuation of our data engineering process series, we focus on architecting insights by designing and implementing data warehouses, constructing logical and physical models, and optimizing tables for efficient analysis. Let's uncover the foundational principles driving effective data modeling and analytics.

Agenda:

Operational Data Concepts:
- Explanation of operational data and its characteristics.
- Discussion on data storage options, including relational databases and NoSQL databases.
Data Lake for Data Staging:
- Introduction to the concept of a data lake as a central repository for raw, unstructured, and semi-structured data.
- Explanation of data staging within a data lake for ingesting, storing, and preparing data for downstream processing.
- Discussion on the advantages of using a data lake for data staging, such as scalability and flexibility.
Data Warehouse for Analytical Data:
- Overview of the role of a data warehouse in storing and organizing structured data for analytics and reporting purposes.
- Discussion on the benefits of using a data warehouse for analytical queries and business intelligence.
Data Warehouse Design and Implementation:
- Introduction to data warehouse design principles and methodologies.
- Explanation of logical models for designing a data warehouse schema, including conceptual and dimensional modeling.
Star Schema:
- Explanation of the star schema design pattern for organizing data in a data warehouse.
- Discussion on fact tables, dimension tables, and their relationships within a star schema.
- Explanation of the advantages of using a star schema for analytical querying and reporting.
Logical Models:
- Discussion on logical models in data warehouse design.
- Explanation of conceptual modeling and entity-relationship diagrams (ERDs).
Physical Models - Table Construction:
- Discussion on constructing tables from the logical model, including entity mapping and data normalization.
- Explanation of primary and foreign key relationships and their implementation in physical tables.
Table Optimization Index and Partitions:
- Introduction to table optimization techniques for improving query performance.
- Explanation of index creation and usage for speeding up data retrieval.
- Discussion on partitioning strategies for managing large datasets and enhancing query efficiency.
Incremental Strategy:
- Introduction to incremental loading techniques for efficiently updating data warehouses.
- Explanation of delta processing.
- Discussion on the benefits of incremental loading in reducing processing time and resource usage.
Orchestration and Operations:
- Tools and frameworks for orchestrating data pipelines, such as dbt.
- Discussion on the importance of orchestration and monitoring the data processing tasks.
- Policies to archive data in blob storage.

Why join this session?

Learn analytical data modeling essentials.
Explore schema design patterns like star and snowflake.
Optimize large dataset management and query efficiency.
Understand logical and physical modeling strategies.
Gain practical insights and best practices.
Engage in discussions with experts.
Advance your data engineering skills.
Architect insights for data-driven decisions.

Presentation

Data Engineering Overview

A Data Engineering Process involves executing steps to understand the problem, scope, design, and architecture for creating a solution. This enables ongoing big data analysis using analytical and visualization tools.

Topics

Operational Data
Data Lake
Data Warehouse
Schema and Data Modeling
Data Strategy and Optimization
Orchestration and Operations

Follow this project: Star/Follow the project

👉 Data Engineering Process Fundamentals

Operational Data

Operational data (OLTP) is often generated by applications, and it is stored in transactional relational databases like SQL Server, Oracle and NoSQL (JSON) databases like CosmosDB, Firebase. This is the data that is created after an application saves a user transaction like contact information, a purchase or other activities that are available from the application.

Features

Application support and transactions
Relational data structure and SQL or document structure NoSQL
Small queries for case analysis

Not Best For:

Reporting and analytical systems (OLAP)
Large queries
Centralized Big Data system

Data Lake - From Ops to Analytical Data Staging

A Data Lake is an optimized storage system for Big Data scenarios. The primary function is to store the data in its raw format without any transformation. Analytical data is the transaction data that has been extracted from a source system via a data pipeline as part of the staging data process.

Features:

Store the data in its raw format without any transformation
This can include structure data like CSV files, unstructured data like JSON and XML documents, or column-base data like parquet files
Low Cost for massive storage power
Not Designed for querying or data analysis
It is used as external tables by most systems

Data Warehouse - Staging to Analytical Data

A Data Warehouse, OLAP system, is a centralized storage system that stores integrated data from multiple sources. The system is designed to host and serve Big Data scenarios with lower operational cost than transaction databases, but higher costs than a Data Lake.

Features:

Stores historical data in relational tables with an optimized schema, which enables the data analysis process
Provides SQL support to query and transform the data
Integrates external resources on Data Lakes as external tables
The system is designed to host and serve Big Data scenarios.
Storage is more expensive
Offloads archived data to Data Lakes

Data Warehouse - Design and Implementation

In the design phase, we lay the groundwork by defining the database system, schema model, logical data models, and technology stack (SQL, Python, frameworks and tools) required to support the data warehouse’s implementation and operations.

In the implementation phase, we focus on converting logical data models into a functional system. By creating concrete structures like dimension and fact tables and performing data transformation tasks, including data cleansing, integration, and scheduled batch loading, we ensure that raw data is processed and unified for analysis.

Design - Schema Modeling

The Star and Snowflake Schemas are two common data warehouse modeling techniques. The Star Schema consist of a central fact table is connected to multiple dimension tables via foreign key relationships. The Snowflake Schema is a variation of the Star Schema, but with dimension tables that are further divided into multiple related tables.

What to use:

Use the Star Schema when query performance is a primary concern, and data model simplicity is essential
Use the Snowflake Schema when storage optimization is crucial, and the data model involves high-cardinality dimension attributes with potential data redundancy

Data Modeling

Data modeling lays the foundation for a data warehouse. It starts with modeling raw data into a logical model outlining the data and its relationships, with a focus based on data requirements. This model is then translated, using DDL, into the specific views, tables, columns (data types), and keys that make up the physical model of the data warehouse, with a focus on technical requirements.

Data Optimization to Deliver Performance

To achieve faster queries, improve performance and reduce resource cost, we need to efficiently organize our data. Two key techniques for accomplishing this are data partitioning and data clustering.

Data Partitioning: Imagine dividing your data table into smaller, self-contained segments based on a specific column (e.g., date). This allows the DW to quickly locate and retrieve only the relevant data for your queries, significantly reducing scan times.
Data Clustering: Allows us to organize the data within each partition based on another column (e.g., Station). This groups frequently accessed data together physically, leading to faster query execution, especially for aggregations or filtering based on the clustered column.

Data Transformation and Incremental Strategy

The data transformation phase is a critical stage in a data warehouse project. This phase involves several key steps, including data extraction, cleaning, loading, data type casting, use of naming conventions, and implementing incremental loads to continuously insert the new information since the last update via batch processes.

Data Lineage: Tracks the flow of data from its origin to its destination, including all the intermediate processes and transformations that it undergoes.

Orchestration and Operations

Effective orchestration and operation are the keys of a reliable and efficient data project. They streamline data pipelines, ensure data quality, and minimize human intervention. This translates to faster development cycles, reduced errors, and improved overall data management.

Version Control and CI/CD with GitHub: Enables development, automated testing, and seamless deployment of data pipelines.
Documentation: Maintain clear and comprehensive documentation covering data pipelines, data quality checks, scheduling, data archiving policies
Scheduling and Automation: Automates repetitive tasks, such as data ingestion, transformation, and archiving processes,
Monitoring and Notification: Provides real-time insights into pipeline health, data quality, and archiving success

Summary

Before we can move data into a data warehouse system, we explore two pivotal phases for our data warehouse solution: design and implementation. In the design phase, we lay the groundwork by defining the database system, schema and data model, and technology stack required to support the data warehouse’s implementation and operations. This stage ensures a solid infrastructure for data storage and management.

In the implementation phase, we focus on converting conceptual data models into a functional system. By creating concrete structures like dimension and fact tables and performing data transformation tasks, including data cleansing, integration, and scheduled batch loading, we ensure that raw data is processed and unified for analysis.

Thanks for reading.

Send question or comment at Twitter @ozkary

👍 Originally published by ozkary.com

Coupling Data Flows: Data Pipelines and Orchestration - Data Engineering Process Fundamentals

2024-03-07T17:02:00.000-05:00

Overview

A data pipeline refers to a series of connected tasks that handles the extract, transform and load (ETL) as well as the extract, load and transform (ELT) operations and integration from a source to a target storage like a data lake or data warehouse. Properly designed pipelines ensure data integrity, quality, and consistency throughout the system.

In this technical presentation, we embark on the next chapter of our data journey, delving into building a pipeline with orchestration for ongoing development and operational support.

Follow this GitHub repo during the presentation: (Give it a star)

👉 https://github.com/ozkary/data-engineering-mta-turnstile

Read more information on my blog at:

👉 https://www.ozkary.com/2023/03/data-engineering-process-fundamentals.html

YouTube Video

Video Agenda

*Understanding Data Pipelines:"
- Delve into the concept of data pipelines and their significance in modern data engineering.
Implementation Options:
- Explore different approaches to implementing data pipelines, including code-centric and low-code tools.
Pipeline Orchestration:
- Learn about the role of orchestration in managing complex data workflows and the tools available, such as Apache Airflow, Apache Spark, Prefect, and Azure Data Factory.
Cloud Resources:
- Identify the necessary cloud resources for staging environments and data lakes to support efficient data pipeline deployment.
Implementing Flows:
- Examine the process of building data pipelines, including defining tasks, components, and logging mechanisms.
Deployment with Docker:
- Discover how Docker containers can be used to isolate data pipeline environments and streamline deployment processes.
Monitor and Operations:
- Manage operational concerns related to data pipeline performance, reliability, and scalability.

Key Takeaways:

Gain practical insights into building and managing data pipelines.
Learn coding techniques with Python for efficient data pipeline development.
Discover the benefits of Docker deployments for data pipeline management.
Understand the significance of data orchestration in the data engineering process.
Connect with industry professionals and expand your network.
Stay updated on the latest trends and advancements in data pipeline architecture and orchestration.

Some of the technologies that we will be covering:

Cloud Infrastructure
Data Pipelines
GitHub
VSCode
Docker and Docker Hub

Presentation

Data Engineering Overview

Topics

Understanding Data pipelines
Implementation Options
Pipeline Orchestration
Cloud Resources
Implementing Code-Centric Flows
Deployment with Docker
Monitor and Operations

Follow this project: Star/Follow the project

👉 Data Engineering Process Fundamentals

Understanding Data Pipelines

Foundational Areas

Data Ingestion and Transformation
Code-Centric vs. Low-Code Options
Orchestration
Cloud Resources
Implementing flows, tasks, components and logging
Deployment
Monitoring and Operations

Data Ingestion and Transformation

Data ingestion is the process of bringing data in from various sources, such as databases, APIs, data streams and files, into a staging area. Once the data is ingested, we can transform it to match our requirements.

Key Areas:

Identify methods for extracting data from various sources (databases, APIs, Data Streams, files, etc.).
Choose between batch or streaming ingestion based on data needs and use cases
Data cleansing and standardization ensure quality and consistency.
Data enrichment adds context and value.
Formatting into the required data models for analysis.

Implementation Options

The implementation of a pipeline refers to the designing and/or coding of each task in the pipeline. A task can be implemented using a programming languages like Python or SQL. It can also be implemented using a low-code tool with zero or some code snippet.

Options:

Code-centric: Provides flexibility, customization, and full control (Python, SQL, etc.). Ideal for complex pipelines with specific requirements. Requires programming expertise.
Low-code: Offers visual drag-and-drop interfaces that allow the engineer to connect to APIs, databases, data lakes and other sources that provide access via API, enabling faster development. (Azure Data Factory, GCP Cloud Dataflow)

Pipeline Orchestration

Orchestration is the automation, management and coordination of the data pipeline tasks. It involves the scheduling, workflows, monitoring and recovery of those tasks. The orchestration handles the execution, error handling, retry and the alerting of problems in the pipeline.

Orchestration Tools:

Apache Airflow: Offers flexible and customizable workflow creation for engineers using Python code, ideal for complex pipelines.
Apache Spark: Excels at large-scale batch processing tasks involving API calls and file downloads with Python. Its distributed framework efficiently handles data processing and analysis.
Prefect: This open-source workflow management system allows defining and managing data pipelines as code, providing a familiar Python API.
Cloud-based Services: Tools like Azure Data Factory and GCP Cloud Dataflow provide a visual interface for building and orchestrating data pipelines, simplifying development. They also handle logging and alerting.

Cloud Resources

Cloud resources are critical for data pipelines. Virtual machines (VMs) offer processing power for code-centric pipelines, while data lakes serve as central repositories for raw data. Data warehouses, optimized for structured data analysis, often integrate with data lakes to enable deeper insights.

Resources:

Code-centric pipelines: VMs are used for executing workflows, managing orchestration, and providing resources for data processing and transformation. Often, code runs within Docker containers.
Data Storage: Data lakes act as central repositories for storing vast amounts of raw and unprocessed data. They offer scalable and cost-effective solutions for capturing and storing data from diverse sources.
Low-code tools: typically have their own infrastructure needs specified by the platform provider. Provisioning might not be necessary, and the tool might be serverless or run on pre-defined infrastructure.

Implementing Code-Centric Flows

In a data pipeline, orchestrated flows define the overall sequence of steps. These flows consist of tasks, which represent specific actions within the pipeline. For modularity and reusability, a task should use components to encapsulate common concerns like security and data lake access.

Pipeline Structure:

Flows: Are coordinators that define the overall structure and sequence of the data pipeline. They are responsible for orchestrating the execution of other flows or tasks in a specific order.
Tasks: Are operators for each individual units of work within the pipeline. Each task represents a specific action or function performed on the data, such as data extraction, transformation, or loading. They manipulate the data according to the flow's instructions.
Components: These are reusable code blocks that encapsulate functionalities common across different tasks. They act as utilities, providing shared functionality like security checks, data lake access, logging, or error handling.

Deployment with Docker and Docker Hub

Docker proves invaluable for our data pipelines by providing self-contained environments with all necessary dependencies. With Docker Hub, we can effortlessly distribute pipeline images, facilitating swift and reliable provisioning of new environments.

Docker containers streamline the deployment process by encapsulating application and dependency configurations, reducing runtime errors.
Containerizing data pipelines ensures reliability and portability by packaging all necessary components within a single container image.
Docker Hub serves as a centralized container registry, enabling seamless image storage and distribution for streamlined environment provisioning and scalability.

Monitor and Operations

Monitoring your data pipeline's performance with telemetry data is key to smooth operations. This enables the operations team to proactively identify and address issues, ensuring efficient data delivery.

Key Components:

Telemetry Tracing: Tracks the execution of flows and tasks, providing detailed information about their performance, such as execution time, resource utilization, and error messages.
Monitor and Dashboards: Visualize key performance indicators (KPIs) through user-friendly dashboards, offering real-time insights into overall pipeline health and facilitating anomaly detection.
Notifications to Support: Timely alerts are essential for the operations team to be notified of any critical issues or performance deviations, enabling them to take necessary actions.

Summary

A data pipeline is basically a workflow of tasks that can be executed in Docker containers. The execution, scheduling, managing and monitoring of the pipeline is referred as orchestration. In order to support the operations of the pipeline and its orchestration, we need to provision a VM and data lake cloud resources, which we can also automate with Terraform. By selecting the appropriate programming language and orchestration tools, we can construct resilient pipelines capable of scaling and meeting evolving data demands effectively.

Thanks for reading.

Send question or comment at Twitter @ozkary 👍 Originally published by ozkary.com

Unlock the Blueprint: Design and Planning Phase - Data Engineering Process Fundamentals

2024-02-14T15:00:00.008-05:00

Overview

The design and planning phase of a data engineering project is crucial for laying out the foundation of a successful and scalable solution. This phase ensures that the architecture is strategically aligned with business objectives, optimizes resource utilization, and mitigates potential risks.

In this technical presentation, we embark on the next chapter of our data journey, delving into the critical Design and Planning Phase.

Follow this GitHub repo during the presentation: (Give it a star)

👉 https://github.com/ozkary/data-engineering-mta-turnstile

Read more information on my blog at:

👉 https://www.ozkary.com/2023/03/data-engineering-process-fundamentals.html

YouTube Video

Video Agenda

System Design and Architecture:

Understanding the foundational principles that shape a robust and scalable data system.

Data Pipeline and Orchestration:
Uncovering the essentials of designing an efficient data pipeline and orchestrating seamless data flows.

Source Control and Deployment:
Navigating the best practices for source control, versioning, and deployment strategies.

CI/CD in Data Engineering:
Implementing Continuous Integration and Continuous Deployment (CI/CD) practices for agility and reliability.

Docker Container and Docker Hub:
Harnessing the power of Docker containers and Docker Hub for containerized deployments.

Cloud Infrastructure with IaC:
Exploring technologies for building out cloud infrastructure using Infrastructure as Code (IaC), ensuring efficiency and consistency.

Key Takeaways:

Gain insights into designing scalable and efficient data systems.
Learn best practices for cloud infrastructure and IaC.
Discover the importance of data pipeline orchestration and source control.
Explore the world of CI/CD in the context of data engineering.
Unlock the potential of Docker containers for your data workflows.

Some of the technologies that we will be covering:

Cloud Infrastructure
Data Pipelines
GitHub and Actions
VC Code
Docker and Docker Hub
Terraform

Presentation

Data Engineering Overview

Topics

Importance of Design and Planning
System Design and Architecture
Data Pipeline and Orchestration
Source Control and CI/CD
Docker Containers
Cloud Infrastructure with IaC

Follow this project: Give a star

👉 Data Engineering Process Fundamentals

Importance of Design and Planning

Foundational Areas

Designing the data pipeline and technology specifications like flows, coding language, data governance and tools
Define the system architecture like cloud services for scalability, data platform
Source control and deployment automation with CI/CD
Using Docker containers for environment isolation to avoid deployment issues
Infrastructure automation with Terraform or cloud CLI tools
System monitor, notification and recovery

System Design and Architecture

In a system design, we need to clearly define the different technologies that should be used for each area of the solution. It includes the high-level system architecture, which defines the different components and their integration.

The design outlines the technical solution, including system architecture, data integration, flow orchestration, storage platforms, and data processing tools. It focuses on defining technologies for each component to ensure a cohesive and efficient solution.
A system architecture is a critical high-level design encompassing various components such as data sources, ingestion resources, workflow orchestration, storage, transformation services, continuous ingestion, validation mechanisms, and analytics tools.

Data Pipeline and Orchestration

This can be code-centric, leveraging languages like Python, SQL
Or a low-code approach, utilizing tools such as Azure Data Factory, which provides a turn-key solution
Monitor services enable us to track telemetry data to support operational requirements
Docker Hub, GitHub can be used for the CI/CD process and deployed our code-centric solutions
Scheduling, recovering from failures and dashboards are essentials for orchestration
Low-code solutions , like data factory, can also be used

Source Control - CI/CD

Implementing source control practices alongside Continuous Integration and Continuous Delivery (CI/CD) pipelines is vital for facilitating agile development. This ensures efficient collaboration, change tracking, and seamless code deployment, crucial for addressing ongoing feature changes, bug fixes, and new environment deployments.

Systems like Git facilitates effective code and configuration file management, enabling collaboration and change tracking.
Platforms such as GitHub enhance collaboration by providing a remote repository for sharing code.
CI involves integrating code changes into a central repository, followed by automated build and test processes to validate changes and provide feedback.
CD automates the deployment of code builds to various environments, such as staging and production, streamlining the release process and ensuring consistency across environments.

Docker Container and Docker Hub

Docker containers streamline the deployment process by encapsulating application and dependency configurations, reducing runtime errors.
Containerizing data pipelines ensures reliability and portability by packaging all necessary components within a single container image.
Docker Hub serves as a centralized container registry, enabling seamless image storage and distribution for streamlined environment provisioning and scalability.

Cloud Infrastructure with IaC

Infrastructure automation is crucial for maintaining consistency, scalability, and reliability across environments. By defining infrastructure as code (IaC), organizations can efficiently provision and modify cloud resources, mitigating manual errors.

Define infrastructure configurations as code, ensuring consistency across environments.
Easily scale resources up or down to meet changing demands with code-defined infrastructure.
Reduce manual errors and ensure reproducibility by automating resource provisioning and management.
Track infrastructure changes under version control, enabling collaboration and ensuring auditability.
Track infrastructure state, allowing for precise updates and minimizing drift between desired and actual configurations.

Summary

The design and planning phase of a data engineering project sets the stage for success. From designing the system architecture and data pipelines to implementing source control, CI/CD, Docker, and infrastructure automation with Terraform, every aspect contributes to efficient and reliable deployment. Infrastructure automation, in particular, plays a critical role by simplifying provisioning of cloud resources, ensuring consistency, and enabling scalability, ultimately leading to a robust and manageable data engineering system.

Thanks for reading.

Send question or comment at Twitter @ozkary

👍 Originally published by ozkary.com

Decoding Data: A Journey into the Discovery Phase - Data Engineering Process Fundamentals

2024-01-31T15:00:00.013-05:00

Overview

The discovery process involves identifying the problem, analyzing data sources, defining project requirements, establishing the project scope, and designing an effective architecture to address the identified challenges.

In this session, we will delve into the essential building blocks of data engineering, placing a spotlight on the discovery process. From framing the problem statement to navigating the intricacies of exploratory data analysis (EDA) using Python, VSCode, Jupyter Notebooks, and GitHub, you'll gain a solid understanding of the fundamental aspects that drive effective data engineering projects.

Follow this GitHub repo during the presentation: (Give it a star)

👉 https://github.com/ozkary/data-engineering-mta-turnstile

Read more information on my blog at:

👉 https://www.ozkary.com/2023/03/data-engineering-process-fundamentals.html

YouTube Video

Video Agenda

Introduction:
- Unveiling the importance of the discovery process in data engineering.
- Setting the stage with a real-world problem statement that will guide our exploration.
Setting the Stage:
- Downloading and comprehending sample data to kickstart our discovery journey.
- Configuring the development environment with VSCode and Jupyter Notebooks.
Exploratory Data Analysis (EDA):
- Delving deep into EDA techniques with a focus on the discovery phase.
- Demonstrating practical approaches using Python to uncover insights within the data.
Code-Centric Approach:
- Advocating the significance of a code-centric approach during the discovery process.
- Showcasing how a code-centric mindset enhances collaboration, repeatability, and efficiency.
Version Control with GitHub:
- Integrating GitHub seamlessly into our workflow for version control and collaboration.
- Managing changes effectively to ensure a streamlined data engineering discovery process.
Real-World Application:
- Applying insights gained from EDA to address the initial problem statement.
- Discussing practical solutions and strategies derived from the discovery process.

Key Takeaways:

Mastery of the foundational aspects of data engineering.
Hands-on experience with EDA techniques, emphasizing the discovery phase.
Appreciation for the value of a code-centric approach in the data engineering discovery process.

Some of the technologies that we will be covering:

Python
Data Analysis and Visualization
Jupyter Notebook
Visual Studio Code

Presentation

Data Engineering Overview

Topics

Importance of the Discovery Process
Setting the Stage - Technologies
Exploratory Data Analysis (EDA)
Code-Centric Approach
Version Control
Real-World Use Case

Follow this project: Give a star

👉 Data Engineering Process Fundamentals

Importance of the Discovery Process

Clearly document the problem statement to understand the challenges the project aims to address.
Make observations about the data, its structure, and sources during the discovery process.
Define project requirements based on the observations, enabling the team to understand the scope and goals.
Clearly outline the scope of the project, ensuring a focused and well-defined set of objectives.
Use insights from the discovery phase to inform the design of the solution, including data architecture.
Develop a robust project architecture that aligns with the defined requirements and scope.

Setting the Stage - Technologies

To set the stage, we need to identify and select the tools that can facilitate the analysis and documentation of the data. Here are key technologies that play a crucial role in this stage:

Python: A versatile programming language with rich libraries for data manipulation, analysis, and scripting.

Use Cases: Data download, cleaning, exploration, and scripting for automation.

Jupyter Notebooks: An interactive tool for creating and sharing documents containing live code, visualizations, and narrative text.

Use Cases: Exploratory data analysis, documentation, and code collaboration.

Visual Studio Code: A lightweight, extensible code editor with powerful features for source code editing and debugging.

Use Cases: Writing and debugging code, integrating with version control systems like GitHub.

SQL (Structured Query Language): A domain-specific language for managing and manipulating relational databases.

Use Cases: Querying databases, data extraction, and transformation.

Exploratory Data Analysis (EDA)

EDA is our go-to method for downloading, analyzing, understanding and documenting the intricacies of the datasets. It's like peeling back the layers of information to reveal the stories hidden within the data. Here's what EDA is all about:

EDA is the process of analyzing data to identify patterns, relationships, and anomalies, guiding the project's direction.
Python and Jupyter Notebook collaboratively empower us to download, describe, and transform data through live queries.
Insights gained from EDA set the foundation for informed decision-making in subsequent data engineering steps.
Code written on Jupyter Notebook can be exported and used as the starting point for components for the data pipeline and transformation services.

Code-Centric Approach

A code-centric approach, using programming languages and tools in EDA, helps us understand the coding methodology for building data structures, defining schemas, and establishing relationships. This robust understanding seamlessly guides project implementation.

Code delves deep into data intricacies, revealing integration and transformation challenges often unclear with visual tools.
Using code taps into Pandas and Numpy libraries, empowering robust manipulation of data frames, establishment of loading schemas, and addressing transformation needs.
Code-centricity enables sophisticated analyses, covering aggregation, distribution, and in-depth examinations of the data.
While visual tools have their merits, a code-centric approach excels in hands-on, detailed data exploration, uncovering subtle nuances and potential challenges.

Version Control

Using a tool like GitHub is essential for effective version control and collaboration in our discovery process. GitHub enables us to track our exploratory code and Jupyter Notebooks, fostering collaboration, documentation, and comprehensive project management. Here's how GitHub enhances our process:

Centralized Tracking: GitHub centralizes tracking and managing our exploratory code and Jupyter Notebooks, ensuring a transparent and organized record of our data exploration.
Sharing: Easily share code and Notebooks with team members on GitHub, fostering seamless collaboration and knowledge sharing.
Documentation: GitHub supports Markdown, enabling comprehensive documentation of processes, findings, and insights within the same repository.
Project Management: GitHub acts as a project management hub, facilitating CI/CD pipeline integration for smooth and automated delivery of data engineering projects.

Summary

The data engineering discovery process involves defining the problem statement, gathering requirements, and determining the scope of work. It also includes a data analysis exercise utilizing Python and Jupyter Notebooks or other tools to extract valuable insights from the data. These steps collectively lay the foundation for successful data engineering endeavors.

Thanks for reading.

Send question or comment at Twitter @ozkary

👍 Originally published by ozkary.com

AI - A Learning Based Approach For Predicting Heart Disease

2023-12-02T14:32:00.013-05:00

Abstract

Heart disease is a leading cause of mortality worldwide, and its early identification and risk assessment are critical for effective prevention and intervention. With the help of electronic health records (EHR) and a wealth of health-related data, there is a significant opportunity to leverage machine learning techniques for predicting and assessing the risk of heart disease in individuals.

The United States Centers for Disease Control and Prevention (CDC) has been collecting a vast array of data on demographics, lifestyle, medical history, and clinical parameters. This data repository offers a valuable resource to develop predictive models that can help identify those at risk of heart disease before symptoms manifest.

This study aims to use machine learning models to predict an individual's likelihood of developing heart disease based on CDC data. By employing advanced algorithms and data analysis, we seek to create a predictive model that factors in various attributes such as age, gender, cholesterol levels, blood pressure, smoking habits, and other relevant health indicators. The solution could assist healthcare professionals in evaluating an individual's risk profile for heart disease.

Key Objectives

Key objectives of this study include:

Developing a robust machine learning model capable of accurately predicting the risk of heart disease using CDC data.
Identifying the most influential risk factors and parameters contributing to heart disease prediction.
Compare model performance:
- Logistic Regression
- Decision Tree
- Random Forest
- XGBoost Classification
Evaluating the following metrics
- Accuracy
- Precision,
- F1
- Recall
Providing an API, so tools can integrate and make a risk analysis.
- Build a local app
- Build an Azure function for cloud deployment

The successful implementation of this study will lead to a transformative impact on public health by enabling timely preventive measures and tailored interventions for individuals at risk of heart disease.

Conclusion

This study was conducted by using four different Machine Learning algorithm. After comparing the performance of all these models, we concluded that the XGBoost Model has a relatively balanced precision and recall metrics, indicating that it's better at identifying true positives while keeping false positives in check. Based on this analysis, we choose XGBoost as the best performing model for this type of analysis.

Machine Learning Engineering Process

In order to execute this project, we follow a series of steps for discovery and data analysis, data processing and model selection. This process is done using jupyter notebooks for the experimental phase, and python files for the implementation and delivery phase.

Experimental Phase Notebooks

Data analysis and cleanup
- Step 1 - Data Analysis
Process and convert the data for modeling, feature analysis
- Step 2 - Data Processing
Train the model using different algorithm to evaluate the best option
- Step 3 - Model Training
Run test cases and predict results
- Step 4 - Model Prediction

👉 The data files for this study can be found in the same GitHub project as the Jupyter Notebook files.

Data Analysis - Exploratory Data Analysis (EDA)

These are the steps to analysis the data:

Load the data/2020/heart_2020_cleaned.csv
Fill in the missing values with zero
Review the data
- Rename the columns to lowercase
- Check the data types
- Preview the data
Identify the features
- Identify the categorical and numeric features
- Identify the target variables
Remove duplicates
Identify categorical features that can be converted into binary
Check the class balance in the data
- Check for Y/N labels for heart disease identification

Features

Based on the dataset, we have a mix of categorical and numerical features. We consider the following for encoding:

Categorical Features:
- 'heartdisease': This is the target variable. We remove this feature for the model training.
- 'smoking', 'alcoholdrinking', 'stroke', 'sex', 'agecategory', 'race', 'diabetic', 'physicalactivity', 'genhealth', 'sleeptime', 'asthma', 'kidneydisease', 'skincancer': These are categorical features. We can consider one-hot encoding these features.
Numerical Features:
- 'bmi', 'physicalhealth', 'mentalhealth', 'diffwalking': These are already numerical features, so there's no need to encode them.

# get a list of numeric features
features_numeric = list(df.select_dtypes(include=[np.number]).columns)

# get a list of object features and exclude the target feature 'heartdisease'
features_category = list(df.select_dtypes(include=['object']).columns)

# remove the target feature from the list of categorical features
target = 'heartdisease'

features_category.remove(target)

print('Categorical features',features_category)
print('Numerical features',features_numeric)

Data Validation and Class Balance

The data shows imbalance for the Y/N classes. There are less cases of heart disease, as expected, than the rest of the population. This can result in low performing models as there is way more negatives cases (N). To account for that, we can use techniques like down sampling the negative cases.

Heart Disease Distribution

# plot a distribution of the target variable set labels for each bar chart and show the count
print(df[target].value_counts(normalize=True).round(2))

# plot the distribution of the target variable
df[target].value_counts().plot(kind='bar', rot=0)
plt.xlabel('Heart disease')
plt.ylabel('Count')
# add a count label to each bar
for i, count in enumerate(df[target].value_counts()):
    plt.text(i, count-50, count, ha='center', va='top', fontweight='bold')

plt.show()

# # get the percentage of people with heart disease on a pie chart
df[target].value_counts(normalize=True).plot(kind='pie', labels=['No heart disease', 'Heart disease'], autopct='%1.1f%%', startangle=90)
plt.ylabel('')
plt.show()

👉 No 91% Yes 9%

Data Processing

For data processing, we should follow these steps:

Load the data/2020/heart_2020_eda.csv
Process the values
- Convert Yes/No features to binary (1/0)
- Cast all the numeric values to int to avoid float problems
Process the features
- Set the categorical features names
- Set the numeric features names
- Set the target variable
Feature importance analysis
- Use statistical analysis to get the metrics like risk and ratio
- Mutual Information score

Feature Analysis

The purpose of feature analysis in heart disease study is to uncover the relationships and associations between various patient characteristics (features) and the occurrence of heart disease. By examining factors such as lifestyle, medical history, demographics, and more, we aim to identify which specific attributes or combinations of attributes are most strongly correlated with heart disease. Feature analysis allows for the discovery of risk factors and insights that can inform prevention and early detection strategies.

# Calculate the mean and count of heart disease occurrences per feature value
feature_importance = []

# Create a dataframe for the analysis
results = pd.DataFrame(columns=['Feature', 'Value', 'Percentage'])

for feature in all_features:    
    grouped = df.groupby(feature)[target].mean().reset_index()
    grouped.columns = ['Value', 'Percentage']
    grouped['Feature'] = feature
    results = pd.concat([results, grouped], axis=0)

# Sort the results by percentage in descending order and get the top 10
results = results.sort_values(by='Percentage', ascending=False).head(15)

# get the overall heart diease occurrence rate
overall_rate = df[target].mean()
print('Overall Rate',overall_rate)

# calculate the difference between the feature value percentage and the overall rate
results['Difference'] = results['Percentage'] - overall_rate

# calculate the ratio of the difference to the overall rate
results['Ratio'] = results['Difference'] / overall_rate

# calculate the risk of heart disease occurrence for each feature value
results['Risk'] = results['Percentage'] / overall_rate

# sort the results by ratio in descending order
results = results.sort_values(by='Risk', ascending=False)

print(results)

# Visualize the rankings (e.g., create a bar plot)
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12, 6))
sns.barplot(data=results, x='Percentage', y='Value', hue='Feature')
plt.xlabel('Percentage of Heart Disease Occurrences')
plt.ylabel('Feature Value')
plt.title('Top 15 Ranking of Feature Values by Heart Disease Occurrence')
plt.show()


Overall Rate 0.09035
           Feature Value  Percentage  Difference     Ratio      Risk
65             bmi    77    0.400000    0.309647  3.427086  4.427086
1           stroke     1    0.363810    0.273457  3.026542  4.026542
3        genhealth  Poor    0.341131    0.250778  2.775537  3.775537
68             bmi    80    0.333333    0.242980  2.689239  3.689239
18       sleeptime    19    0.333333    0.242980  2.689239  3.689239
71             bmi    83    0.333333    0.242980  2.689239  3.689239
21       sleeptime    22    0.333333    0.242980  2.689239  3.689239
1    kidneydisease     1    0.293308    0.202956  2.246254  3.246254
29  physicalhealth    29    0.289216    0.198863  2.200957  3.200957

Overall Rate: This is the overall rate of heart disease occurrence in the dataset. It represents the proportion of individuals with heart disease (target='Yes') in the dataset. For example, if the overall rate is 0.2, it means that 20% of the individuals in the dataset have heart disease.
Difference: This value represents the difference between the percentage of heart disease occurrence for a specific feature value and the overall rate. It tells us how much more or less likely individuals with a particular feature value are to have heart disease compared to the overall population. A positive difference indicates a higher likelihood, while a negative difference indicates a lower likelihood.
Ratio: The ratio represents the difference relative to the overall rate. It quantifies how much the heart disease occurrence for a specific feature value deviates from the overall rate, considering the overall rate as the baseline. A ratio greater than 1 indicates a higher risk compared to the overall population, while a ratio less than 1 indicates a lower risk.
Risk: This metric directly quantifies the likelihood of an event happening for a specific feature value, expressed as a percentage. It's easier to interpret as it directly answers the question: "What is the likelihood of heart disease for individuals with this feature value?"

These values help us understand the relationship between different features and heart disease. Positive differences, ratios greater than 1, and risk values greater than 100% suggest a higher risk associated with a particular feature value, while negative differences, ratios less than 1, and risk values less than 100% suggest a lower risk. This information can be used to identify factors that may increase or decrease the risk of heart disease within the dataset.

Mutual Information Score

The mutual information score measures the dependency between a feature and the target variable. Higher scores indicate stronger dependency, while lower scores indicate weaker dependency. A higher score suggests that the feature is more informative when predicting the target variable.

# Compute mutual information scores for each feature
X = df[cat_features]
y = df[target]

def mutual_info_heart_disease_score(series):
    return mutual_info_score(series, y)

mi_scores = X.apply(mutual_info_heart_disease_score)
mi_ranking = pd.Series(mi_scores, index=X.columns).sort_values(ascending=False)

print(mi_ranking)
# Visualize the rankings
plt.figure(figsize=(12, 6))
sns.barplot(x=mi_ranking.values, y=mi_ranking.index)
plt.xlabel('Mutual Information Scores')
plt.ylabel('Feature')
plt.title('Feature Importance Ranking via Mutual Information Scores')

agecategory    0.033523
genhealth      0.027151
diabetic       0.012960
sex            0.002771
race           0.001976

Machine Learning Training and Model Selection

Load the data/2020/heart_2020_processed.csv
Process the features
- Set the categorical features names
- Set the numeric features names
- Set the target variable
Split the data
- train/validation/test split with 60%/20%/20% distribution.
- Random_state 42
- Use strategy = y to deal with the class imbalanced problem
Train the model
- LogisticRegression
- RandomForestClassifier
- XGBClassifier
- DecisionTreeClassifier
Evaluate the models and compare them
- accuracy_score
- precision_score
- recall_score
- f1_score
Confusion Matrix

Data Split

Use a 60/20/20 distribution fir train/val/test
Random_state 42 to shuffle the data
Use strategy = y when there is a class imbalance in the dataset. It helps ensure that the class distribution in both the training and validation (or test) sets closely resembles the original dataset's class distribution

def split_data(self, test_size=0.2, random_state=42):
        """
        Split the data into training and validation sets
        """
        # split the data in train/val/test sets, with 60%/20%/20% distribution with seed 1
        X = self.df[self.all_features]
        y = self.df[self.target_variable]
        X_full_train, X_test, y_full_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state, stratify=y)

        # .25 splits the 80% train into 60% train and 20% val
        X_train, X_val, y_train, y_val  = train_test_split(X_full_train, y_full_train, test_size=0.25, random_state=random_state)

        X_train = X_train.reset_index(drop=True)
        X_val = X_val.reset_index(drop=True)
        y_train = y_train.reset_index(drop=True)
        y_val = y_val.reset_index(drop=True)
        X_test = X_test.reset_index(drop=True)
        y_test = y_test.reset_index(drop=True)

        # print the shape of all the data splits
        print('X_train shape', X_train.shape)
        print('X_val shape', X_val.shape)
        print('X_test shape', X_test.shape)
        print('y_train shape', y_train.shape)
        print('y_val shape', y_val.shape)
        print('y_test shape', y_test.shape)

        return X_train, X_val, y_train, y_val, X_test, y_test

X_train, X_val, y_train, y_val, X_test, y_test = train_data.split_data(test_size=0.2, random_state=42)

The split_data call is a method that splits a dataset into training, validation, and test sets. Here's a breakdown of the returned values:

X_train: This represents the features (input variables) of the training set. The model will be trained on this data.
y_train: This corresponds to the labels (output variable) for the training set. It contains the correct outcomes corresponding to the features in X_train.
X_val: These are the features of the validation set. The model's performance is often assessed on this set during training to ensure it generalizes well to new, unseen data.
y_val: These are the labels for the validation set. They serve as the correct outcomes for the features in X_val during the evaluation of the model's performance.
X_test: These are the features of the test set. The model's final evaluation is typically done on this set to assess its performance on completely unseen data.
y_test: Similar to y_val, this contains the labels for the test set. It represents the correct outcomes for the features in X_test during the final evaluation of the model.

Model Training

For model training, we first pre-process the data by taking these steps:

preprocess_data
- The input features X are converted to a dictionary format using the to_dict method with the orientation set to records. This is a common step when working with scikit-learn transformers, as they often expect input data in this format.
- If is_training is True, it fits a transformer (self.encoder) on the data using the fit_transform method. If False, it transforms the data using the previously fitted transformer (self.encoder.transform). The standardized features are then returned.

We then train the different models:

train -This method takes X_train (training features) and y_train (training labels) as parameters. -If the models attribute of the class is None, it initializes a dictionary of machine learning models including logistic regression, random forest, XGBoost, and decision tree classifiers.

def preprocess_data(self, X, is_training=True):      
        """
        Preprocess the data for training or validation
        """  
        X_dict = X.to_dict(orient='records')        

        if is_training:
            X_std = self.encoder.fit_transform(X_dict)            
        else:
            X_std = self.encoder.transform(X_dict)

        # Return the standardized features and target variable
        return X_std

def train(self, X_train, y_train):

      if self.models is None:
          self.models = {
              'logistic_regression': LogisticRegression(C=10, max_iter=1000, random_state=42),
              'random_forest': RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42, n_jobs=-1),
              'xgboost': XGBClassifier(n_estimators=100, max_depth=5, random_state=42, n_jobs=-1),                
              'decision_tree': DecisionTreeClassifier(max_depth=5, random_state=42)
          }

      for model in self.models.keys():
          print('Training model', model)
          self.models[model].fit(X_train, y_train) 

# hot encode the categorical features for the train data
model_factory = HeartDiseaseModelFactory(cat_features, num_features)
X_train_std = model_factory.preprocess_data(X_train[cat_features + num_features], True)

# hot encode the categorical features for the validation data
X_val_std = model_factory.preprocess_data(X_val[cat_features + num_features], False)

# Train the model
model_factory.train(X_train_std, y_train)

Model Evaluation

For the model evaluation, we calculate the following metrics:

Accuracy tells us how often your model is correct. It's the percentage of all predictions that are accurate. For example, an accuracy of 92% is great, while 70% is not good.
Precision is about being precise, not making many mistakes. It's the percentage of positive predictions that were actually correct. For instance, a precision of 90% is great, while 50% is not good.
Recall is about not missing any positive instances. It's the percentage of actual positives that were correctly predicted. A recall of 85% is great, while 30% is not good.
F1 Score is a balance between precision and recall. It's like having the best of both worlds. For example, an F1 score of 80% is great, while 45% is not good.


def evaluate(self, X_val, y_val, threshold=0.5):
        """
        Evaluate the model on the validation data set and return the predictions
        """

        # create a dataframe to store the metrics
        df_metrics = pd.DataFrame(columns=['model', 'accuracy', 'precision', 'recall', 'f1', 'y_pred'])

        # define the metrics to be calculated
        fn_metrics = { 'accuracy': accuracy_score,'precision': precision_score,'recall': recall_score,'f1': f1_score}

        # loop through the models and get its metrics
        for model_name in self.models.keys():

            model = self.models[model_name]

            # The first column (y_pred_proba[:, 0]) is for class 0 ("N")
            # The second column (y_pred_proba[:, 1]) is for class 1 ("Y")            
            y_pred = model.predict_proba(X_val)[:,1]
            # get the binary predictions
            y_pred_binary = np.where(y_pred > threshold, 1, 0)

            # add a new row to the dataframe for each model            
            df_metrics.loc[len(df_metrics)] = [model_name, 0, 0, 0, 0, y_pred_binary]

            # get the row index
            row_index = len(df_metrics)-1

            # Evaluate the model metrics
            for metric in fn_metrics.keys():
                score = fn_metrics[metric](y_val, y_pred_binary)
                df_metrics.at[row_index,metric] = score

        return df_metrics

Model Performance Metrics:

Model	Accuracy	Precision	Recall	F1
Logistic Regression	0.9097	0.509	0.0987	0.1654
Random Forest	0.9095	0.6957	0.0029	0.0058
XGBoost	0.9099	0.5154	0.098	0.1647
Decision Tree	0.9097	0.5197	0.0556	0.1004

These metrics provide insights into the performance of each model, helping us understand their strengths and areas for improvement.

Analysis:

XGBoost Model:
- Accuracy: 90.99
- Precision: 51.54%
- Recall: 9.80%
- F1 Score: 16.47%
Decision Tree Model:
- Accuracy: 90.97%
- Precision: 51.97%
- Recall: 5.56%
- F1 Score: 10.04%
Logistic Regression Model:
- Accuracy: 90.97%
- Precision: 50.90%
- Recall: 9.87%
- F1 Score: 16.54%
Random Forest Model:
- Accuracy: 90.95%
- Precision: 69.57%
- Recall: 0.29%
- F1 Score: 0.58%

XGBoost Model has a relatively balanced precision and recall, indicating it's better at identifying true positives while keeping false positives in check.
Decision Tree Model has the lowest recall, suggesting that it may miss some positive cases.
Logistic Regression Model has a good balance of precision and recall similar to the XGBoost Model.
Random Forest Model has high precision but an extremely low recall, meaning it's cautious in predicting positive cases but may miss many of them.

Based on this analysis, we will choose XGBoost as our API model

Confusion Matrix:

The confusion matrix is a valuable tool for evaluating the performance of classification models, especially for a binary classification problem like predicting heart disease (where the target variable has two classes: 0 for "No" and 1 for "Yes"). Let's analyze what the confusion matrix represents for heart disease prediction using the four models.

For this analysis, we'll consider the following terms:

True Positives (TP): The model correctly predicted "Yes" (heart disease) when the actual label was also "Yes."
True Negatives (TN): The model correctly predicted "No" (no heart disease) when the actual label was also "No."
False Positives (FP): The model incorrectly predicted "Yes" when the actual label was "No." (Type I error)
False Negatives (FN): The model incorrectly predicted "No" when the actual label was "Yes." (Type II error)

from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

cms = []
model_names = []
total_samples = []

for model_name in df_metrics['model']:
    model_y_pred = df_metrics[df_metrics['model'] == model_name]['y_pred'].iloc[0]

    # Compute the confusion matrix
    cm = confusion_matrix(y_val, model_y_pred)    
    cms.append(cm)
    model_names.append(model_name)
    total_samples.append(np.sum(cm))    

# Create a 2x2 grid of subplots
fig, axes = plt.subplots(2, 2, figsize=(10, 10))

# Loop through the subplots and plot the confusion matrices
for i, ax in enumerate(axes.flat):
    cm = cms[i]    
    im = ax.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
    ax.figure.colorbar(im, ax=ax, shrink=0.6)

    # Set labels, title, and value in the center of the heatmap
    ax.set(xticks=np.arange(cm.shape[1]), yticks=np.arange(cm.shape[0]), 
           xticklabels=["No Heart Disease", "Heart Disease"], yticklabels=["No Heart Disease", "Heart Disease"],
           title=f'{model_names[i]} (n={total_samples[i]})\n')

    # Loop to annotate each quadrant with its count
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, str(cm[i, j]), ha="center", va="center", color="gray")

    ax.title.set_fontsize(12)
    ax.set_xlabel('Predicted', fontsize=10)
    ax.set_ylabel('Actual', fontsize=10)
    ax.xaxis.set_label_position('top')

# Adjust the layout
plt.tight_layout()

Let's examine the confusion matrices for each model:

XGBoost:
- Total Samples: 60,344
- Confusion Matrix Total:
  - True Positives (TP): 536
  - True Negatives (TN): 54,370
  - False Positives (FP): 504
  - False Negatives (FN): 4,934
Decision Tree:
- Total Samples: 60,344
- Confusion Matrix Total:
  - True Positives (TP): 304
  - True Negatives (TN): 54,593
  - False Positives (FP): 281
  - False Negatives (FN): 5,166
Logistic Regression:
- Total Samples: 60,344
- Confusion Matrix Total:
  - True Positives (TP): 540
  - True Negatives (TN): 54,353
  - False Positives (FP): 521
  - False Negatives (FN): 4,930
Random Forest:
- Total Samples: 60,344
- Confusion Matrix Total:
  - True Positives (TP): 16
  - True Negatives (TN): 54,867
  - False Positives (FP): 7
  - False Negatives (FN): 5,454

XGBoost:

This model achieved a relatively high number of True Positives (TP) with 536 cases correctly predicted as having heart disease.
It also had a significant number of True Negatives (TN), indicating correct predictions of no heart disease (54,370).
However, there were 504 False Positives (FP), where it incorrectly predicted heart disease.
It had 4,934 False Negatives (FN), suggesting instances where actual heart disease cases were incorrectly predicted as non-disease.

Decision Tree:

The Decision Tree model achieved 304 True Positives (TP), correctly identifying heart disease cases.
It also had 54,593 True Negatives (TN), showing accurate predictions of no heart disease.
There were 281 False Positives (FP), indicating instances where the model incorrectly predicted heart disease.
It had 5,166 False Negatives (FN), meaning it missed identifying heart disease in these cases.

Logistic Regression:

The Logistic Regression model achieved 540 True Positives (TP), correctly identifying cases with heart disease.
It had a high number of True Negatives (TN) with 54,353 correctly predicted non-disease cases.
However, there were 521 False Positives (FP), where the model incorrectly predicted heart disease.
It also had 4,930 False Negatives (FN), indicating missed predictions of heart disease.

Random Forest:

The Random Forest model achieved a relatively low number of True Positives (TP) with 16 cases correctly predicted as having heart disease.
It had a high number of True Negatives (TN) with 54,867 correctly predicted non-disease cases.
There were only 7 False Positives (FP), suggesting rare incorrect predictions of heart disease.
However, it also had 5,454 False Negatives (FN), indicating a substantial number of missed predictions of heart disease.

All models achieved a good number of True Negatives, suggesting their ability to correctly predict non-disease cases. However, there were variations in True Positives, False Positives, and False Negatives. The XGBoost model achieved the highest True Positives but also had a significant number of False Positives. The Decision Tree and Logistic Regression models showed similar TP and FP counts, while the Random Forest model had the lowest TP count. The trade-off between these metrics is essential for assessing the model's performance in detecting heart disease accurately.

Summary

In the quest to find the best solution for predicting heart disease, it's crucial to evaluate various models. However, it's not just about picking a model and hoping for the best. We need to be mindful of class imbalances – situations where one group has more examples than the other. This imbalance can throw our predictions off balance.

To fine-tune our models, we also need to adjust the hyperparameters. Think of it as finding the perfect settings to make our models have a better performance. By addressing class imbalances and tweaking those hyperparameters, we ensure our models perform accurately.

By using the correct data features and evaluating the performance of our models, we can build solutions that could assist healthcare professionals in evaluating an individual's risk profile for heart disease.

Thanks for reading.

Send question or comment at Twitter @ozkary Originally published by ozkary.com

Data Engineering Process Fundamentals - An introduction to Data Analysis and Visualization

2023-11-29T14:20:00.031-05:00

In this technical presentation, we will delve into the fundamental concepts of Data Engineering in the areas of data analysis and visualization. We focus on these areas by using both a code-centric and low-code approach.

Follow this GitHub repo during the presentation: (Give it a star)

https://github.com/ozkary/data-engineering-mta-turnstile

Read more information on my blog at:

https://www.ozkary.com/2023/03/data-engineering-process-fundamentals.html

Presentation

YouTube Video

Section 1: Data Analysis Essentials

Data Analysis: Explore the fundamentals of data analysis using Python, unraveling the capabilities of libraries such as Pandas and NumPy. Learn how Jupyter Notebooks provide an interactive environment for data exploration, analysis, and visualization.

Data Profiling: With Python at our fingertips, discover how Jupyter Notebooks aid in data profiling—understanding data structures, quality, and characteristics. Witness the seamless integration with tools like pandas-profiling for comprehensive data insights.

Cleaning and Preprocessing: Dive into the world of data cleaning and preprocessing with Python's Pandas library, facilitated by the user-friendly environment of Jupyter Notebooks. See how Visual Studio Code enhances the coding experience for efficient data preparation.

Section 2: Statistical Analysis vs. Business Intelligence

Statistical Analysis: Embrace Python's statistical libraries, such as SciPy and StatsModels, within the Jupyter environment. Witness the power of statistical analysis for extracting patterns and correlations from data, all seamlessly integrated into your workflow with Visual Studio Code.

Business Intelligence: Contrast statistical analysis with the broader field of business intelligence, emphasizing the role of Python in data transformation. Utilize Jupyter Notebooks to showcase how Python's versatility extends to business intelligence applications.

Section 3: The Power of Data Visualization

Importance of Data Visualization: Unlock the potential of Python's visualization libraries, such as Matplotlib and Seaborn, within the interactive canvas of Jupyter Notebooks. Visual Studio Code complements this process, providing a robust coding environment for creating captivating visualizations.

Introduction to Tools: While exploring the importance of data visualization, let's talk about the powerful visualization tools like Power BI, Looker, and Tableau. Learn how this integration elevates your data storytelling capabilities.

Conclusion:

This session aims to equip attendees with a strong foundation in data engineering, focusing on the pivotal role of data analysis and visualization. By the end of this presentation, participants will grasp how to effectively utilize these practices, so they are able to start the journey on data analysis and visualization.

This presentation will be accompanied by live code demonstrations and interactive discussions, ensuring attendees gain practical knowledge and valuable insights into the dynamic world of data engineering.

Some of the technologies that we will be covering:

Data Analysis
Data Visualization
Python
Jupyter Notebook
Looker

Thanks for reading.

Send question or comment at Twitter @ozkary Originally published by ozkary.com

Data Engineering Process Fundamentals - Unveiling the Power of Data Lakes and Data Warehouses

2023-10-25T16:00:00.035-04:00

In this technical presentation, we will delve into the fundamental concepts of Data Engineering, focusing on two pivotal components of modern data architecture - Data Lakes and Data Warehouses. We will explore their roles, differences, and how they collectively empower organizations to harness the true potential of their data.

- Follow this GitHub repo during the presentation: (Give it a star)

https://github.com/ozkary/data-engineering-mta-turnstile

- Read more information on my blog at:

https://www.ozkary.com/2023/03/data-engineering-process-fundamentals.html

Presentation

YouTube Video

1. Introduction to Data Engineering:

- Brief overview of the data engineering landscape and its critical role in modern data-driven organizations.

2. Understanding Data Lakes:

- Explanation of what a data lake is and its purpose in storing vast amounts of raw and unstructured data.

3. Exploring Data Warehouses:

- Definition of data warehouses and their role in storing structured, processed, and business-ready data.

4. Comparing Data Lakes and Data Warehouses:

- Comparative analysis of data lakes and data warehouses, highlighting their strengths and weaknesses.

- Discussing when to use each based on specific use cases and business needs.

5. Integration and Data Pipelines:

- Insight into the seamless integration of data lakes and data warehouses within a data engineering pipeline.

- Code walkthrough showcasing data movement and transformation between these two crucial components.

6. Real-world Use Cases:

- Presentation of real-world use cases where effective use of data lakes and data warehouses led to actionable insights and business success.

- Hands-on demonstration using Python, Jupyter Notebook and SQL to solidify the concepts discussed, providing attendees with practical insights and skills.

Conclusion:

This session aims to equip attendees with a strong foundation in data engineering, focusing on the pivotal role of data lakes and data warehouses. By the end of this presentation, participants will grasp how to effectively utilize these tools, enabling them to design efficient data solutions and drive informed business decisions.

Some of the technologies that we will be covering:

- Data Lakes

- Data Warehouse

- Data Analysis and Visualization

- Python

- Jupyter Notebook

- SQL

Send question or comment at Twitter @ozkary

Originally published by ozkary.com

AI with Python and Tensorflow - Convolutional Neural Networks Analysis

2023-10-15T10:54:00.025-04:00

Convolutional neural network (CNN)

Convolutional Neural Networks (CNNs) have revolutionized the field of computer vision and image processing. These specialized deep learning models are inspired by the human visual system and excel at tasks like image classification, object detection, and facial recognition.

CNNs employ convolution operations, primarily used for processing images. The network initiates the analysis by applying filters that aim to extract valuable image features using various convolutional kernels. Similar to other weights in the neural network, these filters can be enhanced by adjusting their kernels based on the output error. After this, the resultant images undergo pooling, followed by pixel-wise input to a standard neural network in a process known as flattening.

Model 1

Input (IMG_WIDTH, IMG_HEIGHT, 3)
|
Conv2D (32 filters, 3x3 kernel, ReLU)
|
MaxPooling2D (2x2 pool size)
|
Flatten
|
Dense (128 nodes, ReLU)
|
Dropout (50%)
|
Dense (NUM_CATEGORIES, softmax)
|
Output (NUM_CATEGORIES)

Input Layer (Conv2D):
- Type: Convolutional Layer (2D)
- Number of Filters: 32
- Filter Size: (3, 3)
- Activation Function: Rectified Linear Unit (ReLU)
- Input Shape: (IMG_WIDTH, IMG_HEIGHT, 3) where 3 represents the color channels (RGB).
Pooling Layer (MaxPooling2D):
- Type: Max Pooling Layer (2D)
- Pool Size: (2, 2)
- Purpose: Reduces the spatial dimensions by taking the maximum value from each group of 2x2 pixels.
Flatten Layer (Flatten):
- Type: Flattening Layer
- Purpose: Converts the multidimensional input into a 1D array to feed into the Dense layer.
Dense Hidden Layer (Dense):
- Number of Neurons: 128
- Activation Function: ReLU
- Purpose: Learns and represents complex patterns in the data.
Dropout Layer (Dropout):
- Rate: 0.5
- Purpose: Helps prevent overfitting by randomly setting 50% of the inputs to zero during training.
Output Layer (Dense):
- Number of Neurons: NUM_CATEGORIES (Number of categories for traffic signs)
- Activation Function: Softmax
- Purpose: Produces probabilities for each category, summing to 1, indicating the likelihood of the input image belonging to each category.


layers = tf.keras.layers

# Create a convolutional neural network
model =  tf.keras.models.Sequential([

    # Convolutional layer. Learn 32 filters using a 3x3 kernel
    layers.Conv2D(32, (3, 3), activation='relu', input_shape=(30, 30, 3)),

    # Max-pooling layer, reduces the spatial dimensions by taking the maximum value from each group of 2x2 pixels
    layers.MaxPooling2D((2, 2)),

    # Converts the multidimensional input into a 1D array to feed into the Dense layer
    layers.Flatten(),

    # Dense Hidden Layer with 128 nodes and relu activation function to learns and represent complex patterns in the data
    layers.Dense(128, activation='relu'),

    # Dropout layer to prevent overfitting by randomly setting 50% of the inputs to 0 at each update during training
    layers.Dropout(0.5),

    # Output layer with NUM_CATEGORIES outputs and softmax activation function to return probability-like results
    layers.Dense(NUM_CATEGORIES, activation='softmax')
])

Model 2

Input (IMG_WIDTH, IMG_HEIGHT, 3)
|
Conv2D (32 filters, 3x3 kernel, ReLU)
|
MaxPooling2D (2x2 pool size)
|
Conv2D (64 filters, 3x3 kernel, ReLU)
|
MaxPooling2D (2x2 pool size)
|
Flatten
|
Dense (128 nodes, ReLU)
|
Dropout (50%)
|
Dense (NUM_CATEGORIES, softmax)
|
Output (NUM_CATEGORIES)

Input Layer (Conv2D):
- Type: Convolutional Layer (2D)
- Number of Filters: 32
- Filter Size: (3, 3)
- Activation Function: Rectified Linear Unit (ReLU)
- Input Shape: (IMG_WIDTH, IMG_HEIGHT, 3) where 3 represents the color channels (RGB).
Pooling Layer (MaxPooling2D):
- Type: Max Pooling Layer (2D)
- Pool Size: (2, 2)
- Purpose: Reduces the spatial dimensions by taking the maximum value from each group of 2x2 pixels.
Convolutional Layer (Conv2D):
- Number of Filters: 64
- Filter Size: (3, 3)
- Activation Function: ReLU
- Purpose: Extracts higher-level features from the input.
Pooling Layer (MaxPooling2D):
- Pool Size: (2, 2)
- Purpose: Further reduces spatial dimensions.
Flatten Layer (Flatten):
- Type: Flattening Layer
- Purpose: Converts the multidimensional input into a 1D array to feed into the Dense layer.
Dense Hidden Layer (Dense):
- Number of Neurons: 128
- Activation Function: ReLU
- Purpose: Learns and represents complex patterns in the data.
Dropout Layer (Dropout):
- Rate: 0.5
- Purpose: Helps prevent overfitting by randomly setting 50% of the inputs to zero during training.
Output Layer (Dense):
- Number of Neurons: NUM_CATEGORIES (Number of categories for traffic signs)
- Activation Function: Softmax
- Purpose: Produces probabilities for each category, summing to 1, indicating the likelihood of the input image belonging to each category.


layers = tf.keras.layers

    # Create a convolutional neural network
    model = tf.keras.models.Sequential([

        # Convolutional layer. Learn 32 filters using a 3x3 kernel
        layers.Conv2D(32, (3, 3), activation='relu', input_shape=(30, 30, 3)),

        # Max-pooling layer, reduces the spatial dimensions by taking the maximum value from each group of 2x2 pixels
        layers.MaxPooling2D((2, 2)),

        # Convolutional layer. Learn 64 filters using a 3x3 kernel to extracts higher-level features from the input
        layers.Conv2D(64, (3, 3), activation='relu'),

        # Max-pooling layer, using 2x2 pool size reduces spatial dimensions
        layers.MaxPooling2D((2, 2)),

        # Converts the multidimensional input into a 1D array to feed into the Dense layer
        layers.Flatten(),

        # Dense Hidden Layer with 128 nodes and relu activation function to learns and represent complex patterns in the data
        layers.Dense(128, activation='relu'),

        # Dropout layer to prevent overfitting by randomly setting 50% of the inputs to 0 at each update during training
        layers.Dropout(0.5),

        # Output layer with NUM_CATEGORIES outputs and softmax activation function to return probability-like results
        layers.Dense(NUM_CATEGORIES, activation='softmax')
    ])

The architecture follows a typical CNN pattern: alternating Convolutional and MaxPooling layers to extract features and reduce spatial dimensions, followed by Flattening and Dense layers for classification.

Feel free to adjust the number of filters, filter sizes, layer types, or other hyperparameters based on your specific problem and dataset. Experimentation is key to finding the best architecture for your task.

Model 1 Results

Images and Labels loaded 26640, 26640
Epoch 1/10
500/500 [==============================] - 7s 12ms/step - loss: 4.9111 - accuracy: 0.0545   
Epoch 2/10
500/500 [==============================] - 6s 12ms/step - loss: 3.5918 - accuracy: 0.0555
Epoch 3/10
500/500 [==============================] - 6s 12ms/step - loss: 3.5411 - accuracy: 0.0565
Epoch 4/10
500/500 [==============================] - 6s 12ms/step - loss: 3.5190 - accuracy: 0.0577
Epoch 5/10
500/500 [==============================] - 6s 12ms/step - loss: 3.5088 - accuracy: 0.0565
Epoch 6/10
500/500 [==============================] - 6s 12ms/step - loss: 3.5041 - accuracy: 0.0577
Epoch 7/10
500/500 [==============================] - 6s 12ms/step - loss: 3.5019 - accuracy: 0.0577
Epoch 8/10
500/500 [==============================] - 6s 12ms/step - loss: 3.5008 - accuracy: 0.0577
Epoch 9/10
500/500 [==============================] - 6s 12ms/step - loss: 3.5002 - accuracy: 0.0577
Epoch 10/10
500/500 [==============================] - 6s 12ms/step - loss: 3.4999 - accuracy: 0.0577
333/333 - 1s - loss: 3.4964 - accuracy: 0.0541 - 1s/epoch - 4ms/step

Model 2 Results


Images and Labels loaded 26640, 26640
Epoch 1/10
500/500 [==============================] - 9s 15ms/step - loss: 4.0071 - accuracy: 0.1315
Epoch 2/10
500/500 [==============================] - 7s 14ms/step - loss: 2.0718 - accuracy: 0.3963
Epoch 3/10
500/500 [==============================] - 7s 15ms/step - loss: 1.4216 - accuracy: 0.5529
Epoch 4/10
500/500 [==============================] - 7s 14ms/step - loss: 1.0891 - accuracy: 0.6546
Epoch 5/10
500/500 [==============================] - 7s 14ms/step - loss: 0.8440 - accuracy: 0.7320
Epoch 6/10
500/500 [==============================] - 7s 14ms/step - loss: 0.6838 - accuracy: 0.7862
Epoch 7/10
500/500 [==============================] - 7s 14ms/step - loss: 0.5754 - accuracy: 0.8184
Epoch 8/10
500/500 [==============================] - 7s 14ms/step - loss: 0.5033 - accuracy: 0.8420
Epoch 9/10
500/500 [==============================] - 7s 14ms/step - loss: 0.4171 - accuracy: 0.8729
Epoch 10/10
500/500 [==============================] - 7s 15ms/step - loss: 0.3787 - accuracy: 0.8851
333/333 - 2s - loss: 0.1354 - accuracy: 0.9655 - 2s/epoch - 6ms/step
Model saved to cnn_model2.keras.

Summary

This is a summary of the CNN model experiments:

Model 1 had a loss of 3.4964 and an accuracy of 0.0541. This model had a simple architecture with few layers and filters, which may have limited its ability to learn complex features in the input images.

Model 2 had a loss of 0.1354 and an accuracy of 0.9655. This model had a more complex architecture with additional hidden layers, including a convolutional layer with 64 filters and an additional max pooling (2x2) layer. The addition of these layers likely helped the model learn more complex features in the input images, leading to a significant improvement in accuracy.

In particular, the addition of more convolutional layers with more filters can help the model learn more complex features in the input images, as each filter learns to detect a different pattern or feature in the input. However, it is important to balance the number of filters with the size of the input images and the complexity of the problem, as using too many filters can lead to overfitting and poor generalization to new data.

Overall, the results suggest that increasing the complexity of the model by adding more hidden layers can help improve its accuracy, but it is important to balance the complexity of the model with the size of the input images and the complexity of the problem to avoid overfitting.

Learning rate

A learning rate of .001 (default) provided optimal results - loss: 0.1354 - accuracy: 0.9655
A learning rate of .01 lower the accuracy and increase the loss loss: 3.4858 - accuracy: 0.0594

A learning rate of 0.01 is too high for our specific problem and dataset, which can cause the model to overshoot the optimal solution and fail to converge.


model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.01),
              loss='categorical_crossentropy',
              metrics=['accuracy'])

Thanks for reading.

Send question or comment at Twitter @ozkary

👍 Originally published by ozkary.com

Data Engineering Process Fundamentals - Data Streaming Exersise

2023-09-09T15:30:00.006-04:00

Data Streaming - Exercise

Now that we've covered the concepts of data streaming, let's move forward with an actual coding exercise. During this exercise, we'll delve into building a Kafka message broker and a Spark consumer with the objective of having the Kafka message broker work as a data producer for our subway system information. The Spark consumer acts as a message aggregator and writes the results to our data lake. This allows the data modeling process to pick up the information and insert it into the data warehouse, providing seamless integration and reusing our already operational data pipeline.

Batch Process vs Data Stream

In a batch process data pipeline, we define a schedule to process the data from its source. With a data stream pipeline, there is no schedule as the data flows as soon as it is available from its source.

In the batch data download, the data is aggregated for periods of four hours. Since the data stream comes in more frequently, there is no four-hour aggregation. The data comes in as single transactions.

Data Stream Strategy

From our system requirements, we already have a data pipeline process that runs an incremental update process to import the data from the data lake into our data warehouse. This process already handles data transformation, mapping, and populates all the dimension tables and fact tables with the correct information.

Therefore, we want to follow the same pipeline process and utilize what already exists. To account for the fact that the data comes in as a single transaction, and we do not want to create single files, we want to implement an aggregation strategy on our data streaming pipeline. This enables us to define time windows for when to publish the data, whether it's 1 minute or 4 hours. It really depends on what fits the business requirements. The important thing here is to understand the technical capabilities and the strategy for the solution.

Data Streaming Data Flow Process

To deliver a data streaming solution, we typically employ a technical design illustrated as follows:

Kafka
- Producer
- Topics
- Consumer
Spark
- Kafka Consumer
- Message Parsing and Aggregations
- Write to Data Lake or Other Storage

Kafka:

Producer: The producer is responsible for publishing data to Kafka topics. It produces and sends messages to specific topics, allowing data to be ingested into the Kafka cluster.
Topics: Topics are logical channels or categories to which messages are published by producers and from which messages are consumed by consumers. They act as data channels, providing a way to organize and categorize messages.
Consumer: Consumers subscribe to Kafka topics and process the messages produced by the producers. They play a vital role in real-time data consumption and are responsible for extracting valuable insights from the streaming data.

Spark:

Kafka Consumer: This component serves as a bridge between Kafka and Spark, allowing Spark to consume data directly from Kafka topics. It establishes a connection to Kafka, subscribes to specified topics, and pulls streaming data into Spark for further processing.
Message Parsing and Aggregations: Once data is consumed from Kafka, Spark performs message parsing to extract relevant information. Aggregations involve summarizing or transforming the data, making it suitable for analytics or downstream processing. This step is crucial for deriving meaningful insights from the streaming data.
Write to Data Lake or Other Storage: After processing and aggregating the data, Spark writes the results to a data lake or other storage systems. A data lake is a centralized repository that allows for the storage of raw and processed data in its native format. This step ensures that the valuable insights derived from the streaming data are persisted for further integration to a data warehouse.

Implementation Requirements

👉 Clone this repo or copy the files from this folder Streaming

For our example, we will adopt a code-centric approach and utilize Python to implement both our producer and consumer components. Additionally, we'll require instances of Apache Kafka and Apache Spark to be running. To ensure scalability, we'll deploy these components on separate virtual machines (VMs). Our Terraform scripts will be instrumental in creating new VM instances for this purpose. It's important to note that all these components will be encapsulated within Docker containers.

👉 For the ease of development in this lab, we can run everything on a single VM or local workstations. This allows us to bypass the complexities associated with network configuration. For real deployments, we should use separate VMs.

Requirements

Docker and Docker Hub
- Install Docker
- Create a Docker Hub Account
Prefect dependencies and cloud account
- Install the Prefect Python library dependencies
- Create a Prefect Cloud account
Data Lake for storage
- Make sure to have the storage account and access ready

👉 Before proceeding with the setup, ensure that the storage and Prefect credentials have been configured as shown on the Orchestration exercise step.

Docker

For running this locally or on virtual machines (VMs), the optimal approach is to leverage Docker Containers. In this exercise, we'll utilize a lightweight configuration of Kafka and Spark using Bitnami images. This configuration assumes a minimal setup with a Spark Master, a Spark Worker, and a Kafka broker.

Docker provides a platform for packaging, distributing, and running applications within containers. This ensures consistency across different environments and simplifies the deployment process. To get started, you can download and install Docker from the official website (https://www.docker.com/get-started). Once Docker is installed, the Docker command-line interface (docker) becomes available, enabling us to efficiently manage and interact with containers.

Docker Compose file

Utilize the docker-bitnami.compose.yaml file to configure a unified environment where both of these services should run. In the event that we need to run these services on distinct virtual machines (VMs), we would deploy each Docker image on a separate VM.

version: "3.6"
services:
  spark-master:
    image: bitnami/spark:latest
    container_name: spark-master
    environment:
      SPARK_MODE: "master"
    ports:
      - 8080:8080

  spark-worker:
    image: bitnami/spark:latest
    container_name: spark-worker
    environment:
      SPARK_MODE: "worker"
      SPARK_MASTER_URL: "spark://spark-master:7077"
    ports:
      - 8081:8081
    depends_on:
      - spark-master

  kafka:
    image: bitnami/kafka:latest
    container_name: kafka
    ports:
      - "9092:9092"
      - "29092:29092"  # Used for internal communication
    environment:
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092,PLAINTEXT_HOST://localhost:9092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: LISTENER_BOB:PLAINTEXT,LISTENER_FRED:PLAINTEXT
      KAFKA_LISTENERS: LISTENER_BOB://kafka:29092,LISTENER_FRED://kafka:9092
      KAFKA_ADVERTISED_LISTENERS: LISTENER_BOB://kafka:29092,LISTENER_FRED://localhost:9092
      KAFKA_INTER_BROKER_LISTENER_NAME: LISTENER_BOB
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1
    depends_on:
      - spark-master

Download the Docker Images

Before we proceed to run the Docker images, it's essential to download them in the target environment. To download the Bitnami images, you can execute the following script from a Bash command line:

$ bash download_images.sh

download_images.sh script


echo "Downloading Spark and Kafka Docker images..."

# Spark images from Bitnami
docker pull bitnami/spark:latest

# Kafka image from Bitnami
docker pull bitnami/kafka:latest

echo "Images downloaded successfully!"

# Display image sizes
echo "Image sizes:"
docker images bitnami/spark:latest bitnami/kafka:latest --format "{{.Repository}}:{{.Tag}} - {{.Size}}"

The download_images.sh script essentially retrieves two images from DockerHub. This script provides an automated way to download these images when creating new environments.

Start the Services

Once the Docker images are downloaded, initiate the services by executing the following script:

bash start_services.sh

start_services.sh script

#!/bin/bash

# Navigate to the Docker folder
cd docker

# Start Spark Master and Spark Worker
docker-compose up -f docker-compose-bitnami.yml -d spark-master spark-worker

# Wait for Spark Master and Worker to be ready (adjust sleep time as needed)
sleep 15

# Start Kafka and create Kafka topic
docker-compose up -d kafka

# Wait for Kafka to be ready (adjust sleep time as needed)
sleep 15

# Check if the Kafka topic exists before creating it
topic_exists=$(docker-compose exec kafka /opt/bitnami/kafka/bin/kafka-topics.sh --list --topic ozkary-topic --bootstrap-server localhost:9092 | grep "mta-turnstile")

if [ -z "$topic_exists" ]; then
  # Create Kafka topic
  docker-compose exec kafka /opt/bitnami/kafka/bin/kafka-topics.sh --create --topic mta-turnstile --partitions 1 --replication-factor 1 --bootstrap-server localhost:9092
  echo "Kafka topic created!"
else
  echo "Kafka topic already exists, no need to recreate."
fi

echo "Services started successfully!"

The start_services.sh script performs the following tasks:

Initiates Spark Master and Spark Worker services in detached mode (-d).
Launches Kafka service in detached mode.
Utilizes docker-compose exec to execute the Kafka topic creation command inside the Kafka container.

At this juncture, both services should be operational and ready to respond to client requests. Now, let's delve into implementing our applications.

Data Specifications

In this data streaming scenario, we are working with messages using a CSV data format with the following fields:

# Define the schema for the incoming data
turnstiles_schema = StructType([
    StructField("AC", StringType()),
    StructField("UNIT", StringType()),
    StructField("SCP", StringType()),
    StructField("STATION", StringType()),
    StructField("LINENAME", StringType()),
    StructField("DIVISION", StringType()),
    StructField("DATE", StringType()),
    StructField("TIME", StringType()),
    StructField("DESC", StringType()),
    StructField("ENTRIES", IntegerType()),
    StructField("EXITS", IntegerType()),
    StructField("ID", StringType()),
    StructField("TIMESTAMP", StringType())
])

The data format closely resembles what the source system provides for batch integration. However, in this scenario, we also have a unique ID and a TIMESTAMP.

As we process these messages, our objective is to generate files with data aggregation based on these fields:

"AC", "UNIT","SCP","STATION","LINENAME","DIVISION", "DATE", "DESC"

And these measures:

"ENTRIES", "EXITS"

An example of a message would look like this:

"A001,R001,02-00-00,Test-Station,456NQR,BMT,09-23-23,REGULAR,16:54:00,140,153"

It's important to note that the number of commuters is substantial, indicating a certain level of aggregation in these messages. However, they aren't aggregated every four hours, as the batch process does.

Once these message files are aggregated, they are then pushed to the data lake. Subsequently, our data warehouse process can pick them up and proceed with the necessary information processing.

Review the Code

To enable this functionality, we need to develop a Kafka producer and a Spark Kafka consumer, both implemented in Python. Let's begin by examining the fundamental features of the producer:

👉 Clone this repository or copy the files from this folder Streaming

Kafka Producer

The Kafka producer is a Python application designed to generate messages every 10 seconds. The produce_messages function utilizes messages from the provider and sends the serialized data to a Kafka topic.


class KafkaProducer:
    def __init__(self, config_path, topic):
        settings = read_config(config_path)
        self.producer = Producer(settings)
        self.topic = topic
        self.provider = Provider(topic)

    def delivery_report(self, err, msg):
        """
        Reports the success or failure of a message delivery.
        Args:
            err (KafkaError): The error that occurred on None on success.
            msg (Message): The message that was produced or failed.
        """
        if err is not None:
            print(f'Message delivery failed: {err}')
        else:
            print('Record {} produced to {} [{}] at offset {}'.format(msg.key(), msg.topic(), msg.partition(), msg.offset()))


    def produce_messages(self):
        while True:
            # Get the message key and value from the provider
            key, message = self.provider.message()

            try:

                # Produce the message to the Kafka topic
                self.producer.produce(topic = self.topic, key=key_serializer(key),
                                    value=value_serializer(message),
                                    on_delivery = self.delivery_report)

                # Flush to ensure delivery
                self.producer.flush()

                # Print the message
                print(f'Sent message: {message}')

                # Wait for 10 seconds before sending the next message
                time.sleep(10)
            except KeyboardInterrupt:
                pass
                exit(0)
            except KafkaTimeoutError as e:
                print(f"Kafka Timeout {e.__str__()}")
            except Exception as e:
                print(f"Exception while producing record - {key} {message}: {e}")
                continue

This class utilizes the Confluent Kafka library for seamless interaction with Kafka. It encapsulates the logic for producing messages to a Kafka topic, relying on external configurations, message providers, and serialization functions. The produce_messages method is crafted to run continuously until interrupted, while the delivery_report method serves as a callback function, reporting the success or failure of message delivery.

@flow (name="MTA Data Stream flow", description="Data Streaming Flow")
def main_flow(params) -> None:
    """
    Main flow to read and send the messages
    """    
    topic = params.topic    
    config_path = params.config    
    producer = KafkaProducer(config_path, topic)

    producer.produce_messages()

if __name__ == "__main__":

    """main entry point with argument parser"""
    os.system('clear')
    print('publisher running...')
    parser = argparse.ArgumentParser(description='Producer : --topic mta-turnstile --config path-to-config')

    parser.add_argument('--topic', required=True, help='stream topic')    
    parser.add_argument('--config', required=True, help='kafka setting') 

    args = parser.parse_args()

    # Register the signal handler to handle Ctrl-C       
    signal.signal(signal.SIGINT, lambda signal, frame: handle_sigint(signal, frame, producer.producer))

    main_flow(args)

    print('publisher end')

The main block acts as the entry point, featuring an argument parser that captures the topic and Kafka configuration path from the command line. The script then invokes the main_flow function with the provided arguments.

The main_flow function is annotated with @flow and functions as the primary entry point for the flow. This flow configuration enables us to monitor the flow using our Prefect Cloud monitoring system. It takes parameters (topic and config_path) and initializes a Kafka producer using the provided configuration path and topic.

👉 The data generated by this producer uses dummy data. It's important to note that the MTA system lacks a real-time feed for the turnstile data.

Spark - Kafka Consumer

The Spark PySpark application listens to a Kafka topic to retrieve messages. It parses these messages using a predefined schema to define the fields and their types. Since these messages arrive every ten seconds, our goal is to aggregate them within a time-span duration of five minutes. The specific duration can be defined based on solution requirements, and for our purposes, it aligns seamlessly with our current data pipeline flow. The aggregated messages are then serialized into compressed CSV files and loaded into the data lake. Subsequently, the data warehouse incremental process can merge this information into our data warehouse.

Our Spark application comprises the following components:

Spark Setting class
Spark Consumer class
Main Application Entry
- Prefect libraries for flow monitoring
- Prefect component for accessing the data lake
- Access to the data lake

Spark Setting Class


class SparkSettings:
    def __init__(self, config_path: str, topic: str, group_id: str, client_id: str) -> None:
        self.settings = read_config(config_path)

        use_sasl = "sasl.mechanism" in self.settings and self.settings["sasl.mechanism"] is not None

        self.kafka_options = {
            "kafka.bootstrap.servers": self.settings["bootstrap.servers"],
            "subscribe": topic,
            "startingOffsets": "earliest",
            "failOnDataLoss": "false",
            "client.id": client_id,
            "group.id": group_id,            
            "auto.offset.reset": "earliest",
            "checkpointLocation": "checkpoint",
            "minPartitions": "2",
            "enable.auto.commit": "false",
            "enable.partition.eof": "true"                        
        }          

        if use_sasl:
            # set the JAAS configuration only when use_sasl is True
            sasl_config = f'org.apache.kafka.common.security.plain.PlainLoginModule required serviceName="kafka" username="{self.settings["sasl.username"]}" password="{self.settings["sasl.password"]}";'

            login_options = {
                "kafka.sasl.mechanisms": self.settings["sasl.mechanism"],
                "kafka.security.protocol": self.settings["security.protocol"],
                "kafka.sasl.username": self.settings["sasl.username"],
                "kafka.sasl.password": self.settings["sasl.password"],  
                "kafka.sasl.jaas.config": sasl_config          
            }
            # merge the login options with the kafka options
            self.kafka_options = {**self.kafka_options, **login_options}


    def __getitem__(self, key):
        """
            Get the value of a key from the settings dictionary.
        """
        return self.settings[key]

    def set_jass_config(self) -> None:
        """
            Set the JAAS configuration with variables
        """
        jaas_config = (
            "KafkaClient {\n"
            "    org.apache.kafka.common.security.plain.PlainLoginModule required\n"
            f"    username=\"{self['sasl.username']}\"\n"
            f"    password=\"{self['sasl.password']}\";\n"            
            "};"
        )

        print('========ENV===========>',jaas_config)
        # Set the JAAS configuration in the environment
        os.environ['KAFKA_OPTS'] = f"java.security.auth.login.config={jaas_config}"        
        os.environ['java.security.auth.login.config'] = jaas_config

The Spark Setting class manages the configuration for connecting to a Kafka topic and receiving messages within Spark.

Spark Consumer Class

class SparkConsumer:
    def __init__(self, settings: SparkSettings, topic: str, group_id: str, client_id: str):
        self.settings = settings
        self.topic = topic        
        self.group_id = group_id
        self.client_id = client_id
        self.stream = None
        self.data_frame = None   
        self.kafka_options = self.settings.kafka_options     

    def read_kafka_stream(self, spark: SparkSession) -> None:
        """
        Reads the Kafka Topic.
        Args:
            spark (SparkSession): The spark session object.
        """
        self.stream = spark.readStream.format("kafka").options(**self.kafka_options).load()

    def parse_messages(self, schema) -> DataFrame:
        """
        Parse the messages and use the provided schema to type cast the fields
        """
        stream = self.stream

        assert stream.isStreaming is True, "DataFrame doesn't receive streaming data"

        options =  {'header': 'true', 'sep': ','}
        df = stream.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "timestamp")               

        # split attributes to nested array in one Column
        col = F.split(df['value'], ',')

        # expand col to multiple top-level columns
        for idx, field in enumerate(schema):
            df = df.withColumn(field.name, col.getItem(idx).cast(field.dataType))

        # remove quotes from TIMESTAMP column
        df = df.withColumn("TIMESTAMP", F.regexp_replace(F.col("TIMESTAMP"), '"', ''))    
        df = df.withColumn("CA", F.regexp_replace(F.col("CA"), '"', ''))    

        result = df.select([field.name for field in schema])    

        df.dropDuplicates(["ID","STATION","TIMESTAMP"])

        result.printSchema()

        return result

    def agg_messages(self, df: DataFrame,  window_duration: str, window_slide: str) -> DataFrame:
        """
            Window for n minutes aggregations group by by AC, UNIT, STATION, DATE, DESC
        """
       # Ensure TIMESTAMP is in the correct format (timestamp type)    
        date_format = "yyyy-MM-dd HH:mm:ss"        
        df = df.withColumn("TS", F.to_timestamp("TIMESTAMP", date_format))    

        df_windowed = df \
            .withWatermark("TS", window_duration) \
            .groupBy(F.window("TS", window_duration, window_slide),"CA", "UNIT","SCP","STATION","LINENAME","DIVISION", "DATE", "DESC") \
            .agg(
                F.sum("ENTRIES").alias("ENTRIES"),
                F.sum("EXITS").alias("EXITS")
            ).withColumn("START",F.col("window.start")) \
            .withColumn("END", F.col("window.end")) \
            .withColumn("TIME", F.date_format("window.end", "HH:mm:ss")) \
            .drop("window") \
            .select("CA","UNIT","SCP","STATION","LINENAME","DIVISION","DATE","DESC","TIME","START","END","ENTRIES","EXITS")

        df_windowed.printSchema()            

        return df_windowed

The Spark consumer class initiates the consumer by loading the Kafka settings, reading from the data stream, parsing the messages, and ultimately aggregating the information using various categorical fields from the data.

The agg_messages function is crafted to perform windowed aggregations on a DataFrame containing message data. It requires three parameters: df (the input DataFrame with message information), window_duration (specifying the duration of each aggregation window in minutes), and window_slide (indicating the sliding interval for the window). The function ensures the 'TIMESTAMP' column is in the correct timestamp format and applies windowed aggregations based on 'AC', 'UNIT', 'STATION', 'DATE', and 'DESC' columns. The resulting DataFrame includes aggregated entries and exits for each window and group, providing insights into activity patterns over specified time intervals. The function also prints the schema of the resulting DataFrame, making it convenient for users to understand the structure of the aggregated data.

👉 The agg_messages function verifies that the timestamp from the data is in the correct Spark timestamp format. An incorrect format will prevent Spark from aggregating the messages, resulting in empty data files.

Main application entry point

# @task(name="Stream write GCS", description='Write stream file to GCS', log_prints=False)
def stream_write_gcs(local_path: Path, file_name: str) -> None:

    """
        Upload local parquet file to GCS
        Args:
            path: File location
            prefix: the folder location on storage

    """    
    block_name = get_block_name()
    prefix = get_prefix()
    gcs_path = f'{prefix}/{file_name}'
    print(f'{block_name} {local_path} {gcs_path}')

    gcs_block = GcsBucket.load(block_name)        
    gcs_block.upload_from_path(from_path=local_path, to_path=gcs_path)

    return


# @task (name="MTA Spark Data Stream - Process Mini Batch", description="Write batch to the data lake")
def process_mini_batch(df, batch_id, path):
    """Processes a mini-batch, converts to Pandas, and writes to GCP Storage as CSV.gz."""

     # Check if DataFrame is empty
    if df.count() == 0:
        print(f"DataFrame for batch {batch_id} is empty. Skipping processing.")
        return

    # Convert to Pandas DataFrame
    df_pandas = df.toPandas()

    # Convert 'DATE' column to keep the same date format
    df_pandas['DATE'] = pd.to_datetime(df_pandas['DATE'], format='%m-%d-%y').dt.strftime('%m/%d/%Y')

    print(df_pandas.head())

    # Get the current timestamp
    timestamp = datetime.now()
    # Format the timestamp as needed
    time = timestamp.strftime("%Y%m%d_%H%M%S")    

    # Write to Storage as CSV.gz    
    file_name = f"batch_{batch_id}_{time}.csv.gz"
    file_path = f"{path}/{file_name}"
    df_pandas.to_csv(file_path, compression="gzip")

    # send to the data lake
    stream_write_gcs(file_path, file_name)

@task (name="MTA Spark Data Stream - Aggregate messages", description="Aggregate the data in time windows")
def aggregate_messages(consumer, df_messages, window_duration, window_slide) -> DataFrame:
    df_windowed = consumer.agg_messages(df_messages, window_duration, window_slide)
    return df_windowed

@task (name="MTA Spark Data Stream - Read data stream", description="Read the data stream")
def read_data_stream(consumer, spark_session) -> None:
    consumer.read_kafka_stream(spark_session) 

# write a streaming data frame to storage ./storage
@task (name="MTA Spark Data Stream - Write to Storage", description="Write batch to the data lake")
def write_to_storage(df: DataFrame, output_mode: str = 'append', processing_time: str = '60 seconds') -> None:
    """
        Output stream values to the console
    """   
    df_csv = df.select(
        "CA", "UNIT", "SCP", "STATION", "LINENAME", "DIVISION", "DATE", "TIME", "DESC","ENTRIES", "EXITS"
    )

    path = "./storage/"     

    folder_path = Path(path)
    if not os.path.exists(folder_path):
        folder_path.mkdir(parents=True, exist_ok=True)

    storage_query = df_csv.writeStream \
        .outputMode(output_mode) \
        .trigger(processingTime=processing_time) \
        .format("csv") \
        .option("header", True) \
        .option("path", path) \
        .option("checkpointLocation", "./checkpoint") \
        .foreachBatch(lambda batch, id: process_mini_batch(batch, id, path)) \
        .option("truncate", False) \
        .start()

    try:
        # Wait for the streaming query to terminate
        storage_query.awaitTermination()
    except KeyboardInterrupt:
        # Handle keyboard interrupt (e.g., Ctrl+C)
        storage_query.stop()

@flow (name="MTA Spark Data Stream flow", description="Data Streaming Flow")
def main_flow(params) -> None:
    """
    main flow to process stream messages with spark
    """    
    topic = params.topic
    group_id = params.group    
    client_id = params.client
    config_path = params.config    

    # define a window for n minutes aggregations group by station
    default_span = '5 minutes'
    window_duration = default_span if params.duration is None else f'{params.duration} minutes'
    window_slide = default_span if params.slide is None else f'{params.slide} minutes'

    # create the consumer settings
    spark_settings = SparkSettings(config_path, topic, group_id, client_id)    

    # create the spark consumer
    spark_session = SparkSession.builder \
            .appName("turnstiles-consumer") \
            .config("spark.sql.adaptive.enabled", "false") \
            .getOrCreate()                

    spark_session.sparkContext.setLogLevel("WARN")

    # create an instance of the consumer class
    consumer = SparkConsumer(spark_settings, topic, group_id, client_id)

    # set the data frame stream
    read_data_stream(consumer, spark_session)
    # consumer.read_kafka_stream(spark_session) 

    # parse the messages
    df_messages = consumer.parse_messages(schema=turnstiles_schema)    

    df_windowed = aggregate_messages(consumer, df_messages, window_duration, window_slide)
    # df_windowed = consumer.agg_messages(df_messages, window_duration, window_slide)

    write_to_storage(df=df_windowed, output_mode='append',processing_time=window_duration)

    spark_session.streams.awaitAnyTermination()


if __name__ == "__main__":
    """
        Main entry point for streaming data between kafka and spark        
    """
    os.system('clear')
    print('Spark streaming running...')
    parser = argparse.ArgumentParser(description='Producer : --topic mta-turnstile --group spark_group --client app1 --config path-to-config')

    parser.add_argument('--topic', required=True, help='kafka topics')    
    parser.add_argument('--group', required=True, help='consumer group')
    parser.add_argument('--client', required=True, help='client id group')
    parser.add_argument('--config', required=True, help='cloud settings')    
    parser.add_argument('--duration', required=False, help='window duration for aggregation 5 mins')        
    parser.add_argument('--slide', required=False, help='window slide 5 mins')        

    args = parser.parse_args()

    main_flow(args)

    print('end')

This is a summary of the main application to start the consumer application:

stream_write_gcs
- Purpose: Uploads a local Parquet file to Google Cloud Storage (GCS).
- Prefect Cloud Monitoring: Marked as a Prefect task (@task) for monitoring.
process_mini_batch
- Purpose: Processes a mini-batch from a Spark DataFrame, converts it to a Pandas DataFrame, and writes it to GCP Storage as a compressed CSV file.
- Prefect Cloud Monitoring: Marked as a Prefect task (@task) for monitoring.
aggregate_messages
- Purpose: Aggregates data in time windows based on specific columns.
- Prefect Cloud Monitoring: Marked as a Prefect task (@task) for monitoring.
read_data_stream
- Purpose: Reads the data stream from Kafka.
- Prefect Cloud Monitoring: Marked as a Prefect task (@task) for monitoring.
write_to_storage
- Purpose: Writes a streaming DataFrame to storage (./storage) and triggers the processing of mini-batches.
- Prefect Cloud Monitoring: Marked as a Prefect task (@task) for monitoring.
main_flow
- Purpose: Defines the main flow to process stream messages with Spark.
- Prefect Cloud Monitoring: Marked as a Prefect flow (@flow) for orchestration and monitoring.
Main Entry Point:
- Purpose: Parses command-line arguments and invokes the main_flow function to execute the streaming data processing.

👉 These decorators (@flow and @task) are employed for Prefect Cloud Monitoring, orchestration, and task management.

How to runt it!

With all the requirements completed and the code review done, we are ready to run our solution. Let's follow these steps to ensure our apps run properly.

Start the Container Services

Initiate the container services from the command line by executing the following script:

$ bash start_services.sh

Install dependencies and run the apps

👉 These applications depend on the Kafka and Spark services to be running. Ensure to start those services first.

Kafka Producer

Execute the producer with the following command line:

$ bash start_producer.sh

Spark - Kafka Consumer

Execute the Spark consumer from the command line:

$ bash start_consumer.sh

Execution Results

After the producer and consumer are running, the following results should be observed:

Kafka Producer Log

As messages are sent by the producer, we should observe the activity in the console or log file.

Spark Consumer Log

Spark parses the messages in real-time, displaying the message schemas for both the individual message from Kafka and the aggregated message. Once the time window is complete, it serializes the message from memory into a compressed CSV file.

Cloud Monitor

As the application runs, the flows and tasks are tracked on our cloud console. This tracking is utilized to monitor for any failures.

Data Lake Integration

Upon serializing the data aggregation, a compressed CSV file is uploaded to the data lake with the purpose of making it visible to our data warehouse integration process.

Data Warehouse Integration

Once the data has been transferred to the data lake, we can initiate the integration from the data warehouse. A quick way to check is to query the external table using the test station name.

👉 Our weekly batch process is scheduled once per week. However, in a data stream use case, where the data arrives more frequently, we need to update the schedule to an hourly or minute window.

Deployment

For our deployment process, we can follow these steps:

Define the Docker files for each component
Build and push the apps to DockerHub
Deploy the Kafka and Spark services
Deploy the Kafka producer and Spark consumer apps

Define the Docker files for each component

To facilitate each deployment, we aim to run our applications within a Docker container. In each application folder, you'll find a Dockerfile. This file installs the application dependencies, copies the necessary files, and runs the specific command to load the application.

👉 Noteworthy is the use of the VOLUME command in these files, enabling us to map a VM hosting folder to an image folder. The objective is to share a common configuration file for the containers.

Kafka Producer Docker file

# Use a base image with Prefect and Python
FROM prefecthq/prefect:2.7.7-python3.9

# Set the working directory
WORKDIR /app

# Copy the requirements file to the working directory
COPY requirements.txt .

# Install dependencies
RUN pip install -r requirements.txt --trusted-host pypi.python.org --no-cache-dir

# Copy the entire current directory into the container
COPY *.py .

# Specify the default command to run when the container starts
CMD ["python3", "program.py", "--topic","mta-turnstile","--config","/config/docker-kafka.properties"]

# Create a directory for Kafka configuration
RUN mkdir -p /config

# Create a volume mount for Kafka configuration
VOLUME ["/config"]

# push the ~/.kafka/docker-kafka.properties to the target machine
# run as to map the volume to the target machine:
# docker run -v ~/.kafka:/config your-image-name

Spark Consumer Docker file

# Use a base image with Prefect and Python
FROM prefecthq/prefect:2.7.7-python3.9

# Set the working directory
WORKDIR /app

# Copy the requirements file to the working directory
COPY requirements.txt .

# Install dependencies
RUN pip install -r requirements.txt --trusted-host pypi.python.org --no-cache-dir

# Copy the entire current directory into the container
COPY *.py .
COPY *.sh .

# Create a directory for Kafka configuration
RUN mkdir -p /config

# Create a volume mount for Kafka configuration
VOLUME ["/config"]

# Set the entry point script as executable
RUN chmod +x submit-program.sh

# Specify the default command to run when the container starts
CMD ["/bin/bash", "submit-program.sh", "program.py", "/config/docker-kafka.properties"]

# push the ~/.kafka/docker-kafka.properties to the target machine
# run as to map the volume to the target machine:
# docker run -v ~/.kafka:/config your-image-name

Build and push the apps to DockerHub

To build the apps in Docker containers, execute the following script:

$ bash build_push_apps.sh

Deploy the Kafka and Spark services

For Kafka and Spark services, we are utilizing predefined Bitnami images from DockerHub. Deploy these images by running the following script on the target environment:

$ bash deploy_kafka_spark.sh

This script utilizes a Docker Compose file to pull the Bitnami images and subsequently run them.

👉 Docker Compose is a tool for defining and running multi-container Docker applications. It can define the services, networks, and volumes needed for the application in a single docker-compose.yml file.

Deploy the Kafka producer and Spark consumer apps

Once the app images are available from DockerHub, initiate the deployment against a new environment by executing this script:

$ bash deploy_publisher_consumer_apps.sh

This script pulls the app images from DockerHub and runs them independently.

👉 It's important to note that while we've covered a local and a GitHub Action deployment, deploying on a cloud provider environment involves additional considerations.

Deployment Strategy

In this guide, we've explored a two-fold approach to deploying our Kafka and Spark-based data streaming solution. Initially, we used the manual deployment process, demonstrating how to execute bash scripts for building and deploying our application. This hands-on method provides a detailed understanding of the steps involved, giving users complete control over the deployment process.

Moving forward, we showcased a more streamlined and automated approach by integrating GitHub Actions into our workflow. By leveraging GitHub Actions, we can trigger builds and deployments with a simple push to dedicated branches (deploy-bitnami and deploy-apps). This automation not only simplifies the deployment process but also enhances efficiency, ensuring consistency across environments.

Summary

The integration of Kafka and Spark in a data streaming architecture involves producers publishing data to Kafka topics, consumers subscribing to these topics, Spark consuming data from Kafka, parsing and aggregating messages, and finally, writing the processed data to a data lake or other storage for further processing.

Once the data is available in the data lake, the data warehouse process can pick up the new files and continue its incremental update process, ultimately reflecting on the analysis and visualization layer. This architecture enables real-time data processing and analytics in a scalable and fault-tolerant manner.

Thanks for reading.

Send question or comment at Twitter @ozkary

👍 Originally published by ozkary.com

Data Engineering Process Fundamentals - Data Streaming

2023-08-05T11:44:00.008-04:00

In modern data engineering solutions, handling streaming data is very important. Businesses often need real-time insights to promptly monitor and respond to operational changes and performance trends. A data streaming pipeline facilitates the integration of real-time data into data warehouses and visualization dashboards. To achieve this integration in a data engineering solution, understanding the principles of data streaming is essential, and how technologies like Apache Kafka and Apache Spark play a key role in building efficient streaming data pipelines.

👉 Data Engineering Process Fundamentals - Data Analysis and Visualization

What is Data Streaming

Data streaming enables us to build data integration in real-time. Unlike traditional batch processing, where data is collected and processed periodically, streaming data arrives continuously by and is processed on-the-fly. This kind of integration empowers organizations to:

React Instantly: Timely responses to events and anomalies become possible
Predict Trends: Identify patterns and trends as they emerge
Enhance User Experience: Provide real-time updates and personalization
Optimize Operations: Streamline processes and resource allocation

Data Streaming Channels

Data streaming is a continuous data flow which can arrive from a channel that is usually hosted on an HTTP end-point. The type of the channel technology depends on the provider technology stack and can be any of the following:

Web Hooks: Web hooks are like virtual messengers that notify us when something interesting happens on the web. They are HTTP callbacks triggered by specific events, such as a change in a system. To harness data from web hooks, we set up endpoints that listen for these notifications, allowing us to react instantly to changes.
Events: Events are a fundamental concept in data streaming. They represent occurrences in a system or application, such as a user click, a sensor detecting a temperature change, or a train arrival to a station. Events can be collected and processed in real-time by using a middleware platform like Apache Kafka or RabbitMQ, providing insights into user behavior, system health, and more.
API Integration: APIs (Application Programming Interfaces) are bridges between different software systems. Through API integration, we can fetch data from external services, social media platforms, IoT devices, or any source that exposes an API. This seamless connectivity enables us to incorporate external data into our applications and processes by scheduling calls to the API at a certain frequency.

👍 Events are used for a wide range of real-time applications, including IoT data collection, application monitoring, and user behavior tracking. Web hooks are typically employed for integrating external services, automating workflows, and receiving notifications from third-party platforms.

Scaling to Handle a Data Stream

Data streaming sources often produce small payload size with high volume of messages. This introduces scalability concerns that should be addressed with essential components like the following:

Streaming Infrastructure: Robust streaming infrastructure is the backbone of data streaming. This includes systems like Apache Kafka, AWS Kinesis, or Azure Stream Analytics, which facilitate the ingestion, processing, and routing of data streams
Real-Time Processing: Traditional batch processing won't cut it for data streaming. We need real-time processing frameworks like Apache Storm, or Apache Spark Streaming to handle data as it flows
Data Storage: Storing and managing streaming data is crucial. we might use data lakes for long-term storage and databases optimized for real-time access. Cloud storage solutions offer scalability and reliability
Analytics and Visualization: To derive meaningful insights, we need analytics tools capable of processing streaming data. Visualization platforms like PowerBI, Looker, or custom dashboards can help you make sense of the information in real time
Monitoring and Alerts: Proactive monitoring ensures that your data streaming pipeline is healthy. Implement alerts and triggers to respond swiftly to anomalies or critical events
Scalable Compute Resources: As data volumes grow, compute resources should be able to scale horizontally to handle increased data loads. Cloud-based solutions are often used for this purpose

Data Streaming Components

At the heart of data streaming solutions lies technologies like Apache Kafka, a distributed event streaming platform, and Apache Spark, a versatile data processing engine. Together, they form a powerful solution that ingests, processes, and analyzes streaming data at scale.

Apache Kafka

Kafka acts as the ingestion layer or message broker in the streaming pipeline. It serves as a highly durable, fault-tolerant, and scalable event streaming platform. Data producers, which can be various sources like applications, sensors, or webhooks publish events (messages) to Kafka topics. These events are typically small pieces of data containing information such as transactions, logs, or sensor readings. Let's look at a simplified overview of how Kafka works:

Kafka organizes events into topics. A topic is a logical channel or category to which records (messages) are sent by producers and from which records are consumed by consumers. Topics serve as the central mechanism for organizing and categorizing data within Kafka. Each topic can have multiple partitions to support fail-over scenarios
- Kafka is distributed and provides fault tolerance. If a broker (Kafka server) fails, partitions can be replicated across multiple brokers
Kafka follows a publish-subscribe model. Producers send records to topics, and consumers subscribe to one or more topics to receive and process those records
- A producer is a program or process responsible for publishing records to Kafka topics. Producers generate data, which is then sent to one or more topics. Each message in a topic is identified by an offset, which represents its position within the topic.
- A consumer is a program or process that subscribes to one or more topics and processes the records within them. Consumers can read data from topics in real-time and perform various operations on it, such as analytics, storage, or forwarding to other systems

Apache Spark Structured Streaming

Apache Spark Structured Streaming is a micro-batch processing framework built on top of Apache Spark. It enables real-time data processing by ingesting data from Kafka topics in mini-batches. Here's how the process works:

👍 Apache Spark offers a unified platform for both batch and stream processing. If your application requires seamless transitions between batch and stream processing modes, Spark can be a good fit.

Kafka Integration: Spark Streaming integrates with Kafka using the Kafka Direct API. It can consume data directly from Kafka topics, leveraging Kafka's parallelism and fault tolerance features
Mini-Batch Processing: Spark Streaming reads data from Kafka topics in mini-batches, typically ranging from milliseconds to seconds. Each mini-batch of data is treated as a Resilient Distributed Dataset (RDD) within the Spark ecosystem
Data Transformation: Once the data is ingested into Spark Streaming, we can apply various transformations, computations, and analytics on the mini-batches of data. Spark provides a rich set of APIs for tasks like filtering, aggregating, joining, and machine learning
Windowed Operations: Spark Streaming allows us to perform windowed operations, such as sliding windows or tumbling windows, to analyze data within specific time intervals. This is useful for aggregating data over fixed time periods (e.g., hourly, daily) or for tracking patterns over sliding windows
Output: After processing, the results can be stored in various destinations, such as a data lake (e.g., Hadoop HDFS), a data warehouse (e.g., BigQuery, Redshift), or other external systems. Spark provides connectors to these storage solutions for seamless data persistence

Benefits of a Kafka and Spark Integration

A Kafka and Spark integration enables us to build solutions with High Availability requirements due to the following features:

Fault Tolerance: Kafka ensures that events are not lost even in the face of hardware failures, making it a reliable source of data
Scalability: Kafka scales horizontally, allowing you to handle increasing data volumes by adding more Kafka brokers
Flexibility: Spark Streaming's flexibility in data processing and windowing operations enables a wide range of real-time analytics
End-to-End Pipeline: By combining Kafka's ingestion capabilities with Spark's processing power, you can create end-to-end data streaming pipelines that handle real-time data ingestion, processing, and storage

Supported Programming Languages

When it comes to programming language support, both Kafka and Spark allows developers to choose the language that aligns best with their skills and project requirements.

Kafka supports multiple programming languages, including Python, C#, and Java
Spark also support multiple programming languages like PySpark (Python), Scala, and even R for data processing tasks. Additionally, Spark allows users to work with SQL-like expressions, making it easier to perform complex data transformations and analysis

Sample Python Code for a Kafka Producer

This is a very simple implementation of a Kafka producer using Python as the programming language. This code does not consume a data feed from a provider. It only shows how a producer sends messages to a Kafka topic.


from kafka import KafkaProducer
import time

kafka_broker = "localhost:9092"

# Create a Kafka producer instance
producer = KafkaProducer(
    bootstrap_servers=kafka_broker,  # Replace with your Kafka broker's address
    value_serializer=lambda v: str(v).encode('utf-8')
)

# Sample data message (comma-delimited)
sample_message = "timestamp,station,turnstile_id,device_id,entry,exit,entry_datetime"

try:
    # continue to run until the instance is shutdown
    while True:
        # Simulate generating a new data message. This data should come from the data provider
        data_message = sample_message + f"\n{int(time.time())},StationA,123,456,10,15,'2023-07-12 08:30:00'"

        # Send the message to the Kafka topic
        producer.send('turnstile-stream', value=data_message)

        # add logging information for tracking
        print("Message sent:", data_message)
        time.sleep(1)  # Sending messages every second
except KeyboardInterrupt:
    pass
finally:
    producer.close()

This Kafka producer code initializes a producer and sends a sample message to the specified Kafka topic. Let's review each code segment:

Create Kafka Producer Configuration:
- Set up the Kafka producer configuration
- Specify the Kafka broker(s) to connect to (bootstrap.servers)
Define Kafka Topic: Define the Kafka topic to send messages (turnstile-stream in this example)
Create a Kafka Producer:
- Create an instance of the Kafka producer with the broker end-point
- Use a value_serializer to encode the string message with unicode utf-8
Define Message Contents:
- Prepare the message to send as a CSV string with the header and value information
Produce Messages:
- Use the send method of the Kafka producer to send messages to the Kafka topic, turnstile-stream
Close the Kafka Producer:
- Always remember to close the Kafka producer when the application terminates to avoid leaving open connections on the broker

Sample Python Code for a Kafka Consumer and Spark Client

After looking at the Kafka producer code, let's take a look at how a Kafka consumer on Spark would consume and process the messages.


from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from pyspark.sql.functions import window, sum

# Create a Spark session
spark = SparkSession.builder.appName("TurnstileStreamApp").getOrCreate()

# Create a StreamingContext with a batch interval of 5 seconds
ssc = StreamingContext(spark.sparkContext, 5)

kafka_broker = "localhost:9092"

# Define the Kafka broker and topic to consume from
kafkaParams = {
    "bootstrap.servers": kafka_broker,  # Replace with your Kafka broker's address
    "auto.offset.reset": "latest",
}
topics = ["turnstile-stream"]

# Create a Kafka stream
kafkaStream = KafkaUtils.createDirectStream(ssc, topics, kafkaParams)

# Parse the Kafka stream as a DataFrame
lines = kafkaStream.map(lambda x: x[1])
df = spark.read.csv(lines)

# Define a window for aggregation (4-hour window)
windowed_df = df
  .withWatermark("entry_datetime", "4 hours") \
  # 4-hour window with a 4-hour sliding interval
  .groupBy("station", window("entry_datetime", "4 hours")) 
  .agg(
    sum("entries").alias("entries"),
    sum("exits").alias("exits")
  )

# Write the aggregated data to a blob storage as compressed CSV files
query = windowed_df.writeStream\
    .outputMode("update")\
    .foreachBatch(lambda batch_df, batch_id: batch_df.write\
        .mode("overwrite")\
        .csv("gs://your-bucket-name/")  # Replace with your blob storage path
    )\
    .start()

query.awaitTermination()

This simple example shows how to write a Kafka consumer, use Spark to process and aggregate the data, and finally write a csv file to the data lake. Let’s look at each code segment for more details:

Create a Spark Session:
- Initialize a Spark session with the name "TurnstileStreamApp"
Create a StreamingContext:
- Set up a StreamingContext with a batch interval of 5 seconds. This determines how often Spark will process incoming data
Define Kafka Broker and Topic:
- Specify the Kafka broker's address (localhost:9092 in this example) and the topic to consume data from ("turnstile-stream")
Create a Kafka Stream:
- Use KafkaUtils to create a direct stream from Kafka
- The stream will consume data from the specified Kafka topic
Parse the Kafka Stream:
- Extract the message values from the Kafka stream
- Read these messages into a DataFrame (df) using Spark's CSV reader
Define a Window for Aggregation:
- We specify the watermark for late data using withWatermark. This ensures that any data arriving later than the specified window is still considered for aggregation
- Create a windowed DataFrame (windowed_df) by grouping data based on "station" and a 4-hour window
- The "entry_datetime" column is used as the timestamp for windowing
- Aggregations are performed to calculate the sum of "entries" and "exits" within each window
Write the Aggregated Data to a Data Lake:
- Define a streaming query (query) to write the aggregated data to a blob storage path
- The "update" output mode indicates that only updated results will be written
- A function is applied to each batch of data, which specifies how to write the data
- In this case, data is written as compressed CSV files to a data lake location
- The awaitTermination method ensures the query continues to run and process data until manually terminated.

This Spark example processes data from Kafka, aggregates it in 4-hour windows, and it writes the results to blob storage. The code is structured to efficiently handle real-time streaming data and organize the output into folders in the data lake based on station names and time windows. In each folder, Spark will generate filenames automatically based on the default naming convention. Typically, it uses a combination of a unique identifier and partition number to create filenames. The exact format of the file name might vary depending on the Spark version and configuration. This approach is used to send the information to a data lake, so the data transformation process can pick it up and send to a data warehouse.

Alternatively, the Spark client can send the aggregated results directly in the data warehouse. The Spark client can connect to the data warehouse, so it can directly insert the information without using a data lake as an staging step.

Solution Design and Architecture

For our solution strategy, we followed the design shown below. This design helps us ensure smooth flow, efficient processing and storage of data so that it is immediately available in our data warehouse consequently, the visualization tools. Let's break down each component and explain its purpose.

Components

Real-Time Data Source: This is an external data source, which continuously emits data as events or messages
Message Broker Layer: Our message broker layer is the central hub for data ingestion and distribution. It consists of two vital components:
- Kafka Broker Instance: Kafka acts as a scalable message broker and entry point for data ingestion. It efficiently collects and organizes data in topics from the source
- Kafka Producer (Python): To bridge the gap between the data source and Kafka, we write a Python-based Kafka producer. This component is responsible for capturing data from the real-time source and forwarding it to the Kafka instance and corresponding topic
Stream Processing Layer: The stream processing layer is where the messages from Kafka are processed, aggregated and sent to the corresponding data storage. This layer also consists of two key components:
- Spark Instance: Apache Spark, a high-performance stream processing framework, is responsible for processing and transforming data in real-time
- Stream Consumer (Python): In order to consume the messages from a Kafka topic, we write a Python component that acts as both a Kafka Consumer and Spark application.
  - The Kafka consumer retrieves data from the Kafka topic, ensuring that the data is processed as soon as it arrives
  - The Spark application process the messages, aggregates the data and saves the results in the data warehouse. This dual role ensures efficient data processing and storage.
Data Warehouse: As the final destination for our processed data, the data warehouse provides a reliable and structured repository for storing the results of our real-time data processing, so visualization tools like Looker and PowerBI can display the data as soon as the dashboards are refreshed

👉 We should note that dashboards query the data from the database. For a near real-time data to be available, the dashboard data needs to be refreshed at certain intervals (e.g., minutes or hourly). To make available the real-time data to the dashboard, there needs to be a live connection (socket) between the dashboard and the streaming platform, which is done by another system component, like Redis Cache or custom service, that could push those events on a socket channel.

DevOps Support

Containerization: In order to continue to meet our DevOps requirements, enhance scalability and manageability, and follow best enterprise level practices, we use Docker containers for all of our components. Each component, our Kafka and Spark instance as well as our two Python-based components, runs in separate Docker container. This ensures modularity, easy deployment, and resource isolation
- This approach also enable us to use a Kubernetes cluster , a container orchestration platform that can help us manage and deploy Docker containers at scale, to run our components. We could use Minikube for local development or use a cloud-managed Kubernetes service like Google Kubernetes Engine (GKE), Amazon Elastic Kubernetes Service (EKS), or Azure Kubernetes Service (AKS)
Virtual Machine (VM): Our components need to run on a VM, so we follow the same approach and create a VM instance using a Terraform script, similar to how it was done for our batch data pipeline during our planning and design phase

Advantages

Our data streaming design offers several advantages:

Real-time Processing: Data is processed as it arrives, enabling timely insights and rapid response to changing conditions
Scalability: The use of Kafka and Spark allows us to scale our architecture effortlessly to handle growing data volumes
Containerization: Docker containers simplify deployment and management, making our system highly portable and maintainable
Integration: The seamless integration of Kafka, Spark, and the Kafka consumer as a Spark client ensures data continuity and efficient processing

This data streaming strategy, powered by Kafka and Spark, empowers us to unlock real-time insights from our data streams, providing valuable information for rapid decision-making, analytics, and storage.

Summary

In today's data-driven landscape, data streaming solutions are an absolute necessity, enabling the rapid processing and analysis of vast amounts of real-time data. Technologies like Kafka and Spark play a pivotal role in empowering organizations to harness real-time insights from their data streams.

Kafka and Spark, work together seamlessly to enable real-time data processing and analytics. Kafka handles the reliable ingestion of events, while Spark Streaming provides the tools for processing, transforming, analyzing, and storing the data in a data lake or data warehouse in near real-time, allowing businesses to make decisions much at a much faster pace.

Exercise - Data Streaming with Apache Kafka

Now that we have defined the data streaming strategy, we can continue our journey and build a containerized Kafka producer that can consume real-time data sources. Let's take a look at that next.

Coming soon!

👉 Data Engineering Process Fundamentals - Data Streaming With Apache Kafka Exercise

Thanks for reading.

Send question or comment at Twitter @ozkary

👍 Originally published by ozkary.com

Data Engineering Process Fundamentals - Data Analysis and Visualization Exercise

2023-07-08T11:51:00.010-04:00

Data analysis and visualization are fundamental to a data-driven decision-making process. To grasp the best strategy for our scenario, we delve into the data analysis and visualization phase of the process, making data models, analyzes and diagrams that allow us to tell stories from the data.

With the understanding of best practices for data analysis and visualization, we start by creating a code-based dashboard using Python, Pandas and Plotly. We then follow up by using a high-quality enterprise tool, such as Looker, to construct a low-code cloud-hosted dashboard, providing us with insights into the type of effort each method takes.

👍 This is a dashboard created with Looker. Similar dashboards can be created with PowerBI and Tableau

Once we have designed our dashboard, we can align it with our initial requirements and proceed to formulate the data analysis conclusions, thereby facilitating informed business decisions for stakeholders. However, before delving into coding, let's commence by reviewing the data analysis specifications, which provide the blueprint for our implementation effort.

Specifications

At this stage of the process, we have a clear grasp of the requirements and a deep familiarity with the data. With these insights, we can now define our specifications as outlined below:

Identify pertinent measures such as exits and entries
Conduct distribution analysis based on station
- This analysis delineates geographical boundary patterns
Conduct distribution analysis based on days of the week and time slots

By calculating the total count of passengers for arrivals and departures, we gain a holistic comprehension of passenger flow dynamics. Furthermore, we can employ distribution analysis to investigate variations across stations, days of the week, and time slots. These analyses provide essential insights for business strategy and decision-making, allowing us to identify peak travel periods, station preferences, and time-specific trends that can help us make informed decisions.

Data Analysis Requirements

In our analysis process, we can adhere to these specified requirements:

Determine distinct time slots for morning and afternoon analysis:

12:00 AM - 3:59 AM
04:00 AM - 7:59 AM
08:00 AM - 11:59 AM
12:00 PM - 3:59 PM
04:00 PM - 7:59 PM
08:00 PM - 11:59 PM

Examine data regarding commuter exits (arrivals) and entries (departures)
Implement a master filter for date ranges, which exerts control over all charts
Incorporate a secondary filter component to facilitate station selection
Display the aggregate counts of entries and exits for the designated date range
- Employ score card components for this purpose
Investigate station distributions to identify the most frequented stations
- Utilize donut charts, with the subway station name as the primary dimension
Analyze distributions using the day of the week to unveil peak traffic days
- Employ bar charts to visualize entries and exits per day
Explore distributions based on time slots to uncover daily peak hours
- Integrate bar charts to illustrate entries and exits within each time slot

Dashboard Design

In the dashboard design, we can utilize a two-column layout, positioning the exits charts in the left column and the entries charts in the right column of the dashboard. Additionally, we can incorporate a header container to encompass the filters, date range, and station name. To support multiple devices, we need a responsive layout. We should note that when using a platform like Looker, there is really no responsive layout, but we need to define different layouts for mobile and desktop.

Layout Configuration:

Desktop 1200px by 900px
Mobile 360px by 1980px

UI Components

For our dashboard components, we should incorporate the following:

Date range picker
Station name list box
For each selected measure (exits, entries), we should employ a set of the following components:
- Score cards for the total numbers
- Donut charts for station distribution
- Bar charts for day of the week distribution
- Bar charts for time slot distribution

Review the Code - Code Centric

The dashboard layout is done using HTML for the presentation and Python to build the different HTML elements using the dash library. All the charts are generated by plotly.

# Define the layout of the app
app.layout = html.Div([
    html.H4("MTA Turnstile Data Dashboard"),

    dcc.DatePickerRange(
        id='date-range',
        start_date=data['created_dt'].min(),
        end_date=data['created_dt'].max(),
        display_format='YYYY-MM-DD'
    ),

    dbc.Row([
        dbc.Col(
            dbc.Card(
                dbc.CardBody([
                    html.P("Total Entries"),
                    html.H5(id='total-entries')
                ]),
                className='score-card'
            ),
            width=6
        ),
        dbc.Col(
            dbc.Card(
                dbc.CardBody([
                    html.P("Total Exits"),
                    html.H5(id='total-exits')
                ]),
                className='score-card'
            ),
            width=6
        )
    ], className='score-cards'),

    dbc.Row([
            dbc.Col(
                dcc.Graph(id='top-entries-stations', className='donut-chart'),
                width=6
            ),
            dbc.Col(
                dcc.Graph(id='top-exits-stations', className='donut-chart'),
                width=6
            )
    ], className='donut-charts'),

    dbc.Row([
                dbc.Col(
                    dcc.Graph(id='exits-by-day', className='bar-chart'),
                    width=6
                ),
                dbc.Col(
                    dcc.Graph(id='entries-by-day', className='bar-chart'),
                    width=6
                )
    ], className='bar-charts'),

    dbc.Row([
                dbc.Col(
                    dcc.Graph(id='exits-by-time', className='bar-chart'),
                    width=6
                ),
                dbc.Col(
                    dcc.Graph(id='entries-by-time', className='bar-chart'),
                    width=6
                )
    ], className='bar-charts')

])

The provided Python code is building a web application dashboard layout using Dash, a Python framework for creating interactive web applications. This dashboard is designed to showcase insights and visualizations derived from MTA Turnstile Data. Here's a breakdown of the main components:

App Layout: The app.layout defines the overall structure of the dashboard using the html.Div component. It acts as a container for all the displayed components
Title: html.H4("MTA Turnstile Data Dashboard") creates a header displaying the title of the dashboard
Date Picker Range: The dcc.DatePickerRange component allows users to select a date range for analysis. It's a part of Dash Core Components (dcc)
Score Cards: The dbc.Row and dbc.Col components create rows and columns for displaying score cards using dbc.Card and dbc.CardBody. These cards show metrics like "Total Entries" and "Total Exits"
Donut Charts: Another set of dbc.Row and dbc.Col components creates columns for displaying donut charts using the dcc.Graph component. These charts visualize the distribution of top entries and exits by station
Bar Charts: Similar to the previous sections, dbc.Row and dbc.Col components are used to create columns for displaying bar charts using the dcc.Graph component. These charts showcase the distribution of exits and entries by day of the week and time slot
CSS Classnames: The className attribute is used to apply CSS class names to the components, allowing for custom styling using CSS

In summary, the code establishes the layout of the dashboard with distinct sections for date selection, score cards, donut charts, and bar charts. The various visualizations and metrics offer valuable insights into MTA Turnstile Data, enabling users to comprehend passenger flow patterns and trends effectively.


def update_dashboard(start_date, end_date):
    filtered_data = data[(data['created_dt'] >= start_date) & (data['created_dt'] <= end_date)]   

    total_entries = filtered_data['entries'].sum() / 1e12  # Convert to trillions
    total_exits = filtered_data['exits'].sum() / 1e12  # Convert to trillions

    measures = ['exits','entries']    
    filtered_data["created_dt"] = pd.to_datetime(filtered_data['created_dt'])  
    measures = ['exits','entries']  

    exits_chart , entries_chart = create_station_donut_chart(filtered_data)
    exits_chart_by_day ,entries_chart_by_day = create_day_bar_chart(filtered_data, measures)
    exits_chart_by_time, entries_chart_by_time = create_time_bar_chart(filtered_data, measures)

    return (
        f"{total_entries:.2f}T",
        f"{total_exits:.2f}T",
        entries_chart,
        exits_chart,
        exits_chart_by_day,
        entries_chart_by_day,
        exits_chart_by_time,
        entries_chart_by_time
    )

The update_dashboard function is responsible for updating and refreshing the dashboard. It handles the data range change event. As the user changes the date range, this function takes in the start and end dates as inputs. The function then filters the dataset, retaining only the records falling within the specified date range. Subsequently, the function calculates key metrics for the dashboard's score cards. It computes the total number of entries and exits during the filtered time period, and these values are converted to trillions for better readability.

The code proceeds to generate various visual components for the dashboard. These components include donut charts illustrating station-wise entries and exits, bar charts showcasing entries and exits by day of the week, and another set of bar charts displaying entries and exits by time slot. Each of these visualizations is created using specialized functions like create_station_donut_chart, create_day_bar_chart, and create_time_bar_chart.

Finally, the function compiles all the generated components and calculated metrics into a tuple. This tuple is then returned by the update_dashboard function, containing values like total entries, total exits, and the various charts.

def create_station_donut_chart(df: pd.DataFrame ) -> Tuple[go.Figure, go.Figure]:
    """
     creates the station distribution donut chart   
    """
    top_entries_stations = df.groupby('station_name').agg({'entries': 'sum'}).nlargest(10, 'entries')
    top_exits_stations = df.groupby('station_name').agg({'exits': 'sum'}).nlargest(10, 'exits')

    entries_chart = px.pie(top_entries_stations, names=top_entries_stations.index, values='entries',
                           title='Top 10 Stations by Entries', hole=0.3)
    exits_chart = px.pie(top_exits_stations, names=top_exits_stations.index, values='exits',
                         title='Top 10 Stations by Exits', hole=0.3)

    entries_chart.update_traces(marker=dict(colors=px.colors.qualitative.Plotly))
    exits_chart.update_traces(marker=dict(colors=px.colors.qualitative.Plotly))
    return entries_chart, exits_chart

The create_station_donut_chart function is responsible for generating donut charts to visualize the distribution of entries and exits across the top stations. It starts by selecting the top stations based on the highest entries and exits from the provided DataFrame. Using Plotly Express, the function then constructs two separate donut charts: one for the top stations by entries and another for the top stations by exits.

Each donut chart provides a graphical representation of the distribution, where each station is represented by a segment based on the number of entries or exits it recorded. The charts are presented in a visually appealing manner with a central hole for a more focused view.

def create_day_bar_chart(df: pd.DataFrame, measures: List[str]) -> Tuple[go.Figure, go.Figure]:
    """
    Creates a bar chart using the week days from the given dataframe.
    """
    measures = ['exits','entries']
    day_categories = ['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat']   
    group_by_date = df.groupby(["created_dt"], as_index=False)[measures].sum()

    df['weekday'] = pd.Categorical(df['created_dt'].dt.strftime('%a'),
                                                 categories=day_categories,
                                                 ordered=True)        

    group_by_weekday =  df.groupby('weekday', as_index=False)[measures].sum()

    exits_chart_by_day = px.bar(group_by_weekday, x='weekday', y='exits', color='weekday',
                                title='Exits by Day of the Week', labels={'weekday': 'Day of the Week', 'exits': 'Exits'},
                                color_discrete_sequence=['green'])

    entries_chart_by_day = px.bar(group_by_weekday, x='weekday', y='entries', color='weekday',
                                  title='Entries by Day of the Week', labels={'weekday': 'Day of the Week', 'entries': 'Entries'},
                                  color_discrete_sequence=['orange'])

    # Hide the legend on the side
    exits_chart_by_day.update_layout(showlegend=False)
    entries_chart_by_day.update_layout(showlegend=False)   

    # Return the chart
    return exits_chart_by_day, entries_chart_by_day

The create_day_bar_chart function is responsible for generating bar charts that illustrate the distribution of data based on the day of the week. Due to the limitations of the date-time data type not inherently containing day information, the function maps the data to the corresponding day category.

To begin, the function calculates the sum of the specified measures (entries and exits) for each date in the DataFrame using group_by_date. Next, it creates a new column named 'weekday' that holds the abbreviated day names (Sun, Mon, Tue, etc.) by applying the strftime method to the 'created_dt' column. This column is then transformed into a categorical variable using predefined day categories, ensuring that the order of days is preserved.

Using the grouped data by 'weekday', the function constructs two separate bar charts using Plotly Express. One chart visualizes the distribution of exits by day of the week, while the other visualizes the distribution of entries by day of the week.

def create_time_bar_chart(df: pd.DataFrame, measures : List[str] ) -> Tuple[go.Figure, go.Figure]:

    """
    Creates a bar chart using the time slot category
    """

    # Define time (hr) slots
    time_slots = {
        '12:00-3:59am': (0, 3, 0),
        '04:00-7:59am': (4, 7, 1),
        '08:00-11:59am': (8, 11, 2),
         '12:00-3:59pm': (12, 15, 3),
        '04:00-7:59pm': (16, 19, 4),
        '08:00-11:59pm': (20, 23, 5)
    }

    # Add a new column 'time_slot' based on time ranges
    def categorize_time(row):
        for slot, (start, end, order) in time_slots.items():
            if start <= row.hour <= end:
                return slot

    df['time_slot'] = df['created_dt'].apply(categorize_time)
    group_by_time = df.groupby('time_slot', as_index=False)[measures].sum()

    # Sort the grouped_data DataFrame based on the sorting value
    group_by_time_sorted = group_by_time.sort_values(by=['time_slot'], key=lambda x: x.map({slot: sort_order for slot, (_, _, sort_order) in time_slots.items()}))


    exits_chart_by_time = px.bar(group_by_time_sorted, x='time_slot', y='exits', color='time_slot',
                                title='Exits by Day of the Week', labels={'time_slot': 'Time of Day', 'exits': 'Exits'},
                                color_discrete_sequence=['green'])

    entries_chart_by_time = px.bar(group_by_time_sorted, x='time_slot', y='entries', color='time_slot',
                                  title='Entries by Day of the Week', labels={'time_slot': 'Time of Day', 'entries': 'Entries'},
                                  color_discrete_sequence=['orange'])
    # Hide the legend on the side
    exits_chart_by_time.update_layout(showlegend=False)
    entries_chart_by_time.update_layout(showlegend=False)

    return exits_chart_by_time, entries_chart_by_time

The create_time_bar_chart function is responsible for generating bar charts that depict the data distribution at specific times of the day. Just as with days of the week, the function maps and labels time ranges to create a new series, enabling the creation of these charts.

First, the function defines time slots using a dictionary, where each slot corresponds to a specific time range. For each data row, a new column named 'time_slot' is added based on the time ranges defined. This is achieved by using the categorize_time function, which checks the hour of the row's timestamp and assigns it to the appropriate time slot.

The data is then grouped by 'time_slot', and the sum of the specified measures (exits and entries) is calculated for each slot. To ensure that the time slots are displayed in the correct order, the grouped data is sorted based on a sorting value derived from the time slots' dictionary.

Using the grouped and sorted data, the function constructs two bar charts using Plotly Express. One chart visualizes the distribution of exits by time of day, while the other visualizes the distribution of entries by time of day. Each bar in the chart represents the sum of exits or entries for a specific time slot.

Once the implementation of this Python dashboard is complete, we can run it and see the following dashboard load on our browser:

Requirements

These are the requirements to be able to run the Python dashboard.

👉 Clone this repo or copy the files from this folder. We could also create a GitHub CodeSpace and run this online.

Use the analysis_data.csv file for test data
- Use the local file for this implementation
Install the Python dependencies
- Type the following from the terminal

$ pip install pandas
$ pip install plotly
$ pip install dash
$ pip install dash_bootstrap_components

How to Run It

After installing the dependencies and downloading the code, we should be able to run the code from a terminal by typing:

$ python3 dashboard.py

We should note that this is a simple implementation to illustrate the amount of effort it takes to build a dashboard using code. The code uses a local CSV file. If we need to connect to the data warehouse, we need to expand this code to use an API call that is authorized to access the data warehouse. Writing Python dashboards or creating Jupyter charts, works well for small teams that are working closely together and are running experiments on the data. However, for a more enterprise solution, we should look at using a tool like Looker or PowerBI. Let's take a look at that next.

Review the Code - Low-Code

Tools like Looker and PowerBI excel in data visualization, requiring little to no coding. These tools offer a plethora of visual elements for configuring dashboards, minimizing the need for extensive coding. For instance, these platforms effortlessly handle tasks like automatically displaying the day of the week from a date-time field.

In cases where an out-of-the-box solution is lacking, we might need to supplement it with a code snippet. For instance, consider our time range requirement. Since this is quite specific to our project, we must generate a new series with our desired labels. To achieve this, we introduce a new field that corresponds to the date-time hour value. When the field is created, we are essentially implementing a function.

The provided code reads the hour value from the date-time fields and subsequently maps it to its corresponding label.

CASE 
    WHEN HOUR(created_dt) BETWEEN 0 AND 3 THEN "12:00-3:59am" 
    WHEN HOUR(created_dt) BETWEEN 4 AND 7 THEN "04:00-7:59am" 
    WHEN HOUR(created_dt) BETWEEN 8 AND 11 THEN "08:00-11:59am" 
    WHEN HOUR(created_dt) BETWEEN 12 AND 15 THEN "12:00-3:59pm" 
    WHEN HOUR(created_dt) BETWEEN 16 AND 20 THEN "04:00-7:59pm" 
    WHEN HOUR(created_dt) BETWEEN 20 AND 23 THEN "08:00-11:59pm" 
END

Requirements

The only requirement here is to sign up with Looker Studio and have access to a data warehouse or database that can serve data and is accessible from external sources.

👉 Sign-up for Looker Studio

Other Visualizations tools:

Looker UI

Take a look at the image below. This is the Looker UI. We should familiarize ourselves with the following areas:

Theme and Layout: Use it to configure the theme and change the layout for mobile or desktop
Add data: Use this to add a new data source
Add a chart: This allows us to add new charts
Add a control: Here, we can add the date range and station name list
Canvas: This is where we place all the components
Setup Pane: This allows us to configure the date range, dimension, measures, and sorting settings
Style Pane: Here, we can configure the colors and font
Data Pane: This displays the data sources with their fields. New fields are created as functions. When we hover over a field, we can see a function (fx) icon, which indicates that we can edit the function and configure our snippet

How to Build it

Create a new dashboard
Click on the "Add Data" button
Use the connector for our data source:
- This should allow us to configure the credentials for access
- Select the "rpt_turnstile" view, which already includes joins with the fact_table and dimension tables
Once the data is loaded, we can see the dimensions and measures
Add the dashboard filters:
- Include a date range control for the filter, using the "created_dt" field
- Add a list control and associate it with the station name
Proceed to add the remaining charts:
- Ensure that all charts are associated with the date range dimension
- This enables filtering to cascade across all the charts
Utilize the "entries" and "exits" measures for all dashboards:
- Integrate two scorecards for the sum of entries and exits
- Incorporate a donut chart to display exits and entries distribution by stations
- Incorporate two bar charts (entries and exits) and use the weekday value from the "created_dt" dimension
  - Sort them by the weekday. Use the day number (0-6), not the name (Sun-Sat). This is achieved by adding a new field with the following code and using it for sorting:

WEEKDAY(created_dt)

Create the time slot dimension field (click "Add Field" and enter this definition):

CASE 
    WHEN HOUR(created_dt) BETWEEN 0 AND 3 THEN "12:00-3:59am" 
    WHEN HOUR(created_dt) BETWEEN 4 AND 7 THEN "04:00-7:59am" 
    WHEN HOUR(created_dt) BETWEEN 8 AND 11 THEN "08:00-11:59am" 
    WHEN HOUR(created_dt) BETWEEN 12 AND 15 THEN "12:00-3:59pm" 
    WHEN HOUR(created_dt) BETWEEN 16 AND 19 THEN "04:00-7:59pm" 
    WHEN HOUR(created_dt) BETWEEN 20 AND 23 THEN "08:00-11:59pm" 
END

Add two bar charts (entries and exits) and use the time slot dimension:
- Use the hour value from the "created_dt" dimension for sorting by adding a new field and using it as your sorting criteria:

HOUR(created_dt)

View the Dashboard

After following all the specification, we should be able to preview the dashboard on the browser. We can load an example, of a dashboard by clicking on the link below:

👉 View the dashboard online

👉 View the mobile dashboard online

This is a an image of the mobile dashboard.

Data Analysis Conclusions

By examining the dashboard, the following conclusions can be observed:

Stations with the highest distribution represent the busiest locations
The busiest time slot for both exits and entries is between 4pm and 9pm
Every day of the week exhibits a high volume of commuters
Businesses can choose the stations near their locations for further analysis

With these insights, strategies can be devised to optimize marketing campaigns and target users within geo-fenced areas and during specific hours of the day that are in close proximity to corresponding business locations.

Summary

We utilize our expertise in data analysis and visualization to construct charts and build them into dashboards. We adopt two distinct approaches for dashboard creation: a code-centric method and a low-code enterprise solution like Looker. After a comprehensive comparison, we deduce that the code-centric approach is optimal for small teams, whereas it might not suffice for enterprise users, especially when targeting executive stakeholders.

Lastly, as the dashboard becomes operational, we transition into the role of business analysts, deciphering insights from the data. This enables us to offer answers aligned with our original requirements.

We have successfully completed our data pipeline from CSV files to our data warehouse and dashboard. Now, let's explore an advanced concept in data engineering: data streaming, which facilitates real-time data integration. This involves the continuous and timely processing of incoming data. Technologies like Apache Kafka and Apache Spark play a crucial role in enabling efficient data streaming processes. Let's take a closer look at these components next.

Coming Soon!

👉 [Data Engineering Process Fundamentals - Real-Time Data]

Thanks for reading.

Send question or comment at Twitter @ozkary

👍 Originally published by ozkary.com

Data Engineering Process Fundamentals - Data Analysis and Visualization

2023-07-01T11:49:00.008-04:00

After completing our data warehouse design and implementation, our data pipeline should be fully operational. We can move forward with the analysis and visualization step of our process. Data analysis entails exploring, comprehending, and reshaping data to yield insights, thereby enabling stakeholders to make informed business decisions. Conversely, data visualization employs these insights to adeptly convey information via visual elements, encompassing charts and dashboards.

👉 Data Engineering Process Fundamentals - Data Warehouse and Transformation

Data analysis entails utilizing guidelines and patterns to guide the selection of appropriate analyses tailored to the specific use case. For instance, a Business Analyst (BA) focuses on examining data summations and aggregations across categorical dimensions such as date or station name. Conversely, a Manufacturing Quality Engineer (MQE) prioritizes the examination of statistical data, encompassing metrics like the mean and standard deviation.

In data visualization, we follow guidelines and design patterns to determine the appropriate chart for our data. For instance, a Business Intelligence (BI) dashboard may employ bar and pie charts to monitor sales performance in specific regions, while a Quality Control (QA) dashboard might utilize box plots, bell curves, and control charts to assess manufacturing process quality.

Data analysis and visualization are fundamental to a data-driven decision-making process. To grasp the best strategy for our scenario, we now dive deeper into this process by using a sample dataset from our data warehouse to illustrate the approach with examples.

Data Analysis

Data analysis is the practice of exploring data and understanding its meaning. It involves activities that can help us achieve a specific goal, such as identifying data dimensions and measures, as well as data analysis to identify outliers, trends, distributions, and hypothesis testing. We can accomplish these activities by writing code snippets using Python and Pandas, Visual Studio Code or Jupyter Notebooks. What's more, we can use libraries, such as Plotly, to generate some visuals to further analyze data and create prototypes.

For low-code tools, the analysis can be done using a smart and rich user interface that automatically discovers the meta-data to identify the dataset properties like dimensions and measures. With little to no-code, those tools can help us model the data, create charts and dashboards.

Data Profiling

Data profiling is the process to identify the data types, dimensions, measures, and quantitative values, which allows the analyst to understand the characteristics of the data, so we can understand how to group the information.

Data Types: This is the type classification of the data fields. It enables us to identify categorical (text), numeric and date-time values. The date-time data type is specially important as it provides us with the ability to slice the numeric values with a date range, specific dates and times (e.g., hourly)
Dimensions: Dimensions are textual, and categorical attributes that describe business entities. They are often discrete and used for grouping, filtering, and organizing data
Measures: Measures are the quantitative values that are subject to calculations such as sum, average, minimum, maximum, etc. They represent the KPIs that the organization wants to track and analyze

As an example of data profiling, we can inspect the average of arrivals and departures at certain time slots. This can help us identify patterns at different times.

👉 Clone this repo or copy the files from this folder. Use Jupyter Notebook file.


import pandas as pd

# use the sample dataset in this path Step5-Analysis/analysis_data.csv
df = pd.read_csv('./analysis_data.csv', iterator=False)
df.head(10)

# Define time (hr) slots
time_slots = {
    'morning': (8, 11),
    'afternoon': (12, 15),
    'night': (16, 20)
}
# cast the date column to datetime
df["created_dt"] = pd.to_datetime(df['created_dt'])
df["exits"] = df["exits"].astype(int)
df["entries"] = df["entries"].astype(int)

# Calculate average arrivals (exits) and departures (entries) for each time slot
for slot, (start_hour, end_hour) in time_slots.items():
    slot_data = df[(df['created_dt'].dt.hour >= start_hour) & (df['created_dt'].dt.hour < end_hour)]
    avg_arrivals = slot_data['exits'].mean()
    avg_departures = slot_data['entries'].mean()
    print(f"{slot.capitalize()} - Avg Arrivals: {avg_arrivals:.2f}, Avg Departures: {avg_departures:.2f}")

# output
Morning - Avg Arrivals: 30132528.64, Avg Departures: 37834954.08
Afternoon - Avg Arrivals: 30094161.08, Avg Departures: 37482421.78
Night - Avg Arrivals: 29513309.25, Avg Departures: 36829260.66

The code calculates the average arrivals and departures for each time slot. It prints out the results for each time slot, helping us identify the patterns of commuter activity during different times of the day.

Data Cleaning and Preprocessing

Data cleaning and preprocessing is the process of finding bad data and outliers that can affect the results. Bad data could be null values or values that are not within the range of the average trend for that day. These kinds of data problems should have been identified during the data load process, but it is always a best practice to repeat this process, even when the data comes from a trusted resource.

👍Outliers are values that are notably different from the other data points in terms of magnitude or distribution. They can be either unusually high (positive outliers) or unusually low (negative outliers) in comparison to the majority of data points.

For example, we might want to look at stations where the average number of arrivals in the morning differs unusually from the average number of departures in the evening. A normal pattern is that both should be within the threshold value.

# get the departures and arrivals for each station at the morning and night time slots
df_morning_arrivals =  df[(df['created_dt'].dt.hour >= time_slots['morning'][0]) & (df['created_dt'].dt.hour < time_slots['morning'][1])]
df_night_departures = df[(df['created_dt'].dt.hour >= time_slots['night'][0] ) & (df['created_dt'].dt.hour < time_slots['night'][1])]
# Calculate the mean arrivals and departures for each station
mean_arrivals_by_station = df_morning_arrivals.groupby('station_name')['exits'].mean()
mean_departures_by_station = df_night_departures.groupby('station_name')['entries'].mean()

# Calculate the z-scores for the differences between mean arrivals and departures
z_scores = (mean_arrivals_by_station - mean_departures_by_station) / np.sqrt(mean_arrivals_by_station.var() + mean_departures_by_station.var())

# Set a z-score threshold to identify outliers
z_score_threshold = 1.95  # You can adjust this value based on your needs

# Identify stations with outliers
outlier_stations = z_scores[abs(z_scores) > z_score_threshold]

print("Stations with outliers:")
print(outlier_stations)

# output
Stations with outliers:
station_name
183 ST          -3.170777
BAYCHESTER AV   -4.340479
JACKSON AV      -4.215668
NEW LOTS         3.124990

The output shows that there is a significant difference in the number of arrivals (morning) at these stations compared to departures later in the evening. This issue could be a result of some missing data or perhaps an event that caused the difference in commuters.

Statistical Analysis

Statistical analysis focuses on applying statistical techniques in order to draw meaningful conclusions about a set of data. It involves mathematical computations, probability theory, correlation analysis, and hypothesis testing to make inferences and predictions based on the data.

An example of statistical analysis is to describe the statistics for the numeric data and plot the relationships between two measures.

# Summary statistics
measures = ['entries','exits']
dims = ['station_name']

# Filter rows for the month of July for morning and night time slots
df_morning_july = df_morning_arrivals[df_morning_arrivals['created_dt'].dt.month == 7][measures + dims]
df_night_july = df_night_departures[df_night_departures['created_dt'].dt.month == 7][measures + dims]

correlation_data = []
for station in df_morning_july['station_name'].unique():
    morning_arrival = df_morning_july[df_morning_july['station_name'] == station]['exits'].values[0]
    evening_departure = df_night_july[df_night_july['station_name'] == station]['entries'].values[0]
    correlation_data.append({'station_name': station, 'arrivals': morning_arrival, 'departures': evening_departure})

df_correlation = pd.DataFrame(correlation_data)

# Select top 10 stations with most morning arrivals
top_stations = df_correlation.groupby('station_name')['arrivals'].sum().nlargest(10).index
df_top_stations = df_correlation[df_correlation['station_name'].isin(top_stations)]

print("Summary Statistics:")
print(df_top_stations[measures].describe() / 10000)

#output
Summary Statistics:
             entries          exits
count      10.000000      10.000000
mean     3691.269728    2954.513148
std     20853.999335   18283.964419
min         0.000000       0.000000
25%        27.126200      19.537525
50%       135.898650     100.470600
75%       615.586650     445.015200
max    214717.057100  212147.622600


# Create a scatter matrix to visualize relationships between numeric columns
fig_scatter = plotly_x.scatter(df_top_stations, x='arrivals', y='departures', color='station_name',
                         title='Morning Arrivals vs Evening Departures')
fig_scatter.show()

df_top_stations.describe() provides summary statistics for the numerical columns
plotly_x.scatter() creates scatter plots to visualize relationships between numerical columns

These statistics can help us identify trends, correlations, and relationships in our data, allowing us to gain insights and make informed decisions about further analysis or modeling.

👍 Data correlation refers to the degree to which two or more variables change together. It indicates the strength and direction of the linear relationship between variables. The correlation coefficient is a value between -1 and 1. A 1 indicates a strong positive correlation, while a value close to -1 indicates a strong negative correlation. A correlation coefficient near 0 suggests a weak or no linear relationship between the variables.

Hypothesis Testing

In hypothesis testing, we use statistical methods to validate assumptions and draw conclusions. On the previous scatter chart, we can see that there appears to not be a strong correlation between arrivals and departures for the top 10 stations with most arrivals. This fact could be an area of interest for the analysis, and we may want to take a deeper look by running a test.

# Perform Pearson correlation test
def test_arrival_departure_correlation(df: pd.DataFrame, label: str) -> None:
   corr_coefficient, p_value = pearsonr(df['arrivals'], df['departures'])   
   p_value = round(p_value, 5)

   if p_value < 0.05:
      conclusion = f"The correlation {label} is statistically significant."
   else:
      conclusion = f"The correlation {label} is not statistically significant."

   print(f"Pearson Correlation {label} - Coefficient : {corr_coefficient} P-Value : {p_value}")    
   print(f"Conclusion: {conclusion}")

test_arrival_departure_correlation(df_top_stations, 'top-10 stations')
test_arrival_departure_correlation(df_correlation, 'all stations')

# output
Pearson Correlation top-10 stations - Coefficient : -0.14112 P-Value : 0.69738
Conclusion: The correlation top-10 stations is not statistically significant.
Pearson Correlation all stations - Coefficient : 0.73803 P-Value : 0.0
Conclusion: The correlation all stations is statistically significant.

Let's take a look at the output and explain what is going on. A correlation coefficient of -0.14 suggests a weak negative correlation between the variables being analyzed. The p-value of 0.69 is relatively high, which suggests that there is no real correlation between morning arrivals and night departures for the top stations. This means that the high number of arrivals in that morning is not reflecting a correlation of departures in the evening.

If we compare the entire data frame with all the stations, we can see a correlation .73 (close to 1) and a p-value of 0 which indicates that there is a statistically significant correlation for the entire dataset, which means that other stations had an increase in departures compared to its arrivals. By looking at the entire data sample, we can see there is in fact a correlation, and the increase in arrivals directly impacts departures later in the day.

Business Intelligence and Reporting

Business intelligence (BI) is a strategic approach that involves the collection, analysis, and presentation of data to facilitate informed decision-making within an organization. In the context of business analytics, BI is a powerful tool for extracting meaningful insights from data and turning them into actionable strategies.

A Business Analyst (BA) uses a systematic approach to uncover valuable insights from data. As example, by calculating the total number of passengers for arrivals and departures, we gain a comprehensive understanding of passenger flow dynamics. Furthermore, we can employ distribution analysis to investigate variations across stations, days of the week, and time slots. These analyses provide essential insights for business strategy and decision-making, allowing us to identify peak travel periods, station preferences, and time-specific trends that directly influence business operations.

# Calculate total passengers for arrivals and departures
total_arrivals = df['exits'].sum() 
total_departures = df['entries'].sum() 

print(f"Total Arrivals: {total_arrivals} Total Departures: {total_departures}")

# output
Total Arrivals: 2954513147693 Total Departures: 3691269727684

# Create distribution analysis by station
df_by_station = df.groupby(["station_name"], as_index=False)[measures].sum()
print(df_by_station.head(5))

#output
 station_name      entries        exits
0         1 AV  41921835330   4723874242
1       103 ST   1701063755   1505114656
2       104 ST  60735889120  35317207533
3       111 ST   1856383672    840818137
4       116 ST   7419106031   8292936323

# Create distribution analysis by day of the week
df_by_date = df.groupby(["created_dt"], as_index=False)[measures].sum()
day_order = ['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat']
df_by_date["weekday"] = pd.Categorical(df_by_date["created_dt"].dt.strftime('%a'), categories=day_order, ordered=True)
df_entries_by_date =  df_by_date.groupby(["weekday"], as_index=False)[measures].sum()
print(df_entries_by_date.head(5))

# output
weekday       entries         exits
0     Sun   83869272617   53997290047
1     Mon  839105447014  667971771875
2     Tue  723988041023  592238758942
3     Wed  728728461351  594670413050
4     Thu   80966812864   51232966458

# Create distribution analysis time slots
for slot, (start_hour, end_hour) in time_slots.items():
    slot_data = df[(df['created_dt'].dt.hour >= start_hour) & (df['created_dt'].dt.hour <= end_hour)]
    arrivals = slot_data['exits'].sum()
    departures = slot_data['entries'].sum()
    print(f"{slot.capitalize()} - Arrivals: {arrivals:.2f}, Departures: {departures:.2f}")
# output
Morning - Arrivals: 494601773970.00, Departures: 619832037915.00
Afternoon - Arrivals: 493029769709.00, Departures: 615375337214.00
Night - Arrivals: 814729184132.00, Departures: 1008230417627.00

BI analysis is important in helping us understand the data, which can then be communicated to stakeholders so that we can make decisions based on which information is more relevant to the organization.

Data Visualization

Data visualization is a powerful tool that takes the insights derived from data analysis and presents them in a visual format. While tables with numbers on a report provide raw information, visualizations allow us to grasp complex relationships and trends at a glance. Dashboards, in particular, bring together various visual components like charts, graphs, and scorecards into a unified interface.

Imagine a scenario where we have analyzed passenger data using Python and determined that certain stations experience higher passenger volumes during specific times of the day. Translating this into a dashboard, we can use donut graphs to show the distribution of passenger counts on stations, bar graphs to visualize passenger trends over different times of the day, and scorecards to highlight key metrics like total passengers.

Such a dashboard offers a comprehensive view of the data, enabling quick comparisons, trend identification, and actionable insights. Instead of sifting through numbers, stakeholders can directly observe the patterns, correlations, and anomalies, leading to informed decision-making. This visualization approach enhances communication, collaboration, and comprehension among teams, making it an essential tool for data-driven organizations.

Types of Data Visualizations

There are a few terms that are used interchangeably when it comes to data visualization, but in reality there are subtle differences and specific uses between the terms. Let's review them in more details.

Chart: A chart is a visual representation that displays data points, trends, and patterns. It uses graphical elements such as bars, lines, or pie charts to depict data relationships. Charts are focused on illustrating specific data comparisons or distributions, making it easier for viewers to understand data at a glance.
Graph: A graph is a broader term that encompasses both charts and diagrams. It's used to represent data visually and can include a variety of visual elements, including nodes, edges, bars, lines, and more. Graphs are often used to showcase relationships and connections among various data points, allowing viewers to understand complex structures or networks
Report: A report is a structured document that provides a comprehensive overview of data analysis, findings, and insights. It typically includes a mix of textual explanations, tables, charts, and graphs. Reports are designed to convey detailed information and can be several pages long. They often include an executive summary, methodology, results, and recommendations
Dashboard: A dashboard is a visual display of key performance indicators (KPIs) and metrics that offers a real-time snapshot of business data. Dashboards consolidate multiple visual elements like charts, graphs, gauges, and tables onto a single screen. They are interactive and customizable, allowing users to monitor trends, and identify anomalies. Dashboards provide a quick and holistic view of business performance

In summary, a chart is a specific type of visual representation focusing on data points, a graph represents broader data relationships, a report is a structured document presenting detailed analysis, and a dashboard is an interactive screen displaying real-time KPIs and metrics. Each serves a unique purpose in effectively communicating information to different types of audiences.

Dashboard Design Principles

Designing effective dashboards requires attention to several key principles to ensure clarity, usability, and the ability to convey insights. Here's a short list of essential Dashboard Design Principles:

User-Centered Design: Understand our audience and their needs. Design the dashboard to provide relevant and actionable information to specific user roles, executives only want to see the big picture not details
Clarity and Simplicity: Keep the design clean and uncluttered. Use a simple layout, meaningful titles, and avoid unnecessary decorations
Consistency: Maintain a consistent design across all dashboard components. Use the same color schemes, fonts, and visual styles to create a cohesive experience
Master Filter: Include a master filter that allows users to select a date range, segment, or other criteria. This synchronizes data across all components, ensuring a unified view
Data Context and Relationships: Clearly label components and provide context to explain data relationships. Help users understand the significance of each element
Whitespace: Use whitespace effectively to separate components and enhance readability. Proper spacing reduces visual clutter
Real-Time Updates: If applicable, ensure that the dashboard provides real-time or near-real-time data updates for accurate decision-making
Mobile Responsiveness: Design the dashboard to be responsive across various devices and screen sizes, ensuring usability on both desktop and mobile
Testing and Iteration: Test the dashboard with actual users and gather feedback. Iterate on the design based on user insights and preferences

Effective dashboard design not only delivers data but also tells a story. It guides users through insights, highlights trends, and supports data-driven decision-making. Applying these principles will help create dashboards that are intuitive, informative, and impactful.

Data Visualization Tools

The data visualization tools can be divided into code-centric and low-code solutions. A code-centric solution involves writing programs to manage the data analysis and visuals. A low-code solution uses cloud-hosted tools that accelerate the data analysis and visualization. Instead of focusing on code, a low-code tool enables data professionals to focus on the data. Let's review some of these tools in more detail:

Python, coupled with libraries like Plotly, offers a versatile platform for data visualization that comes with its own set of advantages and limitations. This code-centric approach enables data professionals to integrate data analysis and visualization seamlessly, and they are particularly suited for individual research, in-depth analysis, and presentations in a controlled setting
Looker Studio is a powerful low-code, cloud-hosted business intelligence and data visualization platform that empowers organizations to explore, analyze, and share insights from their data. It offers a user-friendly interface that allows users to create interactive reports, dashboards, and visualizations
Microsoft Power BI is a widely-used low-code, cloud-hosted data visualization and business intelligence tool. It seamlessly integrates with other Microsoft tools and services, making it a popular choice for organizations already in the Microsoft ecosystem. Power BI offers an intuitive drag-and-drop interface for building interactive reports and dashboards. Its extensive library of visuals and custom visuals allows users to create compelling data representations
Tableau, acquired by Salesforce, is renowned for its cloud-hosted and low-code data visualization capabilities. It provides users with an array of options for creating dynamic and interactive visuals. Tableau's "drag-and-drop" approach simplifies the process of connecting to various data sources and creating insightful dashboards.

Each of these tools offers unique features and benefits, catering to different user preferences and organizational needs. Whether it's Looker's focus on data modeling, Power BI's integration with Microsoft products, or Tableau's flexibility and advanced analytics capabilities, these tools play a significant role in empowering users to unlock insights from their data.

Summary

Data analysis involves meticulous exploration, transformation, and comprehension of raw data to identify meaningful insights. There are guidelines and design patterns to follow for each specific use case. A BA might focus on KPIs, while a QAE might focus on statistical analysis of process quality. These insights, however, find their true value through data visualization. A code-centric approach with Python, aided by Plotly, offers potent tools for crafting analyses and visuals, but a low-code cloud hosted solution is often the solution for broader sharing and enterprise solutions.

In conclusion, the synergy between data analysis and visualization is pivotal for data-driven projects. Navigating data analysis with established principles and communicating insights through visually engaging dashboards empowers us to extract value from data. Whether opting for code-centric or low-code solutions, the choice of tooling and platform hinges on the balance between team expertise and target audience.

Exercise - Data Analysis and Visualization

With a better understanding of the data analysis and visualization process, the next step is to put these concepts into practice through a hands-on exercise. In this lab, we can continue our data engineering process and create a dashboard that will meet the requirements established in the discovery phase.

👉 Data Engineering Process Fundamentals - Data Analysis and Visualization Exercise

Thanks for reading.

Send question or comment at Twitter @ozkary

👍 Originally published by ozkary.com

Data Engineering Process Fundamentals - Data Warehouse and Transformation Exercise

2023-06-17T11:47:00.014-04:00

In this hands-on lab, we build upon our data engineering process where we previously focused on defining a data pipeline orchestration process. Now, we should focus on storing and making the data accessible for visualization and analysis. So far, our data is stored in a Data Lake, while Data Lakes excel at handling vast volumes of data, they are not optimized for query performance, so our step is to enable the bulk data processing and analytics by working on our Data Warehouse (DW).

During this exercise, we delve into the data warehouse design and implementation step, crafting robust data models, and designing transformation tasks. We explore how to efficiently load, cleanse, and merge data, ultimately creating dimension and fact tables. Additionally, we discuss areas like query performance, testability, and source control of our code, ensuring a reliable and scalable data solution. By leveraging incremental models, we continuously update our data warehouse with only the deltas (new updates), optimizing query performance and enhancing the overall data pipeline. By the end, we have a complete data pipeline, taking data from CSV to our data warehouse, equipped for seamless visualization and analysis.

Data Warehouse Design

A data warehouse is an OLAP system, which serves as the central data repository for historical and aggregated data. In contrast to the ETL process employed by data lakes with Python code, a data warehouse relies on the ETL process. This fundamental distinction emphasizes the need for well-defined and optimized models within the database, enabling efficient data access and exceptional performance.

👍 For the ETL process, the data is transformed before adding it to storage. For the ELT process, the data is first loaded in storage in its raw format, the transformation is then done before inserting into the dimension and fact tables.

Before building the concrete tables, our initial focus is on creating precise data models based on thorough analysis and specific requirements. To achieve this, we leverage SQL (Structured Query Language) and tools that facilitate model development in an automated, testable, and repeatable manner. By incorporating such tools into our project, we build the data services area in which we manage the data modeling and transformation to expand our architecture into the following:

👉 For our use case, we are using Google BigQuery as our data warehouse system. Make sure to review the Data Engineering Process - Design and Planning section and run the Terraform script to provision this resource.

External Tables

An external table is not physically hosted within the data warehouse database. Since our raw data is stored on a data lake, we can reference that location and load those files as an external table. we can create an external table using the data lake files as the source by providing a file pattern to select all the compressed files.

The following SQL can be executed as a query on the data warehouse. Access to the data lake should already be configured when the service accounts where assigned to the resources during the design and planning phase.

CREATE OR REPLACE EXTERNAL TABLE mta_data.ext_turnstile
OPTIONS (
  format = 'CSV',
  uris = ['gs://ozkary_data_lake_ozkary-de-101/turnstile/*.csv.gz']  
);

When this SQL script is executed, and the external table is created, the data warehouse retrieves the metadata about the external data, such as the schema, column names, and data types, without actually moving the data into the data warehouse storage. Once the external table is created, we can query the data using SQL as if it were a regular table.

Design and Architecture

During the design and architecture stage of our data warehouse project, our primary objective is to transition from conceptual ideas to concrete designs. Here, we make pivotal technical choices that pave the way for building the essential resources and defining our data warehouse approach.

Star Schema

We start by selecting the Star Schema model. This model consist of a central fact table that is connected to multiple dimension tables via foreign key relationships. The fact table contains the measures or metrics, while the dimension tables hold descriptive attributes.

Infrastructure

For the infrastructure, we are using a cloud hosted OLAP system, Google BigQuery. This is a system that can handle petabytes of data. It also provides MPP (Massive Parallel Processing), built-in indexing and caching, which improves query performance and reduce compute by caching query results. The serverless architecture of these systems help us on reducing cost. Because the system is managed by the cloud provider, we can focus on the data analysis instead of infrastructure management.

Technology Stack

For the technology stack, we are using a SQL-centric approach. We want to be able to manage our models and transformation tasks within the memory context and processing power of the database, which tends to work best for large datasets and faster processing. In addition, this approach works well with a batch processing approach.

dbt (data build tool) is a SQL-centric framework which at its core is primarily focused on transforming data using SQL-based queries. It allows us to define data models and transformation logic using SQL and Jinja, a templating language with data transformation capabilities, such as loops, conditionals, and macros, within our SQL code. This framework enables us to build the actual data models as views, tables and SQL based transformation that are hosted on the data warehouse.

As we build code for our data model and transformation tasks, we need to track it, manage the different versions and automate the deployments to our database. To manage this, we use GitHub, which is a web-based platform that provides version control and collaborative features for software development and management. It also provides CI/CD capabilities to help us execute test plans, build releases and deploy them. dbt connects with GitHub to manage deployments. This enables the dbt orchestration features to run the latest code as part of the pipeline.

👍 A deployment consists of getting the latest model metadata, build it on the database, and run the incremental data tasks when new data is available in the data lake.

Data Warehouse Implementation

The data warehouse implementation is the stage where the conceptual data model and design plans are transformed into a functional system by implementing the data models and writing the code for our transformation tasks.

Data Modeling

Data modeling is the implementation of the structure of the data warehouse, creating models (views) and entities (tables), defining attributes (columns), and establishing data relationships to ensure efficient querying and reporting. It is also important to identify the primary keys, foreign keys, and indexes to improve data retrieval performance.

To build our models, we should follow these specifications:

Create an external table using the Data Lake folder and *.csv.gz file pattern as a source
- ext_turnstile
Create the staging models
- Create the station view (stg_station) from the external table as source
  - Get the unique stations
  - Create a surrogate key using the station name
- Create the booth view (stg_booth) from the external table as source
  - Create a surrogate key using the booth UNIT and CA fields
- Create the fact view (stg_turnstile) from the external table as source
  - Create a surrogate key using CA, UNIT, SCP, DATE, time

Data Transformation

The data transformation phase is a critical stage in a data warehouse project. This phase involves several key steps, including data extraction, cleaning, loading, data type casting, use naming conventions, and implementing incremental loads to continuously insert the new information since the last update via batch processes.

For our transformation services, we follow these specifications:

Use the staging models to build the physical models
- Map all the columns to our naming conventions, lowercase and underline between words
- Create the station dimension table (dim_station) from the stg_station model
  - Add incremental strategy for ongoing new data
- Create the booth dimension table (dim_booth) from the stg_booth model
  - Add incremental strategy for ongoing new data
  - Use the station_name to get the foreign key, station_id
  - Cluster the table by station_id
- Create the fact table (fact_turnstile) from the stg_turnstile model
  - Add incremental strategy for ongoing new data
  - Partition the table by created_dt and use day granularity
  - Cluster the table by station_id
  - Join on dimension tables to use id references instead of text
Remove rows with null values for the required fields
- Station, CA, UNIT, SCP, DATE, TIME
Cast columns to the correct data types
- created
Continuously run all the model with an incremental strategy to append new records

Our physical data model should look like this:

Why do we use partitions and cluster

👍 We should always review the technical specifications of the database system to find out what other best practices are recommended to improve performance.

Partitioning is the process of dividing a large table into smaller, more manageable parts based on the specified column. Each partition contains rows that share a common value like a specific date. A partition improves performance and query cost
When we run a query in BigQuery, it gets executed by a distributed computing infrastructure that spans multiple machines. Clustering is an optional feature in BigQuery that allows us to organize the data within each partition. The purpose of clustering is to physically arrange data within a partition in a way that is conducive to efficient query processing

SQL Server and Big Query Concept Comparison

In SQL Server, a clustered index defines the physical order of data in a table. In BigQuery, clustering refers to the organization of data within partitions based on one or more columns. Clustering in BigQuery does not impact the physical storage order like a clustered index in SQL Server
Both SQL Server and BigQuery support table partitioning. The purpose is similar, allowing for better data management and performance optimization

Install System Requirements and Frameworks

Before looking at the code, we need to setup our environment with all the necessary dependencies, so we can build our models.

Requirements

👉 Verify that there are files on the data lake. If not, run the data pipeline process to download the files into the data lake.

👉 Clone this repo or copy the files from this folder, dbt and sql.

Must have CSV files in the data lake
Create a dbt cloud account
- Link dbt with your GitHub project (Not needed when running locally)
- Create schedule job on dbt cloud for every Saturday 9am
- Or install locally (VM) and run from CLI
GitHub account
Google BigQuery resource

Configure the CLI

Install dbt core and BigQuery dependencies

Run these command from the Step4-Data-Warehouse/dbt folder to install the dependencies and initialize the project.

$ cd Step4-Data-Warehouse/dbt
$ pip install dbt-core dbt-bigquery  
$ dbt init
$ dbt deps

Create a profile file

From the Step4-Data-Warehouse folder, run the following commands.

$ cd ~
$ mkdir .dbt
$ cd .dbt
$ touch profiles.yml
$ nano profiles.yml

Paste the profiles file content

👉 Use your dbt cloud project project information and cloud key file

Run this command see the project folder configuration location

$ dbt debug --config-dir

Update the content of the file to match your project information

Analytics:
  outputs:
    dev:
      dataset: mta_data
      job_execution_timeout_seconds: 300
      job_retries: 1
      keyfile: /home/.gcp/your-file.json
      location: us-east1
      method: service-account
      priority: interactive
      project: your-gcp-project
      threads: 2
      type: bigquery
  target: dev

Validate the project configuration

This should generate a list of all the assets that should be generated in the project including the constraints.

$ dbt list --profile Analytics

Review the Code

With a dev environment ready and clear specifications about how to build the models and our transformations, we can now look at the code and review the approach. We can use Visual Studio Code or a similar tool to edit the source code and open a terminal to run the CLI commands.

Start by navigating to the dbt project folder.

$ cd Step4-Data-Warehouse/dbt

Project tree:

- dbt
  │
  ├─ models
  │   │
  │   ├─ core
  │   │   ├─ schema.yml
  │   │   ├─ dim_booth.sql
  │   │   ├─ dim_station.sql
  │   │   ├─ fact_turnstile.sql
  │   │   └─ ...
  │   ├─ staging
  │   │   ├─ schema_*.yml
  │   │   ├─ stg_booth.sql
  │   │   ├─ stg_station.sql
  │   │   ├─ stg_turnstile.sql
  │   │   └─ ...  
  │   ├─ target
  │   │   ├─ compile
  │   │   ├─ run
  │   │   └─ ...  
  └─ dbt_project.yml

The dbt folder contains the SQL-based source code. The staging folder contains the view definitions. The core folder contains the table definitions. The schema files in those folders have test rules and data constraints that are used to validate the models. This is how we are able to test our models.

The schema.yml files are used as configurations to define the schema of the final output of the models. It provides the ability to explicitly specify the column names, data types, and other properties of the resulting table created by each dbt model. This file allows dbt to generate the appropriate SQL statements for creating or altering tables in the target data warehouse.

👍 All these files are executed using the dbt CLI. The files are compiled into SQL statements that are deployed to the database or just executed in memory to run the test, validation and insert scripts. The compiled SQL is stored in the target folder and these are assets deployed to the database. The transformation tasks are compiled into the run folder and are only executed on the database.

Lineage

Data lineage is the documentation and tracking of the flow of data from its origin to its destination, including all the intermediate processes and transformations that it undergoes. In this case, we show how the external table is the source for the fact table and the dimension table dependencies.

Staging Data Models - Views

We use the view strategy to build our staging models. When these files are executed (via CLI commands), the SQL DDL (Data Definition Language) is generated and deployed to the database, essentially building the views. We also add a test parameter to limit the number of rows to 100 during the development process only. This is removed when it is deployed. Notice how the Jinja directives are in double brackets {{}} and handle some conditional logic and directives to configure the build process or call user defined functions.

👍 DDL (Data Definition Language) is used to create objects. DML (Data Manipulation Language) is used to query the data.

stg_station.sql

{{ config(materialized='view') }}

with stations as 
(
  select 
    Station,
    row_number() over(partition by Station) as rn
  from {{ source('staging','ext_turnstile') }}   
  where Station is not null
)
select
    -- create a unique key based on the station name
    {{ dbt_utils.generate_surrogate_key(['Station']) }} as station_id,    
    Station as station_name
from stations
where rn = 1

-- use is_test_run false to disable the test limit
-- dbt build --m <model.sql> --var 'is_test_run: false'
{% if var('is_test_run', default=true) %}
  limit 100
{% endif %}

stg_booth.sql

{{ config(materialized='view') }}

with booths as 
(
  select
    UNIT,
    CA,
    Station,
    row_number() over(partition by UNIT, CA) as rn
  from {{ source('staging','ext_turnstile') }}   
  where Unit is not null and CA is not null and Station is not null
)
select
    -- create a unique key 
    {{ dbt_utils.generate_surrogate_key(['UNIT', 'CA']) }} as booth_id,
    UNIT as remote,
    CA as booth_name,
    Station as station_name
from booths
where rn = 1

-- dbt build --m <model.sql> --var 'is_test_run: false'
{% if var('is_test_run', default=true) %}

  limit 100

{% endif %}

stg_turnstile.sql


{{ config(materialized='view') }}

with turnstile as 
(
  select     
  CA,
  UNIT,
  STATION,
  concat(CA,UNIT,SCP) as REF,
  SCP,
  LINENAME,
  DIVISION,
  concat(log.DATE," ", log.TIME) as CREATED,
  ENTRIES,
  EXITS,
    row_number() over(partition by CA, UNIT, SCP, DATE, TIME) as rn
  from {{ source('staging','ext_turnstile') }} as log
  where Station is not null and DATE is not null and TIME is not null

)
select
    -- create a unique key 
    {{ dbt_utils.generate_surrogate_key(['REF', 'CREATED']) }} as log_id,
    CA as booth,
    UNIT as remote,
    STATION as station,

    -- unit and line information
    SCP as scp,
    LINENAME AS line_name,
    DIVISION AS division,

     -- timestamp
    cast(CREATED as timestamp) as created_dt,    

    -- measures
    cast(entries as integer) as entries,
    cast(exits as integer) as exits    
from turnstile
where rn = 1


-- dbt build --m <model.sql> --var 'is_test_run: false'
{% if var('is_test_run', default=true) %}
  limit 100
{% endif %}

Physical Data Models - Tables

We use the incremental strategy to build our tables. This enable us to continuously append data to our tables when there is new information. This strategy creates both DDL and DML scripts. This enable us to build the tables and also create the scripts to merge the new data in the table.

We use the models (views) to build the actual tables. When these scripts are executed (via CLI commands), the process checks if the object exists, if it does not exists, it creates it. It then reads the data from the views using CTE (common table expressions) and appends all the records that are not already in the table.

dim_station.sql


{{ config(materialized='incremental') }}

with stations as (
select 
    station_id, 
    station_name    
from {{ ref('stg_station') }} as d
where station_id is not null
)
select
    ns.station_id,
    ns.station_name
from stations ns
{% if is_incremental() %}
     -- logic for incremental models this = dim_station table
    left outer join {{ this }} dim
        on ns.station_id = dim.station_id
    where dim.station_id is null     

 {% endif %}

dim_booth.sql


{{ config(materialized='incremental',
   cluster_by = "station_id"
 )}}

with booth as (
select 
    booth_id,
    remote,
    booth_name,
    station_name
from {{ ref('stg_booth') }}
where booth_id is not null 
),

dim_station as (
    select station_id, station_name from {{ ref('dim_station') }}   
)
select 
    b.booth_id,
    b.remote,
    b.booth_name,
    st.station_id
from booth b 
inner join dim_station st 
    on b.station_name = st.station_name
{% if is_incremental() %}
     -- logic for incremental models this = dim_booth table
    left outer join {{ this }} s
        on b.booth_id = s.booth_id
    where s.booth_id is null     
 {% endif %}

fact_turnstile.sql


{{ config(materialized='incremental',
    partition_by={
      "field": "created_dt",
      "data_type": "timestamp",
      "granularity": "day"
    },
    cluster_by = "station_id") 
}}

with turnstile as (
    select 
        log_id,
        remote,
        booth,
        station,
        scp,
        line_name,
        division,
        created_dt,
        entries,
        exits
    from {{ ref('stg_turnstile') }}
    where log_id is not null
), 

dim_station as (
    select station_id, station_name from {{ ref('dim_station') }}   
),

dim_booth as (
    select booth_id, remote, booth_name  from {{ ref('dim_booth') }}   
)
select 
    log.log_id,
    st.station_id,
    booth.booth_id,
    log.scp,
    log.line_name,
    log.division,
    log.created_dt,
    log.entries,
    log.exits
from turnstile as log
left join dim_station as st
   on log.station = st.station_name
left join dim_booth as booth
on log.remote = booth.remote and log.booth = booth.booth_name 
{% if is_incremental() %}
     -- logic for incremental models this = fact_turnstile table
    left outer join {{ this }} fact
        on log.log_id = fact.log_id
    where fact.log_id is null     

 {% endif %}

schema.yml

version: 2

models:
  - name: dim_station
    description: >
      List of unique stations identify by station_id.       
    columns:
          - name: station_id
            description: The station identifier            
            tests:
                - unique:
                    severity: warn
                - not_null:
                    severity: warn
          - name: station_name
            description: the station name
            tests:
                - not_null:
                    severity: warn

  - name: dim_booth
    description: >
      List of unique booth identify by booth_id.  
    columns:
          - name: booth_id
            description: The booth identifier            
            tests:
                - unique:
                    severity: warn
                - not_null:
                    severity: warn
          - name: remote
            description: the remote gate name
            tests:
                - not_null:
                    severity: warn
          - name: booth_name
            description: the station booth
            tests:
                - not_null:
                    severity: warn
          - name: station_id
            description: the station id
            tests:
                - relationships:
                  to: ref('dim_station')
                  field: station_id
                  severity: warn              
  - name: fact_turnstile
    description: >
     Represents the daily entries and exits associated to booths in subway stations
    columns:
          - name: log_id
            description: Primary key for this table, generated with a concatenation CA, SCP,UNIT, STATION CREATED            
            tests:
                - unique:
                    severity: warn
                - not_null:
                    severity: warn
          - name: booth_id
            description: foreign key to the booth dimension            
            tests:
               - relationships:
                  to: ref('dim_booth')
                  field: booth_id
                  severity: warn
          - name: station_id          
            description:  The foreign key to the station dimension            
            tests:
               - relationships:
                  to: ref('dim_station')
                  field: station_id
                  severity: warn
          - name: scp
            description: The device address
          - name: line_name
            description: The subway line
          - name: division
            description: The subway division          
          - name: created_dt
            description: The date time for the activity
            tests:
                - not_null:
                    severity: warn
          - name: entries
            description: The number of entries
            tests:
                - not_null:
                    severity: warn
          - name: exits
            description: the number of exits 
            tests:
                - not_null:
                    severity: warn

Incremental Models

In dbt, an incremental model uses a merge operation to update a data warehouse's tables incrementally rather than performing a full reload of the data each time. This approach is particularly useful when dealing with large datasets and when the source data has frequent updates or inserts. Incremental models help optimize data processing and reduce the amount of data that needs to be processed during each run, resulting in faster data updates.

SQL merge query for the station dimension table (generated code)


merge into `ozkary-de-101`.`mta_data`.`dim_station` as DBT_INTERNAL_DEST
using (

  with stations as (
  select 
      station_id, 
      station_name    
  from `ozkary-de-101`.`mta_data`.`stg_station` as d
  )
  select
      ns.station_id,
      ns.station_name
  from stations ns
  -- logic for incremental models
  left outer join `ozkary-de-101`.`mta_data`.`dim_station` s
      on ns.station_id = s.station_id
  where s.station_id is null     
  -- 
    ) as DBT_INTERNAL_SOURCE
    on (FALSE)  
when not matched then insert
    (`station_id`, `station_name`)
values
    (`station_id`, `station_name`)

SQL merge query for the fact table (generated code)

merge into `ozkary-de-101`.`mta_data`.`fact_turnstile` as DBT_INTERNAL_DEST
using (

    with turnstile as (
        select 
            log_id,
            remote,
            booth,
            station,
            scp,
            line_name,
            division,
            created_dt,
            entries,
            exits
        from `ozkary-de-101`.`mta_data`.`stg_turnstile`
        where log_id is not null
    ), 

    dim_station as (
        select station_id, station_name from `ozkary-de-101`.`mta_data`.`dim_station`   
    ),

    dim_booth as (
        select booth_id, remote, booth_name  from `ozkary-de-101`.`mta_data`.`dim_booth`   
    )
    select 
        log.log_id,
        st.station_id,
        booth.booth_id,
        log.scp,
        log.line_name,
        log.division,
        log.created_dt,
        log.entries,
        log.exits
    from turnstile as log
    left join dim_station as st
      on log.station = st.station_name
    left join dim_booth as booth
    on log.remote = booth.remote and log.booth = booth.booth_name 

    -- logic for incremental models this = fact_turnstile table
    left outer join `ozkary-de-101`.`mta_data`.`fact_turnstile` fact
        on log.log_id = fact.log_id
    where fact.log_id is null     

    ) as DBT_INTERNAL_SOURCE
    on (FALSE)
    when not matched then insert
        (`log_id`, `station_id`, `booth_id`, `scp`, `line_name`, `division`, `created_dt`, `entries`, `exits`)
    values
        (`log_id`, `station_id`, `booth_id`, `scp`, `line_name`, `division`, `created_dt`, `entries`, `exits`)

How to Run It

We are ready to see this in action. We first need to build the data models on our database by running the following steps:

Validate the project

Debug the project to make sure there are no compilation errors.

$ dbt debug

Run the test cases

All test should pass.

$ dbt test

Build the models

Set the test run variable to false. This allows for the full dataset to be created without limiting the rows.

$ cd Step4-Data-Warehouse/dbt
$ dbt build --select stg_booth.sql --var 'is_test_run: false'
$ dbt build --select stg_station.sql --var 'is_test_run: false'
$ dbt build --select stg_turnstile.sql --var 'is_test_run: false'

$ dbt build --select dim_booth.sql 
$ dbt build --select dim_station.sql 
$ dbt build --select fact_turnstile.sql

After running these command, the following resources should be in the data warehouse:

👍 The build command is responsible for compiling, generating and deploying the SQL code for our dbt project, while the run command executes that SQL code against your data warehouse to update the data. Typically, we would run dbt build first to compile the project, and then run dbt run to execute the compiled code against the database.

Generate documentation

Run generate to create the documentation. We can then run serve to view the documentation on the browser.

$ dbt docs generate
$ dbt docs serve

The entire project is documented. The image below shows the documentation for the fact table with the lineage graph showing how it was built.

Manually test the incremental updates

We can run our updates on demand by using the CLI. To be able to run the updates. We should first run the data pipeline and import a new CSV file into the data lake. We can then run our updates as follows:

$ cd Step4-Data-Warehouse/dbt
$ dbt run --model dim_booth.sql 
$ dbt run --model dim_station.sql 
$ dbt run --model fact_turnstile.sql

We should notice that we are "running" the model, which only runs the incremental (merge) updates.

Schedule the job

On dbt Cloud setup the dbt schedule job to run every Saturday at 9am
Use the production environment
Use the following command

$ dbt run --model fact_turnstile.sql

After running the cloud job, the log should show the following information with the number of rows affected.

👍 There should be files on the data lake for the job to insert any new records.

Manually Query the data lake for new data

To test the for new records, we can manually run this query on the database.

with turnstile as (
    select 
        log_id      
    from mta_data.stg_turnstile
)
select 
    log.log_id    
from turnstile as log
-- logic for incremental models find new rows that are not in the fact table
left outer join mta_data.fact_turnstile fact
    on log.log_id = fact.log_id
where fact.log_id is null

Validate the data

To validate the number of records in our database, we can run these queries:

-- check station dimension table
select count(*) from mta_data.dim_station;

-- check booth dimension table
select count(*) from mta_data.dim_booth;

-- check the fact table
select count(*) from mta_data.fact_turnstile;

-- check the staging fact data
select count(*) from mta_data.stg_turnstile;

After following all these instructions, we should see data in our data warehouse, which closes the loop on the entire data pipeline for data ingestion from a CSV file to our data warehouse. We should also note that we could have done this process using a Python-Centric approach with Apache Spark, and we will discuss that in a later section.

Summary

During this data warehouse exercise, we delve into the design and implementation step, crafting robust data models, and designing transformation tasks. Carefully selecting a star schema design and utilizing BigQuery as our OLAP system, we optimize performance and handle large datasets efficiently. Leveraging SQL for coding and a SQL-Centric framework, we ensure seamless data modeling and transformation. We use GitHub for our source code management and CI/CD tool integration, so the latest changes can be built and deployed. Thorough documentation and automated data transformations underscore our commitment to data governance and streamlined operations. The result is a resilient and future-ready data warehouse capable of meeting diverse analytical needs.

Next Step

With our data warehouse design and implementation complete, we have laid a solid foundation to unleash the full potential of our data. Now, we venture into the realm of data analysis and visualization, where we can leverage powerful tools like Power BI and Looker to transform raw data into actionable insights.

Coming Soon!

👉 [Data Engineering Process Fundamentals - Data Analysis and Visualization]

Thanks for reading.

Send question or comment at Twitter @ozkary

👍 Originally published by ozkary.com

Data Engineering Process Fundamentals - Data Warehouse and Transformation

2023-06-10T11:45:00.018-04:00

After completing the pipeline and orchestration phase in the data engineering process, our pipeline should be fully operational and loading data into our data lake. The compressed CSV files in our data lake, even though is optimized for storage, are not designed for easy access for analysis and visualization tools. Therefore, we should transition into moving the data from the files into a data warehouse, so we can facilitate the access for the analysis process.

The process to send the data into a data warehouse requires a few essential design activities before we can migrate the data into tables. Like any process before any implementation is done, we need to first work on defining the database system and schema, identifying the programming language, frameworks, tools to use for CI/CD requirements, and supporting requirements to keep our data warehouse operational.

Once the data warehouse design is in place, we can then transition into the implementation stage of the process where we can transition from concepts into concrete structures, including dimension and fact tables, while also defining the data transformation tasks to process the data into the data warehouse.

To get a better understanding about the data warehouse process, let's first do a refresh on some important concepts related to data warehouse systems. As we cover these concepts, we can then relate them to some of the necessary activities that we need to take on to deliver a solution that can scale according to our data demands.

OLAP vs OLTP Database Systems

An Online Analytical Processing (OLAP) and an Online Transaction Processing (OLTP) are two different types of database systems with distinct purposes and characteristics:

OLAP

It is designed for complex analytical queries and data analysis
It is optimized for read-heavy workloads and aggregates large volumes of data to support business intelligence (BI), reporting, and data analysis.
These databases store historical data and facilitate data exploration, trend analysis, and decision-making
Data is typically denormalized and organized in a multidimensional structure like a star schema or snowflake schema to enable efficient querying and aggregation.
Some examples include data warehouses and analytical databases like Amazon Redshift, Google BigQuery, and Microsoft Azure Synapse Analytics.

OLTP

It is designed for transactional processing and handling frequent, real-time, and high-throughput transactions
It focuses on transactional operations like inserting, updating, and deleting individual records
Databases are typically normalized to minimize redundancy and ensure data integrity during frequent transactions
The data is organized in a relational structure and optimized for read and write operations
Some examples include traditional relational databases like MySQL, PostgreSQL, Microsoft SQL Server, and Oracle

👍 OLAP databases (e.g., BigQuery) are used for analytical processing. OLTP databases (e.g., SQL Server) are used for transaction processing

In summary, OLAP and OLTP serve different purposes in the database world. OLAP databases are used for analytical processing, supporting complex queries and data analysis, while OLTP databases are used for transaction processing, managing high-frequency and real-time transactional operations. Depending on the needs of the solution, we would choose the appropriate type of database system to achieve the desired performance and functionality. In our case, an OLAP system aligns what the requirements for our solution.

What is a Data Warehouse

ELT vs ETL

An extract, load and transform (ELT) process differs from the extract, transform and load (ETL) process on the data transformation approach. For some solutions, a flow task may transform (ETL) the data prior to loading it into storage, so it can then be inserted into the data warehouse directly. This approach increases the amount of python code and hardware resources used by the VM environments.

For the ELT process, the transformation may be done using SQL (Structured Query Language) code and the data warehouse resources, which often tends to perform great for Big Data scenarios. This is usually done by defining the data model with views over some external tables and running the transformation using SQL for bulk data processing. In our case, we can use the data lake as external tables and use the power of the data warehouse to read and transform the data, which aligns with the ELT approach as the data is first loaded in the data lake.

👍 For the ETL process, the data is transformed before adding to storage. For the ELT process, the data is first loaded in storage in raw format, the transformation is then done before inserting into the dimension and fact tables.

External Tables

An external table in the context of a data warehouse refers to a table that is not physically stored within the data warehouse's database but instead references data residing in an external storage location. The data in an external table can be located in cloud storage (e.g., Azure Blob Storage, AWS S3) or on-premises storage. When querying an external table, the data warehouse's query engine accesses the data in the external location on-the-fly without physically moving or copying it into the data warehouse's database.

Advantages of using external tables in a data warehouse include:

Cost Savings: External tables allow us to store data in cost-effective storage solutions like cloud object storage
Data Separation: By keeping the data external to the data warehouse, we can maintain a clear separation between compute and storage. We can scale them independently, optimizing costs and performance
Data Freshness: External tables provide real-time access to data, as changes made to the external data source are immediately reflected when queried. There's no need for raw data ingestion processes to load the data into the data warehouse.
Data Variety and Integration: You can have external tables referencing data in various formats (e.g., CSV, Parquet, JSON), enabling seamless integration of diverse data sources without the need for complex data transformations
Data Archiving and Historical Analysis: External tables allow you to store historical data in an external location, reducing the data warehouse's storage requirements. You can keep archived data accessible without impacting the performance of the main data warehouse.
Rapid Onboarding: Setting up external tables is often quicker and more straightforward than traditional data ingestion processes. This allows for faster onboarding of new data sources into the data warehouse.
Reduced ETL Complexity: External tables can reduce the need for complex ETL (Extract, Transform, Load) processes as the data doesn't need to be physically moved or transformed before querying.

Data Mart

Depending on the use case, the analytical tools can connect directly to the data warehouse for data analysis and reporting. In other scenarios, it may be better to create a data mart, which is a smaller, focused subset of a data warehouse that is designed to serve the needs of a specific business unit within an organization. The data mart stores its data in separate storage.

There are two main types of data marts:

Dependent Data Mart: This type of data mart is derived directly from the data warehouse. It extracts and transforms data from the centralized data warehouse and optimizes it for a specific business unit.
Independent Data Mart: An independent data mart is created separately from the data warehouse, often using its own ELT processes to extract and transform data from the source systems. It is not directly connected to the data warehouse

By providing a more focused view of the data, data marts enable faster and more efficient decision-making within targeted business areas.

Data Warehouse Design and Architecture

During the design and architecture stage of our data warehouse project, our primary objective is to transition from conceptual ideas to concrete designs. With a clear understanding of the business requirements, data sources and their update frequencies, we can move forward with the design of the data warehouse architecture. To start, we need to define the data warehouse models such as star schema, snowflake schema, or hybrid models based on data relationships and query patterns. We should also determine the infrastructure and technology stack for the data warehouse, considering factors like data volume, frequency of updates, and query performance requirements, source control, and CI/CD activities.

Schema Design

The Star and Snowflake Schemas are two common data warehouse modeling techniques. The Star Schema consist of a central fact table is connected to multiple dimension tables via foreign key relationships. The fact table contains the measures or metrics, while the dimension tables hold descriptive attributes. The Snowflake Schema is a variation of the Star Schema, but with normalized dimension tables. This means that dimension tables are further divided into multiple related tables, reducing data redundancy, but increasing SQL joins.

Star Schema Pros and Cons

Simplicity: The Star Schema is straightforward and easy to understand, making it user-friendly for both data engineers and business analysts
Performance: Star Schema typically delivers faster query performance because it denormalizes data, reducing the number of joins required to retrieve data
Data Redundancy: Due to denormalization, there might be some data redundancy in dimension tables, which can lead to increased storage requirements
Maintenance: The Star Schema is relatively easier to maintain and modify since changes in dimension tables don't affect the fact table

Snowflake Schema Pros and Cons

Normalization: The Snowflake Schema reduces data redundancy and optimizes storage by normalizing dimension data
Complexity: Compared to the Star Schema, the Snowflake Schema is more complex due to the presence of multiple normalized dimension tables
Performance: Snowflake Schema require more joins, which can impact query performance compared to the Star Schema. However, modern data warehouses are optimized for handling Snowflake Schema efficiently
Maintenance: The Snowflake Schema might be slightly more challenging to maintain and modify due to the normalized structure and the need for more joins

In summary. we can use the Star Schema when query performance is a primary concern, and data model simplicity is essential. Use the Snowflake Schema when storage optimization is crucial, and the data model involves high-cardinality dimension attributes with potential data redundancy.

Infrastructure

Cloud based OLAP systems like Amazon Redshift, Google BigQuery, and Microsoft Azure Synapse Analytics are built to scale with growing data volumes. They can handle petabytes of data, making them a great fit for Big Data scenarios. These systems also support MPP (Massive Parallel Processing), built-in indexing and caching, which improves query performance and reduce compute by caching query results. The serverless architecture of these systems help us on reducing cost. Because the system is managed by the cloud provider, we can focus on the data analysis instead of infrastructure management.

OLAP systems also provides data governance by providing a structured and controlled environment for managing data, ensuring data quality, enforcing security, access controls, and promoting consistency and trust in the data across the organization. These systems also implement robust security measures to protect the data, auditing capabilities for tracking data lineage and changes, which are crucial for compliance requirements.

In all, OLAP systems are well-equipped to handle big data scenarios, offering scalability, high-performance querying, cost-effectiveness, and data governance, which is a critical business requirement.

Technology Stack

When it comes to the technology stack, we have to decide on what programming language, frameworks and platforms to use for our solution. For example, Python is a suitable functional programming language with an extensive ecosystem of libraries for data modeling and transformation. But when using Python, we need to parse the CSV files, models and transform the data in memory, so it can be sent to the database. This tends to increase the amount of Python code, Docker containers, VM resources, and overall DevOps activities.

Within the memory context and processing power of the data warehouse, we could use SQL to create the models and run the transformation, which tends to work best for large datasets and faster processing. Due to the nature of the data lake, the CSV files can be modeled as external tables within the data warehouse. SQL can then be used to create models using views to enforce the data types. In addition, the transformation can be done right in the database using SQL statements with batch queries, which tends to perform a lot better than using Python.

Frameworks

Frameworks provide libraries to handle specific technical concerns. In the case of a Python-centric solution, we can use the Pandas library, which is an open-source data manipulation, cleaning, transformation and analysis library widely use by data engineers and scientists. Pandas supports a DataFrame-based modeling and transformation. A DataFrame is a two-dimensional table-like data structure. It can hold data with different data types and allows us to perform various operations like filtering, grouping, joining, and aggregating. Pandas offers functions for handling missing data, removing duplicates, and converting data types, making data cleaning tasks easier.

There are also frameworks that consist of generating SQL code to build the models and process the transformation. dbt (data build tool) is a SQL-centric framework which at its core is primarily focused on transforming data using SQL-based queries. It allows us to define data transformation logic using SQL and Jinja, a templating language with data transformation capabilities, such as loops, conditionals, and macros, within our SQL code. dbt enables us to build the actual data models as views, entities (tables) and SQL based transformation that are hosted on the data warehouse.

Apache Spark Platform

Apache Spark is a widely used open-source distributed computing system designed for big data processing and analytics. It provides a fast, scalable, and versatile platform for handling large-scale data workloads. While it can be used for data modeling and transformation, it serves a broader range of use cases, including batch processing, real-time processing and machine learning. There are many popular cloud platforms that use Spark as their core engine. Some of them include: Databricks, Azure Synapse Analytics, Google Dataproc, Amazon EMR.

Spark supports multiple programming languages like Scala, Python, SQL. Since Spark requires a runtime environment to manage the execution of a task, the programming model is very similar to running applications on a VM. The Spark application connects to a Spark cluster to create a session, and it can then perform data processing and run Spark SQL queries. Let's look at what a Python and SQL application looks like with Spark.

Data Modeling and Transformation with PySpark and SQL:

The next example (for both Python and SQL) show us how to create a Spark session. It then joins two data frames by using the station_id as the related column. Lastly, it selects and displays the result of the query.

PySpark: PySpark provides a high-level API for Spark, allowing us to write Spark applications using Python. It exposes the core Spark functionalities and supports DataFrame and Dataset APIs for working with structured data. PySpark is popular among data engineers and data scientists.

PySpark Code Sample:

from pyspark.sql import SparkSession

# Assuming you already have the two DataFrames `dim_station` and `fact_turnstile`

# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("JoinEntities").getOrCreate()

# Join the two DataFrames on the 'station_id' column
joined_df = fact_turnstile.join(dim_station, on="station_id")

# Select the desired columns
result_df = joined_df.select("station_name", "created_datetime", "entries", "exits")

# Show the result
result_df.show()

SQL: Spark includes a SQL module that allows us to run SQL queries directly on data. This makes it convenient for those familiar with SQL to leverage their SQL skills to perform data modeling and transformation tasks using Spark.

PySpark and SQL Code Sample:

from pyspark.sql import SparkSession

# Assuming you already have the two DataFrames `dim_station` and `fact_turnstile`

# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("JoinEntities").getOrCreate()

# Register the DataFrames as temporary views
dim_station.createOrReplaceTempView("dim_station_view")
fact_turnstile.createOrReplaceTempView("fact_turnstile_view")

# Write the SQL query for joining and selecting the desired columns
sql_query = """
SELECT s.station_name, t.created, t.entries, t.exits
FROM fact_turnstile_view t
JOIN dim_station_view s ON t.station_id = s.station_id
"""

# Execute the SQL query
result_df = spark.sql(sql_query)

# Show the result
result_df.show()

Sample Output

station_name	created	entries	exits
Central Station	2023-02-13 12:00:00	10000	5000
Times Square	2023-02-13 12:10:00	8000	3000
Union Square	2023-02-13 12:20:00	12000	7000
Grand Central	2023-02-13 12:30:00	9000	4000
Penn Station	2023-02-13 12:40:00	11000	6000

By supporting multiple languages like PySpark and SQL, Apache Spark caters to a broader audience, making it easier for developers, data engineers, and data scientists to leverage its capabilities effectively. Apache Spark provides a unified and flexible platform for data modeling and transformation at scale.

Source Control and CI/CD

As we build code for our data model and transformation tasks, we need to track it, manage the different versions and automate the deployments to our data warehouse. Storing the source code on systems like GitHub offers several benefits that enhance governance, version control, collaboration, and continuous integration/continuous deployment (CI/CD) on a data engineering project. Some of these benefits include:

Governance and Version Control for Data Models: GitHub provides version control, ensuring that all changes to data models are tracked, audited, and properly managed, ensuring compliance with regulatory requirements and business standards
CI/CD for Data Transformation: CI/CD pipelines ensure that changes to data transformation code are thoroughly tested and safely deployed, reducing errors and improving data accuracy
Collaboration and Teamwork on Data Assets: GitHub's collaborative features enable data engineers and analysts to work together on data models and transformations code
Reusability and Flexibility in Data Transformation: Storing data transformation code on GitHub promotes the reuse of code snippets and best practices across the data warehouse solution
Disaster Recovery and Redundancy: GitHub acts as a secure backup for data transformation logic, ensuring redundancy and disaster recovery capabilities. In case of any issues, the data transformation code can be restored, minimizing downtime and data inconsistencies

In the context of a data warehouse solution, using GitHub, or similar systems, as a version control system for managing data models and transformation assets brings numerous advantages that improve governance, collaboration, and code quality. It ensures that the data warehouse solution remains agile, reliable, and capable of adapting to changes in business requirements and data sources.

Data Warehouse Implementation

The data warehouse implementation is the stage where the conceptual data model and design plans are transformed into a functional system. During this critical phase, data engineers and architects convert the abstract data model into concrete structures, including dimension and fact tables, while also defining the data transformation tasks to cleanse, integrate, and load data into the data warehouse. This implementation process lays the foundation for data accessibility, efficiency, and accuracy, ensuring that the data warehouse becomes a reliable and valuable source of insights for analytical purposes.

Data Modeling

When using the Star Schema model, we need to carefully understand the data, so we can identify the dimensions and fact tables that need to be created. Dimension tables represent descriptive attributes or context data (e.g., train stations, commuters), while fact tables contain quantitative data or measures (e.g., number of stations or passengers). Dimensions are used for slicing data, providing business context to the measures, whereas fact tables store numeric data that can be aggregated to derive KPIs (Key Performance Indicators).

To help us define the data models, we can follow these simple rules:

Dimensions: Dimensions are textual, and categorical attributes that describe business entities. They are often discrete and used for grouping, filtering, and organizing data.
Fact Tables: Fact tables contain numeric data that can be aggregated. They hold the measurable data and are related to dimensions through foreign keys.
Measures: Measures are the quantitative values that are subject to calculations such as sum, average, minimum, maximum, etc. They represent the KPIs that organizations want to track and analyze.
ERD: Create a Entity Relationship Diagram to visualize the models and their relationships

👍 Simple Star Schema ERD with dimension and fact tables

For reporting and dashboards, additional models can be created to accelerate the data analysis process. This is usually done to create common queries and abstract the join complexity with SQL views. Alternative, data scientist can choose to connect directly to the entities and create their data models using their analytical tools, which handle the building of SQL queries. The approach really depends on the expertise of the team, and the data modeling standards of the organization.

By defining clear dimension and fact tables with appropriate measures, a well-structured data model can enable effective analysis and visualization, supporting the generation of insightful KPIs for data-driven decision-making.

Data Transformation

The data transformation phase is a critical stage in a data warehouse project, where raw data is processed, cleansed, mapped to use proper naming conventions, and loaded into the data warehouse to create a reliable dataset for analysis. Additionally, implementing incremental loads to continuously insert the new information since the last update via batch processes, ensures that the data warehouse stays up-to-date with the latest data.

To help us define the data transformation tasks, we should do the following activities:

Data Dictionary, Mapping and Transformation Rules: Develop a clear and comprehensive data dictionary and mapping document that outlines how source data fields correspond to target data warehouse tables and columns
Data Profiling: Identify data patterns, anomalies, and potential issues that need to be addressed during the transformation process, like removing null values, duplicates, invalid data
Transformation Logic: Apply data transformation logic to standardize formats, resolve data inconsistencies, and calculate derived measures, define the incremental data rules
Data Validation and Testing: Validate the transformed data against predefined business rules and requirements to ensure its accuracy and alignment with expectations
Complete the Orchestration: Schedule the transformation tasks to automate the data loading process
Monitor and Operations: Monitor the transformation tasks to check for failures. Track incomplete data and notify the team of errors
Database Tuning: Involves making adjustments to the database system itself to optimize query execution and overall system performance.

A well-executed implementation phase ensures that the data warehouse aligns with the business requirements and enables stakeholders to make informed decisions based on comprehensive and organized data, thus playing a fundamental role in the success of the overall data warehouse project.

Summary

Before we can move data into a data warehouse system, we explore two pivotal phases for our data warehouse solution: design and implementation. In the design phase, we lay the groundwork by defining the database system, schema model, and technology stack required to support the data warehouse's implementation and operations. This stage ensures a solid infrastructure for data storage and management.

Moving on to the implementation phase, we focus on converting conceptual data models into a functional system. By creating concrete structures like dimension and fact tables and performing data transformation tasks, including data cleansing, integration, and scheduled batch loading, we ensure that raw data is processed and unified for analysis. With this approach, we successfully complete the entire data pipeline and orchestration, seamlessly moving data from CSV files to the data warehouse.

Exercise - Data Warehouse Model and Transformation

With a solid understanding of the data warehouse design and implementation, the next step is to put these concepts into practice through a hands-on exercise. In this lab, we build a cloud data warehouse system, applying the knowledge gained to create a powerful and efficient analytical platform.

👉 Data Engineering Process Fundamentals - Data Warehouse Model and Transformation Exercise

Thanks for reading.

Send question or comment at Twitter @ozkary

👍 Originally published by ozkary.com

Azure OpenAI API Service with CSharp

2023-06-03T15:50:00.011-04:00

The OpenAI Service is a cloud-based API that provides access to Large Language Models (LLM) and Artificial Intelligence (AI) Capabilities. This API allows developers to leverage the LLM models to create AI application that can perform Natural Language Processing (NLP) tasks such as text generation, code generation, language translation and others.

Azure provides the Azure OpenAI services which integrates the OpenAI API in Azure infrastructure. This enables us to create custom hosting resources and access the OpenAI API with a custom domain and deployment configuration. There are API client libraries to support different programming languages. To access the Azure OpenAI API using .NET, we could use the OpenAI .NET client library and access an OpenAI resource in Azure. As an alternative, we could use the HttpClient class from the System.Net.Http namespace and code the HTTP requests.

👍 The OpenAI client libraries is available for Python, JavaScript, .NET, Java

In this article, we take a look at using the OpenAI API to generate code from a GitHub user story using an Azure OpenAI resource with the .NET client library.

👉 An Azure OpenAI resource can be created by visiting Azure OpenAI Portal

Install the OpenAI APi Client Dependencies

To use the client library, we first need to install the dependencies and configure some environment variables.

$ dotnet add package Azure.AI.OpenAI --prerelease

Install the OpenAI dependencies restoring the project file from this project

Clone this GitHub code repo: - LLM Code Generation
Open a terminal and navigate to the CSharp folder
- Use the dotnet restore command when cloning the repository.

$ cd csharp/CodeGeneration
$ dotnet restore

This should download the code to your workstation.

Add the Azure OpenAI environment configurations

Get the following configuration information from your Azure OpenAI resource.

👍 This example uses a custom Azure OpenAI resource hosted at Azure OpenAI Portal

GitHub Repo API Token with write permissions to push comments to an issue
Get an OpenAI API key
If you are using an Azure OpenAI resource, get your custom end-point and deployment
- The deployment should have the code-davinci-002 model

Set the linux environment variables with these commands:

$ echo export AZURE_OpenAI_KEY="OpenAI-key-here" >> ~/.bashrc && source ~/.bashrc
$ echo export GITHUB_TOKEN="github-key-here" >> ~/.bashrc && source ~/.bashrc
$ echo export AZURE_OpenAI_DEPLOYMENT="deployment-name" >> ~/.bashrc && source ~/.bashrc
$ echo export AZURE_OpenAI_ENDPOINT="https://YOUR-END-POINT.OpenAI.azure.com/" >> ~/.bashrc && source ~/.bashrc

Build and Run the Code

$ dotnet build

Describe the code

The code should run this workflow:

Get a list of open GitHub issues with the label user-story
Each issue content is sent to the OpenAI API to generate the code
The generated code is posted as a comment on the user-story for the developers to review

👍 The following code uses a simple API call implementation for the GitHub and OpenAI APIs. Use the code from this repo: - LLM Code Generation

    // Get environment variables
    private static readonly string openaiApiKey = Environment.GetEnvironmentVariable("AZURE_OPENAI_KEY") ?? String.Empty;    
    private static readonly string openaiBase = Environment.GetEnvironmentVariable("AZURE_OPENAI_ENDPOINT") ?? String.Empty;       
    private static readonly string openaiEngine = Environment.GetEnvironmentVariable("AZURE_OPENAI_DEPLOYMENT") ?? String.Empty;           

    // GitHub API endpoint and authentication headers    
    private static readonly string githubToken = Environment.GetEnvironmentVariable("GITHUB_TOKEN") ?? String.Empty;

    /// <summary>
    /// Process a GitHub issue by label.
    /// </summary>
    public static async Task ProcessIssueByLabel(string repo, string label)
    {
        try
        {
            // Get the issues from the repo
            var @params = new Parameter { Label = label, State = "open" };              
            List<Issue> issues = await GitHubService.GetIssues(repo, @params, githubToken);
            if (issues != null)
            {
                foreach (var issue in issues)
                {
                    // Generate code using OpenAI
                    Console.WriteLine($"Generating code from GitHub issue: {issue.title} to {openaiBase}");
                    OpenAIService openaiService = new OpenAIService(openaiApiKey, openaiBase, openaiEngine);
                    string generatedCode = await openaiService.Create(issue.body ?? String.Empty);

                    if (!string.IsNullOrEmpty(generatedCode))
                    {
                        // Post a comment with the generated code to the GitHub issue
                        string comment = $"Generated code:\n\n```{generatedCode}\n```";
                        bool commentPosted = await GitHubService.PostIssueComment(repo, issue.number, comment, githubToken);

                        if (commentPosted)
                        {
                            Console.WriteLine("Code generated and posted as a comment on the GitHub issue.");
                        }
                        else
                        {
                            Console.WriteLine("Failed to post the comment on the GitHub issue.");
                        }
                    }
                    else
                    {
                        Console.WriteLine("Failed to generate code from the GitHub issue.");
                    }
                }
            }
            else
            {
                Console.WriteLine("Failed to retrieve the GitHub issue.");
            }
        }
        catch (Exception ex)
        {
            Console.WriteLine($"Error: {ex.Message}");
        }
    }

The OpenAI service class handles the OpenAI API details. It takes default parameters for the model deployment (engine), temperature and token limits, which control the cost and amount of text (roughly four letters per token) that should be allowed. For this service, we use the "Completion" model which allows developers to interact with OpenAI's language models and generate text-based completions.


internal class OpenAIService
    {
        private string apiKey;
        private string engine;
        private string endPoint;
        private float temperature;
        private int maxTokens;
        private int n;
        private string stop;

        /// <summary>
        /// OpenAI client
        /// </summary>
        private OpenAIClient? client;

        public OpenAIService(string apiKey, string endPoint, string engine = "text-davinci-003", float temperature = 0.5f, int maxTokens = 350, int n = 1, string stop = "")
        {
            // Configure the OpenAI client with your API key and endpoint                 
            client = new OpenAIClient(new Uri(endPoint), new AzureKeyCredential(apiKey));
            this.apiKey = apiKey;
            this.endPoint = endPoint;            
            this.engine = engine;
            this.temperature = temperature;
            this.maxTokens = maxTokens;
            this.n = n;
            this.stop = stop;                
        }

        /// <summary>
        /// Create a completion from a prompt
        /// </summary>
        public async Task<string> Create(string prompt)
        {     
            var result = String.Empty;

            if (!String.IsNullOrEmpty(prompt) && client != null)
            {

                Response<Completions> completionsResponse = await client.GetCompletionsAsync(engine, prompt);

                Console.WriteLine(completionsResponse);
                result = completionsResponse.Value.Choices[0].Text.Trim();                
                Console.WriteLine(result);
            }

            return result;            
        }
    }

Run the code

After configuring your environment and downloading the code, we can run the code from a terminal by typing the following command from the project folder:

👉 Make sure to enter your repo name and label your issues with either user-story or any other label you would rather use.

# dotnet run --repo ozkary/ai-engineering --label user-story

After running the code successfully, we should be able to see the generated code as a comment on the GitHub issue.

Summary

The Azure OpenAI Service provides a seamless integration of OpenAI models into the Azure platform, offering the benefits of Azure's security, compliance, management, and billing capabilities. On the other hand, using the OpenAI API directly allows for a more direct and independent integration with OpenAI services. It may be a preferable option if you have specific requirements, and you do not want to use Azure resources.

Thanks for reading.

Send question or comment at Twitter @ozkary

👍 Originally published by ozkary.com

Data Engineering Process Fundamentals - Pipeline and Orchestration Exercise

2023-05-27T11:22:00.009-04:00

Once we have gained an understanding of data pipelines and their orchestration, along with the various programming options and technical tools at our disposal, we can proceed with the implementation and configuration of our own data pipeline. We have the flexibility to adopt either a code-centric approach, leveraging languages like Python, or a low-code approach, utilizing tools such as Azure Data Factory. This allows us to evaluate and compare the effectiveness of each approach based on our team's expertise and the operational responsibilities involved. Before diving into the implementation, let's first review our pipeline process to ensure a clear road map for our journey ahead.

Data Flow Process

Our basic data flow can be defined as the following:

Define the date when a new CSV becomes available
Perform an HTTP Get request to download the CSV file for the selected date
- Example: http://web.mta.info/developers/data/nyct/turnstile/turnstile_230318.txt
Compress the text file and upload in chunks to the data lake container

After the file is copied to our data lake, the data transformation service picks up the file, identifies new data and inserts into the Data Warehouse. We will take a look at the process on the Data WareHouse and Transformation services on the next step of the process.

👉 Since a new file is available weekly, This data integration project fits into the batch processing model. For real-time scenarios, we should use a data streaming technologies like Apache Kafka with Apache Spark

Initial Data Load

When there are requirements to load previous data, we need to first run a batch process to load all the previous months of data. Since the file are available weekly, we need to write code that can accept a date range, identify all the past Saturdays, and copy each file into our data lake. The process can be executed in parallel processes by running different years or months (if only one year is selected) in each process. This way multiple threads can be used to copy the data, which should reduce the processing time.

Moving forward, the process will target a specific date for when the file becomes available. The process will not allow for the download of future data files, so an attempt to pass future dates will not be allowed.

Weekly Automation

Since the files are available on a weekly basis, we use a batch processing approach to process those files. For that, we create a scheduled job on our automation tool. This trigger should run on the day that the file is available, so a dynamic parameter can be created based on the current date value. The code can then parse this date and resolve the file name format to download the corresponding file.

Monitor the jobs

It is very important to be able to monitor and create alerts in case there are failures. This should allow the teams to identify and address the problems quickly. Therefore, it is important that we select a code-centric framework of a platform that provides integrated monitor and alert system.

Programming Language and Tooling

A code-centric data pipeline refers to a high coding effort using a programming language, supporting libraries and cloud platform that can enable us to quickly implement our pipelines and collect telemetry to monitor our jobs. In our case, Python provides a versatile and powerful programming language for building data pipelines, with various frameworks available to streamline the process. Three popular options for Python-based data pipelines are Prefect, Apache Airflow, and Apache Spark.

Apache Airflow is a robust platform for creating, scheduling, and monitoring complex workflows. It uses Directed Acyclic Graphs (DAGs) to define pipelines and supports a rich set of operators for different data processing tasks.
Apache Spark is a distributed data processing engine that provides high-speed data processing capabilities. It supports complex transformations, real-time streaming, and advanced analytics, making it suitable for large-scale data processing.
Prefect is a modern workflow management system that enables easy task scheduling, dependency management, and error handling. It emphasizes code-driven workflows and offers a user-friendly interface.

For low-code efforts, Azure Data Factory is a cloud-based data integration service provided by Microsoft. It offers a visual interface for building and orchestrating data pipelines, making it suitable for users with less coding experience.

👉 There are several platforms for low-code solutions. Some of them provide a total enterprise turn-key solution to build the entire pipeline and orchestration. These platforms, however, come at a higher financial cost.

When choosing between these options, we should consider factors such as the complexity of the pipeline, scalability requirements, ease of use, and integration with other tools and systems. Each framework has its strengths and use cases, so selecting the most suitable one depends on your specific project needs.

Pipeline Implementation Requirements

For our example, we will take on a code-centric approach and use Python as our programming language. In addition, we use the Prefect libraries and cloud services to manage the orchestration. After we are done with the code-centric approach, we take a look at using a low-code approach with Azure Data Factory, so we can compare between the two different approaches.

Before we get started, we need to setup our environment with all the necessary dependencies.

Requirements

Docker and Docker Hub
- Install Docker
- Create a Docker Hub Account
Prefect dependencies and cloud account
- Install the Prefect Python library dependencies
- Create a Prefect Cloud account
Data Lake for storage
- Make sure to have the storage account and access ready

👉 Clone this repo or copy the files from this folder

Prefect Configuration

Install the Python libraries using the requirements file from the repo

$ cd Step3-Orchestration
$ pip install -r prefect-requirements.txt

Make sure to run the terraform script to build the VM, Data lake and BigQuery resources as shown on the Design and Planning exercise
Copy the GCP credentials file to follow this format

$ cd ~ && mkdir -p ~/.gcp/
$ cp <path to JSON file> ~/.gcp/credentials.json

Create the PREFECT Cloud Account

👉 Login to Prefect Cloud, API keys can be created from the user profile configuration (click your profile picture)

From a terminal, login with Prefect cloud to host the blocks, deployments, and dashboards on the Cloud
```
$ prefect cloud login  
# or use an API key to login instead
# prefect cloud login -k API_KEY_FROM_PREFECT
```
The login creates a key file ~/.prefect/profiles.toml which the framework looks for to authenticate the pipeline.
Install the Prefect code blocks dependencies and run the "block ls" command to check that there are none installed

$ prefect block register -m prefect_gcp
$ prefect block ls

List of resources that are needed

These are the resource names that are used by the code.

Data lake name
- mta_data_lake
Prefect Account block name
- blk-gcp-svc-acc
Prefect GCS (storage) block name
- blk-gcs_name
Prefect Deployments
- dep-docker-mta
Docker container name after pushing to Docker Hub
- ozkary/prefect:mta-de-101

Review the Code

After setting up all the dependencies, we can move forward to look at the actual code. We can start by reviewing the code blocks or components. We can then view the actual pipeline code, and how it is wired, so we can enable the flow telemetry in our pipeline.

Code Blocks or Components

👉 Blocks are a secured and reusable components which can manage a single technical concern and can be used by our applications

Credentials Component

Since we need secured access to cloud resources, we first need to create a credentials component to store the cloud key file. We can then use this component in other areas of the code whenever we need to do a cloud operation. The save operation done by the code pushes the component to the cloud, so it is centralized.

import argparse
import os
from pathlib import Path
from prefect_gcp import GcpCredentials

# insert your own service_account_file path or service_account_info dictionary from the json file
# IMPORTANT - do not store credentials in a publicly available repository!

def main(params) -> None:
    """entry point to create the prefect block for GCP service account"""
    gcp_file_path = params.file_path
    account_block_name = params.gcp_acc_block_name

    file_handle = Path(gcp_file_path) #.read_text()
    print(file_handle.read_text())
    if file_handle.exists() :
        content = file_handle.read_text()

        if content :
            credentials_block = GcpCredentials(
                service_account_info=content     # set the file credential
            )
            credentials_block.save(account_block_name, overwrite=True)
            print('block was saved')
    else:
        print(F'{gcp_file_path} not found')

    os.system('prefect block ls')

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Create a reusable Credential block')

    parser.add_argument('--file_path', required=True, help='key file path for the service account')
    parser.add_argument('--gcp_acc_block_name', required=True, help='prefect block name to hold the service account setting')

    args = parser.parse_args()

    main(args)

Cloud Storage Component

The cloud storage component enables us to reuse the credentials component, so applications can be authenticated and authorize to access it. This component also has support to upload files to the storage container, thus simplifying our code. Similar to the credential component, this component is saved on the cloud.

import argparse
from prefect_gcp import GcpCredentials
from prefect_gcp.cloud_storage import GcsBucket

# insert your own service_account_file path or service_account_info dictionary from the json file
# IMPORTANT - do not store credentials in a publicly available repository!

def main(params) -> None:
    """entry point to create the prefect block for GCS"""    
    account_block_name = params.gcp_acc_block_name
    gcs_bucket_name = params.gcs_bucket_name
    gcs_block_name = params.gcs_block_name

    credentials = GcpCredentials.load(account_block_name)
    if credentials :
        bucket_block = GcsBucket(
            gcp_credentials=credentials,
            bucket=gcs_bucket_name  # insert your  GCS bucket name
        )
        # save the bucket
        bucket_block.save(gcs_block_name, overwrite=True)

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Ingest CSV data to storage')

    parser.add_argument('--gcp_acc_block_name', required=True, help='prefect block name which holds the service account')
    parser.add_argument('--gcs_bucket_name', required=True, help='GCS bucket name')
    parser.add_argument('--gcs_block_name', required=True, help='GCS block name')

    args = parser.parse_args()

    main(args)

Docker Container Component

Since we are running our pipeline on a Docker container, we also want to write a component which can manage that technical concern. This allow us to pull the Docker image from Docker Hub when we are ready to deploy and run the pipeline. We will learn more about deployments as we create our Docker deployment definition.


import argparse
from prefect.infrastructure.docker import DockerContainer

def main(params) -> None:
    """Create a Docker prefect block"""
    block_name = params.block_name
    image_name = params.image_name

    # alternative to creating DockerContainer block in the UI
    docker_block = DockerContainer(
        image=image_name,  # insert your image here
        image_pull_policy="ALWAYS",
        auto_remove=True,
    )

    docker_block.save(block_name, overwrite=True)


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Create a reusable Docker image block from Docker Hub')

    parser.add_argument('--block_name', required=True, help='Prefect block name')    
    parser.add_argument('--image_name', required=True, help='Docker image name used when the image was build')    

    args = parser.parse_args()

    main(args)

Deployments

Cloud deployments are used to deploy and manage pipelines in a production environment. Deployments provide a centralized way to run and monitor pipelines across multiple execution environments, such as local machines, cloud-based infrastructure, and on-premises clusters.

Docker Deployment

With a deployment definition, we can associate a Docker image that is hosted on Docker Hub with a deployment. This enables us to automate the deployment of this image to other environments when we are ready to run the pipeline. The code below associates a Docker component with a deployment definition from the cloud. It also defines the main flow entry point (main_flow) from the etl_web_to_gcs.py file, so it can be easily executed as a scheduled task from the terminal.


import argparse
import sys
import os
from prefect.deployments import Deployment
from prefect.infrastructure.docker import DockerContainer
sys.path.append(os.path.join(os.path.dirname(__file__), '..', 'flows'))
from etl_web_to_gcs import main_flow

def main(params) -> None:
    """Create a prefect deployment"""
    block_name = params.block_name
    deploy_name = params.deploy_name

    # use the prefect block name for the container
    docker_block = DockerContainer.load(block_name)

    docker_dep = Deployment.build_from_flow(
        flow=main_flow,
        name=deploy_name,
        infrastructure=docker_block
    )
    docker_dep.apply()

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Create a reusable prefect deployment script')

    parser.add_argument('--block_name', required=True, help='Prefect Docker block name')    
    parser.add_argument('--deploy_name', required=True, help='Prefect deployment name')    

    args = parser.parse_args()

    main(args)

GitHub Deployment

In cases when a Docker image is not used, we can also use a deployment definition using GitHub. This allows us to download the code to other environments in which dependencies will need to be installed prior to running the code. The build_from_flow operation is used to define which file and what entry point (function) of that file to use. In this example, we are using the etl_web_to_gcs.py file and the function main_flow.


import argparse
from prefect.deployments import Deployment
from etl_web_to_gcs import main_flow
from prefect.filesystems import GitHub 

def main(params) -> None:
    """Create a prefect deployment with github"""
    block_name = params.block_name
    deploy_name = params.deploy_name
    github_path = params.github_path    

    github_block = GitHub.load(block_name)

    deployment = Deployment.build_from_flow(
          flow=main_flow,
          name=deploy_name,
          storage=github_block,
          entrypoint=f"{github_path}/etl_web_to_gcs.py:main_flow")

    deployment.apply()


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Create a reusable prefect deployment script')

    parser.add_argument('--block_name', required=True, help='Github block name')    
    parser.add_argument('--deploy_name', required=True, help='Prefect deployment name')    
    parser.add_argument('--github_path', required=True, help='Github folder path where the pipeline file is located')    

    args = parser.parse_args()

    main(args)

Pipeline Flows and Tasks

A pipeline is implemented by defining flows and tasks, which are defined using Python, CSharp code or other languages. Flows are composed of multiple tasks and define the sequence and dependencies between them. Flows use the @flow function decorator or attributes, which is specific to the Python library being used (Prefect), and it is used to mark a function as a flow. The decorator also allows us to define the flow's name, description, and other attributes like number of retries in case of failures.

Tasks are defined by the @task function decorator or attribute. Tasks are individual units of work that can be combined to form a data pipeline. They represent the different steps or operations that need to be performed within a workflow. Each task is responsible for executing a specific action or computation.

In our example, we have the main_flow function which uses another flow (etl_web_to_local) to handle the file download from the Web to a local storage. The main flow also uses tasks to handle the input validation and file name formatting to make sure the values are only for the specific dates the new CSV file is available for download. Finally, there is task to write a compressed CSV file to the data lake using our components.

By putting together flows and tasks that handle a specific workflow, we build a pipeline that enables us to download files into our data lake. At the same time, by using those function decorators, we are enabling the Prefect framework to call its internal class to track telemetry information for each flow and task in our pipeline, which enable us to monitor and track failures at a specific point in the pipeline. Let's see what our pipeline implementation looks like:

import argparse
from pathlib import Path
import os
import pandas as pd
from prefect import flow, task
from prefect_gcp.cloud_storage import GcsBucket
from typing import List
# from prefect.tasks import task_input_hash
from settings import get_block_name, get_min_date, get_max_date, get_prefix, get_url
from datetime import timedelta, date

@task(name="write_gcs", description='Write file gcs', log_prints=False)
def write_gcs(local_path: Path, file_name: str, prefix: str) -> None:

    """
        Upload local parquet file to GCS
        Args:
            path: File location
            prefix: the folder location on storage

    """    
    block_name = get_block_name()
    gcs_path = f'{prefix}/{file_name}.csv.gz'
    print(f'{block_name} {local_path} {gcs_path}')

    gcs_block = GcsBucket.load(block_name)        
    gcs_block.upload_from_path(from_path=local_path, to_path=gcs_path)

    return

@task(name='write_local', description='Writes the file into a local folder')
def write_local(df: pd.DataFrame, folder: str, file_path: Path) -> Path:
    """
        Write DataFrame out locally as csv file
        Args:
            df: dataframe chunk
            folder: the download data folder
            file_name: the local file name
    """

    path = Path(folder)
    if not os.path.exists(path):
        path.mkdir(parents=True, exist_ok=True)

    df = df.rename(columns={'C/A': 'CA'})            
    df = df.rename(columns=lambda x: x.strip().replace(' ', ''))    
    # df = df.rename_axis('row_no').reset_index()

    if not os.path.isfile(file_path):
        df.to_csv(file_path, compression="gzip")
        # df.to_parquet(file_path, compression="gzip", engine='fastparquet')
        print('new file', flush=True)
    else:  
        df.to_csv(file_path, header=None, compression="gzip", mode="a")          
        # df.to_parquet(file_path, compression="gzip", engine='fastparquet', append=True) 
        print('chunk appended', flush=True)

    return file_path

@flow(name='etl_web_to_local', description='Download MTA File in chunks')
def etl_web_to_local(name: str, prefix: str) -> Path:
    """
       Download a file    
       Args:            
            name : the file name  
            prefix: the file prefix          

    """    

    # skip an existent file
    path = f"//www.ozkary.dev/data/"
    file_path = Path(f"{path}/{name}.csv.gz")
    if os.path.exists(file_path):            
            print(f'{name} already processed')
            return file_path

    url = get_url()
    file_url = f'{url}/{prefix}_{name}.txt'
    print(file_url)
    # os.system(f'wget {url} -O {name}.csv')
    # return

    df_iter = pd.read_csv(file_url, iterator=True, chunksize=5000)     
    if df_iter:              
        for df in df_iter:
            try:                                                
                write_local(df, path, file_path)
            except StopIteration as ex:
                print(f"Finished reading file {ex}")
                break
            except Exception as ex:
                print(f"Error found {ex}")
                return

        print(f"file was downloaded {file_path}")                
    else:
        print("dataframe failed")

    return file_path

@task(name='get_file_date', description='Resolves the last file drop date')    
def get_file_date(curr_date: date = date.today()) -> str:    
    if curr_date.weekday() != 5:
        days_to_sat = (curr_date.weekday() - 5) % 7
        curr_date = curr_date - timedelta(days=days_to_sat)

    year_tag = str(curr_date.year)[2:4]
    file_name = f'{year_tag}{curr_date.month:02}{curr_date.day:02}'
    return file_name


@task(name='get_the_file_dates', description='Downloads the file in chunks')
def get_the_file_dates(year: int, month: int, day: int = 1, limit: bool = True ) -> List[str]:
    """
        Process all the Sundays of the month
        Args:
            year : the selected year
            month : the selected month 
            day:  the file day
    """
    date_list = []        
    curr_date = date(year, month, day)    
    while curr_date.month == month and curr_date <= date.today():   
        # print(f'Current date {curr_date}')     
        if curr_date.weekday() == 5:
            # add the date filename format yyMMdd
            year_tag = str(curr_date.year)[2:4]
            file_name = f'{year_tag}{curr_date.month:02}{curr_date.day:02}'
            date_list.append(file_name)            
            curr_date = curr_date + timedelta(days=7)
            if limit:
                 break
        else:
            # find next week
            days_to_sat = (5 - curr_date.weekday()) % 7
            curr_date = curr_date + timedelta(days=days_to_sat)
    return date_list


@task(name='valid_task', description='Validate the tasks input parameter')
def valid_task(year: int, month: int, day: int = 1) -> bool:
    """
        Validates the input parameters for the request
         Args:
            year : the selected year
            month : the selected month   
            day: file day
    """    
    isValid = False
    if month > 0 and month < 13:        
        curr_date = date(year, month, day)         
        min_date = get_min_date()
        max_date = get_max_date()
        isValid =  curr_date >= min_date and curr_date < max_date and curr_date <= date.today()

    print(f'task request status {isValid} input {year}-{month}')
    return isValid


@flow (name="MTA Batch flow", description="MTA Multiple File Batch Data Flow. Defaults to the last Saturday date")
def main_flow(year: int = 0 , month: int = 0, day: int = 0, limit_one: bool = True) -> None:
    """
        Entry point to download the data
    """        
    try:
        # if no params provided, resolve to the last saturday  
        file_list: List[str] = []
        if (year == 0):
            file_dt = get_file_date()
            file_list.append(file_dt)
        elif valid_task(year, month, day):                
            file_list = get_the_file_dates(year, month, day, limit_one)                    

        prefix = get_prefix()        
        for file_name in file_list:        
            print(file_name)
            local_file_path = etl_web_to_local(file_name, prefix)        
            write_gcs(local_file_path, file_name, prefix)

    except Exception as ex:
        print(f"error found {ex}")

Function Decorators

In some programming languages, we can create function decorators or attributes that enables to enhance a specific function without altering its purpose. In Python, this can be done by defining a class with a __call__ method, which allows instances of the class to be callable like functions. Within the __call__ method, logic can be implemented to track telemetry data and then return the original function unchanged. Here's an example of a simple telemetry function decorator class:

class TelemetryDecorator:
    def __init__(self, tracking_type):
        self.tracking_type = tracking_type

    def __call__(self, func):
        def wrapped_func(*args, **kwargs):
            # Track telemetry data here
            print(f"Tracking {self.tracking_type} for function {func.__name__}")

            # Call the original function with its parameters
            return func(*args, **kwargs)

        return wrapped_func

# Usage example:
@TelemetryDecorator(tracking_type="performance")
def my_task(x, y):
    return x + y

result = my_task(3, 5)

How to Run It

After installing the pre-requisites and reviewing the code, we are ready to run our pipeline and set up our orchestration by configuring our components, deployment image and scheduling the runs.

Install the code blocks or components for our credentials and data lake access

We should first authenticate our terminal with the cloud instance. This should enable us to call other APIs to register our components. We next register the block dependencies. From the blocks folder, we register our components by running the Python scripts. We then run a "block ls" command to see the components that have been registered.

👍 Components are a secured way to download credentials and secrets that are used by your applications.

$ prefect cloud login
$ prefect block register -m prefect_gcp
$ cd ./blocks
$ python3 gcp_acc_block.py --file_path=~/.gcp/credentials.json --gcp_acc_block_name=blk-gcp-svc-acc
$ python3 gcs_block.py --gcp_acc_block_name=blk-gcp-svc-acc --gcs_bucket_name=mta_data_lake --gcs_block_name=blk-gcs-name
$ prefect block ls

Create a docker image and push to Docker Hub

We are adding our Python script in a Docker container, so we can create and push the image (ozkary/prefect:mta-de-101) to Docker Hub. This should enable us to later create a deployment definition and refer to that image, so we can download it from a centralized hub location to one or more environments.

👉 Make sure to run the Docker build command where the Docker file is located or use -f with the file path. Ensure Docker is also running.

$ docker login --username USER --password PW
$ docker image build -t ozkary/prefect:mta-de-101 .
$ docker image push ozkary/prefect:mta-de-101

The Docker file defines the image dependency with Python already installed. We also copy a requirements file which contains additional dependencies that need to be installed on the container image. We finally copy our code on the container, so when we run it, it is able to find the pipeline main_flow.

FROM prefecthq/prefect:2.7.7-python3.9
COPY docker-requirements.txt .

RUN pip install -r docker-requirements.txt --trusted-host pypi.python.org --no-cache-dir

RUN mkdir -p /opt/prefect/data/
RUN mkdir -p /opt/prefect/flows/

COPY flows opt/prefect/flows
COPY data opt/prefect/data

Create the prefect block with the docker image

After creating the Docker image, we can register the Docker component (blk-docker-mta-de-101) with the image name reference, which is what allows us to pull that image from Docker Hub during a new deployment.

$ cd ./blocks
$ python3 docker_block.py --block_name=blk-docker-mta-de-101 --image_name=ozkary/prefect:mta-de-101

Create the deployment with the docker image

We can now configure a cloud deployment by running our deployment definition file (docker_deploy_etl_web_to_gcs.py). For this configuration, we associate the Docker component (blk-docker-mta-de-101) to our definition. The configuration uses the component, which in turns defines where to get the Docker image from. We also setup a cron job to schedule the deployment to run on Saturdays at 9am. This scheduling of the deployments is an orchestration tasks. To verify all is configured properly, we list the deployment configurations by running the "deployment ls" command. The listing of the deployments also enables us to confirm the deployment name and id, which can be used when we test run the deployment.

$ cd ./deployments
$ python3 docker_deploy_etl_web_to_gcs.py --block_name=blk-docker-mta-de-101 --deploy_name=dep-docker-mta-de-101
$ prefect deployments build etl_web_to_gcs.py:main_flow --name dep-docker-mta-de-101 --tag mta --work-queue default --cron '0 9 * * 6' 
$ prefect deployments ls

👍 Scheduled jobs can also be managed from the cloud dashboards

Start the Prefect agent

The agent should be running, so the scheduled deployments can be executed. If the image Docker image is not downloaded yet, it is downloaded, so the code can be executed.

$ prefect agent start -q default

Test run the prefect deployments with the docker image

This next command will download the Docker image and run the entry point, main_flow. The additional parameters are also provided. so the pipeline can download the file for the specified year, month and day.

$ prefect deployment run "MTA Batch flow/dep-docker-mta-de-101" -p "year=2023 month=3 day=25"

Manual test run can be done from a terminal

A manual test run can also be executed from the command line to help us identify any possible bugs without having to run the app from the container. Run the code directly from the terminal by typing this command:

$ python3 etl_web_to_gcs.py --year 2023 --month 5 --day 6

See the flow runs from the CLI

To check the actual flow runs, we can use the "flow-run ls" command. This should show the date and time when the flow has been executed.

$ prefect flow-run ls

👍 Flow runs can also be visualized from the cloud dashboards To get more telemetry details about the pipeline, we can look at the flow dashboards on the cloud.

GitHub Action to build and deploy the Docker image to Docker Hub

So far, we have shown how to build and push our Docker images via the CLI. A more mature way to do this is to enable that process on a deployment pipeline. With GitHub, we have CI/CD pipelines that can automate this process. This pipeline can be triggered when a change is made to the code, and a pull request (PR) is merged into the branch. This is called a GitHub action. A simple script to handle that automation is shown below:


name: Build and Push Docker Image

on:
  push:
    branches:
      - main

jobs:
  build-and-push:
    runs-on: ubuntu-latest

    steps:
    - name: Checkout repository
      uses: actions/checkout@v2

    - name: Set up Docker Buildx
      uses: docker/setup-buildx-action@v1

    - name: Login to Docker Hub
      uses: docker/login-action@v1
      with:
        username: ${{ secrets.DOCKERHUB_USERNAME }}
        password: ${{ secrets.DOCKERHUB_PASSWORD }}

    - name: Build and push Docker image
      env:        
        DOCKER_REPOSITORY:  ${{ secrets.DOCKERHUB_USERNAME }}/prefect:mta-de-101
      run: |
        docker buildx create --use
        docker buildx build --push --platform linux/amd64,linux/arm64 -t $DOCKER_REPOSITORY .

Low-Code Data Pipeline

After learning about a code-centric pipeline, we can transition into a low-code approach, which marks a significant evolution in the way data engineering projects are implemented. In the code-centric approach, engineers create and manage every aspect of the pipeline through code, providing maximum flexibility and control. On the other hand, the low-code approach, exemplified by platforms like Azure Data Factory, empowers data engineers to design and orchestrate pipelines with visual interfaces and pre-built components. This results in faster development and a more streamlined pipeline creation process. The low-code approach is especially beneficial for less experienced developers or projects where speed and simplicity are essential.

Pipeline with Azure Data Factory

👉 Setup an Azure Data Factory Resource

To show a low-code approach, we will write our data pipeline using Azure Data Factory. Following a similar approach, we can design an efficient data ingestion process that involves compressing and copying CSV files to Blob storage. The pipeline consists of two essential steps to streamline the process.

Set Pipeline Variable - To ensure proper file naming, we use a code snippet to dynamically set a pipeline variable with today's date in the format "yymmdd.txt" This allows us to create a file name for a specific drop date. This variable is then used by the Copy Data activity.
Copy with Compression - We initiate a data copy action from the website "http://web.mta.info/developers/data/nyct/turnstile/turnstile_230318.txt". This action has a source configuration where we can define the file to download dynamically. There is also a destination configuration, which links to our Blog storage and has a setting to compress and parse the CSV file. As the data is copied, the CSV file is compressed into the GZ format, optimizing storage and reducing costs. The compressed file is then stored in the designated Blob container in our Data Lake.

By implementing this data pipeline, we achieve a seamless and automated data ingestion process, ensuring that data is efficiently transferred and stored in a cost-effective manner. The platform also manages all the orchestration concerns like monitoring, scheduling, logging, integration. We should also note that this is a third party managed service, and there is a cost based on the resource usage. Depending on the project, this cost could be less than a coding effort or could be higher compared to the code-centric approach.

Summary

For our code-centric approach, we used Python to code each step of the pipeline to meet our specific requirements. Python allows us to create custom tasks and workflows, providing flexibility and control over the pipeline process. We deploy our pipeline within Docker containers, ensuring consistency across different environments. This facilitates seamless deployment and scalability, making it easier to manage the pipeline as it grows in complexity and volume.

For the pipeline orchestration, we are using the power of cloud technologies to host our code for deployments and execution, log the telemetry data to track the performance and health of the process, schedule and monitor our deployments to manage our operational concerns.

While the code-centric approach offers more granular control, it also demands more development and DevOps activities. On the other hand, a low-code approach, like Azure Data Factory, abstracts some complexity, making it faster and simpler to set up data pipelines.

The choice between a code-centric and low-code approach depends on the team's expertise, project requirements, and long-term goals. Python, combined with Docker and CI/CD, empowers data engineers to create sophisticated pipelines, while platforms like Azure Data Factory offer a faster and more accessible solution for specific use cases.

Next Step

Having successfully established a robust data pipeline and data orchestration, it is now time to embark on the next phase of our data engineering process – the design and implementation of a data warehouse.

👉 Data Engineering Process Fundamentals - Data Warehouse and Transformation

Thanks for reading.

Send question or comment at Twitter @ozkary

👍 Originally published by ozkary.com

Data Engineering Process Fundamentals - Pipeline and Orchestration

2023-05-20T14:08:00.034-04:00

After completing the Design and Planning phase in the data engineering process, we can transition into the implementation and orchestration of our data pipeline. For this step, it is important to have a clear understanding on what is the implementation and orchestration effort, as well as what are the programming languages and tooling that are available to enable us to complete those efforts.

It is also important to understand some of the operational requirements, so we can choose the correct platform that should help us deliver on those requirements. Additionally, this is the time to leverage the cloud resources we have provisioned to support an operational pipeline, but before we get deep into those concepts, let's review some background information about what is exactly a pipeline, how can it be implemented and executed with orchestration?

Data Pipelines

The use of ETL or ELT depends on the design. For some solutions, a flow task may transform the data prior to loading it into storage. This approach tends to increase the amount of python code and hardware resources used by the hosting environment. For the ELT process, the transformation may be done using SQL code and the data warehouse resources, which often tend to perform great for big data scenarios.

Pipeline Implementation

The implementation of a pipeline refers to the building and/or coding of each task in the pipeline. A task can be implemented using a programming languages like Python or SQL. It can also be implemented using a no-code or low-code tool, which provides a visual interface that allows the engineer to connect to Web services, databases, data lakes and other sources that provide access via API. The use of which technology to use depends on the skill set of the engineering team and cost analysis of the tools that should be used. Let's compare some of these options in more detail:

Python is a versatile programming language widely used in data engineering. It offers robust libraries and frameworks like Apache Airflow, Apache Beam, and Pandas that provide powerful capabilities for building and managing data pipelines. With Python, we have granular control over pipeline logic, allowing for complex transformations and custom data processing. It is ideal for handling diverse data sources and implementing advanced data integration scenarios. Even in some low-code scenarios, Python is used to build components which do special transformation or logic that may not be available right out of the box of the low-code tool.
SQL (Structured Query Language) is a standard language for interacting with relational databases. Many data pipeline frameworks, such as Apache NiFi and Azure Data Factory, offer SQL-based transformation capabilities. SQL allows for declarative and set-based operations, making it efficient for querying and transforming structured data. It is well-suited for scenarios where the data transformations align closely with SQL operations and can be expressed concisely.
Low-code tools, such as Azure Logic Apps, Power Platform Automate, provide a visual interface for designing and orchestrating data pipelines. They offer a no-code or low-code approach, making it easier for non-technical users to build pipelines with drag-and-drop functionality. These tools abstract the underlying complexity, enabling faster development and easier maintenance. Low-code tools are beneficial when simplicity, speed, and ease of use are prioritized over fine-grained control and advanced data processing capabilities.

The choice between Python, SQL, or low-code tools depends on specific project requirements, team skills, and the complexity of the data processing tasks. Python offers flexibility and control, SQL excels in structured data scenarios, while low-code tools provide rapid development and simplicity.

Pipeline Orchestration

Pipeline orchestration refers to the automation, management and coordination of the data pipeline tasks. It involves the scheduling, workflows, monitoring and recovery of those tasks. The orchestration ensures the execution of those tasks, and it takes care of error handling, retry and the alerting of problems in the pipeline.

Similar to the implementation effort, there are several options for the orchestration approach. There are code-centric, low-code and no-code platforms. Let's take a look at some of those options.

Orchestration Tooling

When it comes to orchestrating data pipelines, there are several options available.

One popular choice is Apache Airflow, an open-source platform that provides workflow automation, task scheduling, and monitoring capabilities. With Airflow, engineers can define complex workflows using Python code, allowing for flexibility and customization. Apache Airflow requires an active service or process to be running. It operates as a centralized service that manages and schedules workflows.
Apache Spark can be a good choice for batch processing tasks that involve calling APIs and downloading files using Python. Spark provides a distributed processing framework that can handle large-scale data processing and analysis efficiently. Spark provides a Python API (PySpark) that allows you to write Spark applications using Python. Spark runs as a distributed processing engine that provides high-performance data processing capabilities. To use Spark for data pipeline processing, we need to set up and run a Spark cluster or Spark standalone server.
For those who prefer a code-centric approach, frameworks like Prefect can be a good choice. Prefect is an open-source workflow management system that allows us to define and manage data pipelines as code. It provides a Python-native API for building workflows, allowing for version control, testing, and collaboration in addition to the monitoring and reporting capabilities. Prefect requires an agent to be running in order to execute scheduled jobs. The agent acts as the workflow engine that coordinates the execution of tasks and manages the scheduling and orchestration of workflows.
For low-code and no-code efforts, Azure Data Factory is a cloud-based data integration service provided by Microsoft. It offers a visual interface for building and orchestrating data pipelines, making it suitable for users with less coding experience. Data Factory supports a wide range of data sources and provides features like data movement, transformation, and scheduling. It also integrates well with other Azure services, enabling seamless data integration within the Microsoft ecosystem.

When comparing these options, it's essential to consider factors like ease of use, scalability, extensibility, integration with other tools and systems.

Orchestration Operations

In addition to the technical skill set requirements, there is an operational requirement that should be highly considered. Important aspects include automation and monitoring:

Automation allows us to streamline and automate repetitive tasks, ensures consistent execution of tasks and workflows, thus eliminating the human factor, and enables us to scale up or down the data workflows based on demand.
Monitoring plays a critical role in identifying issues, errors, or bottlenecks in data pipelines. We can also gather insights into the performance of the data pipelines. This information helps identify areas for improvement, optimize resource utilization, and enhance overall pipeline efficiency.

Automation and monitoring contribute to compliance and governance requirements. By tracking and documenting data lineage, monitoring data quality, and implementing data governance policies, engineers can ensure regulatory compliance and maintain data integrity and security.

Cloud Resources

When it comes to cloud resources, there are often two components that play a significant role in this process: a Virtual Machine (VM) and the Data Lake.

A Virtual Machine (VM) serves as the compute power for the pipelines. It is responsible for executing the pipeline workflows and managing the overall orchestration. It provides the computational resources needed to process and transform data, ensuring the smooth execution of data pipeline tasks. The code executed on this resource is often running on Docker containers, which enables the use of automated deployments when code changes become available. In addition, containers can be deployed on Kubernetes clusters to support high availability and automated management use cases.
A Data Lake acts as a central repository for storing vast amounts of raw and unprocessed data. It offers a scalable and cost-effective solution for capturing and storing data from various sources. The Data Lake allows for easy integration and flexibility to support evolving data requirements. There are also data retention policies that can be implemented to manage old files.

Together, a VM and Data Lake are the backbone of the data pipeline infrastructure. They enable efficient data processing, facilitate data integration, and lay the foundation for seamless data analysis and visualization. By leveraging these components, we can stage the data flow into other resources like a data warehouse, which in turn enables the analytical process.

Summary

A data pipeline is basically a workflow of tasks that can be executed in Docker containers. The execution, scheduling, managing and monitoring of the pipeline is referred to as orchestration. In order to support the operations of the pipeline and its orchestration, we need to provision a VM and data lake cloud resources, which we can also automate with Terraform. By selecting the appropriate programming language and orchestration tools, we can construct resilient pipelines capable of scaling and meeting evolving data demands effectively.

Exercise - Data Pipeline and Orchestration

Now that we understand the concepts of a pipeline and its orchestration, we should dive into a hands-on exercise in which we can implement a pipeline to extract CSV data from a source and send it to our data lake.

👉 Data Engineering Process Fundamentals - Pipeline and Orchestration Exercise

Thanks for reading.

Send question or comment at Twitter @ozkary

👍 Originally published by ozkary.com

AI Engineering Generate Code from User Stories

2023-05-06T10:15:00.028-04:00

Introduction

Large Language Model (LLM) refers to a class of AI models that are designed to understand and generate human-like text based on large amounts of training data. LLM can play a significant role when it comes to generating code by leveraging the language understanding and generative capabilities. Users can simply create text prompts with a user story format and provide enough context like technologies, requirements, and technical specifications, and the LLM model can generate code snippets to match what is requested by the prompt.

👍 Note that LLM-generated code may not always be perfect, and developers should manually review and validate the generated code to ensure its correctness, efficiency, and adherence to coding standards.

Benefits of LLM for code generation

When it comes to generating code, LLMs can be used in various ways:

Code completion: LLMs can assist developers by suggesting code completions as they write code. By providing a partial code snippet or a description of the desired functionality, developers can prompt the LLM to generate the remaining code, saving time and reducing manual effort.
Code synthesis: LLMs can synthesize code based on high-level descriptions or requirements. Developers can provide natural language specifications or user stories, and the LLM can generate code that implements the desired functionality. This can be particularly useful in the early stages of development or when exploring different approaches to solve a problem.
Code refactoring: LLMs can help with code refactoring by suggesting improvements or alternative implementations. By analyzing existing code snippets or code bases, LLMs can generate refactored versions that follow best practices, optimize performance, or enhance readability.
Documentation generation: LLMs can assist in generating code documentation or comments. Developers can provide descriptions of functions, classes, or code blocks, and the LLM can generate the corresponding documentation or comments that describe the code's purpose and usage.
Code translation: LLMs can be utilized for translating code snippets between different programming languages. By providing a code snippet in one programming language, developers can prompt the LLM to generate the equivalent code in another programming language.

Prompt Engineering

Prompt engineering is the process of designing and optimizing prompts to better utilize LLMs. Well described prompts can help the AI models better understand the context and generate more accurate responses. It is also helpful to provide some labels or expected results as examples, as this help the AI models evaluate its responses and provide more accurate results.

Improve Code Completeness

Due to some of the API configuration, the generated code may be incomplete. To improve the completeness of the generated code when using the API, you can experiment with the following:

Simplify or shorten your prompt to ensure it fits within the token limit
Split your prompt into multiple API calls if it exceeds the response length limit
Try refining and iterating on your prompt to provide clearer instructions to the model
Experiment with different temperature and max tokens settings to influence the output

Generate code from user stories

In the Agile development methodology, user stories are used to capture requirements or a feature from the perspective of end user or customer. For code generation, developers can write user stories to capture the context, requirements and technical specification necessary to generate code. This user story can then be processed by the LLM models to generate the code. As an example, a user story could be written in this way:


As a data scientist, I want to generate code using the following technologies, requirements, and specifications:

Technologies: 
- Python

Requirements:
- Transform a data frame by consolidating the 'date' and 'time' columns into a date time field column named 'created'.

Specifications:
- Create a function with the name 'transform_data'
- Use pandas to perform the data transformation
- Load the data from a parameter with the CSV file path
- Save the resulting data frame to disk in Parquet format
- Return true if successful or false if is not
- For coding standards, use the guidelines outlined in the [PEP 8 style guide for Python](https://www.python.org/dev/peps/pep-0008/).
- Create a unit test for the `transform_data` function to verify its correctness. The unit test should cover different scenarios and assert the expected behavior of the function.

From this user story, we can see that the end user of the story is a data scientist who would like to generate Python code. The requirement is basically to transform two columns from a data frame into a date-time column. There is also additional technical specification written that provides more context to the AI model, so it is able to generate the code with the specifics including what coding standards to use.

Let's take a look a python example to see how we can generate code from a user story:

Install the OpenAI dependencies

$ pip install OpenAI

Add the OpenAI environment configurations

Get the following configuration information:

👉 This example can run directly on the OpenAI APIs or it can use a custom Azure OpenAI resource.

GitHub Repo API Token with write permissions to push comments to an issue
Get an OpenAI API key
If you are using an Azure OpenAI resource, get your custom end-point and deployment
- The deployment should have the code-davinci-002 model

Set the linux environment variables with these commands:

$ echo export AZURE_OpenAI_KEY="OpenAI-key-here" >> ~/.bashrc && source ~/.bashrc
$ echo export GITHUB_TOKEN="github-key-here" >> ~/.bashrc && source ~/.bashrc

# only set these variables when using your Azure OpenAI resource

$ echo export AZURE_OpenAI_DEPLOYMENT="deployment-name" >> ~/.bashrc && source ~/.bashrc
$ echo export AZURE_OpenAI_ENDPOINT="https://YOUR-END-POINT.OpenAI.azure.com/" >> ~/.bashrc && source ~/.bashrc

Describe the code

The code should run this workflow: (see the diagram for a visual reference)

1 Get a list of open GitHub issues with the label user-story
2 Each issue content is sent to the OpenAI API to generate the code
3 The generated code is posted as a comment on the user-story for the developers to review

👍 The following code is a simple implementation for the GitHub and OpenAI APIs. Use the code from this GitHub repo: LLM Code Generation


def process_issue_by_label(repo: str, label: str):

    try:    
        # get the issues from the repo
        issues = GitHubService.get_issues(repo=repo, label=label, access_token=github_token)
        if issues is not None:

            for issue in issues:                                
                # Generate code using OpenAI            
                print(f'Generating code from GitHub issue: {issue.title}')
                openai_service = OpenAIService(api_key=openai_api_key, engine=openai_api_deployment, end_point=openai_api_base)
                generated_code = openai_service.create(issue.body)

                if generated_code is not None:            
                    # Post a comment with the generated code to the GitHub issue
                    comment = f'Generated code:\n\n```{generated_code}\n```'                    
                    comment_posted = GitHubService.post_issue_comment(repo,  issue.id, comment, access_token=github_token)

                    if comment_posted:
                        print('Code generated and posted as a comment on the GitHub issue.')
                    else:
                        print('Failed to post the comment on the GitHub issue.')
                else:
                    print('Failed to generate code from the GitHub issue.')
        else:
            print('Failed to retrieve the GitHub issue.')
    except Exception as ex:
        print(f"Error:  {ex}")

The OpenAI service handles the API details. It takes default parameters for the model (engine), temperature and token limits, which control the cost and amount of text (roughly four letters per token) that should be allowed. For this service, we use the "Completion" model which allows developers to interact with OpenAI's language models and generate text-based completions.

Other parameters include:

Temperature: Controls the randomness of the model's output. Higher values like 0.8 make the output more diverse and creative, while lower values like 0.2 make it more focused and deterministic.
Max Tokens: Specifies the maximum length of the response generated by the model, measured in tokens. It helps to limit the length of the output and prevent excessively long responses.
Top-p (Nucleus) Sampling: Also known as "P-Top" or "Nucleus Sampling," this parameter determines the probability distribution of the next token based on the likelihood of the most likely tokens. It helps in controlling the diversity of the generated output.
Frequency Penalty: A parameter used to discourage repetitive or redundant responses by penalizing the model for repeating the same tokens too often.
Presence Penalty: Used to encourage the model to generate responses that include all the provided information. A higher presence penalty value, such as 0.6, can help in ensuring more accurate and relevant responses.
Stop Sequences: Optional tokens or phrases that can be specified to guide the model to stop generating output. It can be used to control the length of the response or prevent the model from continuing beyond a certain point.

class OpenAIService:
    def __init__(self, api_key: str, engine: str = 'code-davinci-002', end_point: str = None, temperature: float = 0.5, max_tokens: int = 350, n: int = 1, stop: str = None):
        openai.api_key = api_key

        # Azure OpenAI API custom resource
        # Use these settings only when using a custom endpoint like https://ozkary.openai.azure.com        
        if end_point is not None:
            openai.api_base = end_point                  
            openai.api_type = 'azure'
            openai.api_version = '2023-05-15' # this will change as the API evolves

        self.engine = engine
        self.temperature = temperature
        self.max_tokens = max_tokens
        self.n = n
        self.stop = stop

    def create(self, prompt: str) -> str:
        """Create a completion using the OpenAI API."""

        response = openai.Completion.create(
            engine=self.engine,
            prompt=prompt,
            max_tokens=self.max_tokens,
            n=self.n,
            stop=self.stop,
            temperature=self.temperature
        )

        print(response)       
        return response.choices[0].text.strip()

Run the code

After configuring your environment and downloading the code, we can run the code from a terminal by typing the following command:

👍 Make sure to enter your repo name and label your issues with either user-story or any other label you would rather use.

# python3 gen_code_from_issue.py --repo ozkary/ai-engineering --label user-story

After running the code successfully, we should be able to see the generated code as a comment on the GitHub issue.

Summary

LLM plays a crucial role in code generation by harnessing its language understanding and generative capabilities. Developers, data engineers, and scientists can utilize AI models to swiftly generate scripts in various programming languages, streamlining their programming tasks. Moreover, documenting user stories with intricate details forms an integral part of the brainstorming process before writing a single line of code

Thanks for reading.

Send question or comment at Twitter @ozkary

👍 Originally published by ozkary.com

Data Engineering Process Fundamentals - Design and Planning Exercise

2023-04-22T12:40:00.007-04:00

Having laid a strong design foundation, it's time to embark on a hands-on exercise that's crucial to our data engineering project's success. Our immediate focus is on building the essential cloud resources that will serve as the backbone for our data pipelines, data lake, and data warehouse. Taking a cloud-agnostic approach ensures our implementation remains flexible and adaptable across different cloud providers, enabling us to leverage the advantages of multiple platforms or switch providers seamlessly if required. By completing this step, we set the stage for efficient and effective coding of our solutions. Let's get started on this vital infrastructure-building journey.

Cloud Infrastructure Planning

Infrastructure planning is a critical aspect of every technical project, laying the foundation for successful project delivery. In the case of a Data Engineering project, it becomes even more crucial. To support our project's objectives, we need to carefully consider and provision specific resources:

VM instance: This serves as the backbone for hosting our data pipelines and orchestration, ensuring efficient execution of our data workflows.
Data Lake: A vital component for storing various data formats, such as CSV or Parquet files, in a scalable and flexible manner.
Data Warehouse: An essential resource that hosts transformed and curated data, providing a high-performance environment for easy access by visualization tools.

Infrastructure automation, facilitated by tools like Terraform, is important in modern data engineering projects. It enables the provisioning and management of cloud resources, such as virtual machines and storage, in a consistent and reproducible manner. Infrastructure as Code (IaC) allows teams to define their infrastructure declaratively, track it in source control, version it, and apply changes as needed. Automation reduces manual efforts, ensures consistency, and enables infrastructure to be treated as a code artifact, improving reliability and scalability.

Infrastructure Implementation Requirements

When using Terraform with a any cloud provider, there are several key artifacts and considerations to keep in mind for successful configuration and security. Terraform needs access to the cloud account where it can provision the resources. The account token or credentials can vary based on your cloud configuration. For our purpose, we will use a GCP (Google) project to build our resources, but first we need to install the Terraform dependencies for the development environment.

Install Terraform

To install Terraform, open a bash terminal and run the commands below:

Download the package file
Unzip the file
Copy the Terraform binary from the extracted file to the /usr/bin/ directory
Verify the version

$ wget https://releases.hashicorp.com/terraform/1.3.7/terraform_1.3.7_linux_amd64.zip
$ unzip terraform_1.1.2_linux_amd64.zip
$ cp terraform /usr/bin/
$ terraform -v

We should get an output similar to this:

Terraform v1.3.7
on linux_amd64

Configure a Cloud Account

Create a Google account. Here

Create a new project
Make sure to keep track of the Project ID and the location for your project

Create a service account

In the left-hand menu, click on "IAM & Admin" and then select "Service accounts"
Click on the "Create Service Account" button at the top of the page
Enter a name for the service account and an optional description
Then add the BigQuery Admin, Storage Admin, Storage Object Admin as roles for our service account and click the save button.

Enable IAM APIs by clicking the following links:
- IAM-API
- IAM-credential-API

Authenticate the VM or Local environment with GCP

In the left navigation menu (GCP), click on "IAM & Admin" and then "Service accounts"
Click on the three verticals dots under the action section for the service name you just created
Then click Manage keys, Add key, Create new key. Select JSON option and click Create
Move the key file to a system folder

$ mkdir -p $HOME/.gcp/ 
$ mv ~/Downloads/{xxxxxx}.json ~/.gcp/{acc_credentials}.json

install the SDK and CLI Tools
- Follow the instruction here

Validate the installation and login to GCP with the following commands

$ echo 'export GOOGLE_APPLICATION_CREDENTIALS="~/.gcp/{acc_credentials}.json"' >> ~/.bashrc
$ export GOOGLE_APPLICATION_CREDENTIALS="~/.gcp/{acc_credentials}.json"
$ gcloud auth application-default login

Follow the login process on the browser

Review the Code

👉 Clone this repo or copy the files from this folder

Terraform uses declarative configuration files written in a domain-specific language (DSL) called HCL (HashiCorp Configuration Language). It provides a concise and human-readable syntax for defining resources, dependencies, and configurations, enabling us to provision, modify, and destroy infrastructure in a predictable and reproducible manner.

At a minimum, we should define a variables file, which contains the cloud provider information and a resource file which define what kind of resources should be provision on the cloud. There could be a file for each resource or a single file can define multiple resources.

Variables File

The variables file is used to define a set of variables that can be used on the resource file. This allows for the use of those variables in one more more resource configuration files. The format looks as follows:

locals {
  data_lake_bucket = "mta_data_lake"
}

variable "project" {
  description = "Enter Your GCP Project ID"
  type = string
}

variable "region" {
  description = "Region for GCP resources. Choose as per your location: https://cloud.google.com/about/locations"
  default = "us-east1"
  type = string
}

variable "storage_class" {
  description = "Storage class type for your bucket. Check official docs for more info."
  default = "STANDARD"
  type = string
}

variable "stg_dataset" {
  description = "BigQuery Dataset that raw data (from GCS) will be written to"
  type = string
  default = "mta_data"
}

variable "vm_image" {
  description = "Base image for your Virtual Machine."
  type = string
  default = "ubuntu-os-cloud/ubuntu-2004-lts"
}

Resource File

The resource file is where we define what should be provisioned on the cloud. This is also the file where we need to define the provider specific resources that need to be created.

terraform {
  required_version = ">= 1.0"
  backend "local" {}  # Can change from "local" to "gcs" (for google) or "s3" (for aws), if you would like to preserve your tf-state online
  required_providers {
    google = {
      source  = "hashicorp/google"
    }
  }
}

provider "google" {
  project = var.project
  region = var.region
  // credentials = file(var.credentials)  # Use this if you do not want to set env-var GOOGLE_APPLICATION_CREDENTIALS
}

# Data Lake Bucket
# Ref: https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/storage_bucket
resource "google_storage_bucket" "data-lake-bucket" {
  name          = "${local.data_lake_bucket}_${var.project}" # Concatenating DL bucket & Project name for unique naming
  location      = var.region

  # Optional, but recommended settings:
  storage_class = var.storage_class
  uniform_bucket_level_access = true

  versioning {
    enabled     = true
  }

  lifecycle_rule {
    action {
      type = "Delete"
    }
    condition {
      age = 15  // days
    }
  }

  force_destroy = true
}

# BigQuery data warehouse
# Ref: https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/bigquery_dataset
resource "google_bigquery_dataset" "stg_dataset" {
  dataset_id = var.stg_dataset
  project    = var.project
  location   = var.region
}

# VM instance
resource "google_compute_instance" "vm_instance" {
  name          = "mta-instance"
  project       = var.project
  machine_type  = "e2-standard-4"
  zone          = var.region

  boot_disk {
    initialize_params {
      image = var.vm_image
    }
  }

  network_interface {
    network = "default"

    access_config {
      // Ephemeral public IP
    }
  }
}

This Terraform file defines the infrastructure components to be provisioned on the Google Cloud Platform (GCP). It includes the configuration for the Terraform backend, required providers, and the resources to be created.

The backend section specifies the backend type as "local" for storing the Terraform state locally. It can be changed to "gcs" or "s3" for cloud storage if desired.
The required_providers block defines the required provider and its source, in this case, the Google Cloud provider.
The provider block configures the Google provider with the project and region specified as variables. It can use either environment variable GOOGLE_APPLICATION_CREDENTIALS or the credentials file defined in the variables.
The resource blocks define the infrastructure resources to be created, such as a Google Storage Bucket for the data lake, Google BigQuery datasets for staging and production, and a Google Compute Engine instance named "mta-instance" with specific configuration settings.

Overall, this Terraform file automates the provisioning of a data lake bucket, BigQuery datasets, and a virtual machine instance on the Google Cloud Platform.

How to run it!

Refresh service-account's auth-token for this session

$ gcloud auth application-default login

Set the credentials file on the bash configuration file
- Add the export line and replace filename-here with your file

$ echo export GOOGLE_APPLICATION_CREDENTIALS="${HOME}/.gcp/filename-here.json" >> ~/.bashrc && source ~/.bashrc

Open the terraform folder in your project
Initialize state file (.tfstate) by running terraform init

$ cd ./terraform
$ terraform init

Check changes to new infrastructure plan before applying them

It is important to always review the plan to make sure no unwanted changes are showing up.

👉 Get the project id from your GCP cloud console and replace it on the next command

$ terraform plan -var="project=<your-gcp-project-id>"

Apply the changes

This provisions the resources on the cloud project.

$ terraform apply -var="project=<your-gcp-project-id>"

(Optional) Delete infrastructure after your work, to avoid costs on any running services

$ terraform destroy

Terraform Lifecycle

GitHub Action

In order to be able to automate the building of infrastructure with GitHub, we need to define the cloud provider token as a secret with GitHub. This can be done by following the steps from this link:

👉 Configure GitHub Secrets

Once the secret has been configured, we can create a build action script with the cloud provider secret information as shown with this GitHub Action workflow YAML file:


name: Terraform Deployment

on:
  push:
    branches:
      - main

jobs:
  deploy:
    runs-on: ubuntu-latest

    steps:
    - name: Checkout repository
      uses: actions/checkout@v2

    - name: Set up Terraform
      uses: hashicorp/setup-terraform@v1

    - name: Terraform Init
       env:        
        GOOGLE_APPLICATION_CREDENTIALS:  ${{ secrets.GOOGLE_APPLICATION_CREDENTIALS }}
      run: |
        cd Step2-Cloud-Infrastructure/terraform
        terraform init

    - name: Terraform Apply
      run: |
        cd path/to/terraform/project
        terraform apply -auto-approve

Conclusion

With this exercise, we gain practical experience in using tools like Terraform to automate the provisioning of resources, such as VM, a data lakes and other components essential to our data engineering system. By following cloud-agnostic practices, we can achieve interoperability and avoid vendor lock-in, ensuring our project remains scalable, cost-effective, and adaptable to future requirements.

Next Step

After building our cloud infrastructure, we are now ready to talk about the implementation and orchestration of a data pipeline and review some of the operational requirements that can enable us to make decisions.

Coming Soon!

👉 [Data Engineering Process Fundamentals - Data Pipeline and Orchestration]

Thanks for reading.

Send question or comment at Twitter @ozkary

👍 Originally published by ozkary.com

Data Engineering Process Fundamentals - Design and Planning

2023-04-15T11:17:00.026-04:00

Now that we have completed the discovery step and the scope of work on the project is clearly defined, we move on to the design and planning step. The design and planning phase of a data engineering project is crucial for laying out the foundation of a successful system. It involves defining the system architecture, designing data pipelines, implementing source control practices, ensuring continuous integration and deployment (CI/CD), and leveraging tools like Docker and Terraform for infrastructure automation.

Data Engineering Design

A data engineering design is the actual plan to build the technical solution. It includes the system architecture, data integration, flow and pipeline orchestration, the data storage platform, transformation and management, data processing and analytics tooling. This is the area where we need to clearly define the different technologies that should be used for each area.

System Architecture

The system architecture is a critical high-level design that encompasses various components and their integration within the solution. This includes data sources, data ingestion resources, workflow and data orchestration frameworks, storage resources, data services for transformation, continuous data ingestion, and validation, as well as data analysis and visualization tools. Properly designing the system architecture ensures a robust and efficient data engineering solution.

Data Pipelines

Data Orchestration

Data orchestration refers to the automation, management and coordination of the data pipeline tasks. It involves the scheduling, workflows, monitoring and recovery of those tasks. The orchestration ensures the execution of those tasks, and it takes care of error handling, retry and the alerting of problems in the pipeline.

Source Control and Deployment

Source control is an essential practice for managing code and configuration files. Utilizing version control systems allows teams to collaborate effectively, track changes, and revert to previous states if necessary. It is important to properly define the tooling that should be used for source control and deployments automation. Source code should include the Python code, Terraform scripts, Docker files as well as any deployment automation scripts.

Source Control

Client side source control systems such as Git help in tracking and maintaining the source code for our projects. Cloud side systems such as GitHub should be used to enable the team to push their code and configuration changes to a remote repository, so it can be shared with other team members.

Continuous Integration and Continuous Delivery (CI/CD)

A remote code repository like GitHub also provides deployment automation pipelines that enable us to push changes to other environments for testing and finally production releases.

Continuous Integration (CI) is the practice to continuously integrate the code changes into the central repository, so it can be built and tested to validate the latest change and provide feedback in case of problems. Continuous Deployment (CD) is the practice to automate the deployment of the latest code builds to other environments like staging and production.

Docker Containers and Docker Hub

When deploying a new build, we need to also deploy the environment dependencies to avoid any run-time errors. Docker containers enable us to automate the management of the application by creating a self-contained environment with the build and its dependencies. A data pipeline can be built and imported into a container image, which should contain everything needed for the pipeline to reliably execute.

Docker Hub is a container registry which allows us to push our pipeline images into a cloud repository. The goal is to provide the ability to download those images from the repository as part of the new environment provisioning process.

Terraform for Cloud Infrastructure

Terraform is an Infrastructure as Code (IaC) tool that enables us to manage cloud resources across multiple cloud providers. By creating resource definition scripts and tracking them under version control, we can automate the creation, modification and deletion of resources. Terraform tracks the state of the infrastructure, so when changes are made, they can be applied to the environments as part of a CI/CD process.

Data Analysis and Visualization Tools

The selection of an analytical and visualization tool is very important in any data engineering projects. Tools like Looker, Power Builder among others enable business to gain insights from their data by visualizing the information on easy to read dashboards. By selecting the right tool for the job, organizations can transform complex data into actionable insights, empowering users across the business to uncover valuable information and drive strategic outcomes.

Summary

Exercise - Infrastructure Planning and Automation

Having established a solid design foundation, it's time to put that into practice with a hands-on exercise. Our objective is to construct the necessary infrastructure that will serve as the hosting environment for our solution. Let's delve into the practical implementation to bring our data engineering project to life.

👉 Data Engineering Process Fundamentals - Design and Planning Exercise

Thanks for reading.

Send question or comment at Twitter @ozkary

👍 Originally published by ozkary.com

Data Engineering Process Fundamentals - Discovery Exercise

2023-04-08T15:31:00.060-04:00

In this discovery exercise lab, we review a problem statement and do the analysis to define the scope of work and requirements.

Problem Statement

In the city of New York, commuters use the Metropolitan Transportation Authority (MTA) subway system for transportation. There are millions of people that use this system every day; therefore, businesses around the subway stations would like to be able to use Geofencing advertisement to target those commuters or possible consumers and attract them to their business locations at peak hours of the day.

Geofencing is a location based technology service in which mobile devices’ electronic signal is tracked as it enters or leaves a virtual boundary (geo-fence) on a geographical location. Businesses around those locations would like to use this technology to increase their sales.

The MTA subway system has stations around the city. All the stations are equipped with turnstiles or gates which tracks as each person enters or leaves the station. MTA provides this information in CSV files, which can be imported into a data warehouse to enable the analytical process to identify patterns that can enable these businesses to understand how to best target consumers.

Analytical Approach

Dataset Criteria

We are using the MTA Turnstile data for 2023. Using this data, we can investigate the following criteria:

Stations with the high number of exits by day and hours
Stations with high number of entries by day and hours

Exits indicates that commuters are arriving to those locations. Entries indicate that commuters are departing from those locations.

Data Analysis Criteria

The data can be grouped into stations, date and time of the day. This data is audited in blocks of fours hours apart. This means that there are intervals of 8am to 12pm as an example. We analyze the data into those time block intervals to help us identify the best times both in the morning and afternoon for each station location. This should allow businesses to target a particular geo-fence that is close to their business.

In the discovery process, we take a look at the data that is available for our analysis. We are using the MTA turnstiles information which is available at this location:

👉 New York Metropolitan Transportantion Authority Turnstile Data

We can download a single file to take a look at the data structure and make the following observations about the data:

Observations

It is available in weekly batches every Sunday
The information is audited in blocks of fours hours apart
The date and time field are on different columns
The cumulative entries are on the ENTRIES field
The cumulative exits are on the EXITS field
This data is audited in blocks of fours hours apart

Field Description

Name	Description
C/A	Control Area (A002) (Booth)
UNIT	Remote Unit for a station (R051)
SCP	Subunit Channel Position represents an specific address for a device (02-00-00)
STATION	Represents the station name the device is located at
LINENAME	Represents all train lines that can be boarded at this station. Normally lines are represented by one character. LINENAME 456NQR repersents train server for 4, 5, 6, N, Q, and R trains.
DIVISION	Represents the Line originally the station belonged to BMT, IRT, or IND
DATE	Represents the date (MM-DD-YY)
TIME	Represents the time (hh:mm:ss) for a scheduled audit event
DESc	Represent the “REGULAR” scheduled audit event (Normally occurs every 4 hours). Audits may occur more that 4 hours due to planning, or troubleshooting activities. Additionally, there may be a “RECOVR AUD” entry: This refers to missed audit that was recovered.
ENTRIES	The cumulative entry register value for a device
EXIST	The cumulative exit register value for a device

Data Example

The data below shows the entry/exit register values for one turnstile at control area (A002) from 09/27/14 at 00:00 hours to 09/29/14 at 00:00 hours

C/A	UNIT	SCP	STATION	LINENAME	DIVISION	DATE	TIME	DESC	ENTRIES	EXITS
A002	R051	02-00-00	LEXINGTON AVE	456NQR	BMT	09-27-14	00:00:00	REGULAR	0004800073	0001629137
A002	R051	02-00-00	LEXINGTON AVE	456NQR	BMT	09-27-14	04:00:00	REGULAR	0004800125	0001629149
A002	R051	02-00-00	LEXINGTON AVE	456NQR	BMT	09-27-14	08:00:00	REGULAR	0004800146	0001629162

Conclusions

Based on observations, the following conclusions can be made:

Merge the DATE and TIME columns and create a date time column, CREATED
The STATION column is a location dimension
The CREATED column is the datetime dimension to enable the morning and afternoon timeframes
The ENTRIES column is the measure for entries
The EXITS column is the measure for exits
A gate can be identified by using the C/A, SCP and UNIT columns

Requirements

These observations can be used to define technical requirements that can enable us to deliver a successful project.

Define the infrastructure requirements to host the technology
- Automate the provisioning of the resources using Terraform
- Deploy the technology on a cloud platform
Define the data orchestration process
- On the original pipeline, load the initial data for 2023
- Create a data pipeline that runs every week after a new file has been published
- Copy the unstructured CSV files into a Data Lake
Define a well-structured and optimized model on a Data Warehouse
- Keep the source code for the models under source control
- Copy the data into the Data Warehouse
- Allow access to the Data Warehouse, so visualization tools can consume the data.
Create Data Analysis dashboard with the following information
- Data Analysis dashboard
- Identify the time slots for morning and afternoon analysis
- Look at the distribution by stations
- Look at the daily models
- Look at the time slot models

Review the Code

In order to do our data analysis, we need to first download some sample data by writing a Python script. We can the analyze this data by writing some code snippets and use the power of the Python Pandas library. We can also use Jupyter Notebooks to quickly manipulate the data and create some charts that can help us as baseline requirements for the final visualization dashboard.

👉 Clone this repo or copy the files from this folder

Download a CSV File from the MTA Site

With this Python script (mta_discovery.py), we download a CSV file with the URL of http://web.mta.info/developers/data/nyct/turnstile/turnstile_230318.txt. The code creates a data stream to download the file in chunks to avoid any timeouts. We append the chunks into a local compressed file to reduce the size of the file. In order to reuse this code, we use the command line parser, so we can pass as parameters the URL.

import os
import argparse
from time import time
from pathlib import Path
import pandas as pd


def read_local(file_path: str) -> Path:
    """
        Reads a local file
        Args:
            file_path:  local file            
    """
    print(F'Reading local file {file_path}')
    df_iter = pd.read_csv(file_path, iterator=True,compression="gzip", chunksize=10000) 
    if df_iter:        
        for df in df_iter:
            try:                                
                print('File headers',df.columns)                                
                print('Top 10 rows',df.head(10))            
                break
            except Exception as ex:
                print(f"Error found {ex}")
                return

        print(f"file was loaded {file_path}")        
    else:
        print(F"failed to read file {file_path}")

def write_local(df: pd.DataFrame, folder: str, file_name: str) -> Path:
    """
        Write DataFrame out locally as csv file
        Args:
            df: dataframe chunk
            folder: the download data folder
            file_name: the local file name
    """

    path = Path(f"{folder}")
    if not os.path.exists(path):
        path.mkdir(parents=True, exist_ok=True)

    file_path = Path(f"{folder}/{file_name}")

    if not os.path.isfile(file_path):
        df.to_csv(file_path, compression="gzip")
        print('new file')
    else:
        df.to_csv(file_path, header=None, compression="gzip", mode="a")    
        print('chunk appended')

    return file_path

def etl_web_to_local(url: str, name: str) -> None:
    """
       Download a file    
       Args:
            url : The file url
            name : the file name

    """
    print(url, name)      

    # skip an existent file
    path = f"../data/"
    file_path = Path(f"{path}/{name}.csv.gz")
    if os.path.exists(file_path):
            read_local(file_path)            
            return

    df_iter = pd.read_csv(url, iterator=True, chunksize=10000) 
    if df_iter:      
        file_name = f"{name}.csv.gz"    
        for df in df_iter:
            try:                                                
                write_local(df, path, file_name)
            except StopIteration as ex:
                print(f"Finished reading file {ex}")
                break
            except Exception as ex:
                print(f"Error found {ex}")
                return

        print(f"file was loaded {file_path}")        
    else:
        print("dataframe failed")

def main_flow(params: str) -> None:
    """
        Process a CSV file from a url location with the goal to understand the data structure
    """    
    url = params.url  
    prefix = params.prefix

    try:
        start_index = url.index('_')
        end_index = url.index('.txt')
        file_name = F"{prefix}{url[start_index:end_index]}"
        # print(file_name)
        etl_web_to_local(url, file_name)
    except ValueError:
        print("Substring not found")


if __name__ == '__main__':

    os.system('clear')    
    parser = argparse.ArgumentParser(description='Process CSV data to understand the data')
    parser.add_argument('--url', required=True, help='url of the csv file')
    parser.add_argument('--prefix', required=True, help='the file prefix or group name')
    args = parser.parse_args()

    print('running...')
    main_flow(args)
    print('end')

Analyze the Data

With some sample data, we can now take a look at the data and make some observations. There are a few ways to approach the analysis. We could create another Python script and play with the data, but this will require to run the script from the console after every code change. A more productive way is to use Jupyter Notebooks. This tools enables us to edit and run code snippets in cells without having to run the entire script. This is a friendlier analysis tool that can help us focus on the data analysis instead of coding and running the script. In addition, once we are good with our changes, the notebook can be exported into a Python file. Let's look at that file discovery.ipynb:

import os
import argparse
from time import time
from pathlib import Path
import pandas as pd 

# read the file and display the top 10 rows
df = pd.read_csv('../data/230318.csv.gz', iterator=False,compression="gzip")
df.head(10)

# Create a new DateTime column and merge the DATE and TIME columns
df['CREATED'] =  pd.to_datetime(df['DATE'] + ' ' + df['TIME'], format='%m/%d/%Y %H:%M:%S')
df = df.drop('DATE', axis=1).drop('TIME',axis=1)
df.head(10)

# Aggregate the information by station and datetime
df["ENTRIES"] = df["ENTRIES"].astype(int)
df["EXITS"] = df["EXITS"].astype(int)
df_totals = df.groupby(["STATION","CREATED"], as_index=False)[["ENTRIES","EXITS"]].sum()
df_totals.head(10)

df_station_totals = df.groupby(["STATION"], as_index=False)[["ENTRIES","EXITS"]].sum()
df_station_totals.head(10)

# Show the total entries by station, use a subset of data
import plotly.express as px
import plotly.graph_objects as go

df_stations =  df_station_totals.head(25)
donut_chart = go.Figure(data=[go.Pie(labels=df_stations["STATION"], values=df_stations["ENTRIES"], hole=.2)])
donut_chart.update_layout(title_text='Entries Distribution by Station', margin=dict(t=40, b=0, l=10, r=10))
donut_chart.show()

# Show the data by the day of the week
df_by_date = df_totals.groupby(["CREATED"], as_index=False)[["ENTRIES"]].sum()
day_order = ['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat']
df_by_date["WEEKDAY"] = pd.Categorical(df_by_date["CREATED"].dt.strftime('%a'), categories=day_order, ordered=True)
df_entries_by_date =  df_by_date.groupby(["WEEKDAY"], as_index=False)[["ENTRIES"]].sum()
df_entries_by_date.head(10)

bar_chart = go.Figure(data=[go.Bar(x=df_entries_by_date["WEEKDAY"], y=df_entries_by_date["ENTRIES"])])
bar_chart.update_layout(title_text='Total Entries by Week Day')
bar_chart.show()

How to Run it!

With an understanding of the code and tools, let's run the process.

Requirements

👉 Install Python, Pandas and Jupyter notebook

👉 Install Visual Studio Code

👉 Clone this repo or copy the files from this folder

Follow these steps to run the analysis

Download a file to look at the data
- This should create a gz file under the ../data folder

$ python3 mta_discovery.py --url http://web.mta.info/developers/data/nyct/turnstile/turnstile_230318.txt

Run the Jupyter notebook (dicovery.ipynb) to do some analysis on the data.

Load the Jupyter notebook to do analysis
- First start the Jupyter server from the terminal

$ jupyter notebook

See the URL on the terminal and click it to load it on the browser
- Click the discovery.ipynb file link
Or open the file with VSCode and enter the URL when prompted from a kernel url
Run every cell from the top down as this is required to load the dependencies

The following images show Jupyter notebook loaded on the browser or directly from VSCode.

Jupyter Notebook loaded on the browser

Using VSCode to load the data and create charts

Show the total entries by station using a subset of data using VSCode

Next Step

👉 Data Engineering Process Fundamentals - Design and Planning

Thanks for reading.

Send question or comment at Twitter @ozkary

👉 Originally published by ozkary.com

Data Engineering Process Fundamentals - Discovery

2023-04-01T12:22:00.044-04:00

Introduction

In this series of Data Engineering Process Fundamentals, we explore the Data Engineering Process (DEP) with key concepts, principles and relevant technologies, and explain how they are being used to help us deliver solutions. The first step, and important to never skip, in this process is the Discovery step.

During the discovery step of a Data Engineering Process, we look to identify and clearly document a problem statement, which helps us have an understanding of what we are trying to solve. We also look at our analytical approach to make observations about at the data, its structure and source. This leads us into defining the requirements for the project, so we can define the scope, design and architecture of the solution.

Problem Statement

A Problem Statement is a description of what it is that we are trying to solve. As part of the problem statement, we should provide some background or context on how the data is processed or collected. We should also provide a specific description of what the data engineering process is looking to solve by taking a specific approach to integrate the data. Finally, the objective and goals should also be described with information about how this data will be made available for analytical purposes.

Analytical Approach

The Analytical Approach is a systematic method to observe the data and arrive to insights from it. It involves the use of different techniques, tools and frameworks to make sense of the data in order to arrive to conclusions and actionable requirements.

Dataset Criteria

A Dataset Criteria technique refers to the set of characteristics used to evaluate the data, so we can determine the quality and accuracy of it.

In the data collection process, we should identify the various sources that can provide us with accurate and complete information. Data cleaning and preprocessing needs to be done to identify and eliminate missing values, invalid data and outliers that can skew the information. In addition, we should understand how this data is available for the ongoing integration. Some integrations may require a batch process integration at a scheduled interval. Others may require a real-time integration and/or a combination of batch and real-time processing.

Exploratory Data Analysis

We should conduct exploratory data analysis to understand the structure, patterns and characteristics of the data. We need to make observations about the data, identify the valuable fields, create statistical summaries, and run some data profiling to identify trends, metrics and correlations that are relevant to the main objective of the project.

Tools and Framework

Depending on the size and budget of the organization, the solution can be built with lots of coding and integration, or instead a low-code turn-key solution that provides enterprise quality resources could be used instead. Regardless of the approach, a programming language like Python is a popular programming language for data science and engineers, and it is always applicable. The Python Pandas library is great for data manipulation and analysis. Jupyter notes with Python scripts is great for experiments and discovery.

To run our Python scripts and Jupyter notebooks, we can use Visual Studio Code (VSCode), which is cross-platform Integrated Development Environment (IDE) tool. This tool also enables the integration with source control and deployments platforms like GitHub, so we can maintain version control and automate the deployment and test of our code changes.

To orchestrate the pipelines, we often use a workflow framework like Apache Airflow, Prefect. To host the data, we use data lakes (blob storage) and a relational data warehouse. For data modeling, incremental data and continuous test and data ingestion, Apache Spark or gbt cloud are used.

For the final data analysis and visualization, we could use tools like Looker, PowerBI and Tableau. These are tools that can connect to a data warehouse and consume the data models, so they can visualize in ways that enable stakeholders to make decisions based on the story provided by the data.

Requirements

Requirements refer to the needs, capabilities and constraints that are needed to deliver a data engineering solution. They should outline the project deliverables that are required to meet the main objectives. The requirements should include data related areas like:

Sources and integration
Modeling and transformation
Quality and validation
Storage and infrastructure
Processing and Analytics
Governance and Security
Scalability and performance
Monitoring

Design

System Architecture

The system architecture is a high-level design of the solution, its components and how they integrate with each other. This often includes the data sources, data ingestion resources, workflow and data orchestration resources and frameworks, storage resources, data services for data transformation and continuous data ingestion and validation, and data analysis and visualization tooling.

Data Pipelines

Data Orchestration

Summary

The data engineering discovery process involves defining the problem statement, gathering requirements, and determining the scope of work. It also includes a data analysis exercise utilizing Python and Jupyter Notebooks ,or other tools, to extract valuable insights from the data. These steps collectively lay the foundation for successful data engineering endeavors.

Exercise - Hands-on Use Case

Since we now understand the discovery step, we should be able to put that into practice. Let’s move on to a hands-on use case and see how we apply those concepts.

👉 Data Engineering Process Fundamentals - Discovery Exercise

Thanks for reading.

Send question or comment at Twitter @ozkary

👍 Originally published by ozkary.com

Data Engineering Process Fundamentals

2023-03-25T11:41:00.040-04:00

Introduction

Data Engineering is changing constantly. From cloud data platforms and pipeline automation to data streaming and visualizations tools, new innovations are impacting that way we build today’s data and analytical solutions.

In this series of Data Engineering Process Fundamentals, we explore the Data Engineering Process (DEP) with key concepts, principles and relevant technologies, and explain how they are being used to help us deliver the solution. We discuss concepts and take on a real use case where we execute an end-to-end process from downloading data to visualizing the results.

The end-goal of this series is to take us thru a process in which we deliver an architecture, which facilitates the ongoing analysis of big data via analytical and visualization tools. In the following images, we can get a preview of what we will be delivering as we execute each step of the process.

Data Engineering - Architecture

Data Engineering - Analysis Results

Data Engineering Process

A Data Engineering Process follows a series of steps that should be executed to properly understand the problem statement, scope of work, design and architecture that should be used to create the solution. Some of these steps include the following:

👉 Join this newsletter to receive updates Sign up here

Discovery
- Problem Statement
- Data Analysis
- Define the Requirements and Scope of Work
- Discovery Exercise
Design and Planning
- Design Approach
- System Architecture
- Cloud Engineering and Automation
- Design Exercise

Data Orchestration and Operations
- Pipeline Orchestration
  - Batch Processing
- Workflow Automation
- Deployment, Schedules and Monitoring
Data Warehouse and Modeling
- Data modeling
- Data Warehouse Design
- Continuous Integration
Data Analysis and Visualization
- Analyze the data
- Visualization Concepts
- Create a Dashboard
  - Provide answers to the problem statement
Streaming Data
- Data Warehouse Integration
- Real-time dashboard

Concepts

What is Data Engineering?

Data Engineering is the practice of designing and building solutions by integrating, transforming and consolidating various data sources into a centralized and structured system, Data Warehouse, at scale, so the data becomes available for building analytics solutions.

What is a Data Engineering Process?

A Data Engineering Process (DEP) is the sequence of steps that engineers should follow in order to build a testable, robust and scalable solution. This process starts really early on with a problem statement to understand what the team is trying to solve. It is then followed with data analysis and requirements discovery, which leads to a design and architecture approach, in which the different applicable technologies are identified.

Operational and Analytical data

Operational data is often generated by applications, and it is stored in transactional databases like SQL Server, CosmosDB, Firebase and others. This is the data that is created after an application saves a user transaction like contact information, a purchase or other activities that are available from the application. This system is not design to support Big Data query scenarios, so the reporting system should not be overloading its resources with large queries.

Analytical data is the transaction data that has been processed and optimized for analytical and visualization purposes. This data is often processed via Data Lakes and stored on Data Warehouse.

Data Pipelines and Orchestration

Data Pipelines are used to orchestrate and automate workflows to move and process the transactional into Data Lakes and Data Warehouse. The pipelines execute repeatable Extract Transform and Load (ETL) or Extract Load and Transform (ELT) processes that can be triggered by a schedule or a data event.

Data Lakes

A Data Lake is an optimized storage system for Big Data scenarios. The primary function is to store the data in its raw format without any transformation. This can include structure data like CSV files, unstructured data like JSON and XML documents, or column-base data like parquet files.

Data Warehouse

A Data Warehouse is a centralized storage system that stores integrated data from multiple sources. This system stores historical data in relational tables with an optimized schema, which enables the data analysis process. This system can also integrate external resources like CSV and parquet files that are stored on Data Lakes as external tables. The system is designed to host and serve Big Data scenarios. It is not meant to be used as a transactional system.

Data Batch Processing

Batch Processing is a method often used to run high-volume, repetitive data jobs. It is usually scheduled during certain time windows that do not impact the application operations, as these processes are often used to export the data from transactional systems. A batch job is an automated software task that may include one or more workflows. These workflows can often run without supervision, and they are monitored by other tools to ensure that the process is not failing.

Streaming Data

Streaming Data is a data source that sends messages with small content but with high volume of messages in real-time. This data often comes from Internet-of-things (IoT) devices, manufacturing equipment or social media sources, often producing a high volume of information per second. This information is often captured in aggregated time windows and then store in a Data Warehouse, so it can be combined with other analytical data. It can also be sent to monitoring and/or real-time systems to show the current system KPI or any type of variance in the system.

Next Step

👉 Data Engineering Process Fundamentals - Discovery

Thanks for reading.

Send question or comment at Twitter @ozkary

👍 Originally published by ozkary.com

GitHub Codespaces Quick Review

2023-03-18T12:37:00.027-04:00

This is a quick review about GitHub Codespaces, which you can load right on the browser.

In this video, we talk about creating a Codespaces instance from a GitHub Repository. We load a GitHub project using the VM instance that is provisioned for us when a GitHub Codespace is added to the repo. To edit the project files, the browser loads a VS Code online version of the IDE, which then uses the SSH protocol to virtualize the code from the VM onto our browser. Since we are basically using a VM, we can open a terminal from the browser session and run NPM and Git commands. All these commands are executed on the VM. GitHub Codespaces are a quick way to provision a VM instance without the complexity of manually building on a cloud account.

Thanks for reading.

Send question or comment at Twitter @ozkary

Originally published by ozkary.com