tag:blogger.com,1999:blog-22507508824736448092024-03-18T05:47:59.247-04:00Ozkary - Emerging TechnologiesI am Oscar Garcia, Ozkary<sup>TM</sup>. I author this site, speak at conferences and events, contribute to OSS, mentor people. I use this blog to post ideas and experiences about software development, with the goal to both learn from and help the technology communities around the world.Oscar Garcia @ozkaryhttp://www.blogger.com/profile/13427831719934915504noreply@blogger.comBlogger260125tag:blogger.com,1999:blog-2250750882473644809.post-32740236899229016692024-03-07T17:02:00.000-05:002024-03-07T17:02:50.018-05:00Coupling Data Flows: Data Pipelines and Orchestration - Data Engineering Process Fundamentals<h1 id="overview">Overview</h1>
<p>A data pipeline refers to a series of connected tasks that handles the extract, transform and load (ETL) as well as the extract, load and transform (ELT) operations and integration from a source to a target storage like a data lake or data warehouse. Properly designed pipelines ensure data integrity, quality, and consistency throughout the system.</p>
<p>In this technical presentation, we embark on the next chapter of our data journey, delving into building a pipeline with orchestration for ongoing development and operational support.</p>
<p><img src="//www.ozkary.dev/assets/2024/ozkary-data-engineering-process-data-pipeline-orchestration.png" alt="Data Engineering Process Fundamentals - Data Pipelines" title="Data Engineering Process Fundamentals - Data Pipelines"></p>
<ul>
<li>Follow this GitHub repo during the presentation: (Give it a star)</li>
</ul>
<blockquote>
<p>π <a href="https://github.com/ozkary/data-engineering-mta-turnstile">https://github.com/ozkary/data-engineering-mta-turnstile</a></p>
</blockquote>
<ul>
<li>Read more information on my blog at: </li>
</ul>
<blockquote>
<p>π <a href="https://www.ozkary.com/2023/03/data-engineering-process-fundamentals.html">https://www.ozkary.com/2023/03/data-engineering-process-fundamentals.html</a></p>
</blockquote>
<h2 id="youtube-video">YouTube Video</h2>
<iframe width="560" height="315" src="https://www.youtube.com/embed/5ZoK9oKXWMI?si=JlS41yDrx7mddRZt" title="Data Engineering Process Fundamentals - Data Pipeline" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
<h3 id="video-agenda">Video Agenda</h3>
<ul>
<li><p><em>*Understanding Data Pipelines:"</em></p>
<ul>
<li>Delve into the concept of data pipelines and their significance in modern data engineering.</li>
</ul>
</li>
<li><p><strong>Implementation Options:</strong></p>
<ul>
<li>Explore different approaches to implementing data pipelines, including code-centric and low-code tools.</li>
</ul>
</li>
<li><p><strong>Pipeline Orchestration:</strong></p>
<ul>
<li>Learn about the role of orchestration in managing complex data workflows and the tools available, such as Apache Airflow, Apache Spark, Prefect, and Azure Data Factory.</li>
</ul>
</li>
<li><p><strong>Cloud Resources:</strong></p>
<ul>
<li>Identify the necessary cloud resources for staging environments and data lakes to support efficient data pipeline deployment.</li>
</ul>
</li>
<li><p><strong>Implementing Flows:</strong></p>
<ul>
<li>Examine the process of building data pipelines, including defining tasks, components, and logging mechanisms.</li>
</ul>
</li>
<li><p><strong>Deployment with Docker:</strong></p>
<ul>
<li>Discover how Docker containers can be used to isolate data pipeline environments and streamline deployment processes.</li>
</ul>
</li>
<li><p><strong>Monitor and Operations:</strong></p>
<ul>
<li>Manage operational concerns related to data pipeline performance, reliability, and scalability.</li>
</ul>
</li>
</ul>
<p><strong>Key Takeaways:</strong></p>
<ul>
<li><p>Gain practical insights into building and managing data pipelines.</p>
</li>
<li><p>Learn coding techniques with Python for efficient data pipeline development.</p>
</li>
<li><p>Discover the benefits of Docker deployments for data pipeline management.</p>
</li>
<li><p>Understand the significance of data orchestration in the data engineering process.</p>
</li>
<li><p>Connect with industry professionals and expand your network.</p>
</li>
<li><p>Stay updated on the latest trends and advancements in data pipeline architecture and orchestration.</p>
</li>
</ul>
<p><strong>Some of the technologies that we will be covering:</strong></p>
<ul>
<li>Cloud Infrastructure</li>
<li>Data Pipelines</li>
<li>GitHub</li>
<li>VSCode</li>
<li>Docker and Docker Hub</li>
</ul>
<h2 id="presentation">Presentation</h2>
<h3 id="data-engineering-overview">Data Engineering Overview</h3>
<p>A Data Engineering Process involves executing steps to understand the problem, scope, design, and architecture for creating a solution. This enables ongoing big data analysis using analytical and visualization tools.</p>
<h4 id="topics">Topics</h4>
<ul>
<li>Understanding Data pipelines</li>
<li>Implementation Options </li>
<li>Pipeline Orchestration</li>
<li>Cloud Resources</li>
<li>Implementing Code-Centric Flows</li>
<li>Deployment with Docker</li>
<li>Monitor and Operations</li>
</ul>
<p><strong>Follow this project: Star/Follow the project</strong></p>
<blockquote>
<p>π <a href="//github.com/ozkary/data-engineering-mta-turnstile">Data Engineering Process Fundamentals</a></p>
</blockquote>
<h3 id="understanding-data-pipelines">Understanding Data Pipelines</h3>
<p>A data pipeline refers to a series of connected tasks that handles the extract, transform and load (ETL) as well as the extract, load and transform (ELT) operations and integration from a source to a target storage like a data lake or data warehouse</p>
<h4 id="foundational-areas">Foundational Areas</h4>
<ul>
<li>Data Ingestion and Transformation</li>
<li>Code-Centric vs. Low-Code Options</li>
<li>Orchestration</li>
<li>Cloud Resources</li>
<li>Implementing flows, tasks, components and logging</li>
<li>Deployment</li>
<li>Monitoring and Operations</li>
</ul>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-process-pipeline-orchestration.png" alt="Data Engineering Process Fundamentals - Data Pipeline and Orchestration" title="Data Engineering Process Fundamentals - Data Pipeline and Orchestration"></p>
<h3 id="data-ingestion-and-transformation">Data Ingestion and Transformation</h3>
<p>Data ingestion is the process of bringing data in from various sources, such as databases, APIs, data streams and files, into a staging area. Once the data is ingested, we can transform it to match our requirements.</p>
<p><strong>Key Areas:</strong></p>
<ul>
<li>Identify methods for extracting data from various sources (databases, APIs, Data Streams, files, etc.).</li>
<li>Choose between batch or streaming ingestion based on data needs and use cases</li>
<li>Data cleansing and standardization ensure quality and consistency.</li>
<li>Data enrichment adds context and value.</li>
<li>Formatting into the required data models for analysis.</li>
</ul>
<p><img src="//www.ozkary.dev/assets/2024/ozkary-data-engineering-process-fundamentals-data-sources.png" alt="Data Engineering Process Fundamentals - Data Pipeline Sources" title="Data Engineering Process Fundamentals - Data Pipeline Sources"></p>
<h3 id="implementation-options">Implementation Options</h3>
<p>The implementation of a pipeline refers to the designing and/or coding of each task in the pipeline. A task can be implemented using a programming languages like Python or SQL. It can also be implemented using a low-code tool with zero or some code snippet.</p>
<p><strong>Options:</strong></p>
<ul>
<li><p>Code-centric: Provides flexibility, customization, and full control (Python, SQL, etc.). Ideal for complex pipelines with specific requirements. Requires programming expertise.</p>
</li>
<li><p>Low-code: Offers visual drag-and-drop interfaces that allow the engineer to connect to APIs, databases, data lakes and other sources that provide access via API, enabling faster development. (Azure Data Factory, GCP Cloud Dataflow)</p>
</li>
</ul>
<p><img src="//www.ozkary.dev/assets/2024/ozkary-data-engineering-process-fundamentals-integrations.png" alt="Data Engineering Process Fundamentals - Data Pipeline Integration" title="Data Engineering Process Fundamentals - Data Pipeline Integration"></p>
<h3 id="pipeline-orchestration">Pipeline Orchestration</h3>
<p>Orchestration is the automation, management and coordination of the data pipeline tasks. It involves the scheduling, workflows, monitoring and recovery of those tasks. The orchestration handles the execution, error handling, retry and the alerting of problems in the pipeline.</p>
<p><strong>Orchestration Tools:</strong></p>
<ul>
<li>Apache Airflow: Offers flexible and customizable workflow creation for engineers using Python code, ideal for complex pipelines.</li>
<li>Apache Spark: Excels at large-scale batch processing tasks involving API calls and file downloads with Python. Its distributed framework efficiently handles data processing and analysis.</li>
<li>Prefect: This open-source workflow management system allows defining and managing data pipelines as code, providing a familiar Python API.</li>
<li>Cloud-based Services: Tools like Azure Data Factory and GCP Cloud Dataflow provide a visual interface for building and orchestrating data pipelines, simplifying development. They also handle logging and alerting.</li>
</ul>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-process-pipeline-orchestration-architecture.png" alt="Data Engineering Process Fundamentals - Data Pipeline Architecture" title="Data Engineering Process Fundamentals - Data Pipeline Architecture"></p>
<h3 id="cloud-resources">Cloud Resources</h3>
<p>Cloud resources are critical for data pipelines. Virtual machines (VMs) offer processing power for code-centric pipelines, while data lakes serve as central repositories for raw data. Data warehouses, optimized for structured data analysis, often integrate with data lakes to enable deeper insights.</p>
<p><strong>Resources:</strong></p>
<ul>
<li><p><strong>Code-centric pipelines:</strong> VMs are used for executing workflows, managing orchestration, and providing resources for data processing and transformation. Often, code runs within Docker containers.</p>
</li>
<li><p><strong>Data Storage:</strong> Data lakes act as central repositories for storing vast amounts of raw and unprocessed data. They offer scalable and cost-effective solutions for capturing and storing data from diverse sources.</p>
</li>
<li><p><strong>Low-code tools:</strong> typically have their own infrastructure needs specified by the platform provider. Provisioning might not be necessary, and the tool might be serverless or run on pre-defined infrastructure.</p>
</li>
</ul>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-process-pipeline-orchestration-flow.png" alt="Data Engineering Process Fundamentals - Data Pipeline Resources" title="Data Engineering Process Fundamentals - Data Pipeline Resources"></p>
<h3 id="implementing-code-centric-flows">Implementing Code-Centric Flows</h3>
<p>In a data pipeline, orchestrated <strong>flows</strong> define the overall sequence of steps. These flows consist of <strong>tasks</strong>, which represent specific actions within the pipeline. For modularity and reusability, a task should use <strong>components</strong> to encapsulate common concerns like security and data lake access.</p>
<p><strong>Pipeline Structure:</strong></p>
<ul>
<li><p>Flows: Are coordinators that define the overall structure and sequence of the data pipeline. They are responsible for orchestrating the execution of other flows or tasks in a specific order.</p>
</li>
<li><p>Tasks: Are operators for each individual units of work within the pipeline. Each task represents a specific action or function performed on the data, such as data extraction, transformation, or loading. They manipulate the data according to the flow's instructions.</p>
</li>
<li><p>Components: These are reusable code blocks that encapsulate functionalities common across different tasks. They act as utilities, providing shared functionality like security checks, data lake access, logging, or error handling.</p>
</li>
</ul>
<p><img src="//www.ozkary.dev/assets/2024/ozkary-data-engineering-process-fundamentals-monitor-dashboard.png" alt="Data Engineering Process Fundamentals - Data Pipeline Monitor" title="Data Engineering Process Fundamentals - Data Pipeline Monitor"></p>
<h3 id="deployment-with-docker-and-docker-hub">Deployment with Docker and Docker Hub</h3>
<p>Docker proves invaluable for our data pipelines by providing self-contained environments with all necessary dependencies. With Docker Hub, we can effortlessly distribute pipeline images, facilitating swift and reliable provisioning of new environments.</p>
<ul>
<li><p>Docker containers streamline the deployment process by encapsulating application and dependency configurations, reducing runtime errors.</p>
</li>
<li><p>Containerizing data pipelines ensures reliability and portability by packaging all necessary components within a single container image.</p>
</li>
<li><p>Docker Hub serves as a centralized container registry, enabling seamless image storage and distribution for streamlined environment provisioning and scalability.</p>
</li>
</ul>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-design-terraform-docker.png" alt="Data Engineering Process Fundamentals - Data Pipeline Containers" title="Data Engineering Process Fundamentals - Data Pipeline Containers"></p>
<h3 id="monitor-and-operations">Monitor and Operations</h3>
<p>Monitoring your data pipeline's performance with telemetry data is key to smooth operations. This enables the operations team to proactively identify and address issues, ensuring efficient data delivery.</p>
<p><strong>Key Components:</strong></p>
<ul>
<li><p><strong>Telemetry Tracing:</strong> Tracks the execution of flows and tasks, providing detailed information about their performance, such as execution time, resource utilization, and error messages.</p>
</li>
<li><p><strong>Monitor and Dashboards:</strong> Visualize key performance indicators (KPIs) through user-friendly dashboards, offering real-time insights into overall pipeline health and facilitating anomaly detection.</p>
</li>
<li><p><strong>Notifications to Support:</strong> Timely alerts are essential for the operations team to be notified of any critical issues or performance deviations, enabling them to take necessary actions.</p>
</li>
</ul>
<p><img src="//www.ozkary.dev/assets/2024/ozkary-data-engineering-process-fundamentals-monitor-events.png" alt="Data Engineering Process Fundamentals - Data Pipeline Dashboard" title="Data Engineering Process Fundamentals - Data Pipeline Dashboard"></p>
<h2 id="summary">Summary</h2>
<p>A data pipeline is basically a workflow of tasks that can be executed in Docker containers. The execution, scheduling, managing and monitoring of the pipeline is referred as orchestration. In order to support the operations of the pipeline and its orchestration, we need to provision a VM and data lake cloud resources, which we can also automate with Terraform. By selecting the appropriate programming language and orchestration tools, we can construct resilient pipelines capable of scaling and meeting evolving data demands effectively.</p>
<p>Thanks for reading.</p>
<p>Send question or comment at Twitter @ozkary
π Originally published by <a href="https://www.ozkary.com">ozkary.com</a></p>
<div class="blogger-post-footer">Originally Published by https://ozkary.com</div>Oscar Garcia @ozkaryhttp://www.blogger.com/profile/13427831719934915504noreply@blogger.com0tag:blogger.com,1999:blog-2250750882473644809.post-83709458399936345632024-02-14T15:00:00.008-05:002024-02-15T10:09:21.808-05:00Unlock the Blueprint: Design and Planning Phase - Data Engineering Process Fundamentals<h1 id="overview">Overview</h1>
<p>The design and planning phase of a data engineering project is crucial for laying out the foundation of a successful and scalable solution. This phase ensures that the architecture is strategically aligned with business objectives, optimizes resource utilization, and mitigates potential risks.</p>
<p>In this technical presentation, we embark on the next chapter of our data journey, delving into the critical Design and Planning
Phase. </p>
<p><img src="//www.ozkary.dev/assets/2024/ozkary-data-engineering-process-design-planning.png" alt="Data Engineering Process Fundamentals" title="Data Engineering Process Fundamentals"></p>
<ul>
<li>Follow this GitHub repo during the presentation: (Give it a star)</li>
</ul>
<blockquote>
<p>π <a href="https://github.com/ozkary/data-engineering-mta-turnstile">https://github.com/ozkary/data-engineering-mta-turnstile</a></p>
</blockquote>
<ul>
<li>Read more information on my blog at: </li>
</ul>
<blockquote>
<p>π <a href="https://www.ozkary.com/2023/03/data-engineering-process-fundamentals.html">https://www.ozkary.com/2023/03/data-engineering-process-fundamentals.html</a></p>
</blockquote>
<h2 id="youtube-video">YouTube Video</h2>
<iframe width="560" height="315" src="https://www.youtube.com/embed/O6gqhreqGDo?si=8vnZxaRj1K7oJCJp" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
<h3 id="video-agenda">Video Agenda</h3>
<p>System Design and Architecture:</p>
<ul>
<li><p>Understanding the foundational principles that shape a robust and scalable data system.</p>
<p>Data Pipeline and Orchestration:</p>
</li>
<li><p>Uncovering the essentials of designing an efficient data pipeline and orchestrating seamless data flows.</p>
<p>Source Control and Deployment:</p>
</li>
<li><p>Navigating the best practices for source control, versioning, and deployment strategies.</p>
<p>CI/CD in Data Engineering:</p>
</li>
<li><p>Implementing Continuous Integration and Continuous Deployment (CI/CD) practices for agility and reliability.</p>
<p>Docker Container and Docker Hub:</p>
</li>
<li><p>Harnessing the power of Docker containers and Docker Hub for containerized deployments.</p>
<p>Cloud Infrastructure with IaC:</p>
</li>
<li><p>Exploring technologies for building out cloud infrastructure using Infrastructure as Code (IaC), ensuring efficiency and consistency.</p>
</li>
</ul>
<p><strong>Key Takeaways:</strong></p>
<ul>
<li><p>Gain insights into designing scalable and efficient data systems.</p>
</li>
<li><p>Learn best practices for cloud infrastructure and IaC.</p>
</li>
<li><p>Discover the importance of data pipeline orchestration and source control.</p>
</li>
<li><p>Explore the world of CI/CD in the context of data engineering.</p>
</li>
<li><p>Unlock the potential of Docker containers for your data workflows.</p>
</li>
</ul>
<p><strong>Some of the technologies that we will be covering:</strong></p>
<ul>
<li>Cloud Infrastructure</li>
<li>Data Pipelines</li>
<li>GitHub and Actions</li>
<li>VC Code</li>
<li>Docker and Docker Hub</li>
<li>Terraform</li>
</ul>
<h2 id="presentation">Presentation</h2>
<h3 id="data-engineering-overview">Data Engineering Overview</h3>
<p>A Data Engineering Process involves executing steps to understand the problem, scope, design, and architecture for creating a solution. This enables ongoing big data analysis using analytical and visualization tools.</p>
<h4 id="topics">Topics</h4>
<ul>
<li>Importance of Design and Planning</li>
<li>System Design and Architecture</li>
<li>Data Pipeline and Orchestration</li>
<li>Source Control and CI/CD</li>
<li>Docker Containers</li>
<li>Cloud Infrastructure with IaC</li>
</ul>
<p><strong>Follow this project: Give a star</strong></p>
<blockquote>
<p>π <a href="//github.com/ozkary/data-engineering-mta-turnstile">Data Engineering Process Fundamentals</a></p>
</blockquote>
<h3 id="importance-of-design-and-planning">Importance of Design and Planning</h3>
<p>The design and planning phase of a data engineering project is crucial for laying out the foundation of a successful and scalable solution. This phase ensures that the architecture is strategically aligned with business objectives, optimizes resource utilization, and mitigates potential risks.</p>
<h4 id="foundational-areas">Foundational Areas</h4>
<ul>
<li>Designing the data pipeline and technology specifications like flows, coding language, data governance and tools</li>
<li>Define the system architecture like cloud services for scalability, data platform</li>
<li>Source control and deployment automation with CI/CD</li>
<li>Using Docker containers for environment isolation to avoid deployment issues</li>
<li>Infrastructure automation with Terraform or cloud CLI tools</li>
<li>System monitor, notification and recovery</li>
</ul>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-process-design-planning.png" alt="Data Engineering Process Fundamentals - Design and Planning" title="Data Engineering Process Fundamentals - Design and Planning"></p>
<h3 id="system-design-and-architecture">System Design and Architecture</h3>
<p>In a system design, we need to clearly define the different technologies that should be used for each area of the solution. It includes the high-level system architecture, which defines the different components and their integration.</p>
<ul>
<li><p>The <strong>design</strong> outlines the technical solution, including system architecture, data integration, flow orchestration, storage platforms, and data processing tools. It focuses on defining technologies for each component to ensure a cohesive and efficient solution.</p>
</li>
<li><p>A <strong>system architecture</strong> is a critical high-level design encompassing various components such as data sources, ingestion resources, workflow orchestration, storage, transformation services, continuous ingestion, validation mechanisms, and analytics tools.</p>
</li>
</ul>
<p><img src="//www.ozkary.dev/assets/2024/ozkary-data-engineering-process-architecture-stream.png" alt="Data Engineering Process Fundamentals - System Architecture" title="Data Engineering Process Fundamentals - System Architecture"></p>
<h3 id="data-pipeline-and-orchestration">Data Pipeline and Orchestration</h3>
<p>A data pipeline is basically a workflow of tasks that can be executed in Docker containers. The execution, scheduling, managing and monitoring of the pipeline is referred to as orchestration. In order to support the operations of the pipeline and its orchestration, we need to provision a VM and data lake, and monitor cloud resources. </p>
<ul>
<li>This can be code-centric, leveraging languages like Python, SQL</li>
<li>Or a low-code approach, utilizing tools such as Azure Data Factory, which provides a turn-key solution</li>
<li>Monitor services enable us to track telemetry data to support operational requirements</li>
<li>Docker Hub, GitHub can be used for the CI/CD process and deployed our code-centric solutions</li>
<li>Scheduling, recovering from failures and dashboards are essentials for orchestration</li>
<li>Low-code solutions , like data factory, can also be used</li>
</ul>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-process-pipeline-orchestration-architecture.png" alt="Data Engineering Process Fundamentals - Data Pipeline" title="Data Engineering Process Fundamentals - Data Pipeline"></p>
<h3 id="source-control-ci-cd">Source Control - CI/CD</h3>
<p>Implementing source control practices alongside Continuous Integration and Continuous Delivery (CI/CD) pipelines is vital for facilitating agile development. This ensures efficient collaboration, change tracking, and seamless code deployment, crucial for addressing ongoing feature changes, bug fixes, and new environment deployments.</p>
<ul>
<li>Systems like Git facilitates effective code and configuration file management, enabling collaboration and change tracking.</li>
<li>Platforms such as GitHub enhance collaboration by providing a remote repository for sharing code.</li>
<li>CI involves integrating code changes into a central repository, followed by automated build and test processes to validate changes and provide feedback.</li>
<li>CD automates the deployment of code builds to various environments, such as staging and production, streamlining the release process and ensuring consistency across environments.</li>
</ul>
<p><img src="//www.ozkary.dev/assets/2024/ozkary-data-engineering-process-ci-cd.png" alt="Data Engineering Process Fundamentals - GitHub CI/CD" title="Data Engineering Process Fundamentals - GitHub CI/CD"></p>
<h3 id="docker-container-and-docker-hub">Docker Container and Docker Hub</h3>
<p>Docker proves invaluable for our data pipelines by providing self-contained environments with all necessary dependencies. With Docker Hub, we can effortlessly distribute pipeline images, facilitating swift and reliable provisioning of new environments.</p>
<ul>
<li>Docker containers streamline the deployment process by encapsulating application and dependency configurations, reducing runtime errors.</li>
<li>Containerizing data pipelines ensures reliability and portability by packaging all necessary components within a single container image.</li>
<li>Docker Hub serves as a centralized container registry, enabling seamless image storage and distribution for streamlined environment provisioning and scalability.</li>
</ul>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-design-terraform-docker.png" alt="Data Engineering Process Fundamentals - Docker" title="Data Engineering Process Fundamentals - Docker"></p>
<h3 id="cloud-infrastructure-with-iac">Cloud Infrastructure with IaC</h3>
<p>Infrastructure automation is crucial for maintaining consistency, scalability, and reliability across environments. By defining infrastructure as code (IaC), organizations can efficiently provision and modify cloud resources, mitigating manual errors.</p>
<ul>
<li>Define infrastructure configurations as code, ensuring consistency across environments.</li>
<li>Easily scale resources up or down to meet changing demands with code-defined infrastructure.</li>
<li>Reduce manual errors and ensure reproducibility by automating resource provisioning and management.</li>
<li>Track infrastructure changes under version control, enabling collaboration and ensuring auditability.</li>
<li>Track infrastructure state, allowing for precise updates and minimizing drift between desired and actual configurations. </li>
</ul>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-terraform.png" alt="Data Engineering Process Fundamentals - Terraform" title="Data Engineering Process Fundamentals - Terraform"></p>
<h2 id="summary">Summary</h2>
<p>The design and planning phase of a data engineering project sets the stage for success. From designing the system architecture and data pipelines to implementing source control, CI/CD, Docker, and infrastructure automation with Terraform, every aspect contributes to efficient and reliable deployment. Infrastructure automation, in particular, plays a critical role by simplifying provisioning of cloud resources, ensuring consistency, and enabling scalability, ultimately leading to a robust and manageable data engineering system. </p>
<p>Thanks for reading.</p>
<p>Send question or comment at Twitter @ozkary</p>
<h4>π Originally published by <a href="https://www.ozkary.com">ozkary.com</a></h4>
<div class="blogger-post-footer">Originally Published by https://ozkary.com</div>Oscar Garcia @ozkaryhttp://www.blogger.com/profile/13427831719934915504noreply@blogger.com0tag:blogger.com,1999:blog-2250750882473644809.post-30031161643762415292024-01-31T15:00:00.013-05:002024-02-15T10:02:48.527-05:00Decoding Data: A Journey into the Discovery Phase - Data Engineering Process Fundamentals<h1 id="overview">Overview</h1>
<p>The discovery process involves identifying the problem, analyzing data sources, defining project requirements, establishing the project scope, and designing an effective architecture to address the identified challenges.</p>
<p>In this session, we will delve into the essential building blocks of data engineering, placing a spotlight on the discovery process. From framing the problem statement to navigating the intricacies of exploratory data analysis (EDA) using Python, VSCode, Jupyter Notebooks, and GitHub, you'll gain a solid understanding of the fundamental aspects that drive effective data engineering projects.</p>
<p><img src="//www.ozkary.dev/assets/2024/ozkary-data-engineering-process-decoding-data-with-discovery.png" alt="Data Engineering Process Fundamentals - Discovery Phase" title="Data Engineering Process Fundamentals - Discovery Phase"></p>
<ul>
<li>Follow this GitHub repo during the presentation: (Give it a star)</li>
</ul>
<blockquote>
<p>π <a href="https://github.com/ozkary/data-engineering-mta-turnstile">https://github.com/ozkary/data-engineering-mta-turnstile</a></p>
</blockquote>
<ul>
<li>Read more information on my blog at: </li>
</ul>
<blockquote>
<p>π <a href="https://www.ozkary.com/2023/03/data-engineering-process-fundamentals.html">https://www.ozkary.com/2023/03/data-engineering-process-fundamentals.html</a></p>
</blockquote>
<h2 id="youtube-video">YouTube Video</h2>
<iframe width="560" height="315" src="https://www.youtube.com/embed/F2WHH5MrmE4?si=QbU8uhwwcBKtwLeI" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
<h3 id="video-agenda">Video Agenda</h3>
<ol>
<li><p>Introduction:</p>
<ul>
<li><p>Unveiling the importance of the discovery process in data engineering.</p>
</li>
<li><p>Setting the stage with a real-world problem statement that will guide our exploration.</p>
</li>
</ul>
</li>
<li><p>Setting the Stage:</p>
<ul>
<li><p>Downloading and comprehending sample data to kickstart our discovery journey.</p>
</li>
<li><p>Configuring the development environment with VSCode and Jupyter Notebooks.</p>
</li>
</ul>
</li>
<li><p>Exploratory Data Analysis (EDA):</p>
<ul>
<li><p>Delving deep into EDA techniques with a focus on the discovery phase.</p>
</li>
<li><p>Demonstrating practical approaches using Python to uncover insights within the data.</p>
</li>
</ul>
</li>
<li><p>Code-Centric Approach:</p>
<ul>
<li><p>Advocating the significance of a code-centric approach during the discovery process.</p>
</li>
<li><p>Showcasing how a code-centric mindset enhances collaboration, repeatability, and efficiency.</p>
</li>
</ul>
</li>
<li><p>Version Control with GitHub:</p>
<ul>
<li><p>Integrating GitHub seamlessly into our workflow for version control and collaboration.</p>
</li>
<li><p>Managing changes effectively to ensure a streamlined data engineering discovery process.</p>
</li>
</ul>
</li>
<li><p>Real-World Application:</p>
<ul>
<li><p>Applying insights gained from EDA to address the initial problem statement.</p>
</li>
<li><p>Discussing practical solutions and strategies derived from the discovery process.</p>
</li>
</ul>
</li>
</ol>
<p><strong>Key Takeaways:</strong></p>
<ul>
<li><p>Mastery of the foundational aspects of data engineering.</p>
</li>
<li><p>Hands-on experience with EDA techniques, emphasizing the discovery phase.</p>
</li>
<li><p>Appreciation for the value of a code-centric approach in the data engineering discovery process.</p>
</li>
</ul>
<p><strong>Some of the technologies that we will be covering:</strong></p>
<ul>
<li>Python</li>
<li>Data Analysis and Visualization</li>
<li>Jupyter Notebook</li>
<li>Visual Studio Code</li>
</ul>
<h2 id="presentation">Presentation</h2>
<h3 id="data-engineering-overview">Data Engineering Overview</h3>
<p>A Data Engineering Process involves executing steps to understand the problem, scope, design, and architecture for creating a solution. This enables ongoing big data analysis using analytical and visualization tools.</p>
<h4 id="topics">Topics</h4>
<ul>
<li>Importance of the Discovery Process</li>
<li>Setting the Stage - Technologies</li>
<li>Exploratory Data Analysis (EDA)</li>
<li>Code-Centric Approach</li>
<li>Version Control</li>
<li>Real-World Use Case</li>
</ul>
<p><strong>Follow this project: Give a star</strong></p>
<blockquote>
<p>π <a href="//github.com/ozkary/data-engineering-mta-turnstile">Data Engineering Process Fundamentals</a></p>
</blockquote>
<h3 id="importance-of-the-discovery-process">Importance of the Discovery Process</h3>
<p>The discovery process involves identifying the problem, analyzing data sources, defining project requirements, establishing the project scope, and designing an effective architecture to address the identified challenges.</p>
<ul>
<li>Clearly document the problem statement to understand the challenges the project aims to address.</li>
<li>Make observations about the data, its structure, and sources during the discovery process.</li>
<li>Define project requirements based on the observations, enabling the team to understand the scope and goals.</li>
<li>Clearly outline the scope of the project, ensuring a focused and well-defined set of objectives.</li>
<li>Use insights from the discovery phase to inform the design of the solution, including data architecture.</li>
<li>Develop a robust project architecture that aligns with the defined requirements and scope.</li>
</ul>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-process-discovery.png" alt="Data Engineering Process Fundamentals - Discovery Process" title="Data Engineering Process Fundamentals - Discovery Process"></p>
<h3 id="setting-the-stage-technologies">Setting the Stage - Technologies</h3>
<p>To set the stage, we need to identify and select the tools that can facilitate the analysis and documentation of the data. Here are key technologies that play a crucial role in this stage:</p>
<ul>
<li><strong>Python:</strong> A versatile programming language with rich libraries for data manipulation, analysis, and scripting.</li>
</ul>
<p><strong>Use Cases:</strong> Data download, cleaning, exploration, and scripting for automation.</p>
<ul>
<li><strong>Jupyter Notebooks:</strong> An interactive tool for creating and sharing documents containing live code, visualizations, and narrative text.</li>
</ul>
<p><strong>Use Cases:</strong> Exploratory data analysis, documentation, and code collaboration.</p>
<ul>
<li><strong>Visual Studio Code:</strong> A lightweight, extensible code editor with powerful features for source code editing and debugging.</li>
</ul>
<p><strong>Use Cases:</strong> Writing and debugging code, integrating with version control systems like GitHub.</p>
<ul>
<li><strong>SQL (Structured Query Language):</strong> A domain-specific language for managing and manipulating relational databases.</li>
</ul>
<p><strong>Use Cases:</strong> Querying databases, data extraction, and transformation.</p>
<p><img src="//www.ozkary.dev/assets/2024/ozkary-data-engineering-process-discovery-tools.png" alt="Data Engineering Process Fundamentals - Discovery Tools" title="Data Engineering Process Fundamentals - Discovery Tools"></p>
<h3 id="exploratory-data-analysis-eda-">Exploratory Data Analysis (EDA)</h3>
<p>EDA is our go-to method for downloading, analyzing, understanding and documenting the intricacies of the datasets. It's like peeling back the layers of information to reveal the stories hidden within the data. Here's what EDA is all about:</p>
<ul>
<li><p>EDA is the process of analyzing data to identify patterns, relationships, and anomalies, guiding the project's direction.</p>
</li>
<li><p>Python and Jupyter Notebook collaboratively empower us to download, describe, and transform data through live queries.</p>
</li>
<li><p>Insights gained from EDA set the foundation for informed decision-making in subsequent data engineering steps.</p>
</li>
<li><p>Code written on Jupyter Notebook can be exported and used as the starting point for components for the data pipeline and transformation services.</p>
</li>
</ul>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-jupyter-pie-chart.png" alt="Data Engineering Process Fundamentals - Discovery Pie Chart" title="Data Engineering Process Fundamentals - Discovery Pie Chart"></p>
<h3 id="code-centric-approach">Code-Centric Approach</h3>
<p>A code-centric approach, using programming languages and tools in EDA, helps us understand the coding methodology for building data structures, defining schemas, and establishing relationships. This robust understanding seamlessly guides project implementation.</p>
<ul>
<li><p>Code delves deep into data intricacies, revealing integration and transformation challenges often unclear with visual tools.</p>
</li>
<li><p>Using code taps into Pandas and Numpy libraries, empowering robust manipulation of data frames, establishment of loading schemas, and addressing transformation needs.</p>
</li>
<li><p>Code-centricity enables sophisticated analyses, covering aggregation, distribution, and in-depth examinations of the data.</p>
</li>
<li><p>While visual tools have their merits, a code-centric approach excels in hands-on, detailed data exploration, uncovering subtle nuances and potential challenges. </p>
</li>
</ul>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-process-jupyter-observations.png" alt="Data Engineering Process Fundamentals - Discovery Pie Chart" title="Data Engineering Process Fundamentals - Discovery Pie Chart"></p>
<h3 id="version-control">Version Control</h3>
<p>Using a tool like GitHub is essential for effective version control and collaboration in our discovery process. GitHub enables us to track our exploratory code and Jupyter Notebooks, fostering collaboration, documentation, and comprehensive project management. Here's how GitHub enhances our process:</p>
<ul>
<li><p><strong>Centralized Tracking:</strong> GitHub centralizes tracking and managing our exploratory code and Jupyter Notebooks, ensuring a transparent and organized record of our data exploration.</p>
</li>
<li><p><strong>Sharing:</strong> Easily share code and Notebooks with team members on GitHub, fostering seamless collaboration and knowledge sharing.</p>
</li>
<li><p><strong>Documentation:</strong> GitHub supports Markdown, enabling comprehensive documentation of processes, findings, and insights within the same repository.</p>
</li>
<li><p><strong>Project Management:</strong> GitHub acts as a project management hub, facilitating CI/CD pipeline integration for smooth and automated delivery of data engineering projects.</p>
</li>
</ul>
<p><img src="//www.ozkary.dev/assets/2024/ozkary-data-engineering-process-problem-statement.png" alt="Data Engineering Process Fundamentals - Discovery Problem Statement" title="Data Engineering Process Fundamentals - Discovery Problem Statement"></p>
<h2 id="summary">Summary</h2>
<p>The data engineering discovery process involves defining the problem statement, gathering requirements, and determining the scope of work. It also includes a data analysis exercise utilizing Python and Jupyter Notebooks or other tools to extract valuable insights from the data. These steps collectively lay the foundation for successful data engineering endeavors.</p>
<p>Thanks for reading.</p>
<p>Send question or comment at Twitter @ozkary</p>
<p>π Originally published by <a href="https://www.ozkary.com">ozkary.com</a></p>
<div class="blogger-post-footer">Originally Published by https://ozkary.com</div>Oscar Garcia @ozkaryhttp://www.blogger.com/profile/13427831719934915504noreply@blogger.com0tag:blogger.com,1999:blog-2250750882473644809.post-76798558017433544712023-12-02T14:32:00.013-05:002023-12-11T11:13:50.288-05:00AI - A Learning Based Approach For Predicting Heart Disease<h1 id="abstract">Abstract</h1>
<p>Heart disease is a leading cause of mortality worldwide, and its early identification and risk assessment are critical for effective prevention and intervention. With the help of electronic health records (EHR) and a wealth of health-related data, there is a significant opportunity to leverage machine learning techniques for predicting and assessing the risk of heart disease in individuals.</p>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-ml-heart-disease-app.png" alt="ozkary-ai-engineering-heart-disease" title="Predicting Heart Disease"></p>
<p>The United States Centers for Disease Control and Prevention (CDC) has been collecting a vast array of data on demographics, lifestyle, medical history, and clinical parameters. This data repository offers a valuable resource to develop predictive models that can help identify those at risk of heart disease before symptoms manifest.</p>
<p>This study aims to use machine learning models to predict an individual's likelihood of developing heart disease based on CDC data. By employing advanced algorithms and data analysis, we seek to create a predictive model that factors in various attributes such as age, gender, cholesterol levels, blood pressure, smoking habits, and other relevant health indicators. The solution could assist healthcare professionals in evaluating an individual's risk profile for heart disease.</p>
<h2 id="key-objectives">Key Objectives</h2>
<p>Key objectives of this study include:</p>
<ol>
<li>Developing a robust machine learning model capable of accurately predicting the risk of heart disease using CDC data.</li>
<li>Identifying the most influential risk factors and parameters contributing to heart disease prediction.</li>
<li>Compare model performance:<ul>
<li>Logistic Regression</li>
<li>Decision Tree</li>
<li>Random Forest</li>
<li>XGBoost Classification</li>
</ul>
</li>
<li>Evaluating the following metrics<ul>
<li>Accuracy</li>
<li>Precision, </li>
<li>F1 </li>
<li>Recall </li>
</ul>
</li>
<li>Providing an API, so tools can integrate and make a risk analysis.<ul>
<li>Build a local app </li>
<li>Build an Azure function for cloud deployment</li>
</ul>
</li>
</ol>
<p>The successful implementation of this study will lead to a transformative impact on public health by enabling timely preventive measures and tailored interventions for individuals at risk of heart disease.</p>
<h2 id="conclusion">Conclusion</h2>
<p>This study was conducted by using four different Machine Learning algorithm. After comparing the performance of all these models, we concluded that the <strong>XGBoost Model</strong> has a relatively balanced precision and recall metrics, indicating that it's better at identifying true positives while keeping false positives in check. Based on this analysis, we choose XGBoost as the best performing model for this type of analysis.</p>
<h2 id="machine-learning-engineering-process">Machine Learning Engineering Process</h2>
<p>In order to execute this project, we follow a series of steps for discovery and data analysis, data processing and model selection. This process is done using jupyter notebooks for the experimental phase, and python files for the implementation and delivery phase.</p>
<h3 id="experimental-phase-notebooks">Experimental Phase Notebooks</h3>
<ul>
<li>Data analysis and cleanup <ul>
<li><a href="https://github.com/ozkary/machine-learning-engineering/blob/main/projects/heart-disease-risk/data_analysis.ipynb">Step 1 - Data Analysis</a> </li>
</ul>
</li>
<li>Process and convert the data for modeling, feature analysis<ul>
<li><a href="https://github.com/ozkary/machine-learning-engineering/blob/main/projects/heart-disease-risk/data_processing.ipynb">Step 2 - Data Processing</a></li>
</ul>
</li>
<li>Train the model using different algorithm to evaluate the best option<ul>
<li><a href="https://github.com/ozkary/machine-learning-engineering/blob/main/projects/heart-disease-risk/data_train.ipynb">Step 3 - Model Training</a></li>
</ul>
</li>
<li>Run test cases and predict results<ul>
<li><a href="https://github.com/ozkary/machine-learning-engineering/blob/main/projects/heart-disease-risk/data_predict.ipynb">Step 4 - Model Prediction</a></li>
</ul>
</li>
</ul>
<blockquote>
<p>π The data files for this study can be found in the same GitHub project as the Jupyter Notebook files.</p>
</blockquote>
<h2 id="data-analysis-exploratory-data-analysis-eda-">Data Analysis - Exploratory Data Analysis (EDA)</h2>
<p>These are the steps to analysis the data:</p>
<ul>
<li>Load the data/2020/heart_2020_cleaned.csv</li>
<li>Fill in the missing values with zero</li>
<li>Review the data <ul>
<li>Rename the columns to lowercase</li>
<li>Check the data types</li>
<li>Preview the data</li>
</ul>
</li>
<li>Identify the features<ul>
<li>Identify the categorical and numeric features</li>
<li>Identify the target variables </li>
</ul>
</li>
<li>Remove duplicates</li>
<li>Identify categorical features that can be converted into binary</li>
<li>Check the class balance in the data<ul>
<li>Check for Y/N labels for heart disease identification</li>
</ul>
</li>
</ul>
<h3 id="features">Features</h3>
<p>Based on the dataset, we have a mix of categorical and numerical features. We consider the following for encoding:</p>
<ol>
<li><p><strong>Categorical Features:</strong></p>
<ul>
<li>'heartdisease': This is the target variable. We remove this feature for the model training.</li>
<li>'smoking', 'alcoholdrinking', 'stroke', 'sex', 'agecategory', 'race', 'diabetic', 'physicalactivity', 'genhealth', 'sleeptime', 'asthma', 'kidneydisease', 'skincancer': These are categorical features. We can consider one-hot encoding these features.</li>
</ul>
</li>
<li><p><strong>Numerical Features:</strong></p>
<ul>
<li>'bmi', 'physicalhealth', 'mentalhealth', 'diffwalking': These are already numerical features, so there's no need to encode them.</li>
</ul>
</li>
</ol>
<pre><code class="lang-python"><span class="hljs-comment"># get a list of numeric features</span>
features_numeric = <span class="hljs-keyword">list</span>(df.select_dtypes(<span class="hljs-keyword">include</span>=[np.number]).columns)
<span class="hljs-comment"># get a list of object features and exclude the target feature 'heartdisease'</span>
features_category = <span class="hljs-keyword">list</span>(df.select_dtypes(<span class="hljs-keyword">include</span>=[<span class="hljs-string">'object'</span>]).columns)
<span class="hljs-comment"># remove the target feature from the list of categorical features</span>
target = <span class="hljs-string">'heartdisease'</span>
features_category.remove(target)
<span class="hljs-keyword">print</span>(<span class="hljs-string">'Categorical features'</span>,features_category)
<span class="hljs-keyword">print</span>(<span class="hljs-string">'Numerical features'</span>,features_numeric)
</code></pre>
<h3 id="data-validation-and-class-balance">Data Validation and Class Balance</h3>
<p>The data shows imbalance for the Y/N classes. There are less cases of heart disease, as expected, than the rest of the population. This can result in low performing models as there is way more negatives cases (N). To account for that, we can use techniques like down sampling the negative cases.</p>
<h4 id="heart-disease-distribution">Heart Disease Distribution</h4>
<pre><code class="lang-python"><span class="hljs-comment"># plot a distribution of the target variable set labels for each bar chart and show the count</span>
print(df[target].value_counts(normalize=<span class="hljs-keyword">True</span>).round(<span class="hljs-number">2</span>))
<span class="hljs-comment"># plot the distribution of the target variable</span>
df[target].value_counts().plot(kind=<span class="hljs-string">'bar'</span>, rot=<span class="hljs-number">0</span>)
plt.xlabel(<span class="hljs-string">'Heart disease'</span>)
plt.ylabel(<span class="hljs-string">'Count'</span>)
<span class="hljs-comment"># add a count label to each bar</span>
<span class="hljs-keyword">for</span> i, count <span class="hljs-keyword">in</span> enumerate(df[target].value_counts()):
plt.text(i, count<span class="hljs-number">-50</span>, count, ha=<span class="hljs-string">'center'</span>, va=<span class="hljs-string">'top'</span>, fontweight=<span class="hljs-string">'bold'</span>)
plt.show()
<span class="hljs-comment"># # get the percentage of people with heart disease on a pie chart</span>
df[target].value_counts(normalize=<span class="hljs-keyword">True</span>).plot(kind=<span class="hljs-string">'pie'</span>, labels=[<span class="hljs-string">'No heart disease'</span>, <span class="hljs-string">'Heart disease'</span>], autopct=<span class="hljs-string">'%1.1f%%'</span>, startangle=<span class="hljs-number">90</span>)
plt.ylabel(<span class="hljs-string">''</span>)
plt.show()
</code></pre>
<blockquote>
<p>π No 91% Yes 9%</p>
</blockquote>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-ml-heart-disease-class-balance.png" alt="Heart Disease Class Balance"></p>
<h2 id="data-processing">Data Processing</h2>
<p>For data processing, we should follow these steps:</p>
<ul>
<li>Load the data/2020/heart_2020_eda.csv</li>
<li>Process the values<ul>
<li>Convert Yes/No features to binary (1/0)</li>
<li>Cast all the numeric values to int to avoid float problems</li>
</ul>
</li>
<li>Process the features<ul>
<li>Set the categorical features names</li>
<li>Set the numeric features names </li>
<li>Set the target variable</li>
</ul>
</li>
<li>Feature importance analysis<ul>
<li>Use statistical analysis to get the metrics like risk and ratio</li>
<li>Mutual Information score</li>
</ul>
</li>
</ul>
<h4 id="feature-analysis">Feature Analysis</h4>
<p>The purpose of feature analysis in heart disease study is to uncover the relationships and associations between various patient characteristics (features) and the occurrence of heart disease. By examining factors such as lifestyle, medical history, demographics, and more, we aim to identify which specific attributes or combinations of attributes are most strongly correlated with heart disease. Feature analysis allows for the discovery of risk factors and insights that can inform prevention and early detection strategies. </p>
<pre><code class="lang-python"><span class="hljs-comment"># Calculate the mean and count of heart disease occurrences per feature value</span>
feature_importance = []
<span class="hljs-comment"># Create a dataframe for the analysis</span>
results = pd.DataFrame(columns=[<span class="hljs-string">'Feature'</span>, <span class="hljs-string">'Value'</span>, <span class="hljs-string">'Percentage'</span>])
<span class="hljs-keyword">for</span> feature <span class="hljs-keyword">in</span> all_features:
grouped = df.groupby(feature)[target].mean().reset_index()
grouped.columns = [<span class="hljs-string">'Value'</span>, <span class="hljs-string">'Percentage'</span>]
grouped[<span class="hljs-string">'Feature'</span>] = feature
results = pd.concat([results, grouped], axis=<span class="hljs-number">0</span>)
<span class="hljs-comment"># Sort the results by percentage in descending order and get the top 10</span>
results = results.sort_values(by=<span class="hljs-string">'Percentage'</span>, ascending=<span class="hljs-keyword">False</span>).head(<span class="hljs-number">15</span>)
<span class="hljs-comment"># get the overall heart diease occurrence rate</span>
overall_rate = df[target].mean()
print(<span class="hljs-string">'Overall Rate'</span>,overall_rate)
<span class="hljs-comment"># calculate the difference between the feature value percentage and the overall rate</span>
results[<span class="hljs-string">'Difference'</span>] = results[<span class="hljs-string">'Percentage'</span>] - overall_rate
<span class="hljs-comment"># calculate the ratio of the difference to the overall rate</span>
results[<span class="hljs-string">'Ratio'</span>] = results[<span class="hljs-string">'Difference'</span>] / overall_rate
<span class="hljs-comment"># calculate the risk of heart disease occurrence for each feature value</span>
results[<span class="hljs-string">'Risk'</span>] = results[<span class="hljs-string">'Percentage'</span>] / overall_rate
<span class="hljs-comment"># sort the results by ratio in descending order</span>
results = results.sort_values(by=<span class="hljs-string">'Risk'</span>, ascending=<span class="hljs-keyword">False</span>)
print(results)
<span class="hljs-comment"># Visualize the rankings (e.g., create a bar plot)</span>
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> seaborn <span class="hljs-keyword">as</span> sns
plt.figure(figsize=(<span class="hljs-number">12</span>, <span class="hljs-number">6</span>))
sns.barplot(data=results, x=<span class="hljs-string">'Percentage'</span>, y=<span class="hljs-string">'Value'</span>, hue=<span class="hljs-string">'Feature'</span>)
plt.xlabel(<span class="hljs-string">'Percentage of Heart Disease Occurrences'</span>)
plt.ylabel(<span class="hljs-string">'Feature Value'</span>)
plt.title(<span class="hljs-string">'Top 15 Ranking of Feature Values by Heart Disease Occurrence'</span>)
plt.show()
</code></pre>
<pre><code class="lang-bash">
Overall Rate <span class="hljs-number">0.09035</span>
Feature Value Percentage Difference Ratio Risk
<span class="hljs-number">65</span> bmi <span class="hljs-number">77</span> <span class="hljs-number">0.400000</span> <span class="hljs-number">0.309647</span> <span class="hljs-number">3.427086</span> <span class="hljs-number">4.427086</span>
<span class="hljs-number">1</span> stroke <span class="hljs-number">1</span> <span class="hljs-number">0.363810</span> <span class="hljs-number">0.273457</span> <span class="hljs-number">3.026542</span> <span class="hljs-number">4.026542</span>
<span class="hljs-number">3</span> genhealth Poor <span class="hljs-number">0.341131</span> <span class="hljs-number">0.250778</span> <span class="hljs-number">2.775537</span> <span class="hljs-number">3.775537</span>
<span class="hljs-number">68</span> bmi <span class="hljs-number">80</span> <span class="hljs-number">0.333333</span> <span class="hljs-number">0.242980</span> <span class="hljs-number">2.689239</span> <span class="hljs-number">3.689239</span>
<span class="hljs-number">18</span> sleeptime <span class="hljs-number">19</span> <span class="hljs-number">0.333333</span> <span class="hljs-number">0.242980</span> <span class="hljs-number">2.689239</span> <span class="hljs-number">3.689239</span>
<span class="hljs-number">71</span> bmi <span class="hljs-number">83</span> <span class="hljs-number">0.333333</span> <span class="hljs-number">0.242980</span> <span class="hljs-number">2.689239</span> <span class="hljs-number">3.689239</span>
<span class="hljs-number">21</span> sleeptime <span class="hljs-number">22</span> <span class="hljs-number">0.333333</span> <span class="hljs-number">0.242980</span> <span class="hljs-number">2.689239</span> <span class="hljs-number">3.689239</span>
<span class="hljs-number">1</span> kidneydisease <span class="hljs-number">1</span> <span class="hljs-number">0.293308</span> <span class="hljs-number">0.202956</span> <span class="hljs-number">2.246254</span> <span class="hljs-number">3.246254</span>
<span class="hljs-number">29</span> physicalhealth <span class="hljs-number">29</span> <span class="hljs-number">0.289216</span> <span class="hljs-number">0.198863</span> <span class="hljs-number">2.200957</span> <span class="hljs-number">3.200957</span>
</code></pre>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-ml-heart-disease-feature-analysis.png" alt="Heart Disease Feature Importance"></p>
<ol>
<li><p><code>Overall Rate</code>: This is the overall rate of heart disease occurrence in the dataset. It represents the proportion of individuals with heart disease (target='Yes') in the dataset. For example, if the overall rate is 0.2, it means that 20% of the individuals in the dataset have heart disease.</p>
</li>
<li><p><code>Difference</code>: This value represents the difference between the percentage of heart disease occurrence for a specific feature value and the overall rate. It tells us how much more or less likely individuals with a particular feature value are to have heart disease compared to the overall population. A positive difference indicates a higher likelihood, while a negative difference indicates a lower likelihood.</p>
</li>
<li><p><code>Ratio</code>: The ratio represents the difference relative to the overall rate. It quantifies how much the heart disease occurrence for a specific feature value deviates from the overall rate, considering the overall rate as the baseline. A ratio greater than 1 indicates a higher risk compared to the overall population, while a ratio less than 1 indicates a lower risk.</p>
</li>
<li><p><code>Risk</code>: This metric directly quantifies the likelihood of an event happening for a specific feature value, expressed as a percentage. It's easier to interpret as it directly answers the question: "What is the likelihood of heart disease for individuals with this feature value?"</p>
</li>
</ol>
<p>These values help us understand the relationship between different features and heart disease. Positive differences, ratios greater than 1, and risk values greater than 100% suggest a higher risk associated with a particular feature value, while negative differences, ratios less than 1, and risk values less than 100% suggest a lower risk. This information can be used to identify factors that may increase or decrease the risk of heart disease within the dataset.</p>
<h4 id="mutual-information-score">Mutual Information Score</h4>
<p>The mutual information score measures the dependency between a feature and the target variable. Higher scores indicate stronger dependency, while lower scores indicate weaker dependency. A higher score suggests that the feature is more informative when predicting the target variable.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Compute mutual information scores for each feature</span>
X = df[cat_features]
y = df[target]
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">mutual_info_heart_disease_score</span><span class="hljs-params">(series)</span>:</span>
<span class="hljs-keyword">return</span> mutual_info_score(series, y)
mi_scores = X.apply(mutual_info_heart_disease_score)
mi_ranking = pd.Series(mi_scores, index=X.columns).sort_values(ascending=<span class="hljs-keyword">False</span>)
print(mi_ranking)
<span class="hljs-comment"># Visualize the rankings</span>
plt.figure(figsize=(<span class="hljs-number">12</span>, <span class="hljs-number">6</span>))
sns.barplot(x=mi_ranking.values, y=mi_ranking.index)
plt.xlabel(<span class="hljs-string">'Mutual Information Scores'</span>)
plt.ylabel(<span class="hljs-string">'Feature'</span>)
plt.title(<span class="hljs-string">'Feature Importance Ranking via Mutual Information Scores'</span>)
</code></pre>
<pre><code class="lang-bash"><span class="hljs-selector-tag">agecategory</span> 0<span class="hljs-selector-class">.033523</span>
<span class="hljs-selector-tag">genhealth</span> 0<span class="hljs-selector-class">.027151</span>
<span class="hljs-selector-tag">diabetic</span> 0<span class="hljs-selector-class">.012960</span>
<span class="hljs-selector-tag">sex</span> 0<span class="hljs-selector-class">.002771</span>
<span class="hljs-selector-tag">race</span> 0<span class="hljs-selector-class">.001976</span>
</code></pre>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-ml-heart-disease-feature-importance.png" alt="Heart Disease Feature Importance"></p>
<h2 id="machine-learning-training-and-model-selection">Machine Learning Training and Model Selection</h2>
<ul>
<li>Load the data/2020/heart_2020_processed.csv</li>
<li>Process the features<ul>
<li>Set the categorical features names</li>
<li>Set the numeric features names </li>
<li>Set the target variable</li>
</ul>
</li>
<li>Split the data<ul>
<li>train/validation/test split with 60%/20%/20% distribution.</li>
<li>Random_state 42</li>
<li>Use strategy = y to deal with the class imbalanced problem</li>
</ul>
</li>
<li>Train the model<ul>
<li>LogisticRegression</li>
<li>RandomForestClassifier</li>
<li>XGBClassifier</li>
<li>DecisionTreeClassifier</li>
</ul>
</li>
<li>Evaluate the models and compare them<ul>
<li>accuracy_score</li>
<li>precision_score</li>
<li>recall_score</li>
<li>f1_score</li>
</ul>
</li>
<li>Confusion Matrix</li>
</ul>
<h3 id="data-split">Data Split</h3>
<ul>
<li>Use a 60/20/20 distribution fir train/val/test</li>
<li>Random_state 42 to shuffle the data</li>
<li>Use strategy = y when there is a class imbalance in the dataset. It helps ensure that the class distribution in both the training and validation (or test) sets closely resembles the original dataset's class distribution</li>
</ul>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">split_data</span><span class="hljs-params">(self, test_size=<span class="hljs-number">0.2</span>, random_state=<span class="hljs-number">42</span>)</span>:</span>
<span class="hljs-string">"""
Split the data into training and validation sets
"""</span>
<span class="hljs-comment"># split the data in train/val/test sets, with 60%/20%/20% distribution with seed 1</span>
X = self.df[self.all_features]
y = self.df[self.target_variable]
X_full_train, X_test, y_full_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state, stratify=y)
<span class="hljs-comment"># .25 splits the 80% train into 60% train and 20% val</span>
X_train, X_val, y_train, y_val = train_test_split(X_full_train, y_full_train, test_size=<span class="hljs-number">0.25</span>, random_state=random_state)
X_train = X_train.reset_index(drop=<span class="hljs-keyword">True</span>)
X_val = X_val.reset_index(drop=<span class="hljs-keyword">True</span>)
y_train = y_train.reset_index(drop=<span class="hljs-keyword">True</span>)
y_val = y_val.reset_index(drop=<span class="hljs-keyword">True</span>)
X_test = X_test.reset_index(drop=<span class="hljs-keyword">True</span>)
y_test = y_test.reset_index(drop=<span class="hljs-keyword">True</span>)
<span class="hljs-comment"># print the shape of all the data splits</span>
print(<span class="hljs-string">'X_train shape'</span>, X_train.shape)
print(<span class="hljs-string">'X_val shape'</span>, X_val.shape)
print(<span class="hljs-string">'X_test shape'</span>, X_test.shape)
print(<span class="hljs-string">'y_train shape'</span>, y_train.shape)
print(<span class="hljs-string">'y_val shape'</span>, y_val.shape)
print(<span class="hljs-string">'y_test shape'</span>, y_test.shape)
<span class="hljs-keyword">return</span> X_train, X_val, y_train, y_val, X_test, y_test
X_train, X_val, y_train, y_val, X_test, y_test = train_data.split_data(test_size=<span class="hljs-number">0.2</span>, random_state=<span class="hljs-number">42</span>)
</code></pre>
<p>The <code>split_data</code> call is a method that splits a dataset into training, validation, and test sets. Here's a breakdown of the returned values:</p>
<ul>
<li><p><strong>X_train:</strong> This represents the features (input variables) of the training set. The model will be trained on this data.</p>
</li>
<li><p><strong>y_train:</strong> This corresponds to the labels (output variable) for the training set. It contains the correct outcomes corresponding to the features in <code>X_train</code>.</p>
</li>
<li><p><strong>X_val:</strong> These are the features of the validation set. The model's performance is often assessed on this set during training to ensure it generalizes well to new, unseen data.</p>
</li>
<li><p><strong>y_val:</strong> These are the labels for the validation set. They serve as the correct outcomes for the features in <code>X_val</code> during the evaluation of the model's performance.</p>
</li>
<li><p><strong>X_test:</strong> These are the features of the test set. The model's final evaluation is typically done on this set to assess its performance on completely unseen data.</p>
</li>
<li><p><strong>y_test:</strong> Similar to <code>y_val</code>, this contains the labels for the test set. It represents the correct outcomes for the features in <code>X_test</code> during the final evaluation of the model.</p>
</li>
</ul>
<h4 id="model-training">Model Training</h4>
<p>For model training, we first pre-process the data by taking these steps:</p>
<ul>
<li><code>preprocess_data</code> <ul>
<li>The input features X are converted to a dictionary format using the to_dict method with the orientation set to <code>records</code>. This is a common step when working with scikit-learn transformers, as they often expect input data in this format. </li>
<li>If is_training is True, it fits a transformer (self.encoder) on the data using the fit_transform method. If False, it transforms the data using the previously fitted transformer (self.encoder.transform). The standardized features are then returned.</li>
</ul>
</li>
</ul>
<p>We then train the different models:</p>
<ul>
<li><code>train</code>
-This method takes X_train (training features) and y_train (training labels) as parameters.
-If the models attribute of the class is None, it initializes a dictionary of machine learning models including logistic regression, random forest, XGBoost, and decision tree classifiers. </li>
</ul>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">preprocess_data</span><span class="hljs-params">(self, X, is_training=True)</span>:</span>
<span class="hljs-string">"""
Preprocess the data for training or validation
"""</span>
X_dict = X.to_dict(orient=<span class="hljs-string">'records'</span>)
<span class="hljs-keyword">if</span> is_training:
X_std = self.encoder.fit_transform(X_dict)
<span class="hljs-keyword">else</span>:
X_std = self.encoder.transform(X_dict)
<span class="hljs-comment"># Return the standardized features and target variable</span>
<span class="hljs-keyword">return</span> X_std
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">train</span><span class="hljs-params">(self, X_train, y_train)</span>:</span>
<span class="hljs-keyword">if</span> self.models <span class="hljs-keyword">is</span> <span class="hljs-keyword">None</span>:
self.models = {
<span class="hljs-string">'logistic_regression'</span>: LogisticRegression(C=<span class="hljs-number">10</span>, max_iter=<span class="hljs-number">1000</span>, random_state=<span class="hljs-number">42</span>),
<span class="hljs-string">'random_forest'</span>: RandomForestClassifier(n_estimators=<span class="hljs-number">100</span>, max_depth=<span class="hljs-number">5</span>, random_state=<span class="hljs-number">42</span>, n_jobs=<span class="hljs-number">-1</span>),
<span class="hljs-string">'xgboost'</span>: XGBClassifier(n_estimators=<span class="hljs-number">100</span>, max_depth=<span class="hljs-number">5</span>, random_state=<span class="hljs-number">42</span>, n_jobs=<span class="hljs-number">-1</span>),
<span class="hljs-string">'decision_tree'</span>: DecisionTreeClassifier(max_depth=<span class="hljs-number">5</span>, random_state=<span class="hljs-number">42</span>)
}
<span class="hljs-keyword">for</span> model <span class="hljs-keyword">in</span> self.models.keys():
print(<span class="hljs-string">'Training model'</span>, model)
self.models[model].fit(X_train, y_train)
<span class="hljs-comment"># hot encode the categorical features for the train data</span>
model_factory = HeartDiseaseModelFactory(cat_features, num_features)
X_train_std = model_factory.preprocess_data(X_train[cat_features + num_features], <span class="hljs-keyword">True</span>)
<span class="hljs-comment"># hot encode the categorical features for the validation data</span>
X_val_std = model_factory.preprocess_data(X_val[cat_features + num_features], <span class="hljs-keyword">False</span>)
<span class="hljs-comment"># Train the model</span>
model_factory.train(X_train_std, y_train)
</code></pre>
<h4 id="model-evaluation">Model Evaluation</h4>
<p>For the model evaluation, we calculate the following metrics:</p>
<ol>
<li><p><strong>Accuracy</strong> tells us how often your model is correct. It's the percentage of all predictions that are accurate. For example, an accuracy of 92% is great, while 70% is not good.</p>
</li>
<li><p><strong>Precision</strong> is about being precise, not making many mistakes. It's the percentage of positive predictions that were actually correct. For instance, a precision of 90% is great, while 50% is not good.</p>
</li>
<li><p><strong>Recall</strong> is about not missing any positive instances. It's the percentage of actual positives that were correctly predicted. A recall of 85% is great, while 30% is not good.</p>
</li>
<li><p><strong>F1 Score</strong> is a balance between precision and recall. It's like having the best of both worlds. For example, an F1 score of 80% is great, while 45% is not good.</p>
</li>
</ol>
<pre><code class="lang-python">
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">evaluate</span><span class="hljs-params">(self, X_val, y_val, threshold=<span class="hljs-number">0.5</span>)</span>:</span>
<span class="hljs-string">"""
Evaluate the model on the validation data set and return the predictions
"""</span>
<span class="hljs-comment"># create a dataframe to store the metrics</span>
df_metrics = pd.DataFrame(columns=[<span class="hljs-string">'model'</span>, <span class="hljs-string">'accuracy'</span>, <span class="hljs-string">'precision'</span>, <span class="hljs-string">'recall'</span>, <span class="hljs-string">'f1'</span>, <span class="hljs-string">'y_pred'</span>])
<span class="hljs-comment"># define the metrics to be calculated</span>
fn_metrics = { <span class="hljs-string">'accuracy'</span>: accuracy_score,<span class="hljs-string">'precision'</span>: precision_score,<span class="hljs-string">'recall'</span>: recall_score,<span class="hljs-string">'f1'</span>: f1_score}
<span class="hljs-comment"># loop through the models and get its metrics</span>
<span class="hljs-keyword">for</span> model_name <span class="hljs-keyword">in</span> self.models.keys():
model = self.models[model_name]
<span class="hljs-comment"># The first column (y_pred_proba[:, 0]) is for class 0 ("N")</span>
<span class="hljs-comment"># The second column (y_pred_proba[:, 1]) is for class 1 ("Y") </span>
y_pred = model.predict_proba(X_val)[:,<span class="hljs-number">1</span>]
<span class="hljs-comment"># get the binary predictions</span>
y_pred_binary = np.where(y_pred > threshold, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>)
<span class="hljs-comment"># add a new row to the dataframe for each model </span>
df_metrics.loc[len(df_metrics)] = [model_name, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, y_pred_binary]
<span class="hljs-comment"># get the row index</span>
row_index = len(df_metrics)<span class="hljs-number">-1</span>
<span class="hljs-comment"># Evaluate the model metrics</span>
<span class="hljs-keyword">for</span> metric <span class="hljs-keyword">in</span> fn_metrics.keys():
score = fn_metrics[metric](y_val, y_pred_binary)
df_metrics.at[row_index,metric] = score
<span class="hljs-keyword">return</span> df_metrics
</code></pre>
<p><strong>Model Performance Metrics:</strong></p>
<table>
<thead>
<tr>
<th>Model</th>
<th>Accuracy</th>
<th>Precision</th>
<th>Recall</th>
<th>F1</th>
</tr>
</thead>
<tbody>
<tr>
<td>Logistic Regression</td>
<td>0.9097</td>
<td>0.509</td>
<td>0.0987</td>
<td>0.1654</td>
</tr>
<tr>
<td>Random Forest</td>
<td>0.9095</td>
<td>0.6957</td>
<td>0.0029</td>
<td>0.0058</td>
</tr>
<tr>
<td>XGBoost</td>
<td>0.9099</td>
<td>0.5154</td>
<td>0.098</td>
<td>0.1647</td>
</tr>
<tr>
<td>Decision Tree</td>
<td>0.9097</td>
<td>0.5197</td>
<td>0.0556</td>
<td>0.1004</td>
</tr>
</tbody>
</table>
<p>These metrics provide insights into the performance of each model, helping us understand their strengths and areas for improvement. </p>
<p><strong>Analysis:</strong></p>
<ul>
<li><p>XGBoost Model:</p>
<ul>
<li>Accuracy: 90.99</li>
<li>Precision: 51.54%</li>
<li>Recall: 9.80%</li>
<li>F1 Score: 16.47%</li>
</ul>
</li>
<li><p>Decision Tree Model:</p>
<ul>
<li>Accuracy: 90.97%</li>
<li>Precision: 51.97%</li>
<li>Recall: 5.56%</li>
<li>F1 Score: 10.04%</li>
</ul>
</li>
<li><p>Logistic Regression Model:</p>
<ul>
<li>Accuracy: 90.97%</li>
<li>Precision: 50.90%</li>
<li>Recall: 9.87%</li>
<li>F1 Score: 16.54%</li>
</ul>
</li>
<li><p>Random Forest Model:</p>
<ul>
<li>Accuracy: 90.95%</li>
<li>Precision: 69.57%</li>
<li>Recall: 0.29%</li>
<li>F1 Score: 0.58%</li>
</ul>
</li>
</ul>
<ul>
<li><p>XGBoost Model has a relatively balanced precision and recall, indicating it's better at identifying true positives while keeping false positives in check.</p>
</li>
<li><p>Decision Tree Model has the lowest recall, suggesting that it may miss some positive cases.</p>
</li>
<li><p>Logistic Regression Model has a good balance of precision and recall similar to the XGBoost Model.</p>
</li>
<li><p>Random Forest Model has high precision but an extremely low recall, meaning it's cautious in predicting positive cases but may miss many of them.</p>
</li>
</ul>
<p>Based on this analysis, we will choose XGBoost as our API model</p>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-ml-heart-disease-model-evaluation.png" alt="Heart Disease Model Evaluation"></p>
<p><strong>Confusion Matrix:</strong></p>
<p>The confusion matrix is a valuable tool for evaluating the performance of classification models, especially for a binary classification problem like predicting heart disease (where the target variable has two classes: 0 for "No" and 1 for "Yes"). Let's analyze what the confusion matrix represents for heart disease prediction using the four models.</p>
<p>For this analysis, we'll consider the following terms:</p>
<ul>
<li><p>True Positives (TP): The model correctly predicted "Yes" (heart disease) when the actual label was also "Yes."</p>
</li>
<li><p>True Negatives (TN): The model correctly predicted "No" (no heart disease) when the actual label was also "No."</p>
</li>
<li><p>False Positives (FP): The model incorrectly predicted "Yes" when the actual label was "No." (Type I error)</p>
</li>
<li><p>False Negatives (FN): The model incorrectly predicted "No" when the actual label was "Yes." (Type II error)</p>
</li>
</ul>
<pre><code class="lang-python">from sklearn.metrics import confusion_matrix
import seaborn <span class="hljs-keyword">as</span> sns
import matplotlib.pyplot <span class="hljs-keyword">as</span> plt
cms = []
model_names = []
total_samples = []
<span class="hljs-keyword">for</span> model_name in df_metrics[<span class="hljs-string">'model'</span>]:
model_y_pred = df_metrics[df_metrics[<span class="hljs-string">'model'</span>] == model_name][<span class="hljs-string">'y_pred'</span>].iloc[<span class="hljs-number">0</span>]
# Compute the confusion matrix
<span class="hljs-keyword">cm</span> = confusion_matrix(y_val, model_y_pred)
cms.<span class="hljs-keyword">append</span>(<span class="hljs-keyword">cm</span>)
model_names.<span class="hljs-keyword">append</span>(model_name)
total_samples.<span class="hljs-keyword">append</span>(np.sum(<span class="hljs-keyword">cm</span>))
# Create <span class="hljs-keyword">a</span> <span class="hljs-number">2</span>x2 grid of subplots
fig, axes = plt.subplots(<span class="hljs-number">2</span>, <span class="hljs-number">2</span>, figsize=(<span class="hljs-number">10</span>, <span class="hljs-number">10</span>))
# Loop through the subplots <span class="hljs-built_in">and</span> plot the confusion matrices
<span class="hljs-keyword">for</span> i, ax in enumerate(axes.flat):
<span class="hljs-keyword">cm</span> = cms[i]
<span class="hljs-keyword">im</span> = ax.imshow(<span class="hljs-keyword">cm</span>, interpolation=<span class="hljs-string">'nearest'</span>, <span class="hljs-keyword">cmap</span>=plt.<span class="hljs-keyword">cm</span>.Blues)
ax.figure.colorbar(<span class="hljs-keyword">im</span>, ax=ax, shrink=<span class="hljs-number">0.6</span>)
# Set labels, title, <span class="hljs-built_in">and</span> value in the <span class="hljs-keyword">center</span> of the heatmap
ax.<span class="hljs-keyword">set</span>(xticks=np.arange(<span class="hljs-keyword">cm</span>.shape[<span class="hljs-number">1</span>]), yticks=np.arange(<span class="hljs-keyword">cm</span>.shape[<span class="hljs-number">0</span>]),
xticklabels=[<span class="hljs-string">"No Heart Disease"</span>, <span class="hljs-string">"Heart Disease"</span>], yticklabels=[<span class="hljs-string">"No Heart Disease"</span>, <span class="hljs-string">"Heart Disease"</span>],
title=<span class="hljs-keyword">f</span><span class="hljs-string">'{model_names[i]} (n={total_samples[i]})\n'</span>)
# Loop <span class="hljs-keyword">to</span> annotate each quadrant with its <span class="hljs-built_in">count</span>
<span class="hljs-keyword">for</span> i in <span class="hljs-built_in">range</span>(<span class="hljs-keyword">cm</span>.shape[<span class="hljs-number">0</span>]):
<span class="hljs-keyword">for</span> <span class="hljs-keyword">j</span> in <span class="hljs-built_in">range</span>(<span class="hljs-keyword">cm</span>.shape[<span class="hljs-number">1</span>]):
ax.text(<span class="hljs-keyword">j</span>, i, str(<span class="hljs-keyword">cm</span>[i, <span class="hljs-keyword">j</span>]), <span class="hljs-keyword">ha</span>=<span class="hljs-string">"center"</span>, va=<span class="hljs-string">"center"</span>, color=<span class="hljs-string">"gray"</span>)
ax.title.set_fontsize(<span class="hljs-number">12</span>)
ax.set_xlabel(<span class="hljs-string">'Predicted'</span>, fontsize=<span class="hljs-number">10</span>)
ax.set_ylabel(<span class="hljs-string">'Actual'</span>, fontsize=<span class="hljs-number">10</span>)
ax.xaxis.set_label_position(<span class="hljs-string">'top'</span>)
# Adjust the layout
plt.tight_layout()
</code></pre>
<p>Let's examine the confusion matrices for each model:</p>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-ml-heart-disease-model-confusion-matrix.png" alt="Heart Disease Model Confusion Matrix"></p>
<ul>
<li><p><strong>XGBoost</strong>:</p>
<ul>
<li>Total Samples: 60,344</li>
<li>Confusion Matrix Total:<ul>
<li>True Positives (TP): 536</li>
<li>True Negatives (TN): 54,370</li>
<li>False Positives (FP): 504</li>
<li>False Negatives (FN): 4,934</li>
</ul>
</li>
</ul>
</li>
<li><p><strong>Decision Tree</strong>:</p>
<ul>
<li>Total Samples: 60,344</li>
<li>Confusion Matrix Total:<ul>
<li>True Positives (TP): 304</li>
<li>True Negatives (TN): 54,593</li>
<li>False Positives (FP): 281</li>
<li>False Negatives (FN): 5,166</li>
</ul>
</li>
</ul>
</li>
<li><p><strong>Logistic Regression</strong>:</p>
<ul>
<li>Total Samples: 60,344</li>
<li>Confusion Matrix Total:<ul>
<li>True Positives (TP): 540</li>
<li>True Negatives (TN): 54,353</li>
<li>False Positives (FP): 521</li>
<li>False Negatives (FN): 4,930</li>
</ul>
</li>
</ul>
</li>
<li><p><strong>Random Forest</strong>:</p>
<ul>
<li>Total Samples: 60,344</li>
<li>Confusion Matrix Total:<ul>
<li>True Positives (TP): 16</li>
<li>True Negatives (TN): 54,867</li>
<li>False Positives (FP): 7</li>
<li>False Negatives (FN): 5,454</li>
</ul>
</li>
</ul>
</li>
</ul>
<p><strong>XGBoost</strong>:</p>
<ul>
<li>This model achieved a relatively high number of True Positives (TP) with 536 cases correctly predicted as having heart disease.</li>
<li>It also had a significant number of True Negatives (TN), indicating correct predictions of no heart disease (54,370).</li>
<li>However, there were 504 False Positives (FP), where it incorrectly predicted heart disease.</li>
<li>It had 4,934 False Negatives (FN), suggesting instances where actual heart disease cases were incorrectly predicted as non-disease.</li>
</ul>
<p><strong>Decision Tree</strong>:</p>
<ul>
<li>The Decision Tree model achieved 304 True Positives (TP), correctly identifying heart disease cases.</li>
<li>It also had 54,593 True Negatives (TN), showing accurate predictions of no heart disease.</li>
<li>There were 281 False Positives (FP), indicating instances where the model incorrectly predicted heart disease.</li>
<li>It had 5,166 False Negatives (FN), meaning it missed identifying heart disease in these cases.</li>
</ul>
<p><strong>Logistic Regression</strong>:</p>
<ul>
<li>The Logistic Regression model achieved 540 True Positives (TP), correctly identifying cases with heart disease.</li>
<li>It had a high number of True Negatives (TN) with 54,353 correctly predicted non-disease cases.</li>
<li>However, there were 521 False Positives (FP), where the model incorrectly predicted heart disease.</li>
<li>It also had 4,930 False Negatives (FN), indicating missed predictions of heart disease.</li>
</ul>
<p><strong>Random Forest</strong>:</p>
<ul>
<li>The Random Forest model achieved a relatively low number of True Positives (TP) with 16 cases correctly predicted as having heart disease.</li>
<li>It had a high number of True Negatives (TN) with 54,867 correctly predicted non-disease cases.</li>
<li>There were only 7 False Positives (FP), suggesting rare incorrect predictions of heart disease.</li>
<li>However, it also had 5,454 False Negatives (FN), indicating a substantial number of missed predictions of heart disease.</li>
</ul>
<p>All models achieved a good number of True Negatives, suggesting their ability to correctly predict non-disease cases. However, there were variations in True Positives, False Positives, and False Negatives. The XGBoost model achieved the highest True Positives but also had a significant number of False Positives. The Decision Tree and Logistic Regression models showed similar TP and FP counts, while the Random Forest model had the lowest TP count. The trade-off between these metrics is essential for assessing the model's performance in detecting heart disease accurately.</p>
<h3 id="summary">Summary</h3>
<p>In the quest to find the best solution for predicting heart disease, it's crucial to evaluate various models. However, it's not just about picking a model and hoping for the best. We need to be mindful of class imbalances β situations where one group has more examples than the other. This imbalance can throw our predictions off balance. </p>
<p>To fine-tune our models, we also need to adjust the hyperparameters. Think of it as finding the perfect settings to make our models have a better performance. By addressing class imbalances and tweaking those hyperparameters, we ensure our models perform accurately. </p>
<p>By using the correct data features and evaluating the performance of our models, we can build solutions that could assist healthcare professionals in evaluating an individual's risk profile for heart disease.</p>
<p>Thanks for reading.</p>
<p>Send question or comment at Twitter @ozkary
Originally published by <a href="https://www.ozkary.com">ozkary.com</a></p>
<div class="blogger-post-footer">Originally Published by https://ozkary.com</div>Oscar Garcia @ozkaryhttp://www.blogger.com/profile/13427831719934915504noreply@blogger.com0tag:blogger.com,1999:blog-2250750882473644809.post-1594694289349610682023-11-29T14:20:00.031-05:002023-12-02T15:59:34.662-05:00Data Engineering Process Fundamentals - An introduction to Data Analysis and Visualization<p>In this technical presentation, we will delve into the fundamental concepts of Data Engineering in the areas of data analysis and visualization. We focus on these areas by using both a code-centric and low-code approach. </p>
<img style="display:none;" src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-process-data-analysis-visualization.png"/>
<ul>
<li>Follow this GitHub repo during the presentation: (Give it a star)</li>
</ul>
<p><strong><a href="https://github.com/ozkary/data-engineering-mta-turnstile">https://github.com/ozkary/data-engineering-mta-turnstile</a></strong></p>
<ul>
<li>Read more information on my blog at: </li>
</ul>
<p><strong><a href="https://www.ozkary.com/2023/03/data-engineering-process-fundamentals.html">https://www.ozkary.com/2023/03/data-engineering-process-fundamentals.html</a></strong></p>
<h3 id="presentation">Presentation</h3>
<p>
<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vQAtz3RxLOg4Y6Ya4_Wj5E3MnsRDZL_nXC9rsbWFlNRO7-REKLd0UImTG8Y-7a_KZuruKOR8pv_Rnmy/embed?start=false&loop=false&delayms=3000" frameborder="0" width="480" height="299" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>
</p>
<h3 id="youtube-video">YouTube Video</h3>
<p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/CH4ilG0ztQI?si=cuDAzD2eZDlWLfCs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
</p>
<h2 id="section-1-data-analysis-essentials">Section 1: Data Analysis Essentials</h2>
<p>Data Analysis: Explore the fundamentals of data analysis using Python, unraveling the capabilities of libraries such as Pandas and NumPy. Learn how Jupyter Notebooks provide an interactive environment for data exploration, analysis, and visualization.</p>
<p>Data Profiling: With Python at our fingertips, discover how Jupyter Notebooks aid in data profilingβunderstanding data structures, quality, and characteristics. Witness the seamless integration with tools like pandas-profiling for comprehensive data insights.</p>
<p>Cleaning and Preprocessing: Dive into the world of data cleaning and preprocessing with Python's Pandas library, facilitated by the user-friendly environment of Jupyter Notebooks. See how Visual Studio Code enhances the coding experience for efficient data preparation.</p>
<h2 id="section-2-statistical-analysis-vs-business-intelligence">Section 2: Statistical Analysis vs. Business Intelligence</h2>
<p>Statistical Analysis: Embrace Python's statistical libraries, such as SciPy and StatsModels, within the Jupyter environment. Witness the power of statistical analysis for extracting patterns and correlations from data, all seamlessly integrated into your workflow with Visual Studio Code.</p>
<p>Business Intelligence: Contrast statistical analysis with the broader field of business intelligence, emphasizing the role of Python in data transformation. Utilize Jupyter Notebooks to showcase how Python's versatility extends to business intelligence applications.</p>
<h2 id="section-3-the-power-of-data-visualization">Section 3: The Power of Data Visualization</h2>
<p>Importance of Data Visualization: Unlock the potential of Python's visualization libraries, such as Matplotlib and Seaborn, within the interactive canvas of Jupyter Notebooks. Visual Studio Code complements this process, providing a robust coding environment for creating captivating visualizations.</p>
<p>Introduction to Tools: While exploring the importance of data visualization, let's talk about the powerful visualization tools like Power BI, Looker, and Tableau. Learn how this integration elevates your data storytelling capabilities.</p>
<p><strong>Conclusion:</strong></p>
<p>This session aims to equip attendees with a strong foundation in data engineering, focusing on the pivotal role of data analysis and visualization. By the end of this presentation, participants will grasp how to effectively utilize these practices, so they are able to start the journey on data analysis and visualization.</p>
<p>This presentation will be accompanied by live code demonstrations and interactive discussions, ensuring attendees gain practical knowledge and valuable insights into the dynamic world of data engineering.</p>
<p><strong>Some of the technologies that we will be covering:</strong></p>
<ul>
<li>Data Analysis</li>
<li>Data Visualization</li>
<li>Python</li>
<li>Jupyter Notebook</li>
<li>Looker</li>
</ul>
<p>Thanks for reading.</p>
<p>Send question or comment at Twitter @ozkary
Originally published by <a href="https://www.ozkary.com">ozkary.com</a></p>
<div class="blogger-post-footer">Originally Published by https://ozkary.com</div>Oscar Garcia @ozkaryhttp://www.blogger.com/profile/13427831719934915504noreply@blogger.com0tag:blogger.com,1999:blog-2250750882473644809.post-62297736000891663452023-10-25T16:00:00.035-04:002023-12-02T15:59:10.639-05:00Data Engineering Process Fundamentals - Unveiling the Power of Data Lakes and Data Warehouses<div><br /></div><div><div>In this technical presentation, we will delve into the fundamental concepts of Data Engineering, focusing on two pivotal components of modern data architecture - Data Lakes and Data Warehouses. We will explore their roles, differences, and how they collectively empower organizations to harness the true potential of their data.</div>
<div class="separator" style="clear: both; display: none;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiAtABAISerDHv8-vFolaFt-EvSEtRZZxnQCWC98S99Ks2wYnGfiQ5rgtc6wQkyUqtbofR3JyhWNoOcFtOmeYfqoI6q34mueRJdejBTrP1C6BjoM9kc1joRn-vMbE5UQ6IXbA_9BvLEH31tp2inChmMCFwxrm9wBCUuA7rW477lBHgq7i_PiMMJPfgFHHRw/s1171/ozkary-gdg-intro-data-lake-warehouse.png" style="display: block; padding: 1em 0px; text-align: center;"><img alt="" border="0" data-original-height="538" data-original-width="1171" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiAtABAISerDHv8-vFolaFt-EvSEtRZZxnQCWC98S99Ks2wYnGfiQ5rgtc6wQkyUqtbofR3JyhWNoOcFtOmeYfqoI6q34mueRJdejBTrP1C6BjoM9kc1joRn-vMbE5UQ6IXbA_9BvLEH31tp2inChmMCFwxrm9wBCUuA7rW477lBHgq7i_PiMMJPfgFHHRw/s400/ozkary-gdg-intro-data-lake-warehouse.png" width="400" /></a></div>
<div><br /></div><div>- Follow this GitHub repo during the presentation: (Give it a star)</div><div><br /></div><div><b>
<a href="https://github.com/ozkary/data-engineering-mta-turnstile">https://github.com/ozkary/data-engineering-mta-turnstile</a></b></div><div><br /></div><div>- Read more information on my blog at:</div><div><br /></div><div><b><a href="https://www.ozkary.com/2023/03/data-engineering-process-fundamentals.html">https://www.ozkary.com/2023/03/data-engineering-process-fundamentals.html</a></b></div><div><br /></div><div><br /></div>
<h3>Presentation</h3>
<iframe allowfullscreen="true" frameborder="0" height="315" mozallowfullscreen="true" src="https://docs.google.com/presentation/d/e/2PACX-1vTnP_fjlAzbcMkTBto-wviVtWSi-9xPXQ_b9KXkKsN_Ut82Xi17TizYB6UPfU7mubjDOPr0vDex9Fe6/embed?start=false&loop=false&delayms=5000" webkitallowfullscreen="true" width="560"></iframe>
<br />
<h3>YouTube Video</h3>
<iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen="" frameborder="0" height="315" src="https://www.youtube.com/embed/yYOSMmwGEtE?si=bKtz0uIilMXka1Mt" title="Data Engineering Process Fundamentals - Unveiling the Power of Data Lakes and Data Warehouses" width="560"></iframe>
<div><br /></div><div>1. Introduction to Data Engineering:</div><div>- Brief overview of the data engineering landscape and its critical role in modern data-driven organizations.</div><div><br /></div><div>2. Understanding Data Lakes:</div><div>- Explanation of what a data lake is and its purpose in storing vast amounts of raw and unstructured data.</div><div><br /></div><div>3. Exploring Data Warehouses:</div><div>- Definition of data warehouses and their role in storing structured, processed, and business-ready data.</div><div><br /></div><div>4. Comparing Data Lakes and Data Warehouses:</div><div>- Comparative analysis of data lakes and data warehouses, highlighting their strengths and weaknesses.</div><div>- Discussing when to use each based on specific use cases and business needs.</div><div><br /></div><div>5. Integration and Data Pipelines:</div><div>- Insight into the seamless integration of data lakes and data warehouses within a data engineering pipeline.</div><div>- Code walkthrough showcasing data movement and transformation between these two crucial components.</div><div><br /></div><div>6. Real-world Use Cases:</div><div>- Presentation of real-world use cases where effective use of data lakes and data warehouses led to actionable insights and business success.</div><div>- Hands-on demonstration using Python, Jupyter Notebook and SQL to solidify the concepts discussed, providing attendees with practical insights and skills.</div><div><br /></div><div><br /></div><div>Conclusion:</div><div><br /></div><div>This session aims to equip attendees with a strong foundation in data engineering, focusing on the pivotal role of data lakes and data warehouses. By the end of this presentation, participants will grasp how to effectively utilize these tools, enabling them to design efficient data solutions and drive informed business decisions.</div><div><br /></div><div>This presentation will be accompanied by live code demonstrations and interactive discussions, ensuring attendees gain practical knowledge and valuable insights into the dynamic world of data engineering.</div><div><br /></div><div>Some of the technologies that we will be covering:</div><div><br /></div><div>- Data Lakes</div><div>- Data Warehouse</div><div>- Data Analysis and Visualization</div><div>- Python</div><div>- Jupyter Notebook</div><div>- SQL</div></div><div><br /></div>Send question or comment at Twitter @ozkary
<h4>Originally published by <a href="https://www.ozkary.com" title="oscar garcia, ozkary">ozkary.com</a>
</h4><div class="blogger-post-footer">Originally Published by https://ozkary.com</div>Oscar Garcia @ozkaryhttp://www.blogger.com/profile/13427831719934915504noreply@blogger.com0tag:blogger.com,1999:blog-2250750882473644809.post-1464907359619805192023-10-15T10:54:00.025-04:002023-10-26T12:07:42.284-04:00AI with Python and Tensorflow - Convolutional Neural Networks Analysis<h2 id="convolutional-neural-network-cnn-">Convolutional neural network (CNN)</h2>
<p>Convolutional Neural Networks (CNNs) have revolutionized the field of computer vision and image processing. These specialized deep learning models are inspired by the human visual system and excel at tasks like image classification, object detection, and facial recognition. </p>
<p>CNNs employ convolution operations, primarily used for processing images. The network initiates the analysis by applying filters that aim to extract valuable image features using various convolutional kernels. Similar to other weights in the neural network, these filters can be enhanced by adjusting their kernels based on the output error. After this, the resultant images undergo pooling, followed by pixel-wise input to a standard neural network in a process known as flattening.</p>
<p style="display:none"><img src="https://www.ozkary.dev/assets/2023/ozkary-ai-engineering-neural-network-analysis.png" alt="AI convolutional neural network - ozkary" title="AI Traffic Signs Classifier neural networks"></p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/OJjZGb7fmOo?si=vfbgKkpQtUTFN13x" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
<h3 id="model-1">Model 1</h3>
<pre><code>Input (IMG_WIDTH, IMG_HEIGHT, <span class="hljs-number">3</span>)
<span class="hljs-string">|</span>
Conv2D (<span class="hljs-number">32</span> filters, <span class="hljs-number">3</span>x3 kernel, ReLU)
<span class="hljs-string">|</span>
MaxPooling2D (<span class="hljs-number">2</span>x2 pool size)
<span class="hljs-string">|</span>
Flatten
<span class="hljs-string">|</span>
Dense (<span class="hljs-number">128</span> nodes, ReLU)
<span class="hljs-string">|</span>
Dropout (<span class="hljs-number">50</span>%)
<span class="hljs-string">|</span>
Dense (NUM_CATEGORIES, softmax)
<span class="hljs-string">|</span>
Output (NUM_CATEGORIES)
</code></pre><ol>
<li><p>Input Layer (Conv2D):</p>
<ul>
<li>Type: Convolutional Layer (2D)</li>
<li>Number of Filters: 32</li>
<li>Filter Size: (3, 3)</li>
<li>Activation Function: Rectified Linear Unit (ReLU)</li>
<li>Input Shape: (IMG_WIDTH, IMG_HEIGHT, 3) where 3 represents the color channels (RGB).</li>
</ul>
</li>
<li><p>Pooling Layer (MaxPooling2D):</p>
<ul>
<li>Type: Max Pooling Layer (2D)</li>
<li>Pool Size: (2, 2)</li>
<li>Purpose: Reduces the spatial dimensions by taking the maximum value from each group of 2x2 pixels.</li>
</ul>
</li>
<li><p>Flatten Layer (Flatten):</p>
<ul>
<li>Type: Flattening Layer</li>
<li>Purpose: Converts the multidimensional input into a 1D array to feed into the Dense layer.</li>
</ul>
</li>
<li><p>Dense Hidden Layer (Dense):</p>
<ul>
<li>Number of Neurons: 128</li>
<li>Activation Function: ReLU</li>
<li>Purpose: Learns and represents complex patterns in the data.</li>
</ul>
</li>
<li><p>Dropout Layer (Dropout):</p>
<ul>
<li>Rate: 0.5</li>
<li>Purpose: Helps prevent overfitting by randomly setting 50% of the inputs to zero during training.</li>
</ul>
</li>
<li><p>Output Layer (Dense):</p>
<ul>
<li>Number of Neurons: NUM_CATEGORIES (Number of categories for traffic signs)</li>
<li>Activation Function: Softmax</li>
<li>Purpose: Produces probabilities for each category, summing to 1, indicating the likelihood of the input image belonging to each category.</li>
</ul>
</li>
</ol>
<pre><code class="lang-python">
<span class="hljs-attr">layers</span> = tf.keras.layers
<span class="hljs-comment"># Create a convolutional neural network</span>
<span class="hljs-attr">model</span> = tf.keras.models.Sequential([
<span class="hljs-comment"># Convolutional layer. Learn 32 filters using a 3x3 kernel</span>
layers.Conv2D(<span class="hljs-number">32</span>, (<span class="hljs-number">3</span>, <span class="hljs-number">3</span>), <span class="hljs-attr">activation='relu',</span> <span class="hljs-attr">input_shape=(30,</span> <span class="hljs-number">30</span>, <span class="hljs-number">3</span>)),
<span class="hljs-comment"># Max-pooling layer, reduces the spatial dimensions by taking the maximum value from each group of 2x2 pixels</span>
layers.MaxPooling2D((<span class="hljs-number">2</span>, <span class="hljs-number">2</span>)),
<span class="hljs-comment"># Converts the multidimensional input into a 1D array to feed into the Dense layer</span>
layers.Flatten(),
<span class="hljs-comment"># Dense Hidden Layer with 128 nodes and relu activation function to learns and represent complex patterns in the data</span>
layers.Dense(<span class="hljs-number">128</span>, <span class="hljs-attr">activation='relu'),</span>
<span class="hljs-comment"># Dropout layer to prevent overfitting by randomly setting 50% of the inputs to 0 at each update during training</span>
layers.Dropout(<span class="hljs-number">0.5</span>),
<span class="hljs-comment"># Output layer with NUM_CATEGORIES outputs and softmax activation function to return probability-like results</span>
layers.Dense(NUM_CATEGORIES, <span class="hljs-attr">activation='softmax')</span>
])
</code></pre>
<h3 id="model-2">Model 2</h3>
<pre><code>Input (IMG_WIDTH, IMG_HEIGHT, <span class="hljs-number">3</span>)
<span class="hljs-string">|</span>
Conv2D (<span class="hljs-number">32</span> filters, <span class="hljs-number">3</span>x3 kernel, ReLU)
<span class="hljs-string">|</span>
MaxPooling2D (<span class="hljs-number">2</span>x2 pool size)
<span class="hljs-string">|</span>
Conv2D (<span class="hljs-number">64</span> filters, <span class="hljs-number">3</span>x3 kernel, ReLU)
<span class="hljs-string">|</span>
MaxPooling2D (<span class="hljs-number">2</span>x2 pool size)
<span class="hljs-string">|</span>
Flatten
<span class="hljs-string">|</span>
Dense (<span class="hljs-number">128</span> nodes, ReLU)
<span class="hljs-string">|</span>
Dropout (<span class="hljs-number">50</span>%)
<span class="hljs-string">|</span>
Dense (NUM_CATEGORIES, softmax)
<span class="hljs-string">|</span>
Output (NUM_CATEGORIES)
</code></pre><ol>
<li><p>Input Layer (Conv2D):</p>
<ul>
<li>Type: Convolutional Layer (2D)</li>
<li>Number of Filters: 32</li>
<li>Filter Size: (3, 3)</li>
<li>Activation Function: Rectified Linear Unit (ReLU)</li>
<li>Input Shape: (IMG_WIDTH, IMG_HEIGHT, 3) where 3 represents the color channels (RGB).</li>
</ul>
</li>
<li><p>Pooling Layer (MaxPooling2D):</p>
<ul>
<li>Type: Max Pooling Layer (2D)</li>
<li>Pool Size: (2, 2)</li>
<li>Purpose: Reduces the spatial dimensions by taking the maximum value from each group of 2x2 pixels.</li>
</ul>
</li>
<li><p>Convolutional Layer (Conv2D):</p>
<ul>
<li>Number of Filters: 64</li>
<li>Filter Size: (3, 3)</li>
<li>Activation Function: ReLU</li>
<li>Purpose: Extracts higher-level features from the input.</li>
</ul>
</li>
<li><p>Pooling Layer (MaxPooling2D):</p>
<ul>
<li>Pool Size: (2, 2)</li>
<li>Purpose: Further reduces spatial dimensions.</li>
</ul>
</li>
<li><p>Flatten Layer (Flatten):</p>
<ul>
<li>Type: Flattening Layer</li>
<li>Purpose: Converts the multidimensional input into a 1D array to feed into the Dense layer.</li>
</ul>
</li>
<li><p>Dense Hidden Layer (Dense):</p>
<ul>
<li>Number of Neurons: 128</li>
<li>Activation Function: ReLU</li>
<li>Purpose: Learns and represents complex patterns in the data.</li>
</ul>
</li>
<li><p>Dropout Layer (Dropout):</p>
<ul>
<li>Rate: 0.5</li>
<li>Purpose: Helps prevent overfitting by randomly setting 50% of the inputs to zero during training.</li>
</ul>
</li>
<li><p>Output Layer (Dense):</p>
<ul>
<li>Number of Neurons: NUM_CATEGORIES (Number of categories for traffic signs)</li>
<li>Activation Function: Softmax</li>
<li>Purpose: Produces probabilities for each category, summing to 1, indicating the likelihood of the input image belonging to each category.</li>
</ul>
</li>
</ol>
<pre><code class="lang-python">
<span class="hljs-attr">layers</span> = tf.keras.layers
<span class="hljs-comment"># Create a convolutional neural network</span>
<span class="hljs-attr">model</span> = tf.keras.models.Sequential([
<span class="hljs-comment"># Convolutional layer. Learn 32 filters using a 3x3 kernel</span>
layers.Conv2D(<span class="hljs-number">32</span>, (<span class="hljs-number">3</span>, <span class="hljs-number">3</span>), <span class="hljs-attr">activation='relu',</span> <span class="hljs-attr">input_shape=(30,</span> <span class="hljs-number">30</span>, <span class="hljs-number">3</span>)),
<span class="hljs-comment"># Max-pooling layer, reduces the spatial dimensions by taking the maximum value from each group of 2x2 pixels</span>
layers.MaxPooling2D((<span class="hljs-number">2</span>, <span class="hljs-number">2</span>)),
<span class="hljs-comment"># Convolutional layer. Learn 64 filters using a 3x3 kernel to extracts higher-level features from the input</span>
layers.Conv2D(<span class="hljs-number">64</span>, (<span class="hljs-number">3</span>, <span class="hljs-number">3</span>), <span class="hljs-attr">activation='relu'),</span>
<span class="hljs-comment"># Max-pooling layer, using 2x2 pool size reduces spatial dimensions</span>
layers.MaxPooling2D((<span class="hljs-number">2</span>, <span class="hljs-number">2</span>)),
<span class="hljs-comment"># Converts the multidimensional input into a 1D array to feed into the Dense layer</span>
layers.Flatten(),
<span class="hljs-comment"># Dense Hidden Layer with 128 nodes and relu activation function to learns and represent complex patterns in the data</span>
layers.Dense(<span class="hljs-number">128</span>, <span class="hljs-attr">activation='relu'),</span>
<span class="hljs-comment"># Dropout layer to prevent overfitting by randomly setting 50% of the inputs to 0 at each update during training</span>
layers.Dropout(<span class="hljs-number">0.5</span>),
<span class="hljs-comment"># Output layer with NUM_CATEGORIES outputs and softmax activation function to return probability-like results</span>
layers.Dense(NUM_CATEGORIES, <span class="hljs-attr">activation='softmax')</span>
])
</code></pre>
<p>The architecture follows a typical CNN pattern: alternating Convolutional and MaxPooling layers to extract features and reduce spatial dimensions, followed by Flattening and Dense layers for classification.</p>
<p>Feel free to adjust the number of filters, filter sizes, layer types, or other hyperparameters based on your specific problem and dataset. Experimentation is key to finding the best architecture for your task.</p>
<h3 id="model-1-results">Model 1 Results</h3>
<pre><code class="lang-bash">Images and Labels loaded <span class="hljs-number">26640</span>, <span class="hljs-number">26640</span>
Epoch <span class="hljs-number">1</span>/<span class="hljs-number">10</span>
<span class="hljs-number">500</span><span class="hljs-regexp">/500 [==============================] - 7s 12ms/</span>step - <span class="hljs-string">loss:</span> <span class="hljs-number">4.9111</span> - <span class="hljs-string">accuracy:</span> <span class="hljs-number">0.0545</span>
Epoch <span class="hljs-number">2</span>/<span class="hljs-number">10</span>
<span class="hljs-number">500</span><span class="hljs-regexp">/500 [==============================] - 6s 12ms/</span>step - <span class="hljs-string">loss:</span> <span class="hljs-number">3.5918</span> - <span class="hljs-string">accuracy:</span> <span class="hljs-number">0.0555</span>
Epoch <span class="hljs-number">3</span>/<span class="hljs-number">10</span>
<span class="hljs-number">500</span><span class="hljs-regexp">/500 [==============================] - 6s 12ms/</span>step - <span class="hljs-string">loss:</span> <span class="hljs-number">3.5411</span> - <span class="hljs-string">accuracy:</span> <span class="hljs-number">0.0565</span>
Epoch <span class="hljs-number">4</span>/<span class="hljs-number">10</span>
<span class="hljs-number">500</span><span class="hljs-regexp">/500 [==============================] - 6s 12ms/</span>step - <span class="hljs-string">loss:</span> <span class="hljs-number">3.5190</span> - <span class="hljs-string">accuracy:</span> <span class="hljs-number">0.0577</span>
Epoch <span class="hljs-number">5</span>/<span class="hljs-number">10</span>
<span class="hljs-number">500</span><span class="hljs-regexp">/500 [==============================] - 6s 12ms/</span>step - <span class="hljs-string">loss:</span> <span class="hljs-number">3.5088</span> - <span class="hljs-string">accuracy:</span> <span class="hljs-number">0.0565</span>
Epoch <span class="hljs-number">6</span>/<span class="hljs-number">10</span>
<span class="hljs-number">500</span><span class="hljs-regexp">/500 [==============================] - 6s 12ms/</span>step - <span class="hljs-string">loss:</span> <span class="hljs-number">3.5041</span> - <span class="hljs-string">accuracy:</span> <span class="hljs-number">0.0577</span>
Epoch <span class="hljs-number">7</span>/<span class="hljs-number">10</span>
<span class="hljs-number">500</span><span class="hljs-regexp">/500 [==============================] - 6s 12ms/</span>step - <span class="hljs-string">loss:</span> <span class="hljs-number">3.5019</span> - <span class="hljs-string">accuracy:</span> <span class="hljs-number">0.0577</span>
Epoch <span class="hljs-number">8</span>/<span class="hljs-number">10</span>
<span class="hljs-number">500</span><span class="hljs-regexp">/500 [==============================] - 6s 12ms/</span>step - <span class="hljs-string">loss:</span> <span class="hljs-number">3.5008</span> - <span class="hljs-string">accuracy:</span> <span class="hljs-number">0.0577</span>
Epoch <span class="hljs-number">9</span>/<span class="hljs-number">10</span>
<span class="hljs-number">500</span><span class="hljs-regexp">/500 [==============================] - 6s 12ms/</span>step - <span class="hljs-string">loss:</span> <span class="hljs-number">3.5002</span> - <span class="hljs-string">accuracy:</span> <span class="hljs-number">0.0577</span>
Epoch <span class="hljs-number">10</span>/<span class="hljs-number">10</span>
<span class="hljs-number">500</span><span class="hljs-regexp">/500 [==============================] - 6s 12ms/</span>step - <span class="hljs-string">loss:</span> <span class="hljs-number">3.4999</span> - <span class="hljs-string">accuracy:</span> <span class="hljs-number">0.0577</span>
<span class="hljs-number">333</span><span class="hljs-regexp">/333 - 1s - loss: 3.4964 - accuracy: 0.0541 - 1s/</span>epoch - <span class="hljs-number">4</span>ms/step
</code></pre>
<h3 id="model-2-results">Model 2 Results</h3>
<pre><code class="lang-bash">
Images and Labels loaded <span class="hljs-number">26640</span>, <span class="hljs-number">26640</span>
Epoch <span class="hljs-number">1</span>/<span class="hljs-number">10</span>
<span class="hljs-number">500</span><span class="hljs-regexp">/500 [==============================] - 9s 15ms/</span>step - <span class="hljs-string">loss:</span> <span class="hljs-number">4.0071</span> - <span class="hljs-string">accuracy:</span> <span class="hljs-number">0.1315</span>
Epoch <span class="hljs-number">2</span>/<span class="hljs-number">10</span>
<span class="hljs-number">500</span><span class="hljs-regexp">/500 [==============================] - 7s 14ms/</span>step - <span class="hljs-string">loss:</span> <span class="hljs-number">2.0718</span> - <span class="hljs-string">accuracy:</span> <span class="hljs-number">0.3963</span>
Epoch <span class="hljs-number">3</span>/<span class="hljs-number">10</span>
<span class="hljs-number">500</span><span class="hljs-regexp">/500 [==============================] - 7s 15ms/</span>step - <span class="hljs-string">loss:</span> <span class="hljs-number">1.4216</span> - <span class="hljs-string">accuracy:</span> <span class="hljs-number">0.5529</span>
Epoch <span class="hljs-number">4</span>/<span class="hljs-number">10</span>
<span class="hljs-number">500</span><span class="hljs-regexp">/500 [==============================] - 7s 14ms/</span>step - <span class="hljs-string">loss:</span> <span class="hljs-number">1.0891</span> - <span class="hljs-string">accuracy:</span> <span class="hljs-number">0.6546</span>
Epoch <span class="hljs-number">5</span>/<span class="hljs-number">10</span>
<span class="hljs-number">500</span><span class="hljs-regexp">/500 [==============================] - 7s 14ms/</span>step - <span class="hljs-string">loss:</span> <span class="hljs-number">0.8440</span> - <span class="hljs-string">accuracy:</span> <span class="hljs-number">0.7320</span>
Epoch <span class="hljs-number">6</span>/<span class="hljs-number">10</span>
<span class="hljs-number">500</span><span class="hljs-regexp">/500 [==============================] - 7s 14ms/</span>step - <span class="hljs-string">loss:</span> <span class="hljs-number">0.6838</span> - <span class="hljs-string">accuracy:</span> <span class="hljs-number">0.7862</span>
Epoch <span class="hljs-number">7</span>/<span class="hljs-number">10</span>
<span class="hljs-number">500</span><span class="hljs-regexp">/500 [==============================] - 7s 14ms/</span>step - <span class="hljs-string">loss:</span> <span class="hljs-number">0.5754</span> - <span class="hljs-string">accuracy:</span> <span class="hljs-number">0.8184</span>
Epoch <span class="hljs-number">8</span>/<span class="hljs-number">10</span>
<span class="hljs-number">500</span><span class="hljs-regexp">/500 [==============================] - 7s 14ms/</span>step - <span class="hljs-string">loss:</span> <span class="hljs-number">0.5033</span> - <span class="hljs-string">accuracy:</span> <span class="hljs-number">0.8420</span>
Epoch <span class="hljs-number">9</span>/<span class="hljs-number">10</span>
<span class="hljs-number">500</span><span class="hljs-regexp">/500 [==============================] - 7s 14ms/</span>step - <span class="hljs-string">loss:</span> <span class="hljs-number">0.4171</span> - <span class="hljs-string">accuracy:</span> <span class="hljs-number">0.8729</span>
Epoch <span class="hljs-number">10</span>/<span class="hljs-number">10</span>
<span class="hljs-number">500</span><span class="hljs-regexp">/500 [==============================] - 7s 15ms/</span>step - <span class="hljs-string">loss:</span> <span class="hljs-number">0.3787</span> - <span class="hljs-string">accuracy:</span> <span class="hljs-number">0.8851</span>
<span class="hljs-number">333</span><span class="hljs-regexp">/333 - 2s - loss: 0.1354 - accuracy: 0.9655 - 2s/</span>epoch - <span class="hljs-number">6</span>ms/step
Model saved to cnn_model2.keras.
</code></pre>
<h2 id="summary">Summary</h2>
<p>This is a summary of the CNN model experiments:</p>
<p>Model 1 had a loss of <code>3.4964 and an accuracy of 0.0541</code>. This model had a simple architecture with few layers and filters, which may have limited its ability to learn complex features in the input images.</p>
<p>Model 2 had a loss of <code>0.1354 and an accuracy of 0.9655</code>. This model had a more complex architecture with additional hidden layers, including a convolutional layer with 64 filters and an additional max pooling (2x2) layer. The addition of these layers likely helped the model learn more complex features in the input images, leading to a significant improvement in accuracy.</p>
<p>In particular, the addition of more convolutional layers with more filters can help the model learn more complex features in the input images, as each filter learns to detect a different pattern or feature in the input. However, it is important to balance the number of filters with the size of the input images and the complexity of the problem, as using too many filters can lead to overfitting and poor generalization to new data.</p>
<p>Overall, the results suggest that increasing the complexity of the model by adding more hidden layers can help improve its accuracy, but it is important to balance the complexity of the model with the size of the input images and the complexity of the problem to avoid overfitting.</p>
<h3 id="learning-rate">Learning rate</h3>
<ul>
<li>A learning rate of .001 (default) provided optimal results - <code>loss: 0.1354 - accuracy: 0.9655</code></li>
<li>A learning rate of .01 lower the accuracy and increase the loss <code>loss: 3.4858 - accuracy: 0.0594</code></li>
</ul>
<p>A learning rate of 0.01 is too high for our specific problem and dataset, which can cause the model to overshoot the optimal solution and fail to converge.</p>
<pre><code class="lang-python">
model.compile(optimizer=tf<span class="hljs-selector-class">.keras</span><span class="hljs-selector-class">.optimizers</span><span class="hljs-selector-class">.Adam</span>(learning_rate=<span class="hljs-number">0.01</span>),
loss=<span class="hljs-string">'categorical_crossentropy'</span>,
metrics=[<span class="hljs-string">'accuracy'</span>])
</code></pre>
<p>Thanks for reading.</p>
<p>Send question or comment at Twitter @ozkary</p>
<p>π Originally published by <a href="https://www.ozkary.com">ozkary.com</a></p>
<div class="blogger-post-footer">Originally Published by https://ozkary.com</div>Oscar Garcia @ozkaryhttp://www.blogger.com/profile/13427831719934915504noreply@blogger.com0tag:blogger.com,1999:blog-2250750882473644809.post-75203928763754223572023-09-09T15:30:00.006-04:002024-02-04T10:32:10.182-05:00Data Engineering Process Fundamentals - Data Streaming Exersise<h1 id="data-streaming-exercise">Data Streaming - Exercise</h1>
<p>Now that we've covered the concepts of data streaming, let's move forward with an actual coding exercise. During this exercise, we'll delve into building a Kafka message broker and a Spark consumer with the objective of having the Kafka message broker work as a data producer for our subway system information. The Spark consumer acts as a message aggregator and writes the results to our data lake. This allows the data modeling process to pick up the information and insert it into the data warehouse, providing seamless integration and reusing our already operational data pipeline.</p>
<h2 id="batch-process-vs-data-stream">Batch Process vs Data Stream</h2>
<p>In a batch process data pipeline, we define a schedule to process the data from its source. With a data stream pipeline, there is no schedule as the data flows as soon as it is available from its source.</p>
<p>In the batch data download, the data is aggregated for periods of four hours. Since the data stream comes in more frequently, there is no four-hour aggregation. The data comes in as single transactions.</p>
<h3 id="data-stream-strategy">Data Stream Strategy</h3>
<p>From our system requirements, we already have a data pipeline process that runs an incremental update process to import the data from the data lake into our data warehouse. This process already handles data transformation, mapping, and populates all the dimension tables and fact tables with the correct information.</p>
<p>Therefore, we want to follow the same pipeline process and utilize what already exists. To account for the fact that the data comes in as a single transaction, and we do not want to create single files, we want to implement an aggregation strategy on our data streaming pipeline. This enables us to define time windows for when to publish the data, whether it's 1 minute or 4 hours. It really depends on what fits the business requirements. The important thing here is to understand the technical capabilities and the strategy for the solution.</p>
<h2 id="data-streaming-data-flow-process">Data Streaming Data Flow Process</h2>
<p>To deliver a data streaming solution, we typically employ a technical design illustrated as follows:</p>
<p><img src="////www.ozkary.dev/assets/2023/ozkary-data-engineering-process-data-streaming-messages.png" alt="Data Engineering Process Fundamentals - Data Streaming Kafka Topics" title="Data Engineering Process Fundamentals - Data Streaming Kafka Topics"></p>
<ul>
<li><p><strong>Kafka</strong></p>
<ul>
<li>Producer</li>
<li>Topics</li>
<li>Consumer</li>
</ul>
</li>
<li><p><strong>Spark</strong></p>
<ul>
<li>Kafka Consumer</li>
<li>Message Parsing and Aggregations</li>
<li>Write to Data Lake or Other Storage</li>
</ul>
</li>
</ul>
<h3 id="kafka-">Kafka:</h3>
<ul>
<li><strong>Producer:</strong> The producer is responsible for publishing data to Kafka topics. It produces and sends messages to specific topics, allowing data to be ingested into the Kafka cluster. </li>
<li><strong>Topics:</strong> Topics are logical channels or categories to which messages are published by producers and from which messages are consumed by consumers. They act as data channels, providing a way to organize and categorize messages.</li>
<li><strong>Consumer:</strong> Consumers subscribe to Kafka topics and process the messages produced by the producers. They play a vital role in real-time data consumption and are responsible for extracting valuable insights from the streaming data.</li>
</ul>
<h3 id="spark-">Spark:</h3>
<ul>
<li><p><strong>Kafka Consumer:</strong> This component serves as a bridge between Kafka and Spark, allowing Spark to consume data directly from Kafka topics. It establishes a connection to Kafka, subscribes to specified topics, and pulls streaming data into Spark for further processing.</p>
</li>
<li><p><strong>Message Parsing and Aggregations:</strong> Once data is consumed from Kafka, Spark performs message parsing to extract relevant information. Aggregations involve summarizing or transforming the data, making it suitable for analytics or downstream processing. This step is crucial for deriving meaningful insights from the streaming data.</p>
</li>
<li><p><strong>Write to Data Lake or Other Storage:</strong> After processing and aggregating the data, Spark writes the results to a data lake or other storage systems. A data lake is a centralized repository that allows for the storage of raw and processed data in its native format. This step ensures that the valuable insights derived from the streaming data are persisted for further integration to a data warehouse.</p>
</li>
</ul>
<h2 id="implementation-requirements">Implementation Requirements</h2>
<blockquote>
<p>π Clone this repo or copy the files from this folder <a href="https://github.com/ozkary/data-engineering-mta-turnstile/tree/main/Step6-Data-Streaming/">Streaming</a></p>
</blockquote>
<p>For our example, we will adopt a code-centric approach and utilize Python to implement both our producer and consumer components. Additionally, we'll require instances of Apache Kafka and Apache Spark to be running. To ensure scalability, we'll deploy these components on separate virtual machines (VMs). Our Terraform scripts will be instrumental in creating new VM instances for this purpose. It's important to note that all these components will be encapsulated within Docker containers.</p>
<blockquote>
<p>π For the ease of development in this lab, we can run everything on a single VM or local workstations. This allows us to bypass the complexities associated with network configuration. For real deployments, we should use separate VMs.</p>
</blockquote>
<h3 id="requirements">Requirements</h3>
<ul>
<li>Docker and Docker Hub<ul>
<li><a href="https://github.com/ozkary/data-engineering-mta-turnstile/wiki/Configure-Docker">Install Docker</a></li>
<li><a href="https://hub.docker.com/">Create a Docker Hub Account</a></li>
</ul>
</li>
<li>Prefect dependencies and cloud account<ul>
<li>Install the Prefect Python library dependencies</li>
<li><a href="https://www.prefect.io/">Create a Prefect Cloud account</a></li>
</ul>
</li>
<li>Data Lake for storage<ul>
<li>Make sure to have the storage account and access ready</li>
</ul>
</li>
</ul>
<blockquote>
<p>π Before proceeding with the setup, ensure that the storage and Prefect credentials have been configured as shown on the <a href="https://www.ozkary.com/2023/05/data-engineering-process-fundamentals-pipeline-orchestration-exercise.html">Orchestration exercise</a> step.</p>
</blockquote>
<h3 id="docker">Docker</h3>
<p>For running this locally or on virtual machines (VMs), the optimal approach is to leverage Docker Containers. In this exercise, we'll utilize a lightweight configuration of Kafka and Spark using Bitnami images. This configuration assumes a minimal setup with a Spark Master, a Spark Worker, and a Kafka broker.</p>
<p>Docker provides a platform for packaging, distributing, and running applications within containers. This ensures consistency across different environments and simplifies the deployment process. To get started, you can download and install Docker from the official website (<a href="https://www.docker.com/get-started">https://www.docker.com/get-started</a>). Once Docker is installed, the Docker command-line interface (docker) becomes available, enabling us to efficiently manage and interact with containers.</p>
<h4 id="docker-compose-file">Docker Compose file</h4>
<p>Utilize the <strong>docker-bitnami.compose.yaml</strong> file to configure a unified environment where both of these services should run. In the event that we need to run these services on distinct virtual machines (VMs), we would deploy each Docker image on a separate VM.</p>
<pre><code class="lang-yaml"><span class="hljs-attribute">version</span>: <span class="hljs-string">"3.6"</span>
<span class="hljs-attribute">services</span>:
<span class="hljs-attribute">spark-master</span>:
<span class="hljs-attribute">image</span>: bitnami/<span class="hljs-attribute">spark</span>:latest
<span class="hljs-attribute">container_name</span>: spark-master
<span class="hljs-attribute">environment</span>:
<span class="hljs-attribute">SPARK_MODE</span>: <span class="hljs-string">"master"</span>
<span class="hljs-attribute">ports</span>:
- <span class="hljs-number">8080</span>:<span class="hljs-number">8080</span>
<span class="hljs-attribute">spark-worker</span>:
<span class="hljs-attribute">image</span>: bitnami/<span class="hljs-attribute">spark</span>:latest
<span class="hljs-attribute">container_name</span>: spark-worker
<span class="hljs-attribute">environment</span>:
<span class="hljs-attribute">SPARK_MODE</span>: <span class="hljs-string">"worker"</span>
<span class="hljs-attribute">SPARK_MASTER_URL</span>: <span class="hljs-string">"spark://spark-master:7077"</span>
<span class="hljs-attribute">ports</span>:
- <span class="hljs-number">8081</span>:<span class="hljs-number">8081</span>
<span class="hljs-attribute">depends_on</span>:
- spark-master
<span class="hljs-attribute">kafka</span>:
<span class="hljs-attribute">image</span>: bitnami/<span class="hljs-attribute">kafka</span>:latest
<span class="hljs-attribute">container_name</span>: kafka
<span class="hljs-attribute">ports</span>:
- <span class="hljs-string">"9092:9092"</span>
- <span class="hljs-string">"29092:29092"</span> # Used for internal communication
<span class="hljs-attribute">environment</span>:
<span class="hljs-attribute">KAFKA_LISTENER_SECURITY_PROTOCOL_MAP</span>: <span class="hljs-attribute">PLAINTEXT</span>:PLAINTEXT,<span class="hljs-attribute">PLAINTEXT_HOST</span>:PLAINTEXT
<span class="hljs-attribute">KAFKA_ADVERTISED_LISTENERS</span>: <span class="hljs-attribute">PLAINTEXT</span>:<span class="hljs-comment">//kafka:9092,PLAINTEXT_HOST://localhost:9092</span>
<span class="hljs-attribute">KAFKA_LISTENER_SECURITY_PROTOCOL_MAP</span>: <span class="hljs-attribute">LISTENER_BOB</span>:PLAINTEXT,<span class="hljs-attribute">LISTENER_FRED</span>:PLAINTEXT
<span class="hljs-attribute">KAFKA_LISTENERS</span>: <span class="hljs-attribute">LISTENER_BOB</span>:<span class="hljs-comment">//kafka:29092,LISTENER_FRED://kafka:9092</span>
<span class="hljs-attribute">KAFKA_ADVERTISED_LISTENERS</span>: <span class="hljs-attribute">LISTENER_BOB</span>:<span class="hljs-comment">//kafka:29092,LISTENER_FRED://localhost:9092</span>
<span class="hljs-attribute">KAFKA_INTER_BROKER_LISTENER_NAME</span>: LISTENER_BOB
<span class="hljs-attribute">KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR</span>: <span class="hljs-number">1</span>
<span class="hljs-attribute">KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS</span>: <span class="hljs-number">0</span>
<span class="hljs-attribute">KAFKA_TRANSACTION_STATE_LOG_MIN_ISR</span>: <span class="hljs-number">1</span>
<span class="hljs-attribute">KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR</span>: <span class="hljs-number">1</span>
<span class="hljs-attribute">depends_on</span>:
- spark-master
</code></pre>
<h4 id="download-the-docker-images">Download the Docker Images</h4>
<p>Before we proceed to run the Docker images, it's essential to download them in the target environment. To download the Bitnami images, you can execute the following script from a Bash command line:</p>
<pre><code class="lang-bash">$ <span class="hljs-keyword">bash </span>download_images.sh
</code></pre>
<ul>
<li><strong>download_images.sh script</strong></li>
</ul>
<pre><code class="lang-bash">
echo <span class="hljs-string">"Downloading Spark and Kafka Docker images..."</span>
<span class="hljs-comment"># Spark images from Bitnami</span>
docker pull <span class="hljs-keyword">bitnami/spark:latest
</span>
<span class="hljs-comment"># Kafka image from Bitnami</span>
docker pull <span class="hljs-keyword">bitnami/kafka:latest
</span>
echo <span class="hljs-string">"Images downloaded successfully!"</span>
<span class="hljs-comment"># Display image sizes</span>
echo <span class="hljs-string">"Image sizes:"</span>
docker images <span class="hljs-keyword">bitnami/spark:latest </span><span class="hljs-keyword">bitnami/kafka:latest </span>--format <span class="hljs-string">"{{.Repository}}:{{.Tag}} - {{.Size}}"</span>
</code></pre>
<p>The <strong>download_images.sh</strong> script essentially retrieves two images from DockerHub. This script provides an automated way to download these images when creating new environments.</p>
<h4 id="start-the-services">Start the Services</h4>
<p>Once the Docker images are downloaded, initiate the services by executing the following script:</p>
<pre><code class="lang-bash"><span class="hljs-keyword">bash </span>start_services.sh
</code></pre>
<ul>
<li><strong>start_services.sh script</strong></li>
</ul>
<pre><code class="lang-bash"><span class="hljs-meta">#!/bin/bash
</span>
<span class="hljs-comment"># Navigate to the Docker folder</span>
<span class="hljs-built_in">cd</span> docker
<span class="hljs-comment"># Start Spark Master and Spark Worker</span>
docker-compose up <span class="hljs-_">-f</span> docker-compose-bitnami.yml <span class="hljs-_">-d</span> spark-master spark-worker
<span class="hljs-comment"># Wait for Spark Master and Worker to be ready (adjust sleep time as needed)</span>
sleep 15
<span class="hljs-comment"># Start Kafka and create Kafka topic</span>
docker-compose up <span class="hljs-_">-d</span> kafka
<span class="hljs-comment"># Wait for Kafka to be ready (adjust sleep time as needed)</span>
sleep 15
<span class="hljs-comment"># Check if the Kafka topic exists before creating it</span>
topic_exists=$(docker-compose <span class="hljs-built_in">exec</span> kafka /opt/bitnami/kafka/bin/kafka-topics.sh --list --topic ozkary-topic --bootstrap-server localhost:9092 | grep <span class="hljs-string">"mta-turnstile"</span>)
<span class="hljs-keyword">if</span> [ -z <span class="hljs-string">"<span class="hljs-variable">$topic_exists</span>"</span> ]; <span class="hljs-keyword">then</span>
<span class="hljs-comment"># Create Kafka topic</span>
docker-compose <span class="hljs-built_in">exec</span> kafka /opt/bitnami/kafka/bin/kafka-topics.sh --create --topic mta-turnstile --partitions 1 --replication-factor 1 --bootstrap-server localhost:9092
<span class="hljs-built_in">echo</span> <span class="hljs-string">"Kafka topic created!"</span>
<span class="hljs-keyword">else</span>
<span class="hljs-built_in">echo</span> <span class="hljs-string">"Kafka topic already exists, no need to recreate."</span>
<span class="hljs-keyword">fi</span>
<span class="hljs-built_in">echo</span> <span class="hljs-string">"Services started successfully!"</span>
</code></pre>
<p>The <strong>start_services.sh</strong> script performs the following tasks:</p>
<ul>
<li>Initiates Spark Master and Spark Worker services in detached mode (-d).</li>
<li>Launches Kafka service in detached mode.</li>
<li>Utilizes <code>docker-compose exec</code> to execute the Kafka topic creation command inside the Kafka container.</li>
</ul>
<p>At this juncture, both services should be operational and ready to respond to client requests. Now, let's delve into implementing our applications.</p>
<h3 id="data-specifications">Data Specifications</h3>
<p>In this data streaming scenario, we are working with messages using a CSV data format with the following fields:</p>
<pre><code class="lang-python"># Define the schema for the incoming data
turnstiles_schema = StructType([
StructField(<span class="hljs-string">"AC"</span>, StringType()),
StructField(<span class="hljs-string">"UNIT"</span>, StringType()),
StructField(<span class="hljs-string">"SCP"</span>, StringType()),
StructField(<span class="hljs-string">"STATION"</span>, StringType()),
StructField(<span class="hljs-string">"LINENAME"</span>, StringType()),
StructField(<span class="hljs-string">"DIVISION"</span>, StringType()),
StructField(<span class="hljs-string">"DATE"</span>, StringType()),
StructField(<span class="hljs-string">"TIME"</span>, StringType()),
StructField(<span class="hljs-string">"DESC"</span>, StringType()),
StructField(<span class="hljs-string">"ENTRIES"</span>, IntegerType()),
StructField(<span class="hljs-string">"EXITS"</span>, IntegerType()),
StructField(<span class="hljs-string">"ID"</span>, StringType()),
StructField(<span class="hljs-string">"TIMESTAMP"</span>, StringType())
])
</code></pre>
<p>The data format closely resembles what the source system provides for batch integration. However, in this scenario, we also have a unique ID and a TIMESTAMP.</p>
<p>As we process these messages, our objective is to generate files with data aggregation based on these fields:</p>
<pre><code class="lang-python"><span class="hljs-string">"AC"</span>, <span class="hljs-string">"UNIT"</span>,<span class="hljs-string">"SCP"</span>,<span class="hljs-string">"STATION"</span>,<span class="hljs-string">"LINENAME"</span>,<span class="hljs-string">"DIVISION"</span>, <span class="hljs-string">"DATE"</span>, <span class="hljs-string">"DESC"</span>
</code></pre>
<p>And these measures:</p>
<pre><code class="lang-python"><span class="hljs-string">"ENTRIES"</span>, <span class="hljs-string">"EXITS"</span>
</code></pre>
<p>An example of a message would look like this:</p>
<pre><code class="lang-python">"A001,R001,02<span class="hljs-string">-00</span><span class="hljs-string">-00</span>,Test-Station,456NQR,BMT,09<span class="hljs-string">-23</span><span class="hljs-string">-23</span>,REGULAR,16:54:00,140,153"
</code></pre>
<p>It's important to note that the number of commuters is substantial, indicating a certain level of aggregation in these messages. However, they aren't aggregated every four hours, as the batch process does.</p>
<p>Once these message files are aggregated, they are then pushed to the data lake. Subsequently, our data warehouse process can pick them up and proceed with the necessary information processing.</p>
<h2 id="review-the-code">Review the Code</h2>
<p>To enable this functionality, we need to develop a Kafka producer and a Spark Kafka consumer, both implemented in Python. Let's begin by examining the fundamental features of the producer:</p>
<blockquote>
<p>π Clone this repository or copy the files from this folder <a href="https://github.com/ozkary/data-engineering-mta-turnstile/tree/main/Step6-Data-Streaming/">Streaming</a></p>
</blockquote>
<h3 id="kafka-producer">Kafka Producer</h3>
<p>The Kafka producer is a Python application designed to generate messages every 10 seconds. The <code>produce_messages</code> function utilizes messages from the provider and sends the serialized data to a Kafka topic.</p>
<pre><code class="lang-python">
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">KafkaProducer</span>:</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span><span class="hljs-params">(self, config_path, topic)</span>:</span>
settings = read_config(config_path)
self.producer = Producer(settings)
self.topic = topic
self.provider = Provider(topic)
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">delivery_report</span><span class="hljs-params">(self, err, msg)</span>:</span>
<span class="hljs-string">"""
Reports the success or failure of a message delivery.
Args:
err (KafkaError): The error that occurred on None on success.
msg (Message): The message that was produced or failed.
"""</span>
<span class="hljs-keyword">if</span> err <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">None</span>:
print(f<span class="hljs-string">'Message delivery failed: {err}'</span>)
<span class="hljs-keyword">else</span>:
print(<span class="hljs-string">'Record {} produced to {} [{}] at offset {}'</span>.format(msg.key(), msg.topic(), msg.partition(), msg.offset()))
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">produce_messages</span><span class="hljs-params">(self)</span>:</span>
<span class="hljs-keyword">while</span> <span class="hljs-keyword">True</span>:
<span class="hljs-comment"># Get the message key and value from the provider</span>
key, message = self.provider.message()
<span class="hljs-keyword">try</span>:
<span class="hljs-comment"># Produce the message to the Kafka topic</span>
self.producer.produce(topic = self.topic, key=key_serializer(key),
value=value_serializer(message),
on_delivery = self.delivery_report)
<span class="hljs-comment"># Flush to ensure delivery</span>
self.producer.flush()
<span class="hljs-comment"># Print the message</span>
print(f<span class="hljs-string">'Sent message: {message}'</span>)
<span class="hljs-comment"># Wait for 10 seconds before sending the next message</span>
time.sleep(<span class="hljs-number">10</span>)
<span class="hljs-keyword">except</span> KeyboardInterrupt:
<span class="hljs-keyword">pass</span>
exit(<span class="hljs-number">0</span>)
<span class="hljs-keyword">except</span> KafkaTimeoutError <span class="hljs-keyword">as</span> e:
print(f<span class="hljs-string">"Kafka Timeout {e.__str__()}"</span>)
<span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
print(f<span class="hljs-string">"Exception while producing record - {key} {message}: {e}"</span>)
<span class="hljs-keyword">continue</span>
</code></pre>
<p>This class utilizes the Confluent Kafka library for seamless interaction with Kafka. It encapsulates the logic for producing messages to a Kafka topic, relying on external configurations, message providers, and serialization functions. The <code>produce_messages</code> method is crafted to run continuously until interrupted, while the <code>delivery_report</code> method serves as a callback function, reporting the success or failure of message delivery.</p>
<pre><code class="lang-python"><span class="hljs-meta">@flow (name="MTA Data Stream flow", description="Data Streaming Flow")</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">main_flow</span><span class="hljs-params">(params)</span> -> <span class="hljs-keyword">None</span>:</span>
<span class="hljs-string">"""
Main flow to read and send the messages
"""</span>
topic = params.topic
config_path = params.config
producer = KafkaProducer(config_path, topic)
producer.produce_messages()
<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
<span class="hljs-string">"""main entry point with argument parser"""</span>
os.system(<span class="hljs-string">'clear'</span>)
print(<span class="hljs-string">'publisher running...'</span>)
parser = argparse.ArgumentParser(description=<span class="hljs-string">'Producer : --topic mta-turnstile --config path-to-config'</span>)
parser.add_argument(<span class="hljs-string">'--topic'</span>, required=<span class="hljs-keyword">True</span>, help=<span class="hljs-string">'stream topic'</span>)
parser.add_argument(<span class="hljs-string">'--config'</span>, required=<span class="hljs-keyword">True</span>, help=<span class="hljs-string">'kafka setting'</span>)
args = parser.parse_args()
<span class="hljs-comment"># Register the signal handler to handle Ctrl-C </span>
signal.signal(signal.SIGINT, <span class="hljs-keyword">lambda</span> signal, frame: handle_sigint(signal, frame, producer.producer))
main_flow(args)
print(<span class="hljs-string">'publisher end'</span>)
</code></pre>
<p>The <code>main</code> block acts as the entry point, featuring an argument parser that captures the topic and Kafka configuration path from the command line. The script then invokes the <code>main_flow</code> function with the provided arguments.</p>
<p>The <code>main_flow</code> function is annotated with <code>@flow</code> and functions as the primary entry point for the flow. This flow configuration enables us to monitor the flow using our Prefect Cloud monitoring system. It takes parameters (<code>topic</code> and <code>config_path</code>) and initializes a Kafka producer using the provided configuration path and topic.</p>
<blockquote>
<p>π The data generated by this producer uses dummy data. It's important to note that the MTA system lacks a real-time feed for the turnstile data.</p>
</blockquote>
<h3 id="spark-kafka-consumer">Spark - Kafka Consumer</h3>
<p>The Spark PySpark application listens to a Kafka topic to retrieve messages. It parses these messages using a predefined schema to define the fields and their types. Since these messages arrive every ten seconds, our goal is to aggregate them within a time-span duration of five minutes. The specific duration can be defined based on solution requirements, and for our purposes, it aligns seamlessly with our current data pipeline flow. The aggregated messages are then serialized into compressed CSV files and loaded into the data lake. Subsequently, the data warehouse incremental process can merge this information into our data warehouse.</p>
<p>Our Spark application comprises the following components:</p>
<ul>
<li>Spark Setting class</li>
<li>Spark Consumer class</li>
<li>Main Application Entry<ul>
<li>Prefect libraries for flow monitoring</li>
<li>Prefect component for accessing the data lake</li>
<li>Access to the data lake</li>
</ul>
</li>
</ul>
<h4 id="spark-setting-class">Spark Setting Class</h4>
<pre><code class="lang-python">
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">SparkSettings</span>:</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span><span class="hljs-params">(self, config_path: str, topic: str, group_id: str, client_id: str)</span> -> <span class="hljs-keyword">None</span>:</span>
self.settings = read_config(config_path)
use_sasl = <span class="hljs-string">"sasl.mechanism"</span> <span class="hljs-keyword">in</span> self.settings <span class="hljs-keyword">and</span> self.settings[<span class="hljs-string">"sasl.mechanism"</span>] <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">None</span>
self.kafka_options = {
<span class="hljs-string">"kafka.bootstrap.servers"</span>: self.settings[<span class="hljs-string">"bootstrap.servers"</span>],
<span class="hljs-string">"subscribe"</span>: topic,
<span class="hljs-string">"startingOffsets"</span>: <span class="hljs-string">"earliest"</span>,
<span class="hljs-string">"failOnDataLoss"</span>: <span class="hljs-string">"false"</span>,
<span class="hljs-string">"client.id"</span>: client_id,
<span class="hljs-string">"group.id"</span>: group_id,
<span class="hljs-string">"auto.offset.reset"</span>: <span class="hljs-string">"earliest"</span>,
<span class="hljs-string">"checkpointLocation"</span>: <span class="hljs-string">"checkpoint"</span>,
<span class="hljs-string">"minPartitions"</span>: <span class="hljs-string">"2"</span>,
<span class="hljs-string">"enable.auto.commit"</span>: <span class="hljs-string">"false"</span>,
<span class="hljs-string">"enable.partition.eof"</span>: <span class="hljs-string">"true"</span>
}
<span class="hljs-keyword">if</span> use_sasl:
<span class="hljs-comment"># set the JAAS configuration only when use_sasl is True</span>
sasl_config = f<span class="hljs-string">'org.apache.kafka.common.security.plain.PlainLoginModule required serviceName="kafka" username="{self.settings["sasl.username"]}" password="{self.settings["sasl.password"]}";'</span>
login_options = {
<span class="hljs-string">"kafka.sasl.mechanisms"</span>: self.settings[<span class="hljs-string">"sasl.mechanism"</span>],
<span class="hljs-string">"kafka.security.protocol"</span>: self.settings[<span class="hljs-string">"security.protocol"</span>],
<span class="hljs-string">"kafka.sasl.username"</span>: self.settings[<span class="hljs-string">"sasl.username"</span>],
<span class="hljs-string">"kafka.sasl.password"</span>: self.settings[<span class="hljs-string">"sasl.password"</span>],
<span class="hljs-string">"kafka.sasl.jaas.config"</span>: sasl_config
}
<span class="hljs-comment"># merge the login options with the kafka options</span>
self.kafka_options = {**self.kafka_options, **login_options}
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__getitem__</span><span class="hljs-params">(self, key)</span>:</span>
<span class="hljs-string">"""
Get the value of a key from the settings dictionary.
"""</span>
<span class="hljs-keyword">return</span> self.settings[key]
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">set_jass_config</span><span class="hljs-params">(self)</span> -> <span class="hljs-keyword">None</span>:</span>
<span class="hljs-string">"""
Set the JAAS configuration with variables
"""</span>
jaas_config = (
<span class="hljs-string">"KafkaClient {\n"</span>
<span class="hljs-string">" org.apache.kafka.common.security.plain.PlainLoginModule required\n"</span>
f<span class="hljs-string">" username=\"{self['sasl.username']}\"\n"</span>
f<span class="hljs-string">" password=\"{self['sasl.password']}\";\n"</span>
<span class="hljs-string">"};"</span>
)
print(<span class="hljs-string">'========ENV===========>'</span>,jaas_config)
<span class="hljs-comment"># Set the JAAS configuration in the environment</span>
os.environ[<span class="hljs-string">'KAFKA_OPTS'</span>] = f<span class="hljs-string">"java.security.auth.login.config={jaas_config}"</span>
os.environ[<span class="hljs-string">'java.security.auth.login.config'</span>] = jaas_config
</code></pre>
<p>The Spark Setting class manages the configuration for connecting to a Kafka topic and receiving messages within Spark.</p>
<h4 id="spark-consumer-class">Spark Consumer Class</h4>
<pre><code class="lang-python"><span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">SparkConsumer</span>:</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span><span class="hljs-params">(self, settings: SparkSettings, topic: str, group_id: str, client_id: str)</span>:</span>
self.settings = settings
self.topic = topic
self.group_id = group_id
self.client_id = client_id
self.stream = <span class="hljs-keyword">None</span>
self.data_frame = <span class="hljs-keyword">None</span>
self.kafka_options = self.settings.kafka_options
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">read_kafka_stream</span><span class="hljs-params">(self, spark: SparkSession)</span> -> <span class="hljs-keyword">None</span>:</span>
<span class="hljs-string">"""
Reads the Kafka Topic.
Args:
spark (SparkSession): The spark session object.
"""</span>
self.stream = spark.readStream.format(<span class="hljs-string">"kafka"</span>).options(**self.kafka_options).load()
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">parse_messages</span><span class="hljs-params">(self, schema)</span> -> DataFrame:</span>
<span class="hljs-string">"""
Parse the messages and use the provided schema to type cast the fields
"""</span>
stream = self.stream
<span class="hljs-keyword">assert</span> stream.isStreaming <span class="hljs-keyword">is</span> <span class="hljs-keyword">True</span>, <span class="hljs-string">"DataFrame doesn't receive streaming data"</span>
options = {<span class="hljs-string">'header'</span>: <span class="hljs-string">'true'</span>, <span class="hljs-string">'sep'</span>: <span class="hljs-string">','</span>}
df = stream.selectExpr(<span class="hljs-string">"CAST(key AS STRING)"</span>, <span class="hljs-string">"CAST(value AS STRING)"</span>, <span class="hljs-string">"timestamp"</span>)
<span class="hljs-comment"># split attributes to nested array in one Column</span>
col = F.split(df[<span class="hljs-string">'value'</span>], <span class="hljs-string">','</span>)
<span class="hljs-comment"># expand col to multiple top-level columns</span>
<span class="hljs-keyword">for</span> idx, field <span class="hljs-keyword">in</span> enumerate(schema):
df = df.withColumn(field.name, col.getItem(idx).cast(field.dataType))
<span class="hljs-comment"># remove quotes from TIMESTAMP column</span>
df = df.withColumn(<span class="hljs-string">"TIMESTAMP"</span>, F.regexp_replace(F.col(<span class="hljs-string">"TIMESTAMP"</span>), <span class="hljs-string">'"'</span>, <span class="hljs-string">''</span>))
df = df.withColumn(<span class="hljs-string">"CA"</span>, F.regexp_replace(F.col(<span class="hljs-string">"CA"</span>), <span class="hljs-string">'"'</span>, <span class="hljs-string">''</span>))
result = df.select([field.name <span class="hljs-keyword">for</span> field <span class="hljs-keyword">in</span> schema])
df.dropDuplicates([<span class="hljs-string">"ID"</span>,<span class="hljs-string">"STATION"</span>,<span class="hljs-string">"TIMESTAMP"</span>])
result.printSchema()
<span class="hljs-keyword">return</span> result
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">agg_messages</span><span class="hljs-params">(self, df: DataFrame, window_duration: str, window_slide: str)</span> -> DataFrame:</span>
<span class="hljs-string">"""
Window for n minutes aggregations group by by AC, UNIT, STATION, DATE, DESC
"""</span>
<span class="hljs-comment"># Ensure TIMESTAMP is in the correct format (timestamp type) </span>
date_format = <span class="hljs-string">"yyyy-MM-dd HH:mm:ss"</span>
df = df.withColumn(<span class="hljs-string">"TS"</span>, F.to_timestamp(<span class="hljs-string">"TIMESTAMP"</span>, date_format))
df_windowed = df \
.withWatermark(<span class="hljs-string">"TS"</span>, window_duration) \
.groupBy(F.window(<span class="hljs-string">"TS"</span>, window_duration, window_slide),<span class="hljs-string">"CA"</span>, <span class="hljs-string">"UNIT"</span>,<span class="hljs-string">"SCP"</span>,<span class="hljs-string">"STATION"</span>,<span class="hljs-string">"LINENAME"</span>,<span class="hljs-string">"DIVISION"</span>, <span class="hljs-string">"DATE"</span>, <span class="hljs-string">"DESC"</span>) \
.agg(
F.sum(<span class="hljs-string">"ENTRIES"</span>).alias(<span class="hljs-string">"ENTRIES"</span>),
F.sum(<span class="hljs-string">"EXITS"</span>).alias(<span class="hljs-string">"EXITS"</span>)
).withColumn(<span class="hljs-string">"START"</span>,F.col(<span class="hljs-string">"window.start"</span>)) \
.withColumn(<span class="hljs-string">"END"</span>, F.col(<span class="hljs-string">"window.end"</span>)) \
.withColumn(<span class="hljs-string">"TIME"</span>, F.date_format(<span class="hljs-string">"window.end"</span>, <span class="hljs-string">"HH:mm:ss"</span>)) \
.drop(<span class="hljs-string">"window"</span>) \
.select(<span class="hljs-string">"CA"</span>,<span class="hljs-string">"UNIT"</span>,<span class="hljs-string">"SCP"</span>,<span class="hljs-string">"STATION"</span>,<span class="hljs-string">"LINENAME"</span>,<span class="hljs-string">"DIVISION"</span>,<span class="hljs-string">"DATE"</span>,<span class="hljs-string">"DESC"</span>,<span class="hljs-string">"TIME"</span>,<span class="hljs-string">"START"</span>,<span class="hljs-string">"END"</span>,<span class="hljs-string">"ENTRIES"</span>,<span class="hljs-string">"EXITS"</span>)
df_windowed.printSchema()
<span class="hljs-keyword">return</span> df_windowed
</code></pre>
<p>The Spark consumer class initiates the consumer by loading the Kafka settings, reading from the data stream, parsing the messages, and ultimately aggregating the information using various categorical fields from the data.</p>
<p>The <code>agg_messages</code> function is crafted to perform windowed aggregations on a DataFrame containing message data. It requires three parameters: <code>df</code> (the input DataFrame with message information), <code>window_duration</code> (specifying the duration of each aggregation window in minutes), and <code>window_slide</code> (indicating the sliding interval for the window). The function ensures the 'TIMESTAMP' column is in the correct timestamp format and applies windowed aggregations based on 'AC', 'UNIT', 'STATION', 'DATE', and 'DESC' columns. The resulting DataFrame includes aggregated entries and exits for each window and group, providing insights into activity patterns over specified time intervals. The function also prints the schema of the resulting DataFrame, making it convenient for users to understand the structure of the aggregated data.</p>
<blockquote>
<p>π The <code>agg_messages</code> function verifies that the timestamp from the data is in the correct Spark timestamp format. An incorrect format will prevent Spark from aggregating the messages, resulting in empty data files.</p>
</blockquote>
<h4 id="main-application-entry-point">Main application entry point</h4>
<pre><code class="lang-python"><span class="hljs-comment"># @task(name="Stream write GCS", description='Write stream file to GCS', log_prints=False)</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">stream_write_gcs</span><span class="hljs-params">(local_path: Path, file_name: str)</span> -> <span class="hljs-keyword">None</span>:</span>
<span class="hljs-string">"""
Upload local parquet file to GCS
Args:
path: File location
prefix: the folder location on storage
"""</span>
block_name = get_block_name()
prefix = get_prefix()
gcs_path = f<span class="hljs-string">'{prefix}/{file_name}'</span>
print(f<span class="hljs-string">'{block_name} {local_path} {gcs_path}'</span>)
gcs_block = GcsBucket.load(block_name)
gcs_block.upload_from_path(from_path=local_path, to_path=gcs_path)
<span class="hljs-keyword">return</span>
<span class="hljs-comment"># @task (name="MTA Spark Data Stream - Process Mini Batch", description="Write batch to the data lake")</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">process_mini_batch</span><span class="hljs-params">(df, batch_id, path)</span>:</span>
<span class="hljs-string">"""Processes a mini-batch, converts to Pandas, and writes to GCP Storage as CSV.gz."""</span>
<span class="hljs-comment"># Check if DataFrame is empty</span>
<span class="hljs-keyword">if</span> df.count() == <span class="hljs-number">0</span>:
print(f<span class="hljs-string">"DataFrame for batch {batch_id} is empty. Skipping processing."</span>)
<span class="hljs-keyword">return</span>
<span class="hljs-comment"># Convert to Pandas DataFrame</span>
df_pandas = df.toPandas()
<span class="hljs-comment"># Convert 'DATE' column to keep the same date format</span>
df_pandas[<span class="hljs-string">'DATE'</span>] = pd.to_datetime(df_pandas[<span class="hljs-string">'DATE'</span>], format=<span class="hljs-string">'%m-%d-%y'</span>).dt.strftime(<span class="hljs-string">'%m/%d/%Y'</span>)
print(df_pandas.head())
<span class="hljs-comment"># Get the current timestamp</span>
timestamp = datetime.now()
<span class="hljs-comment"># Format the timestamp as needed</span>
time = timestamp.strftime(<span class="hljs-string">"%Y%m%d_%H%M%S"</span>)
<span class="hljs-comment"># Write to Storage as CSV.gz </span>
file_name = f<span class="hljs-string">"batch_{batch_id}_{time}.csv.gz"</span>
file_path = f<span class="hljs-string">"{path}/{file_name}"</span>
df_pandas.to_csv(file_path, compression=<span class="hljs-string">"gzip"</span>)
<span class="hljs-comment"># send to the data lake</span>
stream_write_gcs(file_path, file_name)
<span class="hljs-meta">@task (name="MTA Spark Data Stream - Aggregate messages", description="Aggregate the data in time windows")</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">aggregate_messages</span><span class="hljs-params">(consumer, df_messages, window_duration, window_slide)</span> -> DataFrame:</span>
df_windowed = consumer.agg_messages(df_messages, window_duration, window_slide)
<span class="hljs-keyword">return</span> df_windowed
<span class="hljs-meta">@task (name="MTA Spark Data Stream - Read data stream", description="Read the data stream")</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">read_data_stream</span><span class="hljs-params">(consumer, spark_session)</span> -> <span class="hljs-keyword">None</span>:</span>
consumer.read_kafka_stream(spark_session)
<span class="hljs-comment"># write a streaming data frame to storage ./storage</span>
<span class="hljs-meta">@task (name="MTA Spark Data Stream - Write to Storage", description="Write batch to the data lake")</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">write_to_storage</span><span class="hljs-params">(df: DataFrame, output_mode: str = <span class="hljs-string">'append'</span>, processing_time: str = <span class="hljs-string">'60 seconds'</span>)</span> -> <span class="hljs-keyword">None</span>:</span>
<span class="hljs-string">"""
Output stream values to the console
"""</span>
df_csv = df.select(
<span class="hljs-string">"CA"</span>, <span class="hljs-string">"UNIT"</span>, <span class="hljs-string">"SCP"</span>, <span class="hljs-string">"STATION"</span>, <span class="hljs-string">"LINENAME"</span>, <span class="hljs-string">"DIVISION"</span>, <span class="hljs-string">"DATE"</span>, <span class="hljs-string">"TIME"</span>, <span class="hljs-string">"DESC"</span>,<span class="hljs-string">"ENTRIES"</span>, <span class="hljs-string">"EXITS"</span>
)
path = <span class="hljs-string">"./storage/"</span>
folder_path = Path(path)
<span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> os.path.exists(folder_path):
folder_path.mkdir(parents=<span class="hljs-keyword">True</span>, exist_ok=<span class="hljs-keyword">True</span>)
storage_query = df_csv.writeStream \
.outputMode(output_mode) \
.trigger(processingTime=processing_time) \
.format(<span class="hljs-string">"csv"</span>) \
.option(<span class="hljs-string">"header"</span>, <span class="hljs-keyword">True</span>) \
.option(<span class="hljs-string">"path"</span>, path) \
.option(<span class="hljs-string">"checkpointLocation"</span>, <span class="hljs-string">"./checkpoint"</span>) \
.foreachBatch(<span class="hljs-keyword">lambda</span> batch, id: process_mini_batch(batch, id, path)) \
.option(<span class="hljs-string">"truncate"</span>, <span class="hljs-keyword">False</span>) \
.start()
<span class="hljs-keyword">try</span>:
<span class="hljs-comment"># Wait for the streaming query to terminate</span>
storage_query.awaitTermination()
<span class="hljs-keyword">except</span> KeyboardInterrupt:
<span class="hljs-comment"># Handle keyboard interrupt (e.g., Ctrl+C)</span>
storage_query.stop()
<span class="hljs-meta">@flow (name="MTA Spark Data Stream flow", description="Data Streaming Flow")</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">main_flow</span><span class="hljs-params">(params)</span> -> <span class="hljs-keyword">None</span>:</span>
<span class="hljs-string">"""
main flow to process stream messages with spark
"""</span>
topic = params.topic
group_id = params.group
client_id = params.client
config_path = params.config
<span class="hljs-comment"># define a window for n minutes aggregations group by station</span>
default_span = <span class="hljs-string">'5 minutes'</span>
window_duration = default_span <span class="hljs-keyword">if</span> params.duration <span class="hljs-keyword">is</span> <span class="hljs-keyword">None</span> <span class="hljs-keyword">else</span> f<span class="hljs-string">'{params.duration} minutes'</span>
window_slide = default_span <span class="hljs-keyword">if</span> params.slide <span class="hljs-keyword">is</span> <span class="hljs-keyword">None</span> <span class="hljs-keyword">else</span> f<span class="hljs-string">'{params.slide} minutes'</span>
<span class="hljs-comment"># create the consumer settings</span>
spark_settings = SparkSettings(config_path, topic, group_id, client_id)
<span class="hljs-comment"># create the spark consumer</span>
spark_session = SparkSession.builder \
.appName(<span class="hljs-string">"turnstiles-consumer"</span>) \
.config(<span class="hljs-string">"spark.sql.adaptive.enabled"</span>, <span class="hljs-string">"false"</span>) \
.getOrCreate()
spark_session.sparkContext.setLogLevel(<span class="hljs-string">"WARN"</span>)
<span class="hljs-comment"># create an instance of the consumer class</span>
consumer = SparkConsumer(spark_settings, topic, group_id, client_id)
<span class="hljs-comment"># set the data frame stream</span>
read_data_stream(consumer, spark_session)
<span class="hljs-comment"># consumer.read_kafka_stream(spark_session) </span>
<span class="hljs-comment"># parse the messages</span>
df_messages = consumer.parse_messages(schema=turnstiles_schema)
df_windowed = aggregate_messages(consumer, df_messages, window_duration, window_slide)
<span class="hljs-comment"># df_windowed = consumer.agg_messages(df_messages, window_duration, window_slide)</span>
write_to_storage(df=df_windowed, output_mode=<span class="hljs-string">'append'</span>,processing_time=window_duration)
spark_session.streams.awaitAnyTermination()
<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
<span class="hljs-string">"""
Main entry point for streaming data between kafka and spark
"""</span>
os.system(<span class="hljs-string">'clear'</span>)
print(<span class="hljs-string">'Spark streaming running...'</span>)
parser = argparse.ArgumentParser(description=<span class="hljs-string">'Producer : --topic mta-turnstile --group spark_group --client app1 --config path-to-config'</span>)
parser.add_argument(<span class="hljs-string">'--topic'</span>, required=<span class="hljs-keyword">True</span>, help=<span class="hljs-string">'kafka topics'</span>)
parser.add_argument(<span class="hljs-string">'--group'</span>, required=<span class="hljs-keyword">True</span>, help=<span class="hljs-string">'consumer group'</span>)
parser.add_argument(<span class="hljs-string">'--client'</span>, required=<span class="hljs-keyword">True</span>, help=<span class="hljs-string">'client id group'</span>)
parser.add_argument(<span class="hljs-string">'--config'</span>, required=<span class="hljs-keyword">True</span>, help=<span class="hljs-string">'cloud settings'</span>)
parser.add_argument(<span class="hljs-string">'--duration'</span>, required=<span class="hljs-keyword">False</span>, help=<span class="hljs-string">'window duration for aggregation 5 mins'</span>)
parser.add_argument(<span class="hljs-string">'--slide'</span>, required=<span class="hljs-keyword">False</span>, help=<span class="hljs-string">'window slide 5 mins'</span>)
args = parser.parse_args()
main_flow(args)
print(<span class="hljs-string">'end'</span>)
</code></pre>
<p>This is a summary of the main application to start the consumer application:</p>
<ul>
<li><p><strong><code>stream_write_gcs</code></strong></p>
<ul>
<li><strong>Purpose:</strong> Uploads a local Parquet file to Google Cloud Storage (GCS).</li>
<li><strong>Prefect Cloud Monitoring:</strong> Marked as a Prefect task (<code>@task</code>) for monitoring.</li>
</ul>
</li>
<li><p><strong><code>process_mini_batch</code></strong></p>
<ul>
<li><strong>Purpose:</strong> Processes a mini-batch from a Spark DataFrame, converts it to a Pandas DataFrame, and writes it to GCP Storage as a compressed CSV file.</li>
<li><strong>Prefect Cloud Monitoring:</strong> Marked as a Prefect task (<code>@task</code>) for monitoring.</li>
</ul>
</li>
<li><p><strong><code>aggregate_messages</code></strong></p>
<ul>
<li><strong>Purpose:</strong> Aggregates data in time windows based on specific columns.</li>
<li><strong>Prefect Cloud Monitoring:</strong> Marked as a Prefect task (<code>@task</code>) for monitoring.</li>
</ul>
</li>
<li><p><strong><code>read_data_stream</code></strong></p>
<ul>
<li><strong>Purpose:</strong> Reads the data stream from Kafka.</li>
<li><strong>Prefect Cloud Monitoring:</strong> Marked as a Prefect task (<code>@task</code>) for monitoring.</li>
</ul>
</li>
<li><p><strong><code>write_to_storage</code></strong></p>
<ul>
<li><strong>Purpose:</strong> Writes a streaming DataFrame to storage (./storage) and triggers the processing of mini-batches.</li>
<li><strong>Prefect Cloud Monitoring:</strong> Marked as a Prefect task (<code>@task</code>) for monitoring.</li>
</ul>
</li>
<li><p><strong><code>main_flow</code></strong></p>
<ul>
<li><strong>Purpose:</strong> Defines the main flow to process stream messages with Spark.</li>
<li><strong>Prefect Cloud Monitoring:</strong> Marked as a Prefect flow (<code>@flow</code>) for orchestration and monitoring.</li>
</ul>
</li>
<li><p><strong>Main Entry Point:</strong></p>
<ul>
<li><strong>Purpose:</strong> Parses command-line arguments and invokes the <code>main_flow</code> function to execute the streaming data processing.</li>
</ul>
</li>
</ul>
<blockquote>
<p>π These decorators (<code>@flow</code> and <code>@task</code>) are employed for Prefect Cloud Monitoring, orchestration, and task management.</p>
</blockquote>
<h2 id="how-to-runt-it-">How to runt it!</h2>
<p>With all the requirements completed and the code review done, we are ready to run our solution. Let's follow these steps to ensure our apps run properly.</p>
<h3 id="start-the-container-services">Start the Container Services</h3>
<p>Initiate the container services from the command line by executing the following script:</p>
<pre><code class="lang-bash">$ <span class="hljs-keyword">bash </span>start_services.sh
</code></pre>
<h3 id="install-dependencies-and-run-the-apps">Install dependencies and run the apps</h3>
<blockquote>
<p>π These applications depend on the Kafka and Spark services to be running. Ensure to start those services first.</p>
</blockquote>
<h4 id="kafka-producer">Kafka Producer</h4>
<p>Execute the producer with the following command line:</p>
<pre><code class="lang-bash">$ <span class="hljs-keyword">bash </span>start_producer.sh
</code></pre>
<h4 id="spark-kafka-consumer">Spark - Kafka Consumer</h4>
<p>Execute the Spark consumer from the command line:</p>
<pre><code class="lang-bash">$ <span class="hljs-keyword">bash </span>start_consumer.sh
</code></pre>
<h3 id="execution-results">Execution Results</h3>
<p>After the producer and consumer are running, the following results should be observed:</p>
<h3 id="kafka-producer-log">Kafka Producer Log</h3>
<p><img src="////www.ozkary.dev/assets/2024/ozkary-data-engineering-process-stream-kafka-log.png" alt="Data Engineering Process Fundamentals - Data Streaming Kafka Producer Log" title="Data Engineering Process Fundamentals - Data Streaming Kafka Producer Log"></p>
<p>As messages are sent by the producer, we should observe the activity in the console or log file.</p>
<h3 id="spark-consumer-log">Spark Consumer Log</h3>
<p><img src="////www.ozkary.dev/assets/2024/ozkary-data-engineering-process-stream-spark-log.png" alt="Data Engineering Process Fundamentals - Data Streaming Spark Consumer Log" title="Data Engineering Process Fundamentals - Data Streaming Spark Consumer Log"></p>
<p>Spark parses the messages in real-time, displaying the message schemas for both the individual message from Kafka and the aggregated message. Once the time window is complete, it serializes the message from memory into a compressed CSV file.</p>
<h3 id="cloud-monitor">Cloud Monitor</h3>
<p><img src="////www.ozkary.dev/assets/2024/ozkary-data-engineering-process-stream-prefect-monitor.png" alt="Data Engineering Process Fundamentals - Data Streaming Cloud Monitor" title="Data Engineering Process Fundamentals - Cloud Monitor"></p>
<p>As the application runs, the flows and tasks are tracked on our cloud console. This tracking is utilized to monitor for any failures.</p>
<h3 id="data-lake-integration">Data Lake Integration</h3>
<p><img src="////www.ozkary.dev/assets/2024/ozkary-data-engineering-process-stream-data-lake.png" alt="Data Engineering Process Fundamentals - Data Streaming Data Lake" title="Data Engineering Process Fundamentals - Data Lake"></p>
<p>Upon serializing the data aggregation, a compressed CSV file is uploaded to the data lake with the purpose of making it visible to our data warehouse integration process.</p>
<h3 id="data-warehouse-integration">Data Warehouse Integration</h3>
<p><img src="////www.ozkary.dev/assets/2024/ozkary-data-engineering-process-stream-data-warehouse.png" alt="Data Engineering Process Fundamentals - Data Streaming Data Warehouse" title="Data Engineering Process Fundamentals - Data Warehouse"></p>
<p>Once the data has been transferred to the data lake, we can initiate the integration from the data warehouse. A quick way to check is to query the external table using the test station name.</p>
<blockquote>
<p>π Our weekly batch process is scheduled once per week. However, in a data stream use case, where the data arrives more frequently, we need to update the schedule to an hourly or minute window.</p>
</blockquote>
<h2 id="deployment">Deployment</h2>
<p>For our deployment process, we can follow these steps:</p>
<ul>
<li>Define the Docker files for each component</li>
<li>Build and push the apps to DockerHub</li>
<li>Deploy the Kafka and Spark services</li>
<li>Deploy the Kafka producer and Spark consumer apps</li>
</ul>
<h3 id="define-the-docker-files-for-each-component">Define the Docker files for each component</h3>
<p>To facilitate each deployment, we aim to run our applications within a Docker container. In each application folder, you'll find a Dockerfile. This file installs the application dependencies, copies the necessary files, and runs the specific command to load the application.</p>
<blockquote>
<p>π Noteworthy is the use of the <code>VOLUME</code> command in these files, enabling us to map a VM hosting folder to an image folder. The objective is to share a common configuration file for the containers.</p>
</blockquote>
<ul>
<li>Kafka Producer Docker file</li>
</ul>
<pre><code class="lang-bash"><span class="hljs-comment"># Use a base image with Prefect and Python</span>
<span class="hljs-keyword">FROM</span> prefecthq/prefect:<span class="hljs-number">2.7</span>.<span class="hljs-number">7</span>-python3.<span class="hljs-number">9</span>
<span class="hljs-comment"># Set the working directory</span>
<span class="hljs-keyword">WORKDIR</span><span class="bash"> /app
</span>
<span class="hljs-comment"># Copy the requirements file to the working directory</span>
<span class="hljs-keyword">COPY</span><span class="bash"> requirements.txt .
</span>
<span class="hljs-comment"># Install dependencies</span>
<span class="hljs-keyword">RUN</span><span class="bash"> pip install -r requirements.txt --trusted-host pypi.python.org --no-cache-dir
</span>
<span class="hljs-comment"># Copy the entire current directory into the container</span>
<span class="hljs-keyword">COPY</span><span class="bash"> *.py .
</span>
<span class="hljs-comment"># Specify the default command to run when the container starts</span>
<span class="hljs-keyword">CMD</span><span class="bash"> [<span class="hljs-string">"python3"</span>, <span class="hljs-string">"program.py"</span>, <span class="hljs-string">"--topic"</span>,<span class="hljs-string">"mta-turnstile"</span>,<span class="hljs-string">"--config"</span>,<span class="hljs-string">"/config/docker-kafka.properties"</span>]
</span>
<span class="hljs-comment"># Create a directory for Kafka configuration</span>
<span class="hljs-keyword">RUN</span><span class="bash"> mkdir -p /config
</span>
<span class="hljs-comment"># Create a volume mount for Kafka configuration</span>
<span class="hljs-keyword">VOLUME</span><span class="bash"> [<span class="hljs-string">"/config"</span>]
</span>
<span class="hljs-comment"># push the ~/.kafka/docker-kafka.properties to the target machine</span>
<span class="hljs-comment"># run as to map the volume to the target machine:</span>
<span class="hljs-comment"># docker run -v ~/.kafka:/config your-image-name</span>
</code></pre>
<ul>
<li>Spark Consumer Docker file</li>
</ul>
<pre><code class="lang-bash"><span class="hljs-comment"># Use a base image with Prefect and Python</span>
<span class="hljs-keyword">FROM</span> prefecthq/prefect:<span class="hljs-number">2.7</span>.<span class="hljs-number">7</span>-python3.<span class="hljs-number">9</span>
<span class="hljs-comment"># Set the working directory</span>
<span class="hljs-keyword">WORKDIR</span><span class="bash"> /app
</span>
<span class="hljs-comment"># Copy the requirements file to the working directory</span>
<span class="hljs-keyword">COPY</span><span class="bash"> requirements.txt .
</span>
<span class="hljs-comment"># Install dependencies</span>
<span class="hljs-keyword">RUN</span><span class="bash"> pip install -r requirements.txt --trusted-host pypi.python.org --no-cache-dir
</span>
<span class="hljs-comment"># Copy the entire current directory into the container</span>
<span class="hljs-keyword">COPY</span><span class="bash"> *.py .
</span><span class="hljs-keyword">COPY</span><span class="bash"> *.sh .
</span>
<span class="hljs-comment"># Create a directory for Kafka configuration</span>
<span class="hljs-keyword">RUN</span><span class="bash"> mkdir -p /config
</span>
<span class="hljs-comment"># Create a volume mount for Kafka configuration</span>
<span class="hljs-keyword">VOLUME</span><span class="bash"> [<span class="hljs-string">"/config"</span>]
</span>
<span class="hljs-comment"># Set the entry point script as executable</span>
<span class="hljs-keyword">RUN</span><span class="bash"> chmod +x submit-program.sh
</span>
<span class="hljs-comment"># Specify the default command to run when the container starts</span>
<span class="hljs-keyword">CMD</span><span class="bash"> [<span class="hljs-string">"/bin/bash"</span>, <span class="hljs-string">"submit-program.sh"</span>, <span class="hljs-string">"program.py"</span>, <span class="hljs-string">"/config/docker-kafka.properties"</span>]
</span>
<span class="hljs-comment"># push the ~/.kafka/docker-kafka.properties to the target machine</span>
<span class="hljs-comment"># run as to map the volume to the target machine:</span>
<span class="hljs-comment"># docker run -v ~/.kafka:/config your-image-name</span>
</code></pre>
<h3 id="build-and-push-the-apps-to-dockerhub">Build and push the apps to DockerHub</h3>
<p>To build the apps in Docker containers, execute the following script:</p>
<pre><code class="lang-bash">$ <span class="hljs-keyword">bash </span><span class="hljs-keyword">build_push_apps.sh</span>
</code></pre>
<h3 id="deploy-the-kafka-and-spark-services">Deploy the Kafka and Spark services</h3>
<p>For Kafka and Spark services, we are utilizing predefined Bitnami images from DockerHub. Deploy these images by running the following script on the target environment:</p>
<pre><code class="lang-bash">$ <span class="hljs-keyword">bash </span>deploy_kafka_spark.sh
</code></pre>
<p>This script utilizes a Docker Compose file to pull the Bitnami images and subsequently run them.</p>
<blockquote>
<p>π Docker Compose is a tool for defining and running multi-container Docker applications. It can define the services, networks, and volumes needed for the application in a single docker-compose.yml file.</p>
</blockquote>
<h3 id="deploy-the-kafka-producer-and-spark-consumer-apps">Deploy the Kafka producer and Spark consumer apps</h3>
<p>Once the app images are available from DockerHub, initiate the deployment against a new environment by executing this script:</p>
<pre><code class="lang-bash">$ <span class="hljs-keyword">bash </span>deploy_publisher_consumer_apps.sh
</code></pre>
<p>This script pulls the app images from DockerHub and runs them independently.</p>
<blockquote>
<p>π It's important to note that while we've covered a local and a GitHub Action deployment, deploying on a cloud provider environment involves additional considerations.</p>
</blockquote>
<h2 id="deployment-strategy">Deployment Strategy</h2>
<p>In this guide, we've explored a two-fold approach to deploying our Kafka and Spark-based data streaming solution. Initially, we used the manual deployment process, demonstrating how to execute bash scripts for building and deploying our application. This hands-on method provides a detailed understanding of the steps involved, giving users complete control over the deployment process.</p>
<p>Moving forward, we showcased a more streamlined and automated approach by integrating GitHub Actions into our workflow. By leveraging GitHub Actions, we can trigger builds and deployments with a simple push to dedicated branches (<code>deploy-bitnami</code> and <code>deploy-apps</code>). This automation not only simplifies the deployment process but also enhances efficiency, ensuring consistency across environments.</p>
<h2 id="summary">Summary</h2>
<p>The integration of Kafka and Spark in a data streaming architecture involves producers publishing data to Kafka topics, consumers subscribing to these topics, Spark consuming data from Kafka, parsing and aggregating messages, and finally, writing the processed data to a data lake or other storage for further processing.</p>
<p>Once the data is available in the data lake, the data warehouse process can pick up the new files and continue its incremental update process, ultimately reflecting on the analysis and visualization layer. This architecture enables real-time data processing and analytics in a scalable and fault-tolerant manner.</p>
<p>Thanks for reading.</p>
<p>Send question or comment at Twitter @ozkary</p>
<p>π Originally published by <a href="https://www.ozkary.com">ozkary.com</a></p>
<div class="blogger-post-footer">Originally Published by https://ozkary.com</div>Oscar Garcia @ozkaryhttp://www.blogger.com/profile/13427831719934915504noreply@blogger.com0tag:blogger.com,1999:blog-2250750882473644809.post-11861948700895265582023-08-05T11:44:00.008-04:002023-09-15T11:34:32.688-04:00Data Engineering Process Fundamentals - Data Streaming<p>In modern data engineering solutions, handling streaming data is very important. Businesses often need real-time insights to promptly monitor and respond to operational changes and performance trends. A data streaming pipeline facilitates the integration of real-time data into data warehouses and visualization dashboards. To achieve this integration in a data engineering solution, understanding the principles of data streaming is essential, and how technologies like <a href="https://kafka.apache.org/">Apache Kafka</a> and <a href="https://spark.apache.org/">Apache Spark</a> play a key role in building efficient streaming data pipelines.</p>
<blockquote>
<p>π <a href="https://www.ozkary.com/2023/07/data-engineering-process-fundamentals-data-analysis-visualization.html">Data Engineering Process Fundamentals - Data Analysis and Visualization</a></p>
</blockquote>
<h2 id="what-is-data-streaming">What is Data Streaming</h2>
<p>Data streaming enables us to build data integration in real-time. Unlike traditional batch processing, where data is collected and processed periodically, streaming data arrives continuously by and is processed on-the-fly. This kind of integration empowers organizations to:</p>
<ul>
<li>React Instantly: Timely responses to events and anomalies become possible</li>
<li>Predict Trends: Identify patterns and trends as they emerge</li>
<li>Enhance User Experience: Provide real-time updates and personalization</li>
<li>Optimize Operations: Streamline processes and resource allocation</li>
</ul>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-process-data-streaming-messages.png" alt="ozkary-data-engineering-design-data-streaming-messages" title="Data Engineering Process Fundamentals - Data Streaming Kafka Topics"></p>
<h3 id="data-streaming-channels">Data Streaming Channels</h3>
<p>Data streaming is a continuous data flow which can arrive from a channel that is usually hosted on an HTTP end-point. The type of the channel technology depends on the provider technology stack and can be any of the following:</p>
<ul>
<li><p>Web Hooks: Web hooks are like virtual messengers that notify us when something interesting happens on the web. They are HTTP callbacks triggered by specific events, such as a change in a system. To harness data from web hooks, we set up endpoints that listen for these notifications, allowing us to react instantly to changes.</p>
</li>
<li><p>Events: Events are a fundamental concept in data streaming. They represent occurrences in a system or application, such as a user click, a sensor detecting a temperature change, or a train arrival to a station. Events can be collected and processed in real-time by using a middleware platform like Apache Kafka or RabbitMQ, providing insights into user behavior, system health, and more.</p>
</li>
<li><p>API Integration: APIs (Application Programming Interfaces) are bridges between different software systems. Through API integration, we can fetch data from external services, social media platforms, IoT devices, or any source that exposes an API. This seamless connectivity enables us to incorporate external data into our applications and processes by scheduling calls to the API at a certain frequency.</p>
</li>
</ul>
<blockquote>
<p>π Events are used for a wide range of real-time applications, including IoT data collection, application monitoring, and user behavior tracking. Web hooks are typically employed for integrating external services, automating workflows, and receiving notifications from third-party platforms.</p>
</blockquote>
<h3 id="scaling-to-handle-a-data-stream">Scaling to Handle a Data Stream</h3>
<p>Data streaming sources often produce small payload size with high volume of messages. This introduces scalability concerns that should be addressed with essential components like the following:</p>
<ul>
<li><p>Streaming Infrastructure: Robust streaming infrastructure is the backbone of data streaming. This includes systems like Apache Kafka, AWS Kinesis, or Azure Stream Analytics, which facilitate the ingestion, processing, and routing of data streams</p>
</li>
<li><p>Real-Time Processing: Traditional batch processing won't cut it for data streaming. We need real-time processing frameworks like <a href="https://storm.apache.org/">Apache Storm</a>, or Apache Spark Streaming to handle data as it flows</p>
</li>
<li><p>Data Storage: Storing and managing streaming data is crucial. we might use data lakes for long-term storage and databases optimized for real-time access. Cloud storage solutions offer scalability and reliability</p>
</li>
<li><p>Analytics and Visualization: To derive meaningful insights, we need analytics tools capable of processing streaming data. Visualization platforms like PowerBI, Looker, or custom dashboards can help you make sense of the information in real time</p>
</li>
<li><p>Monitoring and Alerts: Proactive monitoring ensures that your data streaming pipeline is healthy. Implement alerts and triggers to respond swiftly to anomalies or critical events</p>
</li>
<li><p>Scalable Compute Resources: As data volumes grow, compute resources should be able to scale horizontally to handle increased data loads. Cloud-based solutions are often used for this purpose</p>
</li>
</ul>
<h2 id="data-streaming-components">Data Streaming Components</h2>
<p>At the heart of data streaming solutions lies technologies like Apache Kafka, a distributed event streaming platform, and Apache Spark, a versatile data processing engine. Together, they form a powerful solution that ingests, processes, and analyzes streaming data at scale.</p>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-process-data-streaming-kafka-spark.png" alt="ozkary-data-engineering-design-data-streaming" title="Data Engineering Process Fundamentals - Data Streaming Design Kafka and Spark"></p>
<h3 id="apache-kafka">Apache Kafka</h3>
<p>Kafka acts as the ingestion layer or message broker in the streaming pipeline. It serves as a highly durable, fault-tolerant, and scalable event streaming platform. Data producers, which can be various sources like applications, sensors, or webhooks publish events (messages) to Kafka topics. These events are typically small pieces of data containing information such as transactions, logs, or sensor readings. Let's look at a simplified overview of how Kafka works:</p>
<ul>
<li><p>Kafka organizes events into topics. A topic is a logical channel or category to which records (messages) are sent by producers and from which records are consumed by consumers. Topics serve as the central mechanism for organizing and categorizing data within Kafka. Each topic can have multiple partitions to support fail-over scenarios</p>
<ul>
<li>Kafka is distributed and provides fault tolerance. If a broker (Kafka server) fails, partitions can be replicated across multiple brokers</li>
</ul>
</li>
<li><p>Kafka follows a publish-subscribe model. Producers send records to topics, and consumers subscribe to one or more topics to receive and process those records</p>
<ul>
<li><p>A producer is a program or process responsible for publishing records to Kafka topics. Producers generate data, which is then sent to one or more topics. Each message in a topic is identified by an offset, which represents its position within the topic.</p>
</li>
<li><p>A consumer is a program or process that subscribes to one or more topics and processes the records within them. Consumers can read data from topics in real-time and perform various operations on it, such as analytics, storage, or forwarding to other systems</p>
</li>
</ul>
</li>
</ul>
<h3 id="apache-spark-structured-streaming">Apache Spark Structured Streaming</h3>
<p>Apache Spark Structured Streaming is a micro-batch processing framework built on top of Apache Spark. It enables real-time data processing by ingesting data from Kafka topics in mini-batches. Here's how the process works:</p>
<blockquote>
<p>π Apache Spark offers a unified platform for both batch and stream processing. If your application requires seamless transitions between batch and stream processing modes, Spark can be a good fit.</p>
</blockquote>
<ul>
<li><p>Kafka Integration: Spark Streaming integrates with Kafka using the Kafka Direct API. It can consume data directly from Kafka topics, leveraging Kafka's parallelism and fault tolerance features</p>
</li>
<li><p>Mini-Batch Processing: Spark Streaming reads data from Kafka topics in mini-batches, typically ranging from milliseconds to seconds. Each mini-batch of data is treated as a Resilient Distributed Dataset (RDD) within the Spark ecosystem</p>
</li>
<li><p>Data Transformation: Once the data is ingested into Spark Streaming, we can apply various transformations, computations, and analytics on the mini-batches of data. Spark provides a rich set of APIs for tasks like filtering, aggregating, joining, and machine learning</p>
</li>
<li><p>Windowed Operations: Spark Streaming allows us to perform windowed operations, such as sliding windows or tumbling windows, to analyze data within specific time intervals. This is useful for aggregating data over fixed time periods (e.g., hourly, daily) or for tracking patterns over sliding windows</p>
</li>
<li><p>Output: After processing, the results can be stored in various destinations, such as a data lake (e.g., Hadoop HDFS), a data warehouse (e.g., BigQuery, Redshift), or other external systems. Spark provides connectors to these storage solutions for seamless data persistence</p>
</li>
</ul>
<h3 id="benefits-of-a-kafka-and-spark-integration">Benefits of a Kafka and Spark Integration</h3>
<p>A Kafka and Spark integration enables us to build solutions with High Availability requirements due to the following features:</p>
<ul>
<li><p>Fault Tolerance: Kafka ensures that events are not lost even in the face of hardware failures, making it a reliable source of data</p>
</li>
<li><p>Scalability: Kafka scales horizontally, allowing you to handle increasing data volumes by adding more Kafka brokers</p>
</li>
<li><p>Flexibility: Spark Streaming's flexibility in data processing and windowing operations enables a wide range of real-time analytics</p>
</li>
<li><p>End-to-End Pipeline: By combining Kafka's ingestion capabilities with Spark's processing power, you can create end-to-end data streaming pipelines that handle real-time data ingestion, processing, and storage</p>
</li>
</ul>
<h3 id="supported-programming-languages">Supported Programming Languages</h3>
<p>When it comes to programming language support, both Kafka and Spark allows developers to choose the language that aligns best with their skills and project requirements.</p>
<ul>
<li><p>Kafka supports multiple programming languages, including Python, C#, and Java</p>
</li>
<li><p>Spark also support multiple programming languages like PySpark (Python), Scala, and even R for data processing tasks. Additionally, Spark allows users to work with SQL-like expressions, making it easier to perform complex data transformations and analysis</p>
</li>
</ul>
<h4 id="sample-python-code-for-a-kafka-producer">Sample Python Code for a Kafka Producer</h4>
<p>This is a very simple implementation of a Kafka producer using Python as the programming language. This code does not consume a data feed from a provider. It only shows how a producer sends messages to a Kafka topic.</p>
<pre><code class="lang-python">
<span class="hljs-keyword">from</span> kafka <span class="hljs-keyword">import</span> KafkaProducer
<span class="hljs-keyword">import</span> time
kafka_broker = <span class="hljs-string">"localhost:9092"</span>
<span class="hljs-comment"># Create a Kafka producer instance</span>
producer = KafkaProducer(
bootstrap_servers=kafka_broker, <span class="hljs-comment"># Replace with your Kafka broker's address</span>
value_serializer=<span class="hljs-keyword">lambda</span> v: str(v).encode(<span class="hljs-string">'utf-8'</span>)
)
<span class="hljs-comment"># Sample data message (comma-delimited)</span>
sample_message = <span class="hljs-string">"timestamp,station,turnstile_id,device_id,entry,exit,entry_datetime"</span>
<span class="hljs-keyword">try</span>:
<span class="hljs-comment"># continue to run until the instance is shutdown</span>
<span class="hljs-keyword">while</span> <span class="hljs-keyword">True</span>:
<span class="hljs-comment"># Simulate generating a new data message. This data should come from the data provider</span>
data_message = sample_message + f<span class="hljs-string">"\n{int(time.time())},StationA,123,456,10,15,'2023-07-12 08:30:00'"</span>
<span class="hljs-comment"># Send the message to the Kafka topic</span>
producer.send(<span class="hljs-string">'turnstile-stream'</span>, value=data_message)
<span class="hljs-comment"># add logging information for tracking</span>
print(<span class="hljs-string">"Message sent:"</span>, data_message)
time.sleep(<span class="hljs-number">1</span>) <span class="hljs-comment"># Sending messages every second</span>
<span class="hljs-keyword">except</span> KeyboardInterrupt:
<span class="hljs-keyword">pass</span>
<span class="hljs-keyword">finally</span>:
producer.close()
</code></pre>
<p>This Kafka producer code initializes a producer and sends a sample message to the specified Kafka topic. Let's review each code segment:</p>
<ul>
<li><p>Create Kafka Producer Configuration:</p>
<ul>
<li>Set up the Kafka producer configuration</li>
<li>Specify the Kafka broker(s) to connect to <code>(bootstrap.servers)</code></li>
</ul>
</li>
<li><p>Define Kafka Topic: Define the Kafka topic to send messages (turnstile-stream in this example)</p>
</li>
<li><p>Create a Kafka Producer:</p>
<ul>
<li>Create an instance of the Kafka producer with the broker end-point</li>
<li>Use a <code>value_serializer</code> to encode the string message with unicode utf-8</li>
</ul>
</li>
<li><p>Define Message Contents:</p>
<ul>
<li>Prepare the message to send as a CSV string with the header and value information </li>
</ul>
</li>
<li><p>Produce Messages:</p>
<ul>
<li>Use the send method of the Kafka producer to send messages to the Kafka topic, turnstile-stream</li>
</ul>
</li>
<li><p>Close the Kafka Producer:</p>
<ul>
<li>Always remember to close the Kafka producer when the application terminates to avoid leaving open connections on the broker</li>
</ul>
</li>
</ul>
<h4 id="sample-python-code-for-a-kafka-consumer-and-spark-client">Sample Python Code for a Kafka Consumer and Spark Client</h4>
<p>After looking at the Kafka producer code, let's take a look at how a Kafka consumer on Spark would consume and process the messages.</p>
<pre><code class="lang-python">
<span class="hljs-keyword">from</span> pyspark.sql <span class="hljs-keyword">import</span> SparkSession
<span class="hljs-keyword">from</span> pyspark.streaming <span class="hljs-keyword">import</span> StreamingContext
<span class="hljs-keyword">from</span> pyspark.streaming.kafka <span class="hljs-keyword">import</span> KafkaUtils
<span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> <span class="hljs-built_in">window</span>, sum
<span class="hljs-comment"># Create a Spark session</span>
spark = SparkSession.builder.appName(<span class="hljs-string">"TurnstileStreamApp"</span>).getOrCreate()
<span class="hljs-comment"># Create a StreamingContext with a batch interval of 5 seconds</span>
ssc = StreamingContext(spark.sparkContext, <span class="hljs-number">5</span>)
kafka_broker = <span class="hljs-string">"localhost:9092"</span>
<span class="hljs-comment"># Define the Kafka broker and topic to consume from</span>
kafkaParams = {
<span class="hljs-string">"bootstrap.servers"</span>: kafka_broker, <span class="hljs-comment"># Replace with your Kafka broker's address</span>
<span class="hljs-string">"auto.offset.reset"</span>: <span class="hljs-string">"latest"</span>,
}
topics = [<span class="hljs-string">"turnstile-stream"</span>]
<span class="hljs-comment"># Create a Kafka stream</span>
kafkaStream = KafkaUtils.createDirectStream(ssc, topics, kafkaParams)
<span class="hljs-comment"># Parse the Kafka stream as a DataFrame</span>
lines = kafkaStream.map(lambda x: x[<span class="hljs-number">1</span>])
df = spark.read.csv(lines)
<span class="hljs-comment"># Define a window for aggregation (4-hour window)</span>
windowed_df = df
.withWatermark(<span class="hljs-string">"entry_datetime"</span>, <span class="hljs-string">"4 hours"</span>) <span class="hljs-string">\</span>
<span class="hljs-comment"># 4-hour window with a 4-hour sliding interval</span>
.groupBy(<span class="hljs-string">"station"</span>, <span class="hljs-built_in">window</span>(<span class="hljs-string">"entry_datetime"</span>, <span class="hljs-string">"4 hours"</span>))
.agg(
sum(<span class="hljs-string">"entries"</span>).alias(<span class="hljs-string">"entries"</span>),
sum(<span class="hljs-string">"exits"</span>).alias(<span class="hljs-string">"exits"</span>)
)
<span class="hljs-comment"># Write the aggregated data to a blob storage as compressed CSV files</span>
query = windowed_df.writeStream<span class="hljs-string">\</span>
.outputMode(<span class="hljs-string">"update"</span>)<span class="hljs-string">\</span>
.foreachBatch(lambda batch_df, batch_id: batch_df.write<span class="hljs-string">\</span>
.mode(<span class="hljs-string">"overwrite"</span>)<span class="hljs-string">\</span>
.csv(<span class="hljs-string">"gs://your-bucket-name/"</span>) <span class="hljs-comment"># Replace with your blob storage path</span>
)<span class="hljs-string">\</span>
.start()
query.awaitTermination()
</code></pre>
<p>This simple example shows how to write a Kafka consumer, use Spark to process and aggregate the data, and finally write a csv file to the data lake. Letβs look at each code segment for more details:</p>
<ul>
<li><p>Create a Spark Session: </p>
<ul>
<li>Initialize a Spark session with the name "TurnstileStreamApp"</li>
</ul>
</li>
<li><p>Create a StreamingContext:</p>
<ul>
<li>Set up a StreamingContext with a batch interval of 5 seconds. This determines how often Spark will process incoming data</li>
</ul>
</li>
<li><p>Define Kafka Broker and Topic:</p>
<ul>
<li>Specify the Kafka broker's address (localhost:9092 in this example) and the topic to consume data from ("turnstile-stream")</li>
</ul>
</li>
<li><p>Create a Kafka Stream:</p>
<ul>
<li>Use KafkaUtils to create a direct stream from Kafka</li>
<li>The stream will consume data from the specified Kafka topic</li>
</ul>
</li>
<li><p>Parse the Kafka Stream:</p>
<ul>
<li>Extract the message values from the Kafka stream</li>
<li>Read these messages into a DataFrame (<code>df</code>) using Spark's CSV reader</li>
</ul>
</li>
<li><p>Define a Window for Aggregation:</p>
<ul>
<li>We specify the watermark for late data using <code>withWatermark</code>. This ensures that any data arriving later than the specified window is still considered for aggregation</li>
<li>Create a windowed DataFrame (<code>windowed_df</code>) by grouping data based on "station" and a 4-hour window</li>
<li>The "entry_datetime" column is used as the timestamp for windowing</li>
<li>Aggregations are performed to calculate the sum of "entries" and "exits" within each window</li>
</ul>
</li>
<li><p>Write the Aggregated Data to a Data Lake:</p>
<ul>
<li>Define a streaming query (<code>query</code>) to write the aggregated data to a blob storage path</li>
<li>The "update" output mode indicates that only updated results will be written</li>
<li>A function is applied to each batch of data, which specifies how to write the data</li>
<li>In this case, data is written as compressed CSV files to a data lake location</li>
<li>The <code>awaitTermination</code> method ensures the query continues to run and process data until manually terminated.</li>
</ul>
</li>
</ul>
<p>This Spark example processes data from Kafka, aggregates it in 4-hour windows, and it writes the results to blob storage. The code is structured to efficiently handle real-time streaming data and organize the output into folders in the data lake based on station names and time windows. In each folder, Spark will generate filenames automatically based on the default naming convention. Typically, it uses a combination of a unique identifier and partition number to create filenames. The exact format of the file name might vary depending on the Spark version and configuration. This approach is used to send the information to a data lake, so the data transformation process can pick it up and send to a data warehouse.</p>
<p>Alternatively, the Spark client can send the aggregated results directly in the data warehouse. The Spark client can connect to the data warehouse, so it can directly insert the information without using a data lake as an staging step.</p>
<h2 id="solution-design-and-architecture">Solution Design and Architecture</h2>
<p>For our solution strategy, we followed the design shown below. This design helps us ensure smooth flow, efficient processing and storage of data so that it is immediately available in our data warehouse consequently, the visualization tools. Let's break down each component and explain its purpose.</p>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-process-data-streaming-design.png" alt="ozkary-data-engineering-design-data-streaming" title="Data Engineering Process Fundamentals - Data Streaming Design"></p>
<h3 id="components">Components</h3>
<ul>
<li><p>Real-Time Data Source: This is an external data source, which continuously emits data as events or messages</p>
</li>
<li><p>Message Broker Layer: Our message broker layer is the central hub for data ingestion and distribution. It consists of two vital components:</p>
<ul>
<li>Kafka Broker Instance: Kafka acts as a scalable message broker and entry point for data ingestion. It efficiently collects and organizes data in topics from the source</li>
<li>Kafka Producer (Python): To bridge the gap between the data source and Kafka, we write a Python-based Kafka producer. This component is responsible for capturing data from the real-time source and forwarding it to the Kafka instance and corresponding topic</li>
</ul>
</li>
<li><p>Stream Processing Layer: The stream processing layer is where the messages from Kafka are processed, aggregated and sent to the corresponding data storage. This layer also consists of two key components: </p>
<ul>
<li>Spark Instance: Apache Spark, a high-performance stream processing framework, is responsible for processing and transforming data in real-time</li>
<li>Stream Consumer (Python): In order to consume the messages from a Kafka topic, we write a Python component that acts as both a Kafka Consumer and Spark application. <ul>
<li>The Kafka consumer retrieves data from the Kafka topic, ensuring that the data is processed as soon as it arrives</li>
<li>The Spark application process the messages, aggregates the data and saves the results in the data warehouse. This dual role ensures efficient data processing and storage.</li>
</ul>
</li>
</ul>
</li>
<li><p>Data Warehouse: As the final destination for our processed data, the data warehouse provides a reliable and structured repository for storing the results of our real-time data processing, so visualization tools like Looker and PowerBI can display the data as soon as the dashboards are refreshed</p>
</li>
</ul>
<blockquote>
<p>π We should note that dashboards query the data from the database. For a near real-time data to be available, the dashboard data needs to be refreshed at certain intervals (e.g., minutes or hourly). To make available the real-time data to the dashboard, there needs to be a live connection (socket) between the dashboard and the streaming platform, which is done by another system component, like <a href="https://redis.com/">Redis Cache</a> or custom service, that could push those events on a socket channel. </p>
</blockquote>
<h3 id="devops-support">DevOps Support</h3>
<ul>
<li><p>Containerization: In order to continue to meet our DevOps requirements, enhance scalability and manageability, and follow best enterprise level practices, we use Docker containers for all of our components. Each component, our Kafka and Spark instance as well as our two Python-based components, runs in separate Docker container. This ensures modularity, easy deployment, and resource isolation</p>
<ul>
<li>This approach also enable us to use a Kubernetes cluster , a container orchestration platform that can help us manage and deploy Docker containers at scale, to run our components. We could use Minikube for local development or use a cloud-managed Kubernetes service like Google Kubernetes Engine (GKE), Amazon Elastic Kubernetes Service (EKS), or Azure Kubernetes Service (AKS)</li>
</ul>
</li>
<li><p>Virtual Machine (VM): Our components need to run on a VM, so we follow the same approach and create a VM instance using a Terraform script, similar to how it was done for our batch data pipeline during our planning and design phase</p>
</li>
</ul>
<h3 id="advantages">Advantages</h3>
<p>Our data streaming design offers several advantages:</p>
<ul>
<li>Real-time Processing: Data is processed as it arrives, enabling timely insights and rapid response to changing conditions</li>
<li>Scalability: The use of Kafka and Spark allows us to scale our architecture effortlessly to handle growing data volumes</li>
<li>Containerization: Docker containers simplify deployment and management, making our system highly portable and maintainable</li>
<li>Integration: The seamless integration of Kafka, Spark, and the Kafka consumer as a Spark client ensures data continuity and efficient processing</li>
</ul>
<p>This data streaming strategy, powered by Kafka and Spark, empowers us to unlock real-time insights from our data streams, providing valuable information for rapid decision-making, analytics, and storage. </p>
<h2 id="summary">Summary</h2>
<p>In today's data-driven landscape, data streaming solutions are an absolute necessity, enabling the rapid processing and analysis of vast amounts of real-time data. Technologies like Kafka and Spark play a pivotal role in empowering organizations to harness real-time insights from their data streams.</p>
<p>Kafka and Spark, work together seamlessly to enable real-time data processing and analytics. Kafka handles the reliable ingestion of events, while Spark Streaming provides the tools for processing, transforming, analyzing, and storing the data in a data lake or data warehouse in near real-time, allowing businesses to make decisions much at a much faster pace.</p>
<h2 id="exercise-data-streaming-with-apache-kafka">Exercise - Data Streaming with Apache Kafka</h2>
<p>Now that we have defined the data streaming strategy, we can continue our journey and build a containerized Kafka producer that can consume real-time data sources. Let's take a look at that next.</p>
<p>Coming soon!</p>
<blockquote>
<p>π Data Engineering Process Fundamentals - Data Streaming With Apache Kafka Exercise</p>
</blockquote>
<p>Thanks for reading.</p>
<p>Send question or comment at Twitter @ozkary</p>
<p>π Originally published by <a href="https://www.ozkary.com">ozkary.com</a></p>
<div class="blogger-post-footer">Originally Published by https://ozkary.com</div>Oscar Garcia @ozkaryhttp://www.blogger.com/profile/13427831719934915504noreply@blogger.com0tag:blogger.com,1999:blog-2250750882473644809.post-68532005543889197762023-07-08T11:51:00.010-04:002023-08-25T10:40:45.107-04:00Data Engineering Process Fundamentals - Data Analysis and Visualization Exercise<p>Data analysis and visualization are fundamental to a data-driven decision-making process. To grasp the best strategy for our scenario, we delve into the data analysis and visualization phase of the process, making data models, analyzes and diagrams that allow us to tell stories from the data.</p>
<p>With the understanding of best practices for data analysis and visualization, we start by creating a code-based dashboard using Python, Pandas and Plotly. We then follow up by using a high-quality enterprise tool, such as Looker, to construct a low-code cloud-hosted dashboard, providing us with insights into the type of effort each method takes.</p>
<blockquote>
<p>π This is a dashboard created with Looker. Similar dashboards can be created with PowerBI and Tableau</p>
</blockquote>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-process-data-analysis-visualization-dashboard.png" alt="ozkary-data-engineering-analysis-visualization-dashboard" title="Data Engineering Process Fundamentals - Analysis and Visualization Dashboard"></p>
<p>Once we have designed our dashboard, we can align it with our initial requirements and proceed to formulate the data analysis conclusions, thereby facilitating informed business decisions for stakeholders. However, before delving into coding, let's commence by reviewing the data analysis specifications, which provide the blueprint for our implementation effort.</p>
<h2 id="specifications">Specifications</h2>
<p>At this stage of the process, we have a clear grasp of the requirements and a deep familiarity with the data. With these insights, we can now define our specifications as outlined below:</p>
<ul>
<li>Identify pertinent measures such as exits and entries</li>
<li>Conduct distribution analysis based on station<ul>
<li>This analysis delineates geographical boundary patterns</li>
</ul>
</li>
<li>Conduct distribution analysis based on days of the week and time slots</li>
</ul>
<p>By calculating the total count of passengers for arrivals and departures, we gain a holistic comprehension of passenger flow dynamics. Furthermore, we can employ distribution analysis to investigate variations across stations, days of the week, and time slots. These analyses provide essential insights for business strategy and decision-making, allowing us to identify peak travel periods, station preferences, and time-specific trends that can help us make informed decisions.</p>
<h3 id="data-analysis-requirements">Data Analysis Requirements</h3>
<p>In our analysis process, we can adhere to these specified requirements:</p>
<ul>
<li>Determine distinct time slots for morning and afternoon analysis:<pre><code class="lang-bash">12<span class="hljs-selector-pseudo">:00</span> <span class="hljs-selector-tag">AM</span> <span class="hljs-selector-tag">-</span> 3<span class="hljs-selector-pseudo">:59</span> <span class="hljs-selector-tag">AM</span>
04<span class="hljs-selector-pseudo">:00</span> <span class="hljs-selector-tag">AM</span> <span class="hljs-selector-tag">-</span> 7<span class="hljs-selector-pseudo">:59</span> <span class="hljs-selector-tag">AM</span>
08<span class="hljs-selector-pseudo">:00</span> <span class="hljs-selector-tag">AM</span> <span class="hljs-selector-tag">-</span> 11<span class="hljs-selector-pseudo">:59</span> <span class="hljs-selector-tag">AM</span>
12<span class="hljs-selector-pseudo">:00</span> <span class="hljs-selector-tag">PM</span> <span class="hljs-selector-tag">-</span> 3<span class="hljs-selector-pseudo">:59</span> <span class="hljs-selector-tag">PM</span>
04<span class="hljs-selector-pseudo">:00</span> <span class="hljs-selector-tag">PM</span> <span class="hljs-selector-tag">-</span> 7<span class="hljs-selector-pseudo">:59</span> <span class="hljs-selector-tag">PM</span>
08<span class="hljs-selector-pseudo">:00</span> <span class="hljs-selector-tag">PM</span> <span class="hljs-selector-tag">-</span> 11<span class="hljs-selector-pseudo">:59</span> <span class="hljs-selector-tag">PM</span>
</code></pre>
</li>
<li>Examine data regarding commuter exits (arrivals) and entries (departures)</li>
<li>Implement a master filter for date ranges, which exerts control over all charts</li>
<li>Incorporate a secondary filter component to facilitate station selection</li>
<li>Display the aggregate counts of entries and exits for the designated date range<ul>
<li>Employ score card components for this purpose</li>
</ul>
</li>
<li>Investigate station distributions to identify the most frequented stations<ul>
<li>Utilize donut charts, with the subway station name as the primary dimension</li>
</ul>
</li>
<li>Analyze distributions using the day of the week to unveil peak traffic days<ul>
<li>Employ bar charts to visualize entries and exits per day</li>
</ul>
</li>
<li>Explore distributions based on time slots to uncover daily peak hours<ul>
<li>Integrate bar charts to illustrate entries and exits within each time slot</li>
</ul>
</li>
</ul>
<h2 id="dashboard-design">Dashboard Design</h2>
<p>In the dashboard design, we can utilize a two-column layout, positioning the exits charts in the left column and the entries charts in the right column of the dashboard. Additionally, we can incorporate a header container to encompass the filters, date range, and station name. To support multiple devices, we need a responsive layout. We should note that when using a platform like Looker, there is really no responsive layout, but we need to define different layouts for mobile and desktop.</p>
<p>Layout Configuration:</p>
<ul>
<li>Desktop 1200px by 900px</li>
<li>Mobile 360px by 1980px</li>
</ul>
<h3 id="ui-components">UI Components</h3>
<p>For our dashboard components, we should incorporate the following:</p>
<ul>
<li>Date range picker</li>
<li>Station name list box</li>
<li>For each selected measure (exits, entries), we should employ a set of the following components:<ul>
<li>Score cards for the total numbers</li>
<li>Donut charts for station distribution</li>
<li>Bar charts for day of the week distribution</li>
<li>Bar charts for time slot distribution</li>
</ul>
</li>
</ul>
<h2 id="review-the-code-code-centric">Review the Code - Code Centric</h2>
<p>The dashboard layout is done using HTML for the presentation and Python to build the different HTML elements using the <a href="https://dash.plotly.com/">dash</a> library. All the charts are generated by plotly. </p>
<pre><code class="lang-javascript"># <span class="hljs-type">Define</span> the layout <span class="hljs-keyword">of</span> the app
app.layout = html.<span class="hljs-type">Div</span>([
html.<span class="hljs-type">H4</span>(<span class="hljs-string">"MTA Turnstile Data Dashboard"</span>),
dcc.<span class="hljs-type">DatePickerRange</span>(
id=<span class="hljs-symbol">'date</span>-range',
start_date=data[<span class="hljs-symbol">'created_dt'</span>].min<span class="hljs-literal">()</span>,
end_date=data[<span class="hljs-symbol">'created_dt'</span>].max<span class="hljs-literal">()</span>,
display_format=<span class="hljs-symbol">'YYYY</span>-<span class="hljs-type">MM</span>-<span class="hljs-type">DD'</span>
),
dbc.<span class="hljs-type">Row</span>([
dbc.<span class="hljs-type">Col</span>(
dbc.<span class="hljs-type">Card</span>(
dbc.<span class="hljs-type">CardBody</span>([
html.<span class="hljs-type">P</span>(<span class="hljs-string">"Total Entries"</span>),
html.<span class="hljs-type">H5</span>(id=<span class="hljs-symbol">'total</span>-entries')
]),
className=<span class="hljs-symbol">'score</span>-card'
),
width=<span class="hljs-number">6</span>
),
dbc.<span class="hljs-type">Col</span>(
dbc.<span class="hljs-type">Card</span>(
dbc.<span class="hljs-type">CardBody</span>([
html.<span class="hljs-type">P</span>(<span class="hljs-string">"Total Exits"</span>),
html.<span class="hljs-type">H5</span>(id=<span class="hljs-symbol">'total</span>-exits')
]),
className=<span class="hljs-symbol">'score</span>-card'
),
width=<span class="hljs-number">6</span>
)
], className=<span class="hljs-symbol">'score</span>-cards'),
dbc.<span class="hljs-type">Row</span>([
dbc.<span class="hljs-type">Col</span>(
dcc.<span class="hljs-type">Graph</span>(id=<span class="hljs-symbol">'top</span>-entries-stations', className=<span class="hljs-symbol">'donut</span>-chart'),
width=<span class="hljs-number">6</span>
),
dbc.<span class="hljs-type">Col</span>(
dcc.<span class="hljs-type">Graph</span>(id=<span class="hljs-symbol">'top</span>-exits-stations', className=<span class="hljs-symbol">'donut</span>-chart'),
width=<span class="hljs-number">6</span>
)
], className=<span class="hljs-symbol">'donut</span>-charts'),
dbc.<span class="hljs-type">Row</span>([
dbc.<span class="hljs-type">Col</span>(
dcc.<span class="hljs-type">Graph</span>(id=<span class="hljs-symbol">'exits</span>-by-day', className=<span class="hljs-symbol">'bar</span>-chart'),
width=<span class="hljs-number">6</span>
),
dbc.<span class="hljs-type">Col</span>(
dcc.<span class="hljs-type">Graph</span>(id=<span class="hljs-symbol">'entries</span>-by-day', className=<span class="hljs-symbol">'bar</span>-chart'),
width=<span class="hljs-number">6</span>
)
], className=<span class="hljs-symbol">'bar</span>-charts'),
dbc.<span class="hljs-type">Row</span>([
dbc.<span class="hljs-type">Col</span>(
dcc.<span class="hljs-type">Graph</span>(id=<span class="hljs-symbol">'exits</span>-by-time', className=<span class="hljs-symbol">'bar</span>-chart'),
width=<span class="hljs-number">6</span>
),
dbc.<span class="hljs-type">Col</span>(
dcc.<span class="hljs-type">Graph</span>(id=<span class="hljs-symbol">'entries</span>-by-time', className=<span class="hljs-symbol">'bar</span>-chart'),
width=<span class="hljs-number">6</span>
)
], className=<span class="hljs-symbol">'bar</span>-charts')
])
</code></pre>
<p>The provided Python code is building a web application dashboard layout using Dash, a Python framework for creating interactive web applications. This dashboard is designed to showcase insights and visualizations derived from MTA Turnstile Data. Here's a breakdown of the main components:</p>
<ul>
<li><p>App Layout: The <code>app.layout</code> defines the overall structure of the dashboard using the <code>html.Div</code> component. It acts as a container for all the displayed components</p>
</li>
<li><p>Title: <code>html.H4("MTA Turnstile Data Dashboard")</code> creates a header displaying the title of the dashboard</p>
</li>
<li><p>Date Picker Range: The <code>dcc.DatePickerRange</code> component allows users to select a date range for analysis. It's a part of Dash Core Components (<code>dcc</code>)</p>
</li>
<li><p>Score Cards: The <code>dbc.Row</code> and <code>dbc.Col</code> components create rows and columns for displaying score cards using <code>dbc.Card</code> and <code>dbc.CardBody</code>. These cards show metrics like "Total Entries" and "Total Exits"</p>
</li>
<li><p>Donut Charts: Another set of <code>dbc.Row</code> and <code>dbc.Col</code> components creates columns for displaying donut charts using the <code>dcc.Graph</code> component. These charts visualize the distribution of top entries and exits by station</p>
</li>
<li><p>Bar Charts: Similar to the previous sections, <code>dbc.Row</code> and <code>dbc.Col</code> components are used to create columns for displaying bar charts using the <code>dcc.Graph</code> component. These charts showcase the distribution of exits and entries by day of the week and time slot</p>
</li>
<li><p>CSS Classnames: The <code>className</code> attribute is used to apply CSS class names to the components, allowing for custom styling using CSS</p>
</li>
</ul>
<p>In summary, the code establishes the layout of the dashboard with distinct sections for date selection, score cards, donut charts, and bar charts. The various visualizations and metrics offer valuable insights into MTA Turnstile Data, enabling users to comprehend passenger flow patterns and trends effectively.</p>
<pre><code class="lang-python">
def update_dashboard(start_date, end_date):
filtered_data = data[(data[<span class="hljs-string">'created_dt'</span>] >= start_date) & (data[<span class="hljs-string">'created_dt'</span>] <= end_date)]
total_entries = filtered_data[<span class="hljs-string">'entries'</span>].sum() / <span class="hljs-number">1e12</span> # <span class="hljs-symbol">Convert</span> to trillions
total_exits = filtered_data[<span class="hljs-string">'exits'</span>].sum() / <span class="hljs-number">1e12</span> # <span class="hljs-symbol">Convert</span> to trillions
measures = [<span class="hljs-string">'exits'</span>,<span class="hljs-string">'entries'</span>]
filtered_data[<span class="hljs-string">"created_dt"</span>] = pd.to_datetime(filtered_data[<span class="hljs-string">'created_dt'</span>])
measures = [<span class="hljs-string">'exits'</span>,<span class="hljs-string">'entries'</span>]
exits_chart , entries_chart = create_station_donut_chart(filtered_data)
exits_chart_by_day ,entries_chart_by_day = create_day_bar_chart(filtered_data, measures)
exits_chart_by_time, entries_chart_by_time = create_time_bar_chart(filtered_data, measures)
return (
f<span class="hljs-string">"{total_entries:.2f}T"</span>,
f<span class="hljs-string">"{total_exits:.2f}T"</span>,
entries_chart,
exits_chart,
exits_chart_by_day,
entries_chart_by_day,
exits_chart_by_time,
entries_chart_by_time
)
</code></pre>
<p>The <code>update_dashboard</code> function is responsible for updating and refreshing the dashboard. It handles the data range change event. As the user changes the date range, this function takes in the start and end dates as inputs. The function then filters the dataset, retaining only the records falling within the specified date range. Subsequently, the function calculates key metrics for the dashboard's score cards. It computes the total number of entries and exits during the filtered time period, and these values are converted to trillions for better readability.</p>
<p>The code proceeds to generate various visual components for the dashboard. These components include donut charts illustrating station-wise entries and exits, bar charts showcasing entries and exits by day of the week, and another set of bar charts displaying entries and exits by time slot. Each of these visualizations is created using specialized functions like create_station_donut_chart, create_day_bar_chart, and create_time_bar_chart.</p>
<p>Finally, the function compiles all the generated components and calculated metrics into a tuple. This tuple is then returned by the update_dashboard function, containing values like total entries, total exits, and the various charts. </p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">create_station_donut_chart</span><span class="hljs-params">(df: pd.DataFrame )</span> -> Tuple[go.Figure, go.Figure]:</span>
<span class="hljs-string">"""
creates the station distribution donut chart
"""</span>
top_entries_stations = df.groupby(<span class="hljs-string">'station_name'</span>).agg({<span class="hljs-string">'entries'</span>: <span class="hljs-string">'sum'</span>}).nlargest(<span class="hljs-number">10</span>, <span class="hljs-string">'entries'</span>)
top_exits_stations = df.groupby(<span class="hljs-string">'station_name'</span>).agg({<span class="hljs-string">'exits'</span>: <span class="hljs-string">'sum'</span>}).nlargest(<span class="hljs-number">10</span>, <span class="hljs-string">'exits'</span>)
entries_chart = px.pie(top_entries_stations, names=top_entries_stations.index, values=<span class="hljs-string">'entries'</span>,
title=<span class="hljs-string">'Top 10 Stations by Entries'</span>, hole=<span class="hljs-number">0.3</span>)
exits_chart = px.pie(top_exits_stations, names=top_exits_stations.index, values=<span class="hljs-string">'exits'</span>,
title=<span class="hljs-string">'Top 10 Stations by Exits'</span>, hole=<span class="hljs-number">0.3</span>)
entries_chart.update_traces(marker=dict(colors=px.colors.qualitative.Plotly))
exits_chart.update_traces(marker=dict(colors=px.colors.qualitative.Plotly))
<span class="hljs-keyword">return</span> entries_chart, exits_chart
</code></pre>
<p>The <code>create_station_donut_chart</code> function is responsible for generating donut charts to visualize the distribution of entries and exits across the top stations. It starts by selecting the top stations based on the highest entries and exits from the provided DataFrame. Using Plotly Express, the function then constructs two separate donut charts: one for the top stations by entries and another for the top stations by exits.</p>
<p>Each donut chart provides a graphical representation of the distribution, where each station is represented by a segment based on the number of entries or exits it recorded. The charts are presented in a visually appealing manner with a central hole for a more focused view.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">create_day_bar_chart</span><span class="hljs-params">(df: pd.DataFrame, measures: List[str])</span> -> Tuple[go.Figure, go.Figure]:</span>
<span class="hljs-string">"""
Creates a bar chart using the week days from the given dataframe.
"""</span>
measures = [<span class="hljs-string">'exits'</span>,<span class="hljs-string">'entries'</span>]
day_categories = [<span class="hljs-string">'Sun'</span>, <span class="hljs-string">'Mon'</span>, <span class="hljs-string">'Tue'</span>, <span class="hljs-string">'Wed'</span>, <span class="hljs-string">'Thu'</span>, <span class="hljs-string">'Fri'</span>, <span class="hljs-string">'Sat'</span>]
group_by_date = df.groupby([<span class="hljs-string">"created_dt"</span>], as_index=<span class="hljs-keyword">False</span>)[measures].sum()
df[<span class="hljs-string">'weekday'</span>] = pd.Categorical(df[<span class="hljs-string">'created_dt'</span>].dt.strftime(<span class="hljs-string">'%a'</span>),
categories=day_categories,
ordered=<span class="hljs-keyword">True</span>)
group_by_weekday = df.groupby(<span class="hljs-string">'weekday'</span>, as_index=<span class="hljs-keyword">False</span>)[measures].sum()
exits_chart_by_day = px.bar(group_by_weekday, x=<span class="hljs-string">'weekday'</span>, y=<span class="hljs-string">'exits'</span>, color=<span class="hljs-string">'weekday'</span>,
title=<span class="hljs-string">'Exits by Day of the Week'</span>, labels={<span class="hljs-string">'weekday'</span>: <span class="hljs-string">'Day of the Week'</span>, <span class="hljs-string">'exits'</span>: <span class="hljs-string">'Exits'</span>},
color_discrete_sequence=[<span class="hljs-string">'green'</span>])
entries_chart_by_day = px.bar(group_by_weekday, x=<span class="hljs-string">'weekday'</span>, y=<span class="hljs-string">'entries'</span>, color=<span class="hljs-string">'weekday'</span>,
title=<span class="hljs-string">'Entries by Day of the Week'</span>, labels={<span class="hljs-string">'weekday'</span>: <span class="hljs-string">'Day of the Week'</span>, <span class="hljs-string">'entries'</span>: <span class="hljs-string">'Entries'</span>},
color_discrete_sequence=[<span class="hljs-string">'orange'</span>])
<span class="hljs-comment"># Hide the legend on the side</span>
exits_chart_by_day.update_layout(showlegend=<span class="hljs-keyword">False</span>)
entries_chart_by_day.update_layout(showlegend=<span class="hljs-keyword">False</span>)
<span class="hljs-comment"># Return the chart</span>
<span class="hljs-keyword">return</span> exits_chart_by_day, entries_chart_by_day
</code></pre>
<p>The <code>create_day_bar_chart</code> function is responsible for generating bar charts that illustrate the distribution of data based on the day of the week. Due to the limitations of the date-time data type not inherently containing day information, the function maps the data to the corresponding day category. </p>
<p>To begin, the function calculates the sum of the specified measures (entries and exits) for each date in the DataFrame using group_by_date. Next, it creates a new column named 'weekday' that holds the abbreviated day names (Sun, Mon, Tue, etc.) by applying the strftime method to the 'created_dt' column. This column is then transformed into a categorical variable using predefined day categories, ensuring that the order of days is preserved.</p>
<p>Using the grouped data by 'weekday', the function constructs two separate bar charts using Plotly Express. One chart visualizes the distribution of exits by day of the week, while the other visualizes the distribution of entries by day of the week.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">create_time_bar_chart</span><span class="hljs-params">(df: pd.DataFrame, measures : List[str] )</span> -> Tuple[go.Figure, go.Figure]:</span>
<span class="hljs-string">"""
Creates a bar chart using the time slot category
"""</span>
<span class="hljs-comment"># Define time (hr) slots</span>
time_slots = {
<span class="hljs-string">'12:00-3:59am'</span>: (<span class="hljs-number">0</span>, <span class="hljs-number">3</span>, <span class="hljs-number">0</span>),
<span class="hljs-string">'04:00-7:59am'</span>: (<span class="hljs-number">4</span>, <span class="hljs-number">7</span>, <span class="hljs-number">1</span>),
<span class="hljs-string">'08:00-11:59am'</span>: (<span class="hljs-number">8</span>, <span class="hljs-number">11</span>, <span class="hljs-number">2</span>),
<span class="hljs-string">'12:00-3:59pm'</span>: (<span class="hljs-number">12</span>, <span class="hljs-number">15</span>, <span class="hljs-number">3</span>),
<span class="hljs-string">'04:00-7:59pm'</span>: (<span class="hljs-number">16</span>, <span class="hljs-number">19</span>, <span class="hljs-number">4</span>),
<span class="hljs-string">'08:00-11:59pm'</span>: (<span class="hljs-number">20</span>, <span class="hljs-number">23</span>, <span class="hljs-number">5</span>)
}
<span class="hljs-comment"># Add a new column 'time_slot' based on time ranges</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">categorize_time</span><span class="hljs-params">(row)</span>:</span>
<span class="hljs-keyword">for</span> slot, (start, end, order) <span class="hljs-keyword">in</span> time_slots.items():
<span class="hljs-keyword">if</span> start <= row.hour <= end:
<span class="hljs-keyword">return</span> slot
df[<span class="hljs-string">'time_slot'</span>] = df[<span class="hljs-string">'created_dt'</span>].apply(categorize_time)
group_by_time = df.groupby(<span class="hljs-string">'time_slot'</span>, as_index=<span class="hljs-keyword">False</span>)[measures].sum()
<span class="hljs-comment"># Sort the grouped_data DataFrame based on the sorting value</span>
group_by_time_sorted = group_by_time.sort_values(by=[<span class="hljs-string">'time_slot'</span>], key=<span class="hljs-keyword">lambda</span> x: x.map({slot: sort_order <span class="hljs-keyword">for</span> slot, (_, _, sort_order) <span class="hljs-keyword">in</span> time_slots.items()}))
exits_chart_by_time = px.bar(group_by_time_sorted, x=<span class="hljs-string">'time_slot'</span>, y=<span class="hljs-string">'exits'</span>, color=<span class="hljs-string">'time_slot'</span>,
title=<span class="hljs-string">'Exits by Day of the Week'</span>, labels={<span class="hljs-string">'time_slot'</span>: <span class="hljs-string">'Time of Day'</span>, <span class="hljs-string">'exits'</span>: <span class="hljs-string">'Exits'</span>},
color_discrete_sequence=[<span class="hljs-string">'green'</span>])
entries_chart_by_time = px.bar(group_by_time_sorted, x=<span class="hljs-string">'time_slot'</span>, y=<span class="hljs-string">'entries'</span>, color=<span class="hljs-string">'time_slot'</span>,
title=<span class="hljs-string">'Entries by Day of the Week'</span>, labels={<span class="hljs-string">'time_slot'</span>: <span class="hljs-string">'Time of Day'</span>, <span class="hljs-string">'entries'</span>: <span class="hljs-string">'Entries'</span>},
color_discrete_sequence=[<span class="hljs-string">'orange'</span>])
<span class="hljs-comment"># Hide the legend on the side</span>
exits_chart_by_time.update_layout(showlegend=<span class="hljs-keyword">False</span>)
entries_chart_by_time.update_layout(showlegend=<span class="hljs-keyword">False</span>)
<span class="hljs-keyword">return</span> exits_chart_by_time, entries_chart_by_time
</code></pre>
<p>The <code>create_time_bar_chart</code> function is responsible for generating bar charts that depict the data distribution at specific times of the day. Just as with days of the week, the function maps and labels time ranges to create a new series, enabling the creation of these charts.</p>
<p>First, the function defines time slots using a dictionary, where each slot corresponds to a specific time range. For each data row, a new column named 'time_slot' is added based on the time ranges defined. This is achieved by using the categorize_time function, which checks the hour of the row's timestamp and assigns it to the appropriate time slot.</p>
<p>The data is then grouped by 'time_slot', and the sum of the specified measures (exits and entries) is calculated for each slot. To ensure that the time slots are displayed in the correct order, the grouped data is sorted based on a sorting value derived from the time slots' dictionary.</p>
<p>Using the grouped and sorted data, the function constructs two bar charts using Plotly Express. One chart visualizes the distribution of exits by time of day, while the other visualizes the distribution of entries by time of day. Each bar in the chart represents the sum of exits or entries for a specific time slot.</p>
<p>Once the implementation of this Python dashboard is complete, we can run it and see the following dashboard load on our browser:</p>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-process-analysis-visualization-python-dash.png" alt="ozkary-data-engineering-analysis-visualization-dashboard" title="Data Engineering Process Fundamentals - Analysis and Visualization Python Dashboard"></p>
<h3 id="requirements">Requirements</h3>
<p>These are the requirements to be able to run the Python dashboard.</p>
<blockquote>
<p>π <a href="https://github.com/ozkary/data-engineering-mta-turnstile/tree/main/Step5-Analysis" target="_repo">Clone this repo</a> or copy the files from this folder. We could also create a GitHub CodeSpace and run this online.</p>
</blockquote>
<ul>
<li>Use the analysis_data.csv file for test data<ul>
<li>Use the local file for this implementation</li>
</ul>
</li>
<li>Install the Python dependencies <ul>
<li>Type the following from the terminal</li>
</ul>
</li>
</ul>
<pre><code class="lang-bash">$ pip <span class="hljs-keyword">install</span> pandas
$ pip <span class="hljs-keyword">install</span> plotly
$ pip <span class="hljs-keyword">install</span> dash
$ pip <span class="hljs-keyword">install</span> dash_bootstrap_components
</code></pre>
<h3 id="how-to-run-it">How to Run It</h3>
<p>After installing the dependencies and downloading the code, we should be able to run the code from a terminal by typing:</p>
<pre><code class="lang-bash">$ <span class="hljs-keyword">python3</span> dashboard.<span class="hljs-keyword">py</span>
</code></pre>
<p>We should note that this is a simple implementation to illustrate the amount of effort it takes to build a dashboard using code. The code uses a local CSV file. If we need to connect to the data warehouse, we need to expand this code to use an API call that is authorized to access the data warehouse. Writing Python dashboards or creating Jupyter charts, works well for small teams that are working closely together and are running experiments on the data. However, for a more enterprise solution, we should look at using a tool like Looker or PowerBI. Let's take a look at that next.</p>
<h2 id="review-the-code-low-code">Review the Code - Low-Code</h2>
<p>Tools like Looker and PowerBI excel in data visualization, requiring little to no coding. These tools offer a plethora of visual elements for configuring dashboards, minimizing the need for extensive coding. For instance, these platforms effortlessly handle tasks like automatically displaying the day of the week from a date-time field.</p>
<p>In cases where an out-of-the-box solution is lacking, we might need to supplement it with a code snippet. For instance, consider our time range requirement. Since this is quite specific to our project, we must generate a new series with our desired labels. To achieve this, we introduce a new field that corresponds to the date-time hour value. When the field is created, we are essentially implementing a function.</p>
<p>The provided code reads the hour value from the date-time fields and subsequently maps it to its corresponding label.</p>
<pre><code class="lang-python">CASE
<span class="hljs-keyword">WHEN</span> <span class="hljs-keyword">HOUR</span>(created_dt) BETWEEN <span class="hljs-number">0</span> <span class="hljs-keyword">AND</span> <span class="hljs-number">3</span> <span class="hljs-keyword">THEN</span> <span class="hljs-string">"12:00-3:59am"</span>
<span class="hljs-keyword">WHEN</span> <span class="hljs-keyword">HOUR</span>(created_dt) BETWEEN <span class="hljs-number">4</span> <span class="hljs-keyword">AND</span> <span class="hljs-number">7</span> <span class="hljs-keyword">THEN</span> <span class="hljs-string">"04:00-7:59am"</span>
<span class="hljs-keyword">WHEN</span> <span class="hljs-keyword">HOUR</span>(created_dt) BETWEEN <span class="hljs-number">8</span> <span class="hljs-keyword">AND</span> <span class="hljs-number">11</span> <span class="hljs-keyword">THEN</span> <span class="hljs-string">"08:00-11:59am"</span>
<span class="hljs-keyword">WHEN</span> <span class="hljs-keyword">HOUR</span>(created_dt) BETWEEN <span class="hljs-number">12</span> <span class="hljs-keyword">AND</span> <span class="hljs-number">15</span> <span class="hljs-keyword">THEN</span> <span class="hljs-string">"12:00-3:59pm"</span>
<span class="hljs-keyword">WHEN</span> <span class="hljs-keyword">HOUR</span>(created_dt) BETWEEN <span class="hljs-number">16</span> <span class="hljs-keyword">AND</span> <span class="hljs-number">20</span> <span class="hljs-keyword">THEN</span> <span class="hljs-string">"04:00-7:59pm"</span>
<span class="hljs-keyword">WHEN</span> <span class="hljs-keyword">HOUR</span>(created_dt) BETWEEN <span class="hljs-number">20</span> <span class="hljs-keyword">AND</span> <span class="hljs-number">23</span> <span class="hljs-keyword">THEN</span> <span class="hljs-string">"08:00-11:59pm"</span>
<span class="hljs-keyword">END</span>
</code></pre>
<h3 id="requirements">Requirements</h3>
<p>The only requirement here is to sign up with Looker Studio and have access to a data warehouse or database that can serve data and is accessible from external sources.</p>
<blockquote>
<p>π <a href="https://lookerstudio.google.com/">Sign-up for Looker Studio</a></p>
</blockquote>
<p>Other Visualizations tools: </p>
<ul>
<li><a href="https://powerbi.microsoft.com/">PowerBI</a></li>
<li><a href="https://www.tableau.com/">Tableau</a></li>
</ul>
<h3 id="looker-ui">Looker UI</h3>
<p>Take a look at the image below. This is the Looker UI. We should familiarize ourselves with the following areas:</p>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-process-analysis-visualization-looker-design.png" alt="ozkary-data-engineering-analysis-visualization-looker" title="Data Engineering Process Fundamentals - Analysis and Visualization Looker design"></p>
<ul>
<li>Theme and Layout: Use it to configure the theme and change the layout for mobile or desktop</li>
<li>Add data: Use this to add a new data source</li>
<li>Add a chart: This allows us to add new charts</li>
<li>Add a control: Here, we can add the date range and station name list</li>
<li>Canvas: This is where we place all the components</li>
<li>Setup Pane: This allows us to configure the date range, dimension, measures, and sorting settings</li>
<li>Style Pane: Here, we can configure the colors and font</li>
<li>Data Pane: This displays the data sources with their fields. New fields are created as functions. When we hover over a field, we can see a function (fx) icon, which indicates that we can edit the function and configure our snippet</li>
</ul>
<h3 id="how-to-build-it">How to Build it</h3>
<p>Sign up for a Looker account or use another BI tool and follow these steps:</p>
<ul>
<li>Create a new dashboard</li>
<li>Click on the "Add Data" button</li>
<li>Use the connector for our data source:<ul>
<li>This should allow us to configure the credentials for access</li>
<li>Select the "rpt_turnstile" view, which already includes joins with the fact_table and dimension tables</li>
</ul>
</li>
<li>Once the data is loaded, we can see the dimensions and measures</li>
<li>Add the dashboard filters:<ul>
<li>Include a date range control for the filter, using the "created_dt" field</li>
<li>Add a list control and associate it with the station name</li>
</ul>
</li>
<li>Proceed to add the remaining charts:<ul>
<li>Ensure that all charts are associated with the date range dimension</li>
<li>This enables filtering to cascade across all the charts</li>
</ul>
</li>
<li>Utilize the "entries" and "exits" measures for all dashboards:<ul>
<li>Integrate two scorecards for the sum of entries and exits</li>
<li>Incorporate a donut chart to display exits and entries distribution by stations</li>
<li>Incorporate two bar charts (entries and exits) and use the weekday value from the "created_dt" dimension<ul>
<li>Sort them by the weekday. Use the day number (0-6), not the name (Sun-Sat). This is achieved by adding a new field with the following code and using it for sorting:</li>
</ul>
</li>
</ul>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-title">WEEKDAY</span><span class="hljs-params">(created_dt)</span></span>
</code></pre>
<ul>
<li>Create the time slot dimension field (click "Add Field" and enter this definition):</li>
</ul>
<pre><code class="lang-python">CASE
<span class="hljs-keyword">WHEN</span> <span class="hljs-keyword">HOUR</span>(created_dt) BETWEEN <span class="hljs-number">0</span> <span class="hljs-keyword">AND</span> <span class="hljs-number">3</span> <span class="hljs-keyword">THEN</span> <span class="hljs-string">"12:00-3:59am"</span>
<span class="hljs-keyword">WHEN</span> <span class="hljs-keyword">HOUR</span>(created_dt) BETWEEN <span class="hljs-number">4</span> <span class="hljs-keyword">AND</span> <span class="hljs-number">7</span> <span class="hljs-keyword">THEN</span> <span class="hljs-string">"04:00-7:59am"</span>
<span class="hljs-keyword">WHEN</span> <span class="hljs-keyword">HOUR</span>(created_dt) BETWEEN <span class="hljs-number">8</span> <span class="hljs-keyword">AND</span> <span class="hljs-number">11</span> <span class="hljs-keyword">THEN</span> <span class="hljs-string">"08:00-11:59am"</span>
<span class="hljs-keyword">WHEN</span> <span class="hljs-keyword">HOUR</span>(created_dt) BETWEEN <span class="hljs-number">12</span> <span class="hljs-keyword">AND</span> <span class="hljs-number">15</span> <span class="hljs-keyword">THEN</span> <span class="hljs-string">"12:00-3:59pm"</span>
<span class="hljs-keyword">WHEN</span> <span class="hljs-keyword">HOUR</span>(created_dt) BETWEEN <span class="hljs-number">16</span> <span class="hljs-keyword">AND</span> <span class="hljs-number">19</span> <span class="hljs-keyword">THEN</span> <span class="hljs-string">"04:00-7:59pm"</span>
<span class="hljs-keyword">WHEN</span> <span class="hljs-keyword">HOUR</span>(created_dt) BETWEEN <span class="hljs-number">20</span> <span class="hljs-keyword">AND</span> <span class="hljs-number">23</span> <span class="hljs-keyword">THEN</span> <span class="hljs-string">"08:00-11:59pm"</span>
<span class="hljs-keyword">END</span>
</code></pre>
<ul>
<li>Add two bar charts (entries and exits) and use the time slot dimension:<ul>
<li>Use the hour value from the "created_dt" dimension for sorting by adding a new field and using it as your sorting criteria:</li>
</ul>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-title">HOUR</span><span class="hljs-params">(created_dt)</span></span>
</code></pre>
<h3 id="view-the-dashboard">View the Dashboard</h3>
<p>After following all the specification, we should be able to preview the dashboard on the browser. We can load an example, of a dashboard by clicking on the link below:</p>
<blockquote>
<p>π <a href="https://lookerstudio.google.com/reporting/94749e6b-2a1f-4b41-aff6-35c6c33f401e/">View the dashboard online</a></p>
<p>π <a href="https://lookerstudio.google.com/s/qv_IQAC-gKU">View the mobile dashboard online</a></p>
</blockquote>
<p>This is a an image of the mobile dashboard.</p>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-process-analysis-visualization-looker-mobile.png" alt="ozkary-data-engineering-analysis-visualization-mobile-dashboard" title="Data Engineering Process Fundamentals - Analysis and Visualization Mobile Dashboard"></p>
<h2 id="data-analysis-conclusions">Data Analysis Conclusions</h2>
<p>By examining the dashboard, the following conclusions can be observed:</p>
<ul>
<li>Stations with the highest distribution represent the busiest locations</li>
<li>The busiest time slot for both exits and entries is between 4pm and 9pm</li>
<li>Every day of the week exhibits a high volume of commuters</li>
<li>Businesses can choose the stations near their locations for further analysis</li>
</ul>
<p>With these insights, strategies can be devised to optimize marketing campaigns and target users within geo-fenced areas and during specific hours of the day that are in close proximity to corresponding business locations.</p>
<h2 id="summary">Summary</h2>
<p>We utilize our expertise in data analysis and visualization to construct charts and build them into dashboards. We adopt two distinct approaches for dashboard creation: a code-centric method and a low-code enterprise solution like Looker. After a comprehensive comparison, we deduce that the code-centric approach is optimal for small teams, whereas it might not suffice for enterprise users, especially when targeting executive stakeholders.</p>
<p>Lastly, as the dashboard becomes operational, we transition into the role of business analysts, deciphering insights from the data. This enables us to offer answers aligned with our original requirements.</p>
<h2 id="next">Next</h2>
<p>We have successfully completed our data pipeline from CSV files to our data warehouse and dashboard. Now, let's explore an advanced concept in data engineering: data streaming, which facilitates real-time data integration. This involves the continuous and timely processing of incoming data. Technologies like <a href="https://kafka.apache.org/">Apache Kafka</a> and <a href="https://spark.apache.org/">Apache Spark</a> play a crucial role in enabling efficient data streaming processes. Let's take a closer look at these components next.</p>
<p>Coming Soon!</p>
<blockquote>
<p>π [Data Engineering Process Fundamentals - Real-Time Data]</p>
</blockquote>
<p>Thanks for reading.</p>
<p>Send question or comment at Twitter @ozkary</p>
<p>π Originally published by <a href="https://www.ozkary.com">ozkary.com</a></p>
<div class="blogger-post-footer">Originally Published by https://ozkary.com</div>Oscar Garcia @ozkaryhttp://www.blogger.com/profile/13427831719934915504noreply@blogger.com0tag:blogger.com,1999:blog-2250750882473644809.post-69055272830426215752023-07-01T11:49:00.008-04:002023-08-25T10:41:40.118-04:00Data Engineering Process Fundamentals - Data Analysis and Visualization<p>After completing our data warehouse design and implementation, our data pipeline should be fully operational. We can move forward with the analysis and visualization step of our process. Data analysis entails exploring, comprehending, and reshaping data to yield insights, thereby enabling stakeholders to make informed business decisions. Conversely, data visualization employs these insights to adeptly convey information via visual elements, encompassing charts and dashboards.</p>
<blockquote>
<p>π <a href="https://www.ozkary.dev/data-engineering-process-fundamentals-data-warehouse-transformation/">Data Engineering Process Fundamentals - Data Warehouse and Transformation</a></p>
</blockquote>
<p>Data analysis entails utilizing guidelines and patterns to guide the selection of appropriate analyses tailored to the specific use case. For instance, a Business Analyst (BA) focuses on examining data summations and aggregations across categorical dimensions such as date or station name. Conversely, a Manufacturing Quality Engineer (MQE) prioritizes the examination of statistical data, encompassing metrics like the mean and standard deviation.</p>
<p>In data visualization, we follow guidelines and design patterns to determine the appropriate chart for our data. For instance, a Business Intelligence (BI) dashboard may employ bar and pie charts to monitor sales performance in specific regions, while a Quality Control (QA) dashboard might utilize box plots, bell curves, and control charts to assess manufacturing process quality.</p>
<p>Data analysis and visualization are fundamental to a data-driven decision-making process. To grasp the best strategy for our scenario, we now dive deeper into this process by using a sample dataset from our data warehouse to illustrate the approach with examples.</p>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-process-data-analysis-visualization-flow.png" alt="ozkary-data-engineering-analysis-visualization" title="Data Engineering Process Fundamentals - Analysis and Visualization"></p>
<h2 id="data-analysis">Data Analysis</h2>
<p>Data analysis is the practice of exploring data and understanding its meaning. It involves activities that can help us achieve a specific goal, such as identifying data dimensions and measures, as well as data analysis to identify outliers, trends, distributions, and hypothesis testing. We can accomplish these activities by writing code snippets using Python and Pandas, Visual Studio Code or Jupyter Notebooks. What's more, we can use libraries, such as Plotly, to generate some visuals to further analyze data and create prototypes.</p>
<p>For low-code tools, the analysis can be done using a smart and rich user interface that automatically discovers the meta-data to identify the dataset properties like dimensions and measures. With little to no-code, those tools can help us model the data, create charts and dashboards.</p>
<h3 id="data-profiling">Data Profiling</h3>
<p>Data profiling is the process to identify the data types, dimensions, measures, and quantitative values, which allows the analyst to understand the characteristics of the data, so we can understand how to group the information. </p>
<ul>
<li><p>Data Types: This is the type classification of the data fields. It enables us to identify categorical (text), numeric and date-time values. The date-time data type is specially important as it provides us with the ability to slice the numeric values with a date range, specific dates and times (e.g., hourly)</p>
</li>
<li><p>Dimensions: Dimensions are textual, and categorical attributes that describe business entities. They are often discrete and used for grouping, filtering, and organizing data</p>
</li>
<li><p>Measures: Measures are the quantitative values that are subject to calculations such as sum, average, minimum, maximum, etc. They represent the KPIs that the organization wants to track and analyze</p>
</li>
</ul>
<p>As an example of data profiling, we can inspect the average of arrivals and departures at certain time slots. This can help us identify patterns at different times.</p>
<blockquote>
<p>π <a href="https://github.com/ozkary/data-engineering-mta-turnstile/tree/main/Step5-Analysis" target="_repo">Clone this repo</a> or copy the files from this folder. Use Jupyter Notebook file.</p>
</blockquote>
<pre><code class="lang-python">
import pandas <span class="hljs-keyword">as</span> pd
# use the sample dataset in this path Step5-Analysis/analysis_data.csv
df = pd.read_csv(<span class="hljs-string">'./analysis_data.csv'</span>, iterator=False)
df.head(<span class="hljs-number">10</span>)
# Define time (hr) slots
time_slots = {
<span class="hljs-string">'morning'</span>: (<span class="hljs-number">8</span>, <span class="hljs-number">11</span>),
<span class="hljs-string">'afternoon'</span>: (<span class="hljs-number">12</span>, <span class="hljs-number">15</span>),
<span class="hljs-string">'night'</span>: (<span class="hljs-number">16</span>, <span class="hljs-number">20</span>)
}
# cast the date column <span class="hljs-keyword">to</span> datetime
df[<span class="hljs-string">"created_dt"</span>] = pd.to_datetime(df[<span class="hljs-string">'created_dt'</span>])
df[<span class="hljs-string">"exits"</span>] = df[<span class="hljs-string">"exits"</span>].astype(<span class="hljs-keyword">int</span>)
df[<span class="hljs-string">"entries"</span>] = df[<span class="hljs-string">"entries"</span>].astype(<span class="hljs-keyword">int</span>)
# Calculate average arrivals (exits) <span class="hljs-built_in">and</span> departures (entries) <span class="hljs-keyword">for</span> each time slot
<span class="hljs-keyword">for</span> slot, (start_hour, end_hour) in time_slots.<span class="hljs-built_in">items</span>():
slot_data = df[(df[<span class="hljs-string">'created_dt'</span>].dt.hour >= start_hour) & (df[<span class="hljs-string">'created_dt'</span>].dt.hour < end_hour)]
avg_arrivals = slot_data[<span class="hljs-string">'exits'</span>].mean()
avg_departures = slot_data[<span class="hljs-string">'entries'</span>].mean()
<span class="hljs-keyword">print</span>(<span class="hljs-keyword">f</span><span class="hljs-string">"{slot.capitalize()} - Avg Arrivals: {avg_arrivals:.2f}, Avg Departures: {avg_departures:.2f}"</span>)
# output
Morning - Avg Arrival<span class="hljs-variable">s:</span> <span class="hljs-number">30132528.64</span>, Avg Departure<span class="hljs-variable">s:</span> <span class="hljs-number">37834954.08</span>
Afternoon - Avg Arrival<span class="hljs-variable">s:</span> <span class="hljs-number">30094161.08</span>, Avg Departure<span class="hljs-variable">s:</span> <span class="hljs-number">37482421.78</span>
Night - Avg Arrival<span class="hljs-variable">s:</span> <span class="hljs-number">29513309.25</span>, Avg Departure<span class="hljs-variable">s:</span> <span class="hljs-number">36829260.66</span>
</code></pre>
<p>The code calculates the average arrivals and departures for each time slot. It prints out the results for each time slot, helping us identify the patterns of commuter activity during different times of the day.</p>
<h3 id="data-cleaning-and-preprocessing">Data Cleaning and Preprocessing</h3>
<p>Data cleaning and preprocessing is the process of finding bad data and outliers that can affect the results. Bad data could be null values or values that are not within the range of the average trend for that day. These kinds of data problems should have been identified during the data load process, but it is always a best practice to repeat this process, even when the data comes from a trusted resource.</p>
<blockquote>
<p>πOutliers are values that are notably different from the other data points in terms of magnitude or distribution. They can be either unusually high (positive outliers) or unusually low (negative outliers) in comparison to the majority of data points.</p>
</blockquote>
<p>For example, we might want to look at stations where the average number of arrivals in the morning differs unusually from the average number of departures in the evening. A normal pattern is that both should be within the threshold value.</p>
<pre><code class="lang-python"><span class="hljs-comment"># get the departures and arrivals for each station at the morning and night time slots</span>
df_morning_arrivals = df[(df[<span class="hljs-string">'created_dt'</span>].dt.hour >= time_slots[<span class="hljs-string">'morning'</span>][<span class="hljs-number">0</span>]) & (df[<span class="hljs-string">'created_dt'</span>].dt.hour < time_slots[<span class="hljs-string">'morning'</span>][<span class="hljs-number">1</span>])]
df_night_departures = df[(df[<span class="hljs-string">'created_dt'</span>].dt.hour >= time_slots[<span class="hljs-string">'night'</span>][<span class="hljs-number">0</span>] ) & (df[<span class="hljs-string">'created_dt'</span>].dt.hour < time_slots[<span class="hljs-string">'night'</span>][<span class="hljs-number">1</span>])]
<span class="hljs-comment"># Calculate the mean arrivals and departures for each station</span>
mean_arrivals_by_station = df_morning_arrivals.groupby(<span class="hljs-string">'station_name'</span>)[<span class="hljs-string">'exits'</span>].mean()
mean_departures_by_station = df_night_departures.groupby(<span class="hljs-string">'station_name'</span>)[<span class="hljs-string">'entries'</span>].mean()
<span class="hljs-comment"># Calculate the z-scores for the differences between mean arrivals and departures</span>
z_scores = (mean_arrivals_by_station - mean_departures_by_station) / np.sqrt(mean_arrivals_by_station.<span class="hljs-keyword">var</span>() + mean_departures_by_station.<span class="hljs-keyword">var</span>())
<span class="hljs-comment"># Set a z-score threshold to identify outliers</span>
z_score_threshold = <span class="hljs-number">1.95</span> <span class="hljs-comment"># You can adjust this value based on your needs</span>
<span class="hljs-comment"># Identify stations with outliers</span>
outlier_stations = z_scores[abs(z_scores) > z_score_threshold]
<span class="hljs-built_in">print</span>(<span class="hljs-string">"Stations with outliers:"</span>)
<span class="hljs-built_in">print</span>(outlier_stations)
<span class="hljs-comment"># output</span>
Stations <span class="hljs-keyword">with</span> outliers:
station_name
<span class="hljs-number">183</span> ST -<span class="hljs-number">3.170777</span>
BAYCHESTER AV -<span class="hljs-number">4.340479</span>
JACKSON AV -<span class="hljs-number">4.215668</span>
NEW LOTS <span class="hljs-number">3.124990</span>
</code></pre>
<p>The output shows that there is a significant difference in the number of arrivals (morning) at these stations compared to departures later in the evening. This issue could be a result of some missing data or perhaps an event that caused the difference in commuters.</p>
<h3 id="statistical-analysis">Statistical Analysis</h3>
<p>Statistical analysis focuses on applying statistical techniques in order to draw meaningful conclusions about a set of data. It involves mathematical computations, probability theory, correlation analysis, and hypothesis testing to make inferences and predictions based on the data. </p>
<p>An example of statistical analysis is to describe the statistics for the numeric data and plot the relationships between two measures.</p>
<pre><code class="lang-python"># <span class="hljs-symbol">Summary</span> statistics
measures = [<span class="hljs-string">'entries'</span>,<span class="hljs-string">'exits'</span>]
dims = [<span class="hljs-string">'station_name'</span>]
# <span class="hljs-symbol">Filter</span> rows for the month of <span class="hljs-symbol">July</span> for morning and night time slots
df_morning_july = df_morning_arrivals[df_morning_arrivals[<span class="hljs-string">'created_dt'</span>].dt.month == <span class="hljs-number">7</span>][measures + dims]
df_night_july = df_night_departures[df_night_departures[<span class="hljs-string">'created_dt'</span>].dt.month == <span class="hljs-number">7</span>][measures + dims]
correlation_data = []
for station in df_morning_july[<span class="hljs-string">'station_name'</span>].unique():
morning_arrival = df_morning_july[df_morning_july[<span class="hljs-string">'station_name'</span>] == station][<span class="hljs-string">'exits'</span>].values[<span class="hljs-number">0</span>]
evening_departure = df_night_july[df_night_july[<span class="hljs-string">'station_name'</span>] == station][<span class="hljs-string">'entries'</span>].values[<span class="hljs-number">0</span>]
correlation_data.append({<span class="hljs-string">'station_name'</span>: station, <span class="hljs-string">'arrivals'</span>: morning_arrival, <span class="hljs-string">'departures'</span>: evening_departure})
df_correlation = pd.<span class="hljs-symbol">DataFrame</span>(correlation_data)
# <span class="hljs-symbol">Select</span> top <span class="hljs-number">10</span> stations with most morning arrivals
top_stations = df_correlation.groupby(<span class="hljs-string">'station_name'</span>)[<span class="hljs-string">'arrivals'</span>].sum().nlargest(<span class="hljs-number">10</span>).index
df_top_stations = df_correlation[df_correlation[<span class="hljs-string">'station_name'</span>].isin(top_stations)]
print(<span class="hljs-string">"Summary Statistics:"</span>)
print(df_top_stations[measures].describe() / <span class="hljs-number">10000</span>)
#output
<span class="hljs-symbol">Summary</span> <span class="hljs-symbol">Statistics</span>:
entries exits
count <span class="hljs-number">10.000000</span> <span class="hljs-number">10.000000</span>
mean <span class="hljs-number">3691.269728</span> <span class="hljs-number">2954.513148</span>
std <span class="hljs-number">20853.999335</span> <span class="hljs-number">18283.964419</span>
min <span class="hljs-number">0.000000</span> <span class="hljs-number">0.000000</span>
<span class="hljs-number">25</span><span class="hljs-comment">% 27.126200 19.537525</span>
<span class="hljs-number">50</span><span class="hljs-comment">% 135.898650 100.470600</span>
<span class="hljs-number">75</span><span class="hljs-comment">% 615.586650 445.015200</span>
max <span class="hljs-number">214717.057100</span> <span class="hljs-number">212147.622600</span>
# <span class="hljs-symbol">Create</span> a scatter matrix to visualize relationships between numeric columns
fig_scatter = plotly_x.scatter(df_top_stations, x=<span class="hljs-string">'arrivals'</span>, y=<span class="hljs-string">'departures'</span>, color=<span class="hljs-string">'station_name'</span>,
title=<span class="hljs-string">'Morning Arrivals vs Evening Departures'</span>)
fig_scatter.show()
</code></pre>
<ul>
<li><code>df_top_stations.describe()</code> provides summary statistics for the numerical columns</li>
<li><code>plotly_x.scatter()</code> creates scatter plots to visualize relationships between numerical columns</li>
</ul>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-process-analysis-visualization-jupyter-scatter-chart.png" alt="ozkary-data-engineering-analysis-visualization-jupyter" title="Data Engineering Process Fundamentals - Analysis and Visualization Jupyter Scatter Chart"></p>
<p>These statistics can help us identify trends, correlations, and relationships in our data, allowing us to gain insights and make informed decisions about further analysis or modeling.</p>
<blockquote>
<p>π Data correlation refers to the degree to which two or more variables change together. It indicates the strength and direction of the linear relationship between variables. The correlation coefficient is a value between -1 and 1. A 1 indicates a strong positive correlation, while a value close to -1 indicates a strong negative correlation. A correlation coefficient near 0 suggests a weak or no linear relationship between the variables.</p>
</blockquote>
<h4 id="hypothesis-testing">Hypothesis Testing</h4>
<p>In hypothesis testing, we use statistical methods to validate assumptions and draw conclusions. On the previous scatter chart, we can see that there appears to <strong>not be</strong> a strong correlation between arrivals and departures for the top 10 stations with most arrivals. This fact could be an area of interest for the analysis, and we may want to take a deeper look by running a test.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Perform Pearson correlation test</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">test_arrival_departure_correlation</span><span class="hljs-params">(df: pd.DataFrame, label: str)</span> -> <span class="hljs-keyword">None</span>:</span>
corr_coefficient, p_value = pearsonr(df[<span class="hljs-string">'arrivals'</span>], df[<span class="hljs-string">'departures'</span>])
p_value = round(p_value, <span class="hljs-number">5</span>)
<span class="hljs-keyword">if</span> p_value < <span class="hljs-number">0.05</span>:
conclusion = f<span class="hljs-string">"The correlation {label} is statistically significant."</span>
<span class="hljs-keyword">else</span>:
conclusion = f<span class="hljs-string">"The correlation {label} is not statistically significant."</span>
print(f<span class="hljs-string">"Pearson Correlation {label} - Coefficient : {corr_coefficient} P-Value : {p_value}"</span>)
print(f<span class="hljs-string">"Conclusion: {conclusion}"</span>)
test_arrival_departure_correlation(df_top_stations, <span class="hljs-string">'top-10 stations'</span>)
test_arrival_departure_correlation(df_correlation, <span class="hljs-string">'all stations'</span>)
<span class="hljs-comment"># output</span>
Pearson Correlation top<span class="hljs-number">-10</span> stations - Coefficient : <span class="hljs-number">-0.14112</span> P-Value : <span class="hljs-number">0.69738</span>
Conclusion: The correlation top<span class="hljs-number">-10</span> stations <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> statistically significant.
Pearson Correlation all stations - Coefficient : <span class="hljs-number">0.73803</span> P-Value : <span class="hljs-number">0.0</span>
Conclusion: The correlation all stations <span class="hljs-keyword">is</span> statistically significant.
</code></pre>
<p>Let's take a look at the output and explain what is going on. A correlation coefficient of -0.14 suggests a weak negative correlation between the variables being analyzed. The p-value of 0.69 is relatively high, which suggests that there is no real correlation between morning arrivals and night departures for the top stations. This means that the high number of arrivals in that morning is not reflecting a correlation of departures in the evening.</p>
<p>If we compare the entire data frame with all the stations, we can see a correlation .73 (close to 1) and a p-value of 0 which indicates that there is a statistically significant correlation for the entire dataset, which means that other stations had an increase in departures compared to its arrivals. By looking at the entire data sample, we can see there is in fact a correlation, and the increase in arrivals directly impacts departures later in the day.</p>
<h3 id="business-intelligence-and-reporting">Business Intelligence and Reporting</h3>
<p>Business intelligence (BI) is a strategic approach that involves the collection, analysis, and presentation of data to facilitate informed decision-making within an organization. In the context of business analytics, BI is a powerful tool for extracting meaningful insights from data and turning them into actionable strategies. </p>
<p>A Business Analyst (BA) uses a systematic approach to uncover valuable insights from data. As example, by calculating the total number of passengers for arrivals and departures, we gain a comprehensive understanding of passenger flow dynamics. Furthermore, we can employ distribution analysis to investigate variations across stations, days of the week, and time slots. These analyses provide essential insights for business strategy and decision-making, allowing us to identify peak travel periods, station preferences, and time-specific trends that directly influence business operations.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Calculate total passengers for arrivals and departures</span>
total_arrivals = df[<span class="hljs-string">'exits'</span>].sum()
total_departures = df[<span class="hljs-string">'entries'</span>].sum()
print(f<span class="hljs-string">"Total Arrivals: {total_arrivals} Total Departures: {total_departures}"</span>)
<span class="hljs-comment"># output</span>
Total Arrivals: <span class="hljs-number">2954513147693</span> Total Departures: <span class="hljs-number">3691269727684</span>
<span class="hljs-comment"># Create distribution analysis by station</span>
df_by_station = df.groupby([<span class="hljs-string">"station_name"</span>], as_index=<span class="hljs-keyword">False</span>)[measures].sum()
print(df_by_station.head(<span class="hljs-number">5</span>))
<span class="hljs-comment">#output</span>
station_name entries exits
<span class="hljs-number">0</span> <span class="hljs-number">1</span> AV <span class="hljs-number">41921835330</span> <span class="hljs-number">4723874242</span>
<span class="hljs-number">1</span> <span class="hljs-number">103</span> ST <span class="hljs-number">1701063755</span> <span class="hljs-number">1505114656</span>
<span class="hljs-number">2</span> <span class="hljs-number">104</span> ST <span class="hljs-number">60735889120</span> <span class="hljs-number">35317207533</span>
<span class="hljs-number">3</span> <span class="hljs-number">111</span> ST <span class="hljs-number">1856383672</span> <span class="hljs-number">840818137</span>
<span class="hljs-number">4</span> <span class="hljs-number">116</span> ST <span class="hljs-number">7419106031</span> <span class="hljs-number">8292936323</span>
<span class="hljs-comment"># Create distribution analysis by day of the week</span>
df_by_date = df.groupby([<span class="hljs-string">"created_dt"</span>], as_index=<span class="hljs-keyword">False</span>)[measures].sum()
day_order = [<span class="hljs-string">'Sun'</span>, <span class="hljs-string">'Mon'</span>, <span class="hljs-string">'Tue'</span>, <span class="hljs-string">'Wed'</span>, <span class="hljs-string">'Thu'</span>, <span class="hljs-string">'Fri'</span>, <span class="hljs-string">'Sat'</span>]
df_by_date[<span class="hljs-string">"weekday"</span>] = pd.Categorical(df_by_date[<span class="hljs-string">"created_dt"</span>].dt.strftime(<span class="hljs-string">'%a'</span>), categories=day_order, ordered=<span class="hljs-keyword">True</span>)
df_entries_by_date = df_by_date.groupby([<span class="hljs-string">"weekday"</span>], as_index=<span class="hljs-keyword">False</span>)[measures].sum()
print(df_entries_by_date.head(<span class="hljs-number">5</span>))
<span class="hljs-comment"># output</span>
weekday entries exits
<span class="hljs-number">0</span> Sun <span class="hljs-number">83869272617</span> <span class="hljs-number">53997290047</span>
<span class="hljs-number">1</span> Mon <span class="hljs-number">839105447014</span> <span class="hljs-number">667971771875</span>
<span class="hljs-number">2</span> Tue <span class="hljs-number">723988041023</span> <span class="hljs-number">592238758942</span>
<span class="hljs-number">3</span> Wed <span class="hljs-number">728728461351</span> <span class="hljs-number">594670413050</span>
<span class="hljs-number">4</span> Thu <span class="hljs-number">80966812864</span> <span class="hljs-number">51232966458</span>
<span class="hljs-comment"># Create distribution analysis time slots</span>
<span class="hljs-keyword">for</span> slot, (start_hour, end_hour) <span class="hljs-keyword">in</span> time_slots.items():
slot_data = df[(df[<span class="hljs-string">'created_dt'</span>].dt.hour >= start_hour) & (df[<span class="hljs-string">'created_dt'</span>].dt.hour <= end_hour)]
arrivals = slot_data[<span class="hljs-string">'exits'</span>].sum()
departures = slot_data[<span class="hljs-string">'entries'</span>].sum()
print(f<span class="hljs-string">"{slot.capitalize()} - Arrivals: {arrivals:.2f}, Departures: {departures:.2f}"</span>)
<span class="hljs-comment"># output</span>
Morning - Arrivals: <span class="hljs-number">494601773970.00</span>, Departures: <span class="hljs-number">619832037915.00</span>
Afternoon - Arrivals: <span class="hljs-number">493029769709.00</span>, Departures: <span class="hljs-number">615375337214.00</span>
Night - Arrivals: <span class="hljs-number">814729184132.00</span>, Departures: <span class="hljs-number">1008230417627.00</span>
</code></pre>
<p>BI analysis is important in helping us understand the data, which can then be communicated to stakeholders so that we can make decisions based on which information is more relevant to the organization.</p>
<h2 id="data-visualization">Data Visualization</h2>
<p>Data visualization is a powerful tool that takes the insights derived from data analysis and presents them in a visual format. While tables with numbers on a report provide raw information, visualizations allow us to grasp complex relationships and trends at a glance. Dashboards, in particular, bring together various visual components like charts, graphs, and scorecards into a unified interface.</p>
<p>Imagine a scenario where we have analyzed passenger data using Python and determined that certain stations experience higher passenger volumes during specific times of the day. Translating this into a dashboard, we can use donut graphs to show the distribution of passenger counts on stations, bar graphs to visualize passenger trends over different times of the day, and scorecards to highlight key metrics like total passengers.</p>
<p>Such a dashboard offers a comprehensive view of the data, enabling quick comparisons, trend identification, and actionable insights. Instead of sifting through numbers, stakeholders can directly observe the patterns, correlations, and anomalies, leading to informed decision-making. This visualization approach enhances communication, collaboration, and comprehension among teams, making it an essential tool for data-driven organizations.</p>
<h3 id="types-of-data-visualizations">Types of Data Visualizations</h3>
<p>There are a few terms that are used interchangeably when it comes to data visualization, but in reality there are subtle differences and specific uses between the terms. Let's review them in more details.</p>
<ul>
<li><p>Chart: A chart is a visual representation that displays data points, trends, and patterns. It uses graphical elements such as bars, lines, or pie charts to depict data relationships. Charts are focused on illustrating specific data comparisons or distributions, making it easier for viewers to understand data at a glance. </p>
</li>
<li><p>Graph: A graph is a broader term that encompasses both charts and diagrams. It's used to represent data visually and can include a variety of visual elements, including nodes, edges, bars, lines, and more. Graphs are often used to showcase relationships and connections among various data points, allowing viewers to understand complex structures or networks</p>
</li>
<li><p>Report: A report is a structured document that provides a comprehensive overview of data analysis, findings, and insights. It typically includes a mix of textual explanations, tables, charts, and graphs. Reports are designed to convey detailed information and can be several pages long. They often include an executive summary, methodology, results, and recommendations</p>
</li>
<li><p>Dashboard: A dashboard is a visual display of key performance indicators (KPIs) and metrics that offers a real-time snapshot of business data. Dashboards consolidate multiple visual elements like charts, graphs, gauges, and tables onto a single screen. They are interactive and customizable, allowing users to monitor trends, and identify anomalies. Dashboards provide a quick and holistic view of business performance</p>
</li>
</ul>
<p>In summary, a chart is a specific type of visual representation focusing on data points, a graph represents broader data relationships, a report is a structured document presenting detailed analysis, and a dashboard is an interactive screen displaying real-time KPIs and metrics. Each serves a unique purpose in effectively communicating information to different types of audiences.</p>
<h3 id="dashboard-design-principles">Dashboard Design Principles</h3>
<p>Designing effective dashboards requires attention to several key principles to ensure clarity, usability, and the ability to convey insights. Here's a short list of essential Dashboard Design Principles:</p>
<ul>
<li><p>User-Centered Design: Understand our audience and their needs. Design the dashboard to provide relevant and actionable information to specific user roles, executives only want to see the big picture not details</p>
</li>
<li><p>Clarity and Simplicity: Keep the design clean and uncluttered. Use a simple layout, meaningful titles, and avoid unnecessary decorations </p>
</li>
<li><p>Consistency: Maintain a consistent design across all dashboard components. Use the same color schemes, fonts, and visual styles to create a cohesive experience</p>
</li>
<li><p>Master Filter: Include a master filter that allows users to select a date range, segment, or other criteria. This synchronizes data across all components, ensuring a unified view</p>
</li>
<li><p>Data Context and Relationships: Clearly label components and provide context to explain data relationships. Help users understand the significance of each element</p>
</li>
<li><p>Whitespace: Use whitespace effectively to separate components and enhance readability. Proper spacing reduces visual clutter</p>
</li>
<li><p>Real-Time Updates: If applicable, ensure that the dashboard provides real-time or near-real-time data updates for accurate decision-making</p>
</li>
<li><p>Mobile Responsiveness: Design the dashboard to be responsive across various devices and screen sizes, ensuring usability on both desktop and mobile</p>
</li>
<li><p>Testing and Iteration: Test the dashboard with actual users and gather feedback. Iterate on the design based on user insights and preferences</p>
</li>
</ul>
<p>Effective dashboard design not only delivers data but also tells a story. It guides users through insights, highlights trends, and supports data-driven decision-making. Applying these principles will help create dashboards that are intuitive, informative, and impactful.</p>
<h3 id="data-visualization-tools">Data Visualization Tools</h3>
<p>The data visualization tools can be divided into code-centric and low-code solutions. A code-centric solution involves writing programs to manage the data analysis and visuals. A low-code solution uses cloud-hosted tools that accelerate the data analysis and visualization. Instead of focusing on code, a low-code tool enables data professionals to focus on the data. Let's review some of these tools in more detail:</p>
<ul>
<li><p>Python, coupled with libraries like Plotly, offers a versatile platform for data visualization that comes with its own set of advantages and limitations. This code-centric approach enables data professionals to integrate data analysis and visualization seamlessly, and they are particularly suited for individual research, in-depth analysis, and presentations in a controlled setting</p>
</li>
<li><p><a href="https://lookerstudio.google.com/">Looker Studio</a> is a powerful low-code, cloud-hosted business intelligence and data visualization platform that empowers organizations to explore, analyze, and share insights from their data. It offers a user-friendly interface that allows users to create interactive reports, dashboards, and visualizations </p>
</li>
<li><a href="https://powerbi.microsoft.com/">Microsoft Power BI</a> is a widely-used low-code, cloud-hosted data visualization and business intelligence tool. It seamlessly integrates with other Microsoft tools and services, making it a popular choice for organizations already in the Microsoft ecosystem. Power BI offers an intuitive drag-and-drop interface for building interactive reports and dashboards. Its extensive library of visuals and custom visuals allows users to create compelling data representations</li>
<li><a href="https://www.tableau.com/">Tableau</a>, acquired by Salesforce, is renowned for its cloud-hosted and low-code data visualization capabilities. It provides users with an array of options for creating dynamic and interactive visuals. Tableau's "drag-and-drop" approach simplifies the process of connecting to various data sources and creating insightful dashboards.</li>
</ul>
<p>Each of these tools offers unique features and benefits, catering to different user preferences and organizational needs. Whether it's Looker's focus on data modeling, Power BI's integration with Microsoft products, or Tableau's flexibility and advanced analytics capabilities, these tools play a significant role in empowering users to unlock insights from their data.</p>
<h2 id="summary">Summary</h2>
<p>Data analysis involves meticulous exploration, transformation, and comprehension of raw data to identify meaningful insights. There are guidelines and design patterns to follow for each specific use case. A BA might focus on KPIs, while a QAE might focus on statistical analysis of process quality. These insights, however, find their true value through data visualization. A code-centric approach with Python, aided by Plotly, offers potent tools for crafting analyses and visuals, but a low-code cloud hosted solution is often the solution for broader sharing and enterprise solutions. </p>
<p>In conclusion, the synergy between data analysis and visualization is pivotal for data-driven projects. Navigating data analysis with established principles and communicating insights through visually engaging dashboards empowers us to extract value from data. Whether opting for code-centric or low-code solutions, the choice of tooling and platform hinges on the balance between team expertise and target audience.</p>
<h2 id="exercise-data-analysis-and-visualization">Exercise - Data Analysis and Visualization</h2>
<p>With a better understanding of the data analysis and visualization process, the next step is to put these concepts into practice through a hands-on exercise. In this lab, we can continue our data engineering process and create a dashboard that will meet the requirements established in the discovery phase. </p>
<blockquote>
<p>π <a href="https://www.ozkary.com/2023/07/data-engineering-process-fundamentals-data-analysis-visualization-exercise.html">Data Engineering Process Fundamentals - Data Analysis and Visualization Exercise</a></p>
</blockquote>
<p>Thanks for reading.</p>
<p>Send question or comment at Twitter @ozkary</p>
<p>π Originally published by <a href="https://www.ozkary.com">ozkary.com</a></p>
<div class="blogger-post-footer">Originally Published by https://ozkary.com</div>Oscar Garcia @ozkaryhttp://www.blogger.com/profile/13427831719934915504noreply@blogger.com0tag:blogger.com,1999:blog-2250750882473644809.post-64713372728681874152023-06-17T11:47:00.014-04:002023-08-04T11:34:05.895-04:00Data Engineering Process Fundamentals - Data Warehouse and Transformation Exercise<p>In this hands-on lab, we build upon our data engineering process where we previously focused on defining a data pipeline orchestration process. Now, we should focus on storing and making the data accessible for visualization and analysis. So far, our data is stored in a Data Lake, while Data Lakes excel at handling vast volumes of data, they are not optimized for query performance, so our step is to enable the bulk data processing and analytics by working on our Data Warehouse (DW).</p>
<p>During this exercise, we delve into the data warehouse design and implementation step, crafting robust data models, and designing transformation tasks. We explore how to efficiently load, cleanse, and merge data, ultimately creating dimension and fact tables. Additionally, we discuss areas like query performance, testability, and source control of our code, ensuring a reliable and scalable data solution. By leveraging incremental models, we continuously update our data warehouse with only the deltas (new updates), optimizing query performance and enhancing the overall data pipeline. By the end, we have a complete data pipeline, taking data from CSV to our data warehouse, equipped for seamless visualization and analysis.</p>
<h2 id="data-warehouse-design">Data Warehouse Design</h2>
<p>A data warehouse is an OLAP system, which serves as the central data repository for historical and aggregated data. In contrast to the ETL process employed by data lakes with Python code, a data warehouse relies on the ETL process. This fundamental distinction emphasizes the need for well-defined and optimized models within the database, enabling efficient data access and exceptional performance. </p>
<blockquote>
<p>π For the ETL process, the data is transformed before adding it to storage. For the ELT process, the data is first loaded in storage in its raw format, the transformation is then done before inserting into the dimension and fact tables.</p>
</blockquote>
<p>Before building the concrete tables, our initial focus is on creating precise data models based on thorough analysis and specific requirements. To achieve this, we leverage SQL (Structured Query Language) and tools that facilitate model development in an automated, testable, and repeatable manner. By incorporating such tools into our project, we build the data services area in which we manage the data modeling and transformation to expand our architecture into the following:</p>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-process-data-warehouse-architecture.png" alt="ozkary-data-engineering-data-warehouse-architecture" title="Data Engineering Process Fundamentals - Data Warehouse and Transformation Architecture"></p>
<blockquote>
<p>π For our use case, we are using <a href="https://cloud.google.com/bigquery/">Google BigQuery</a> as our data warehouse system. Make sure to review the Data Engineering Process - Design and Planning section and run the Terraform script to provision this resource.</p>
</blockquote>
<h3 id="external-tables">External Tables</h3>
<p>An external table is not physically hosted within the data warehouse database. Since our raw data is stored on a data lake, we can reference that location and load those files as an external table. we can create an external table using the data lake files as the source by providing a file pattern to select all the compressed files. </p>
<p>The following SQL can be executed as a query on the data warehouse. Access to the data lake should already be configured when the service accounts where assigned to the resources during the design and planning phase.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">OR</span> <span class="hljs-keyword">REPLACE</span> <span class="hljs-keyword">EXTERNAL</span> <span class="hljs-keyword">TABLE</span> mta_data.ext_turnstile
OPTIONS (
<span class="hljs-keyword">format</span> = <span class="hljs-string">'CSV'</span>,
uris = [<span class="hljs-string">'gs://ozkary_data_lake_ozkary-de-101/turnstile/*.csv.gz'</span>]
);
</code></pre>
<p>When this SQL script is executed, and the external table is created, the data warehouse retrieves the metadata about the external data, such as the schema, column names, and data types, without actually moving the data into the data warehouse storage. Once the external table is created, we can query the data using SQL as if it were a regular table. </p>
<h2 id="design-and-architecture">Design and Architecture</h2>
<p>During the design and architecture stage of our data warehouse project, our primary objective is to transition from conceptual ideas to concrete designs. Here, we make pivotal technical choices that pave the way for building the essential resources and defining our data warehouse approach. </p>
<h3 id="star-schema">Star Schema</h3>
<p>We start by selecting the Star Schema model. This model consist of a central fact table that is connected to multiple dimension tables via foreign key relationships. The fact table contains the measures or metrics, while the dimension tables hold descriptive attributes. </p>
<h3 id="infrastructure">Infrastructure</h3>
<p>For the infrastructure, we are using a cloud hosted OLAP system, Google BigQuery. This is a system that can handle petabytes of data. It also provides MPP (Massive Parallel Processing), built-in indexing and caching, which improves query performance and reduce compute by caching query results. The serverless architecture of these systems help us on reducing cost. Because the system is managed by the cloud provider, we can focus on the data analysis instead of infrastructure management.</p>
<h3 id="technology-stack">Technology Stack</h3>
<p>For the technology stack, we are using a SQL-centric approach. We want to be able to manage our models and transformation tasks within the memory context and processing power of the database, which tends to work best for large datasets and faster processing. In addition, this approach works well with a batch processing approach.</p>
<p><a href="https://www.getdbt.com/">dbt</a> (data build tool) is a SQL-centric framework which at its core is primarily focused on transforming data using SQL-based queries. It allows us to define data models and transformation logic using SQL and Jinja, a templating language with data transformation capabilities, such as loops, conditionals, and macros, within our SQL code. This framework enables us to build the actual data models as views, tables and SQL based transformation that are hosted on the data warehouse. </p>
<p>As we build code for our data model and transformation tasks, we need to track it, manage the different versions and automate the deployments to our database. To manage this, we use <a href="https://github.com/">GitHub</a>, which is a web-based platform that provides version control and collaborative features for software development and management. It also provides CI/CD capabilities to help us execute test plans, build releases and deploy them. dbt connects with GitHub to manage deployments. This enables the dbt orchestration features to run the latest code as part of the pipeline. </p>
<blockquote>
<p>π A deployment consists of getting the latest model metadata, build it on the database, and run the incremental data tasks when new data is available in the data lake.</p>
</blockquote>
<h2 id="data-warehouse-implementation">Data Warehouse Implementation</h2>
<p>The data warehouse implementation is the stage where the conceptual data model and design plans are transformed into a functional system by implementing the data models and writing the code for our transformation tasks.</p>
<h3 id="data-modeling">Data Modeling</h3>
<p>Data modeling is the implementation of the structure of the data warehouse, creating models (views) and entities (tables), defining attributes (columns), and establishing data relationships to ensure efficient querying and reporting. It is also important to identify the primary keys, foreign keys, and indexes to improve data retrieval performance. </p>
<p>To build our models, we should follow these specifications:</p>
<ul>
<li>Create an external table using the Data Lake folder and *.csv.gz file pattern as a source<ul>
<li>ext_turnstile</li>
</ul>
</li>
<li>Create the staging models<ul>
<li>Create the station view (stg_station) from the external table as source<ul>
<li>Get the unique stations </li>
<li>Create a surrogate key using the station name </li>
</ul>
</li>
<li>Create the booth view (stg_booth) from the external table as source <ul>
<li>Create a surrogate key using the booth UNIT and CA fields </li>
</ul>
</li>
<li>Create the fact view (stg_turnstile) from the external table as source<ul>
<li>Create a surrogate key using CA, UNIT, SCP, DATE, time</li>
</ul>
</li>
</ul>
</li>
</ul>
<h3 id="data-transformation">Data Transformation</h3>
<p>The data transformation phase is a critical stage in a data warehouse project. This phase involves several key steps, including data extraction, cleaning, loading, data type casting, use naming conventions, and implementing incremental loads to continuously insert the new information since the last update via batch processes.</p>
<p>For our transformation services, we follow these specifications:</p>
<ul>
<li>Use the staging models to build the physical models<ul>
<li>Map all the columns to our naming conventions, lowercase and underline between words</li>
<li>Create the station dimension table (dim_station) from the stg_station model <ul>
<li>Add incremental strategy for ongoing new data </li>
</ul>
</li>
<li>Create the booth dimension table (dim_booth) from the stg_booth model <ul>
<li>Add incremental strategy for ongoing new data </li>
<li>Use the station_name to get the foreign key, station_id</li>
<li>Cluster the table by station_id </li>
</ul>
</li>
<li>Create the fact table (fact_turnstile) from the stg_turnstile model<ul>
<li>Add incremental strategy for ongoing new data </li>
<li>Partition the table by created_dt and use day granularity</li>
<li>Cluster the table by station_id</li>
<li>Join on dimension tables to use id references instead of text</li>
</ul>
</li>
</ul>
</li>
<li>Remove rows with null values for the required fields<ul>
<li>Station, CA, UNIT, SCP, DATE, TIME</li>
</ul>
</li>
<li>Cast columns to the correct data types<ul>
<li>created</li>
</ul>
</li>
<li>Continuously run all the model with an incremental strategy to append new records</li>
</ul>
<p>Our physical data model should look like this:</p>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-process-data-warehouse-star-schema.png" alt="ozkary-data-engineering-data-warehouse-star-schema" title="Data Engineering Process Fundamentals - Data Warehouse and Transformation Star Schema"></p>
<h4 id="why-do-we-use-partitions-and-cluster">Why do we use partitions and cluster</h4>
<blockquote>
<p>π We should always review the technical specifications of the database system to find out what other best practices are recommended to improve performance.</p>
</blockquote>
<ul>
<li><p>Partitioning is the process of dividing a large table into smaller, more manageable parts based on the specified column. Each partition contains rows that share a common value like a specific date. A partition improves performance and query cost</p>
</li>
<li><p>When we run a query in BigQuery, it gets executed by a distributed computing infrastructure that spans multiple machines. Clustering is an optional feature in BigQuery that allows us to organize the data within each partition. The purpose of clustering is to physically arrange data within a partition in a way that is conducive to efficient query processing</p>
</li>
</ul>
<h4 id="sql-server-and-big-query-concept-comparison">SQL Server and Big Query Concept Comparison</h4>
<ul>
<li><p>In SQL Server, a clustered index defines the physical order of data in a table. In BigQuery, clustering refers to the organization of data within partitions based on one or more columns. Clustering in BigQuery does not impact the physical storage order like a clustered index in SQL Server</p>
</li>
<li><p>Both SQL Server and BigQuery support table partitioning. The purpose is similar, allowing for better data management and performance optimization </p>
</li>
</ul>
<h2 id="install-system-requirements-and-frameworks">Install System Requirements and Frameworks</h2>
<p>Before looking at the code, we need to setup our environment with all the necessary dependencies, so we can build our models.</p>
<h3 id="requirements">Requirements</h3>
<blockquote>
<p>π Verify that there are files on the data lake. If not, run the data pipeline process to download the files into the data lake.</p>
<p>π <a href="https://github.com/ozkary/data-engineering-mta-turnstile/tree/main/Step4-Data-Warehouse" target="_repo">Clone this repo</a> or copy the files from this folder, dbt and sql.</p>
</blockquote>
<ul>
<li>Must have CSV files in the data lake</li>
<li>Create a <a href="https://www.getdbt.com/">dbt</a> cloud account <ul>
<li>Link dbt with your GitHub project (Not needed when running locally)</li>
<li>Create schedule job on dbt cloud for every Saturday 9am</li>
<li>Or install locally (VM) and run from CLI</li>
</ul>
</li>
<li>GitHub account</li>
<li>Google BigQuery resource </li>
</ul>
<h4 id="configure-the-cli">Configure the CLI</h4>
<h5 id="install-dbt-core-and-bigquery-dependencies">Install dbt core and BigQuery dependencies</h5>
<p>Run these command from the Step4-Data-Warehouse/dbt folder to install the dependencies and initialize the project.</p>
<pre><code class="lang-bash"><span class="hljs-variable">$ </span>cd Step4-Data-Warehouse/dbt
<span class="hljs-variable">$ </span>pip install dbt-core dbt-bigquery
<span class="hljs-variable">$ </span>dbt init
<span class="hljs-variable">$ </span>dbt deps
</code></pre>
<h5 id="create-a-profile-file">Create a profile file</h5>
<p>From the Step4-Data-Warehouse folder, run the following commands.</p>
<pre><code class="lang-bash">$ cd ~
$ mkdir <span class="hljs-selector-class">.dbt</span>
$ cd <span class="hljs-selector-class">.dbt</span>
$ touch profiles<span class="hljs-selector-class">.yml</span>
$ nano profiles.yml
</code></pre>
<ul>
<li>Paste the profiles file content</li>
</ul>
<blockquote>
<p>π Use your dbt cloud project project information and cloud key file</p>
</blockquote>
<ul>
<li>Run this command see the project folder configuration location</li>
</ul>
<pre><code class="lang-bash">$ dbt <span class="hljs-built_in">debug</span> <span class="hljs-comment">--config-dir</span>
</code></pre>
<ul>
<li>Update the content of the file to match your project information</li>
</ul>
<pre><code class="lang-bash"><span class="hljs-symbol">Analytics:</span>
<span class="hljs-symbol"> outputs:</span>
<span class="hljs-symbol"> dev:</span>
<span class="hljs-symbol"> dataset:</span> mta_data
<span class="hljs-symbol"> job_execution_timeout_seconds:</span> <span class="hljs-number">300</span>
<span class="hljs-symbol"> job_retries:</span> <span class="hljs-number">1</span>
<span class="hljs-symbol"> keyfile:</span> <span class="hljs-meta-keyword">/home/</span>.gcp/your-file.json
<span class="hljs-symbol"> location:</span> us-east1
<span class="hljs-symbol"> method:</span> service-account
<span class="hljs-symbol"> priority:</span> interactive
<span class="hljs-symbol"> project:</span> your-gcp-project
<span class="hljs-symbol"> threads:</span> <span class="hljs-number">2</span>
<span class="hljs-symbol"> type:</span> bigquery
<span class="hljs-symbol"> target:</span> dev
</code></pre>
<h5 id="validate-the-project-configuration">Validate the project configuration</h5>
<p>This should generate a list of all the assets that should be generated in the project including the constraints.</p>
<pre><code class="lang-bash">$ dbt <span class="hljs-built_in">list</span> <span class="hljs-comment">--profile Analytics</span>
</code></pre>
<h2 id="review-the-code">Review the Code</h2>
<p>With a dev environment ready and clear specifications about how to build the models and our transformations, we can now look at the code and review the approach. We can use <a href="https://code.visualstudio.com/">Visual Studio Code</a> or a similar tool to edit the source code and open a terminal to run the CLI commands.</p>
<p>Start by navigating to the dbt project folder.</p>
<pre><code class="lang-bash">$ cd Step4-<span class="hljs-meta">Data</span>-Warehouse/dbt
</code></pre>
<p>Project tree:</p>
<pre><code>- dbt
β
ββ models
β β
β ββ core
β β ββ schema<span class="hljs-selector-class">.yml</span>
β β ββ dim_booth<span class="hljs-selector-class">.sql</span>
β β ββ dim_station<span class="hljs-selector-class">.sql</span>
β β ββ fact_turnstile<span class="hljs-selector-class">.sql</span>
β β ββ ...
β ββ staging
β β ββ schema_*<span class="hljs-selector-class">.yml</span>
β β ββ stg_booth<span class="hljs-selector-class">.sql</span>
β β ββ stg_station<span class="hljs-selector-class">.sql</span>
β β ββ stg_turnstile<span class="hljs-selector-class">.sql</span>
β β ββ ...
β ββ target
β β ββ compile
β β ββ run
β β ββ ...
ββ dbt_project.yml
</code></pre><p>The dbt folder contains the SQL-based source code. The staging folder contains the view definitions. The core folder contains the table definitions. The schema files in those folders have test rules and data constraints that are used to validate the models. This is how we are able to test our models. </p>
<p>The schema.yml files are used as configurations to define the schema of the final output of the models. It provides the ability to explicitly specify the column names, data types, and other properties of the resulting table created by each dbt model. This file allows dbt to generate the appropriate SQL statements for creating or altering tables in the target data warehouse.</p>
<blockquote>
<p>π All these files are executed using the dbt CLI. The files are compiled into SQL statements that are deployed to the database or just executed in memory to run the test, validation and insert scripts. The compiled SQL is stored in the target folder and these are assets deployed to the database. The transformation tasks are compiled into the run folder and are only executed on the database.</p>
</blockquote>
<h3 id="lineage">Lineage</h3>
<p>Data lineage is the documentation and tracking of the flow of data from its origin to its destination, including all the intermediate processes and transformations that it undergoes. In this case, we show how the external table is the source for the fact table and the dimension table dependencies.</p>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-process-data-warehouse-lineage.png" alt="ozkary-data-engineering-data-warehouse-lineage" title="Data Engineering Process Fundamentals - Data Warehouse and Transformation Lineage"></p>
<h3 id="staging-data-models-views">Staging Data Models - Views</h3>
<p>We use the view strategy to build our staging models. When these files are executed (via CLI commands), the SQL DDL (Data Definition Language) is generated and deployed to the database, essentially building the views. We also add a test parameter to limit the number of rows to 100 during the development process only. This is removed when it is deployed. Notice how the Jinja directives are in double brackets {{}} and handle some conditional logic and directives to configure the build process or call user defined functions.</p>
<blockquote>
<p>π DDL (Data Definition Language) is used to create objects. DML (Data Manipulation Language) is used to query the data.</p>
</blockquote>
<ul>
<li>stg_station.sql</li>
</ul>
<pre><code class="lang-sql">{{ config(materialized=<span class="hljs-string">'view'</span>) }}
<span class="hljs-function">with stations <span class="hljs-title">as</span>
(<span class="hljs-params">
<span class="hljs-keyword">select</span>
Station,
row_number(</span>) <span class="hljs-title">over</span>(<span class="hljs-params">partition <span class="hljs-keyword">by</span> Station</span>) <span class="hljs-keyword">as</span> rn
<span class="hljs-keyword">from</span> </span>{{ source(<span class="hljs-string">'staging'</span>,<span class="hljs-string">'ext_turnstile'</span>) }}
<span class="hljs-keyword">where</span> Station <span class="hljs-keyword">is</span> not <span class="hljs-literal">null</span>
)
<span class="hljs-keyword">select</span>
-- create a unique key based <span class="hljs-keyword">on</span> the station name
{{ dbt_utils.generate_surrogate_key([<span class="hljs-string">'Station'</span>]) }} <span class="hljs-keyword">as</span> station_id,
Station <span class="hljs-keyword">as</span> station_name
<span class="hljs-keyword">from</span> stations
<span class="hljs-keyword">where</span> rn = <span class="hljs-number">1</span>
-- use is_test_run <span class="hljs-literal">false</span> to disable the test limit
-- dbt build --m <model.sql> --<span class="hljs-keyword">var</span> <span class="hljs-string">'is_test_run: false'</span>
{% <span class="hljs-function"><span class="hljs-keyword">if</span> <span class="hljs-title">var</span>(<span class="hljs-params"><span class="hljs-string">'is_test_run'</span>, <span class="hljs-keyword">default</span>=<span class="hljs-literal">true</span></span>) %}
limit 100
</span>{% endif %}
</code></pre>
<ul>
<li>stg_booth.sql</li>
</ul>
<pre><code class="lang-sql">{{ config(materialized=<span class="hljs-string">'view'</span>) }}
<span class="hljs-function">with booths <span class="hljs-title">as</span>
(<span class="hljs-params">
<span class="hljs-keyword">select</span>
UNIT,
CA,
Station,
row_number(</span>) <span class="hljs-title">over</span>(<span class="hljs-params">partition <span class="hljs-keyword">by</span> UNIT, CA</span>) <span class="hljs-keyword">as</span> rn
<span class="hljs-keyword">from</span> </span>{{ source(<span class="hljs-string">'staging'</span>,<span class="hljs-string">'ext_turnstile'</span>) }}
<span class="hljs-keyword">where</span> Unit <span class="hljs-keyword">is</span> not <span class="hljs-literal">null</span> and CA <span class="hljs-keyword">is</span> not <span class="hljs-literal">null</span> and Station <span class="hljs-keyword">is</span> not <span class="hljs-literal">null</span>
)
<span class="hljs-keyword">select</span>
-- create a unique key
{{ dbt_utils.generate_surrogate_key([<span class="hljs-string">'UNIT'</span>, <span class="hljs-string">'CA'</span>]) }} <span class="hljs-keyword">as</span> booth_id,
UNIT <span class="hljs-keyword">as</span> remote,
CA <span class="hljs-keyword">as</span> booth_name,
Station <span class="hljs-keyword">as</span> station_name
<span class="hljs-keyword">from</span> booths
<span class="hljs-keyword">where</span> rn = <span class="hljs-number">1</span>
-- dbt build --m <model.sql> --<span class="hljs-keyword">var</span> <span class="hljs-string">'is_test_run: false'</span>
{% <span class="hljs-function"><span class="hljs-keyword">if</span> <span class="hljs-title">var</span>(<span class="hljs-params"><span class="hljs-string">'is_test_run'</span>, <span class="hljs-keyword">default</span>=<span class="hljs-literal">true</span></span>) %}
limit 100
</span>{% endif %}
</code></pre>
<ul>
<li>stg_turnstile.sql</li>
</ul>
<pre><code class="lang-sql">
{{ config(materialized='view') }}
<span class="hljs-keyword">with</span> turnstile <span class="hljs-keyword">as</span>
(
select
CA,
UNIT,
STATION,
concat(CA,UNIT,SCP) <span class="hljs-keyword">as</span> REF,
SCP,
LINENAME,
DIVISION,
concat(<span class="hljs-built_in">log</span>.DATE,<span class="hljs-string">" "</span>, <span class="hljs-built_in">log</span>.TIME) <span class="hljs-keyword">as</span> CREATED,
ENTRIES,
EXITS,
row_number() <span class="hljs-keyword">over</span>(partition <span class="hljs-keyword">by</span> CA, UNIT, SCP, DATE, TIME) <span class="hljs-keyword">as</span> rn
<span class="hljs-keyword">from</span> {{ source('staging','ext_turnstile') }} <span class="hljs-keyword">as</span> <span class="hljs-built_in">log</span>
<span class="hljs-keyword">where</span> Station <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> null <span class="hljs-keyword">and</span> DATE <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> null <span class="hljs-keyword">and</span> TIME <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> null
)
select
<span class="hljs-comment">-- create a unique key </span>
{{ dbt_utils.generate_surrogate_key(['REF', 'CREATED']) }} <span class="hljs-keyword">as</span> log_id,
CA <span class="hljs-keyword">as</span> booth,
UNIT <span class="hljs-keyword">as</span> remote,
STATION <span class="hljs-keyword">as</span> station,
<span class="hljs-comment">-- unit and line information</span>
SCP <span class="hljs-keyword">as</span> scp,
LINENAME AS line_name,
DIVISION AS division,
<span class="hljs-comment">-- timestamp</span>
cast(CREATED <span class="hljs-keyword">as</span> timestamp) <span class="hljs-keyword">as</span> created_dt,
<span class="hljs-comment">-- measures</span>
cast(entries <span class="hljs-keyword">as</span> <span class="hljs-built_in">integer</span>) <span class="hljs-keyword">as</span> entries,
cast(exits <span class="hljs-keyword">as</span> <span class="hljs-built_in">integer</span>) <span class="hljs-keyword">as</span> exits
<span class="hljs-keyword">from</span> turnstile
<span class="hljs-keyword">where</span> rn = <span class="hljs-number">1</span>
<span class="hljs-comment">-- dbt build --m <model.sql> --var 'is_test_run: false'</span>
{% <span class="hljs-keyword">if</span> var('is_test_run', default=<span class="hljs-literal">true</span>) %}
limit <span class="hljs-number">100</span>
{% endif %}
</code></pre>
<h3 id="physical-data-models-tables">Physical Data Models - Tables</h3>
<p>We use the incremental strategy to build our tables. This enable us to continuously append data to our tables when there is new information. This strategy creates both DDL and DML scripts. This enable us to build the tables and also create the scripts to merge the new data in the table. </p>
<p>We use the models (views) to build the actual tables. When these scripts are executed (via CLI commands), the process checks if the object exists, if it does not exists, it creates it. It then reads the data from the views using CTE (common table expressions) and appends all the records that are not already in the table.</p>
<ul>
<li>dim_station.sql</li>
</ul>
<pre><code class="lang-sql">
{{ config(materialized=<span class="hljs-string">'incremental'</span>) }}
<span class="hljs-function">with stations <span class="hljs-title">as</span> (<span class="hljs-params">
<span class="hljs-keyword">select</span>
station_id,
station_name
<span class="hljs-keyword">from</span> {{ <span class="hljs-keyword">ref</span>(<span class="hljs-string">'stg_station'</span></span>) }} <span class="hljs-keyword">as</span> d
<span class="hljs-keyword">where</span> station_id <span class="hljs-keyword">is</span> not <span class="hljs-literal">null</span>
)
<span class="hljs-keyword">select</span>
ns.station_id,
ns.station_name
<span class="hljs-keyword">from</span> stations ns
</span>{% <span class="hljs-function"><span class="hljs-keyword">if</span> <span class="hljs-title">is_incremental</span>(<span class="hljs-params"></span>) %}
-- logic <span class="hljs-keyword">for</span> incremental models <span class="hljs-keyword">this</span> </span>= dim_station table
left outer <span class="hljs-keyword">join</span> {{ <span class="hljs-keyword">this</span> }} dim
<span class="hljs-keyword">on</span> ns.station_id = dim.station_id
<span class="hljs-keyword">where</span> dim.station_id <span class="hljs-keyword">is</span> <span class="hljs-literal">null</span>
{% endif %}
</code></pre>
<ul>
<li>dim_booth.sql</li>
</ul>
<pre><code class="lang-sql">
{{ config(materialized=<span class="hljs-string">'incremental'</span>,
cluster_by = <span class="hljs-string">"station_id"</span>
)}}
<span class="hljs-function">with booth <span class="hljs-title">as</span> (<span class="hljs-params">
<span class="hljs-keyword">select</span>
booth_id,
remote,
booth_name,
station_name
<span class="hljs-keyword">from</span> {{ <span class="hljs-keyword">ref</span>(<span class="hljs-string">'stg_booth'</span></span>) }}
<span class="hljs-keyword">where</span> booth_id <span class="hljs-keyword">is</span> not <span class="hljs-literal">null</span>
),
dim_station <span class="hljs-title">as</span> (<span class="hljs-params">
<span class="hljs-keyword">select</span> station_id, station_name <span class="hljs-keyword">from</span> {{ <span class="hljs-keyword">ref</span>(<span class="hljs-string">'dim_station'</span></span>) }}
)
<span class="hljs-keyword">select</span>
b.booth_id,
b.remote,
b.booth_name,
st.station_id
<span class="hljs-keyword">from</span> booth b
inner <span class="hljs-keyword">join</span> dim_station st
<span class="hljs-keyword">on</span> b.station_name </span>= st.station_name
{% <span class="hljs-function"><span class="hljs-keyword">if</span> <span class="hljs-title">is_incremental</span>(<span class="hljs-params"></span>) %}
-- logic <span class="hljs-keyword">for</span> incremental models <span class="hljs-keyword">this</span> </span>= dim_booth table
left outer <span class="hljs-keyword">join</span> {{ <span class="hljs-keyword">this</span> }} s
<span class="hljs-keyword">on</span> b.booth_id = s.booth_id
<span class="hljs-keyword">where</span> s.booth_id <span class="hljs-keyword">is</span> <span class="hljs-literal">null</span>
{% endif %}
</code></pre>
<ul>
<li>fact_turnstile.sql</li>
</ul>
<pre><code class="lang-sql">
{{ config(materialized='incremental',
partition_by={
<span class="hljs-string">"field"</span>: <span class="hljs-string">"created_dt"</span>,
<span class="hljs-string">"data_type"</span>: <span class="hljs-string">"timestamp"</span>,
<span class="hljs-string">"granularity"</span>: <span class="hljs-string">"day"</span>
},
cluster_by = <span class="hljs-string">"station_id"</span>)
}}
<span class="hljs-keyword">with</span> turnstile <span class="hljs-keyword">as</span> (
select
log_id,
remote,
booth,
station,
scp,
line_name,
division,
created_dt,
entries,
exits
<span class="hljs-keyword">from</span> {{ <span class="hljs-keyword">ref</span>('stg_turnstile') }}
<span class="hljs-keyword">where</span> log_id <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> null
),
dim_station <span class="hljs-keyword">as</span> (
select station_id, station_name <span class="hljs-keyword">from</span> {{ <span class="hljs-keyword">ref</span>('dim_station') }}
),
dim_booth <span class="hljs-keyword">as</span> (
select booth_id, remote, booth_name <span class="hljs-keyword">from</span> {{ <span class="hljs-keyword">ref</span>('dim_booth') }}
)
select
<span class="hljs-built_in">log</span>.log_id,
st.station_id,
booth.booth_id,
<span class="hljs-built_in">log</span>.scp,
<span class="hljs-built_in">log</span>.line_name,
<span class="hljs-built_in">log</span>.division,
<span class="hljs-built_in">log</span>.created_dt,
<span class="hljs-built_in">log</span>.entries,
<span class="hljs-built_in">log</span>.exits
<span class="hljs-keyword">from</span> turnstile <span class="hljs-keyword">as</span> <span class="hljs-built_in">log</span>
left join dim_station <span class="hljs-keyword">as</span> st
<span class="hljs-keyword">on</span> <span class="hljs-built_in">log</span>.station = st.station_name
left join dim_booth <span class="hljs-keyword">as</span> booth
<span class="hljs-keyword">on</span> <span class="hljs-built_in">log</span>.remote = booth.remote <span class="hljs-keyword">and</span> <span class="hljs-built_in">log</span>.booth = booth.booth_name
{% <span class="hljs-keyword">if</span> is_incremental() %}
<span class="hljs-comment">-- logic for incremental models this = fact_turnstile table</span>
left outer join {{ this }} fact
<span class="hljs-keyword">on</span> <span class="hljs-built_in">log</span>.log_id = fact.log_id
<span class="hljs-keyword">where</span> fact.log_id <span class="hljs-keyword">is</span> null
{% endif %}
</code></pre>
<ul>
<li>schema.yml</li>
</ul>
<pre><code class="lang-yml"><span class="hljs-attribute">version</span>: 2
<span class="less"><span class="hljs-attribute">models</span>:
- <span class="hljs-attribute">name</span>: dim_station
<span class="hljs-attribute">description</span>: >
List of unique stations identify by station_id.
<span class="hljs-attribute">columns</span>:
- <span class="hljs-attribute">name</span>: station_id
<span class="hljs-attribute">description</span>: The station identifier
<span class="hljs-attribute">tests</span>:
- <span class="hljs-attribute">unique</span>:
<span class="hljs-attribute">severity</span>: warn
- <span class="hljs-attribute">not_null</span>:
<span class="hljs-attribute">severity</span>: warn
- <span class="hljs-attribute">name</span>: station_name
<span class="hljs-attribute">description</span>: the station name
<span class="hljs-attribute">tests</span>:
- <span class="hljs-attribute">not_null</span>:
<span class="hljs-attribute">severity</span>: warn
- <span class="hljs-attribute">name</span>: dim_booth
<span class="hljs-attribute">description</span>: >
List of unique booth identify by booth_id.
<span class="hljs-attribute">columns</span>:
- <span class="hljs-attribute">name</span>: booth_id
<span class="hljs-attribute">description</span>: The booth identifier
<span class="hljs-attribute">tests</span>:
- <span class="hljs-attribute">unique</span>:
<span class="hljs-attribute">severity</span>: warn
- <span class="hljs-attribute">not_null</span>:
<span class="hljs-attribute">severity</span>: warn
- <span class="hljs-attribute">name</span>: remote
<span class="hljs-attribute">description</span>: the remote gate name
<span class="hljs-attribute">tests</span>:
- <span class="hljs-attribute">not_null</span>:
<span class="hljs-attribute">severity</span>: warn
- <span class="hljs-attribute">name</span>: booth_name
<span class="hljs-attribute">description</span>: the station booth
<span class="hljs-attribute">tests</span>:
- <span class="hljs-attribute">not_null</span>:
<span class="hljs-attribute">severity</span>: warn
- <span class="hljs-attribute">name</span>: station_id
<span class="hljs-attribute">description</span>: the station id
<span class="hljs-attribute">tests</span>:
- <span class="hljs-attribute">relationships</span>:
<span class="hljs-attribute">to</span>: ref(<span class="hljs-string">'dim_station'</span>)
<span class="hljs-attribute">field</span>: station_id
<span class="hljs-attribute">severity</span>: warn
- <span class="hljs-attribute">name</span>: fact_turnstile
<span class="hljs-attribute">description</span>: >
Represents the daily entries and exits associated to booths in subway stations
<span class="hljs-attribute">columns</span>:
- <span class="hljs-attribute">name</span>: log_id
<span class="hljs-attribute">description</span>: Primary key for this table, generated with a concatenation CA, SCP,UNIT, STATION CREATED
<span class="hljs-attribute">tests</span>:
- <span class="hljs-attribute">unique</span>:
<span class="hljs-attribute">severity</span>: warn
- <span class="hljs-attribute">not_null</span>:
<span class="hljs-attribute">severity</span>: warn
- <span class="hljs-attribute">name</span>: booth_id
<span class="hljs-attribute">description</span>: foreign key to the booth dimension
<span class="hljs-attribute">tests</span>:
- <span class="hljs-attribute">relationships</span>:
<span class="hljs-attribute">to</span>: ref(<span class="hljs-string">'dim_booth'</span>)
<span class="hljs-attribute">field</span>: booth_id
<span class="hljs-attribute">severity</span>: warn
- <span class="hljs-attribute">name</span>: station_id
<span class="hljs-attribute">description</span>: The foreign key to the station dimension
<span class="hljs-attribute">tests</span>:
- <span class="hljs-attribute">relationships</span>:
<span class="hljs-attribute">to</span>: ref(<span class="hljs-string">'dim_station'</span>)
<span class="hljs-attribute">field</span>: station_id
<span class="hljs-attribute">severity</span>: warn
- <span class="hljs-attribute">name</span>: scp
<span class="hljs-attribute">description</span>: The device address
- <span class="hljs-attribute">name</span>: line_name
<span class="hljs-attribute">description</span>: The subway line
- <span class="hljs-attribute">name</span>: division
<span class="hljs-attribute">description</span>: The subway division
- <span class="hljs-attribute">name</span>: created_dt
<span class="hljs-attribute">description</span>: The date time for the activity
<span class="hljs-attribute">tests</span>:
- <span class="hljs-attribute">not_null</span>:
<span class="hljs-attribute">severity</span>: warn
- <span class="hljs-attribute">name</span>: entries
<span class="hljs-attribute">description</span>: The number of entries
<span class="hljs-attribute">tests</span>:
- <span class="hljs-attribute">not_null</span>:
<span class="hljs-attribute">severity</span>: warn
- <span class="hljs-attribute">name</span>: exits
<span class="hljs-attribute">description</span>: the number of exits
<span class="hljs-attribute">tests</span>:
- <span class="hljs-attribute">not_null</span>:
<span class="hljs-attribute">severity</span>: warn</span>
</code></pre>
<h4 id="incremental-models">Incremental Models</h4>
<p>In dbt, an incremental model uses a merge operation to update a data warehouse's tables incrementally rather than performing a full reload of the data each time. This approach is particularly useful when dealing with large datasets and when the source data has frequent updates or inserts. Incremental models help optimize data processing and reduce the amount of data that needs to be processed during each run, resulting in faster data updates. </p>
<ul>
<li>SQL merge query for the station dimension table (generated code)</li>
</ul>
<pre><code class="lang-sql">
<span class="hljs-keyword">merge</span> <span class="hljs-keyword">into</span> <span class="hljs-string">`ozkary-de-101`</span>.<span class="hljs-string">`mta_data`</span>.<span class="hljs-string">`dim_station`</span> <span class="hljs-keyword">as</span> DBT_INTERNAL_DEST
<span class="hljs-keyword">using</span> (
<span class="hljs-keyword">with</span> stations <span class="hljs-keyword">as</span> (
<span class="hljs-keyword">select</span>
station_id,
station_name
<span class="hljs-keyword">from</span> <span class="hljs-string">`ozkary-de-101`</span>.<span class="hljs-string">`mta_data`</span>.<span class="hljs-string">`stg_station`</span> <span class="hljs-keyword">as</span> d
)
<span class="hljs-keyword">select</span>
ns.station_id,
ns.station_name
<span class="hljs-keyword">from</span> stations ns
<span class="hljs-comment">-- logic for incremental models</span>
<span class="hljs-keyword">left</span> <span class="hljs-keyword">outer</span> <span class="hljs-keyword">join</span> <span class="hljs-string">`ozkary-de-101`</span>.<span class="hljs-string">`mta_data`</span>.<span class="hljs-string">`dim_station`</span> s
<span class="hljs-keyword">on</span> ns.station_id = s.station_id
<span class="hljs-keyword">where</span> s.station_id <span class="hljs-keyword">is</span> <span class="hljs-literal">null</span>
<span class="hljs-comment">-- </span>
) <span class="hljs-keyword">as</span> DBT_INTERNAL_SOURCE
<span class="hljs-keyword">on</span> (<span class="hljs-literal">FALSE</span>)
<span class="hljs-keyword">when</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">matched</span> <span class="hljs-keyword">then</span> <span class="hljs-keyword">insert</span>
(<span class="hljs-string">`station_id`</span>, <span class="hljs-string">`station_name`</span>)
<span class="hljs-keyword">values</span>
(<span class="hljs-string">`station_id`</span>, <span class="hljs-string">`station_name`</span>)
</code></pre>
<ul>
<li>SQL merge query for the fact table (generated code)</li>
</ul>
<pre><code class="lang-sql">merge into `ozkary-de<span class="hljs-number">-101</span>`.`mta_data`.`fact_turnstile` <span class="hljs-keyword">as</span> DBT_INTERNAL_DEST
using (
<span class="hljs-keyword">with</span> turnstile <span class="hljs-keyword">as</span> (
select
log_id,
remote,
booth,
station,
scp,
line_name,
division,
created_dt,
entries,
exits
<span class="hljs-keyword">from</span> `ozkary-de<span class="hljs-number">-101</span>`.`mta_data`.`stg_turnstile`
<span class="hljs-keyword">where</span> log_id is not null
),
dim_station <span class="hljs-keyword">as</span> (
select station_id, station_name <span class="hljs-keyword">from</span> `ozkary-de<span class="hljs-number">-101</span>`.`mta_data`.`dim_station`
),
dim_booth <span class="hljs-keyword">as</span> (
select booth_id, remote, booth_name <span class="hljs-keyword">from</span> `ozkary-de<span class="hljs-number">-101</span>`.`mta_data`.`dim_booth`
)
select
log.log_id,
st.station_id,
booth.booth_id,
log.scp,
log.line_name,
log.division,
log.created_dt,
log.entries,
log.exits
<span class="hljs-keyword">from</span> turnstile <span class="hljs-keyword">as</span> log
left join dim_station <span class="hljs-keyword">as</span> st
on log.station = st.station_name
left join dim_booth <span class="hljs-keyword">as</span> booth
on log.remote = booth.remote and log.booth = booth.booth_name
-- logic for incremental models this = fact_turnstile table
left outer join `ozkary-de<span class="hljs-number">-101</span>`.`mta_data`.`fact_turnstile` fact
on log.log_id = fact.log_id
<span class="hljs-keyword">where</span> fact.log_id is null
) <span class="hljs-keyword">as</span> DBT_INTERNAL_SOURCE
on (FALSE)
when not matched then insert
(`log_id`, `station_id`, `booth_id`, `scp`, `line_name`, `division`, `created_dt`, `entries`, `exits`)
values
(`log_id`, `station_id`, `booth_id`, `scp`, `line_name`, `division`, `created_dt`, `entries`, `exits`)
</code></pre>
<h2 id="how-to-run-it">How to Run It</h2>
<p>We are ready to see this in action. We first need to build the data models on our database by running the following steps:</p>
<h3 id="validate-the-project">Validate the project</h3>
<p>Debug the project to make sure there are no compilation errors.</p>
<pre><code class="lang-bash">$ dbt <span class="hljs-keyword">debug</span>
</code></pre>
<h3 id="run-the-test-cases">Run the test cases</h3>
<p>All test should pass.</p>
<pre><code class="lang-bash">$ dbt <span class="hljs-built_in">test</span>
</code></pre>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-process-data-warehouse-tests.png" alt="ozkary-data-engineering-data-warehouse-tests" title="Data Engineering Process Fundamentals - Data Warehouse and Transformation Tests"></p>
<h3 id="build-the-models">Build the models</h3>
<p>Set the test run variable to false. This allows for the full dataset to be created without limiting the rows.</p>
<pre><code class="lang-bash">$ cd Step4-Data-Warehouse/dbt
$ dbt build --select stg_booth<span class="hljs-selector-class">.sql</span> --<span class="hljs-selector-tag">var</span> <span class="hljs-string">'is_test_run: false'</span>
$ dbt build --select stg_station<span class="hljs-selector-class">.sql</span> --<span class="hljs-selector-tag">var</span> <span class="hljs-string">'is_test_run: false'</span>
$ dbt build --select stg_turnstile<span class="hljs-selector-class">.sql</span> --<span class="hljs-selector-tag">var</span> <span class="hljs-string">'is_test_run: false'</span>
$ dbt build --select dim_booth<span class="hljs-selector-class">.sql</span>
$ dbt build --select dim_station<span class="hljs-selector-class">.sql</span>
$ dbt build --select fact_turnstile.sql
</code></pre>
<p>After running these command, the following resources should be in the data warehouse:</p>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-process-data-warehouse-schema.png" alt="ozkary-data-engineering-data-warehouse-schema" title="Data Engineering Process Fundamentals - Data Warehouse and Transformation Schema"></p>
<blockquote>
<p>π The build command is responsible for compiling, generating and deploying the SQL code for our dbt project, while the run command executes that SQL code against your data warehouse to update the data. Typically, we would run dbt build first to compile the project, and then run dbt run to execute the compiled code against the database.</p>
</blockquote>
<h3 id="generate-documentation">Generate documentation</h3>
<p>Run generate to create the documentation. We can then run serve to view the documentation on the browser.</p>
<pre><code class="lang-bash">$ dbt docs <span class="hljs-keyword">generate</span>
$ dbt docs serve
</code></pre>
<p>The entire project is documented. The image below shows the documentation for the fact table with the lineage graph showing how it was built.</p>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-process-data-warehouse-docs.png" alt="ozkary-data-engineering-data-warehouse-docs" title="Data Engineering Process Fundamentals - Data Warehouse and Transformation Documents"></p>
<h3 id="manually-test-the-incremental-updates">Manually test the incremental updates</h3>
<p>We can run our updates on demand by using the CLI. To be able to run the updates. We should first run the data pipeline and import a new CSV file into the data lake. We can then run our updates as follows:</p>
<pre><code class="lang-bash">$ cd Step4-Data-Warehouse/dbt
$ dbt <span class="hljs-keyword">run</span><span class="bash"> --model dim_booth.sql
</span>$ dbt <span class="hljs-keyword">run</span><span class="bash"> --model dim_station.sql
</span>$ dbt <span class="hljs-keyword">run</span><span class="bash"> --model fact_turnstile.sql</span>
</code></pre>
<p>We should notice that we are "running" the model, which only runs the incremental (merge) updates.</p>
<h3 id="schedule-the-job">Schedule the job</h3>
<p>Login to dbt cloud and set this scheduled job:</p>
<ul>
<li>On dbt Cloud setup the dbt schedule job to run every Saturday at 9am</li>
<li>Use the production environment</li>
<li>Use the following command</li>
</ul>
<pre><code class="lang-bash">$ dbt <span class="hljs-built_in">run</span> <span class="hljs-comment">--model fact_turnstile.sql</span>
</code></pre>
<p>After running the cloud job, the log should show the following information with the number of rows affected. </p>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-process-data-warehouse-jobs.png" alt="ozkary-data-engineering-data-warehouse-job" title="Data Engineering Process Fundamentals - Data Warehouse and Transformation Job"></p>
<blockquote>
<p>π There should be files on the data lake for the job to insert any new records. </p>
</blockquote>
<h3 id="manually-query-the-data-lake-for-new-data">Manually Query the data lake for new data</h3>
<p>To test the for new records, we can manually run this query on the database. </p>
<pre><code class="lang-sql">with turnstile as (
<span class="hljs-keyword">select</span>
log_id
<span class="hljs-keyword">from</span> mta_data.stg_turnstile
)
<span class="hljs-keyword">select</span>
log.log_id
<span class="hljs-keyword">from</span> turnstile <span class="hljs-keyword">as</span> <span class="hljs-keyword">log</span>
<span class="hljs-comment">-- logic for incremental models find new rows that are not in the fact table</span>
<span class="hljs-keyword">left</span> <span class="hljs-keyword">outer</span> <span class="hljs-keyword">join</span> mta_data.fact_turnstile fact
<span class="hljs-keyword">on</span> log.log_id = fact.log_id
<span class="hljs-keyword">where</span> fact.log_id <span class="hljs-keyword">is</span> <span class="hljs-literal">null</span>
</code></pre>
<h3 id="validate-the-data">Validate the data</h3>
<p>To validate the number of records in our database, we can run these queries:</p>
<pre><code class="lang-sql">--<span class="hljs-built_in"> check </span>station dimension table
select count(*) from mta_data.dim_station;
--<span class="hljs-built_in"> check </span>booth dimension table
select count(*) from mta_data.dim_booth;
--<span class="hljs-built_in"> check </span>the fact table
select count(*) from mta_data.fact_turnstile;
--<span class="hljs-built_in"> check </span>the staging fact data
select count(*) from mta_data.stg_turnstile;
</code></pre>
<p>After following all these instructions, we should see data in our data warehouse, which closes the loop on the entire data pipeline for data ingestion from a CSV file to our data warehouse. We should also note that we could have done this process using a Python-Centric approach with Apache Spark, and we will discuss that in a later section.</p>
<h2 id="summary">Summary</h2>
<p>During this data warehouse exercise, we delve into the design and implementation step, crafting robust data models, and designing transformation tasks. Carefully selecting a star schema design and utilizing BigQuery as our OLAP system, we optimize performance and handle large datasets efficiently. Leveraging SQL for coding and a SQL-Centric framework, we ensure seamless data modeling and transformation. We use GitHub for our source code management and CI/CD tool integration, so the latest changes can be built and deployed. Thorough documentation and automated data transformations underscore our commitment to data governance and streamlined operations. The result is a resilient and future-ready data warehouse capable of meeting diverse analytical needs.</p>
<h2 id="next-step">Next Step</h2>
<p>With our data warehouse design and implementation complete, we have laid a solid foundation to unleash the full potential of our data. Now, we venture into the realm of data analysis and visualization, where we can leverage powerful tools like Power BI and Looker to transform raw data into actionable insights.</p>
<p>Coming Soon!</p>
<blockquote>
<p>π [Data Engineering Process Fundamentals - Data Analysis and Visualization]</p>
</blockquote>
<p>Thanks for reading.</p>
<p>Send question or comment at Twitter @ozkary</p>
<p>π Originally published by <a href="https://www.ozkary.com">ozkary.com</a></p>
<div class="blogger-post-footer">Originally Published by https://ozkary.com</div>Oscar Garcia @ozkaryhttp://www.blogger.com/profile/13427831719934915504noreply@blogger.com0tag:blogger.com,1999:blog-2250750882473644809.post-64739404928915423042023-06-10T11:45:00.018-04:002023-08-04T11:34:41.189-04:00Data Engineering Process Fundamentals - Data Warehouse and Transformation<p>After completing the pipeline and orchestration phase in the data engineering process, our pipeline should be fully operational and loading data into our data lake. The compressed CSV files in our data lake, even though is optimized for storage, are not designed for easy access for analysis and visualization tools. Therefore, we should transition into moving the data from the files into a data warehouse, so we can facilitate the access for the analysis process.</p>
<p>The process to send the data into a data warehouse requires a few essential design activities before we can migrate the data into tables. Like any process before any implementation is done, we need to first work on defining the database system and schema, identifying the programming language, frameworks, tools to use for CI/CD requirements, and supporting requirements to keep our data warehouse operational.</p>
<p>Once the data warehouse design is in place, we can then transition into the implementation stage of the process where we can transition from concepts into concrete structures, including dimension and fact tables, while also defining the data transformation tasks to process the data into the data warehouse. </p>
<p>To get a better understanding about the data warehouse process, let's first do a refresh on some important concepts related to data warehouse systems. As we cover these concepts, we can then relate them to some of the necessary activities that we need to take on to deliver a solution that can scale according to our data demands.</p>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-process-data-warehouse-steps.png" alt="ozkary-data-engineering-data-warehouse-transformation-steps" title="Data Engineering Process Fundamentals - Data Warehouse and Transformation"></p>
<h2 id="olap-vs-oltp-database-systems">OLAP vs OLTP Database Systems</h2>
<p>An Online Analytical Processing (OLAP) and an Online Transaction Processing (OLTP) are two different types of database systems with distinct purposes and characteristics:</p>
<h3 id="olap">OLAP</h3>
<ul>
<li>It is designed for complex analytical queries and data analysis</li>
<li>It is optimized for read-heavy workloads and aggregates large volumes of data to support business intelligence (BI), reporting, and data analysis.</li>
<li>These databases store historical data and facilitate data exploration, trend analysis, and decision-making</li>
<li>Data is typically denormalized and organized in a multidimensional structure like a star schema or snowflake schema to enable efficient querying and aggregation.</li>
<li>Some examples include data warehouses and analytical databases like Amazon Redshift, Google BigQuery, and Microsoft Azure Synapse Analytics.</li>
</ul>
<h3 id="oltp">OLTP</h3>
<ul>
<li>It is designed for transactional processing and handling frequent, real-time, and high-throughput transactions</li>
<li>It focuses on transactional operations like inserting, updating, and deleting individual records</li>
<li>Databases are typically normalized to minimize redundancy and ensure data integrity during frequent transactions</li>
<li>The data is organized in a relational structure and optimized for read and write operations</li>
<li>Some examples include traditional relational databases like MySQL, PostgreSQL, Microsoft SQL Server, and Oracle</li>
</ul>
<blockquote>
<p>π OLAP databases (e.g., BigQuery) are used for analytical processing. OLTP databases (e.g., SQL Server) are used for transaction processing</p>
</blockquote>
<p>In summary, OLAP and OLTP serve different purposes in the database world. OLAP databases are used for analytical processing, supporting complex queries and data analysis, while OLTP databases are used for transaction processing, managing high-frequency and real-time transactional operations. Depending on the needs of the solution, we would choose the appropriate type of database system to achieve the desired performance and functionality. In our case, an OLAP system aligns what the requirements for our solution.</p>
<h2 id="what-is-a-data-warehouse">What is a Data Warehouse</h2>
<p>A Data Warehouse is an OLAP system, which serves as the central data repository for historical and aggregated data. A data warehouse is designed to support complex analytical queries, reporting, and data analysis for Big Data use cases. It typically adopts a denormalized entity structure, such as a star schema or snowflake schema, to facilitate efficient querying and aggregations. Data from various OLTP sources is extracted, loaded and transformed (ELT) into the data warehouse to enable analytics and business intelligence. The data warehouse acts as a single source of truth for business users to obtain insights from historical data.</p>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-process-data-warehouse-design.png" alt="ozkary-data-engineering-data-warehouse-transformation-design" title="Data Engineering Process Fundamentals - Data Warehouse and Transformation Design"></p>
<h3 id="elt-vs-etl">ELT vs ETL</h3>
<p>An extract, load and transform (ELT) process differs from the extract, transform and load (ETL) process on the data transformation approach. For some solutions, a flow task may transform (ETL) the data prior to loading it into storage, so it can then be inserted into the data warehouse directly. This approach increases the amount of python code and hardware resources used by the VM environments. </p>
<p>For the ELT process, the transformation may be done using SQL (Structured Query Language) code and the data warehouse resources, which often tends to perform great for Big Data scenarios. This is usually done by defining the data model with views over some external tables and running the transformation using SQL for bulk data processing. In our case, we can use the data lake as external tables and use the power of the data warehouse to read and transform the data, which aligns with the ELT approach as the data is first loaded in the data lake.</p>
<blockquote>
<p>π For the ETL process, the data is transformed before adding to storage. For the ELT process, the data is first loaded in storage in raw format, the transformation is then done before inserting into the dimension and fact tables.</p>
</blockquote>
<h3 id="external-tables">External Tables</h3>
<p>An external table in the context of a data warehouse refers to a table that is not physically stored within the data warehouse's database but instead references data residing in an external storage location. The data in an external table can be located in cloud storage (e.g., Azure Blob Storage, AWS S3) or on-premises storage. When querying an external table, the data warehouse's query engine accesses the data in the external location on-the-fly without physically moving or copying it into the data warehouse's database. </p>
<p>Advantages of using external tables in a data warehouse include:</p>
<ul>
<li>Cost Savings: External tables allow us to store data in cost-effective storage solutions like cloud object storage</li>
<li>Data Separation: By keeping the data external to the data warehouse, we can maintain a clear separation between compute and storage. We can scale them independently, optimizing costs and performance</li>
<li>Data Freshness: External tables provide real-time access to data, as changes made to the external data source are immediately reflected when queried. There's no need for <strong>raw data ingestion</strong> processes to load the data into the data warehouse.</li>
<li>Data Variety and Integration: You can have external tables referencing data in various formats (e.g., CSV, Parquet, JSON), enabling seamless integration of diverse data sources without the need for complex data transformations</li>
<li>Data Archiving and Historical Analysis: External tables allow you to store historical data in an external location, reducing the data warehouse's storage requirements. You can keep archived data accessible without impacting the performance of the main data warehouse.</li>
<li>Rapid Onboarding: Setting up external tables is often quicker and more straightforward than traditional data ingestion processes. This allows for faster onboarding of new data sources into the data warehouse.</li>
<li>Reduced ETL Complexity: External tables can reduce the need for complex ETL (Extract, Transform, Load) processes as the data doesn't need to be physically moved or transformed before querying.</li>
</ul>
<h3 id="data-mart">Data Mart</h3>
<p>Depending on the use case, the analytical tools can connect directly to the data warehouse for data analysis and reporting. In other scenarios, it may be better to create a data mart, which is a smaller, focused subset of a data warehouse that is designed to serve the needs of a specific business unit within an organization. The data mart stores its data in separate storage.</p>
<p>There are two main types of data marts:</p>
<ul>
<li>Dependent Data Mart: This type of data mart is derived directly from the data warehouse. It extracts and transforms data from the centralized data warehouse and optimizes it for a specific business unit. </li>
<li>Independent Data Mart: An independent data mart is created separately from the data warehouse, often using its own ELT processes to extract and transform data from the source systems. It is not directly connected to the data warehouse</li>
</ul>
<p>By providing a more focused view of the data, data marts enable faster and more efficient decision-making within targeted business areas. </p>
<h2 id="data-warehouse-design-and-architecture">Data Warehouse Design and Architecture</h2>
<p>During the design and architecture stage of our data warehouse project, our primary objective is to transition from conceptual ideas to concrete designs. With a clear understanding of the business requirements, data sources and their update frequencies, we can move forward with the design of the data warehouse architecture. To start, we need to define the data warehouse models such as star schema, snowflake schema, or hybrid models based on data relationships and query patterns. We should also determine the infrastructure and technology stack for the data warehouse, considering factors like data volume, frequency of updates, and query performance requirements, source control, and CI/CD activities.</p>
<h3 id="schema-design">Schema Design</h3>
<p>The Star and Snowflake Schemas are two common data warehouse modeling techniques. The Star Schema consist of a central fact table is connected to multiple dimension tables via foreign key relationships. The fact table contains the measures or metrics, while the dimension tables hold descriptive attributes. The Snowflake Schema is a variation of the Star Schema, but with normalized dimension tables. This means that dimension tables are further divided into multiple related tables, reducing data redundancy, but increasing SQL joins.</p>
<h4 id="star-schema-pros-and-cons">Star Schema Pros and Cons</h4>
<ul>
<li>Simplicity: The Star Schema is straightforward and easy to understand, making it user-friendly for both data engineers and business analysts</li>
<li>Performance: Star Schema typically delivers faster query performance because it denormalizes data, reducing the number of joins required to retrieve data</li>
<li>Data Redundancy: Due to denormalization, there might be some data redundancy in dimension tables, which can lead to increased storage requirements</li>
<li>Maintenance: The Star Schema is relatively easier to maintain and modify since changes in dimension tables don't affect the fact table</li>
</ul>
<h4 id="snowflake-schema-pros-and-cons">Snowflake Schema Pros and Cons</h4>
<ul>
<li>Normalization: The Snowflake Schema reduces data redundancy and optimizes storage by normalizing dimension data</li>
<li>Complexity: Compared to the Star Schema, the Snowflake Schema is more complex due to the presence of multiple normalized dimension tables</li>
<li>Performance: Snowflake Schema require more joins, which can impact query performance compared to the Star Schema. However, modern data warehouses are optimized for handling Snowflake Schema efficiently</li>
<li>Maintenance: The Snowflake Schema might be slightly more challenging to maintain and modify due to the normalized structure and the need for more joins</li>
</ul>
<p>In summary. we can use the Star Schema when query performance is a primary concern, and data model simplicity is essential. Use the Snowflake Schema when storage optimization is crucial, and the data model involves high-cardinality dimension attributes with potential data redundancy.</p>
<h3 id="infrastructure">Infrastructure</h3>
<p>Cloud based OLAP systems like Amazon Redshift, Google BigQuery, and Microsoft Azure Synapse Analytics are built to scale with growing data volumes. They can handle petabytes of data, making them a great fit for Big Data scenarios. These systems also support MPP (Massive Parallel Processing), built-in indexing and caching, which improves query performance and reduce compute by caching query results. The serverless architecture of these systems help us on reducing cost. Because the system is managed by the cloud provider, we can focus on the data analysis instead of infrastructure management.</p>
<p>OLAP systems also provides data governance by providing a structured and controlled environment for managing data, ensuring data quality, enforcing security, access controls, and promoting consistency and trust in the data across the organization. These systems also implement robust security measures to protect the data, auditing capabilities for tracking data lineage and changes, which are crucial for compliance requirements.</p>
<p>In all, OLAP systems are well-equipped to handle big data scenarios, offering scalability, high-performance querying, cost-effectiveness, and data governance, which is a critical business requirement.</p>
<h3 id="technology-stack">Technology Stack</h3>
<p>When it comes to the technology stack, we have to decide on what programming language, frameworks and platforms to use for our solution. For example, Python is a suitable functional programming language with an extensive ecosystem of libraries for data modeling and transformation. But when using Python, we need to parse the CSV files, models and transform the data in memory, so it can be sent to the database. This tends to increase the amount of Python code, Docker containers, VM resources, and overall DevOps activities. </p>
<p>Within the memory context and processing power of the data warehouse, we could use SQL to create the models and run the transformation, which tends to work best for large datasets and faster processing. Due to the nature of the data lake, the CSV files can be modeled as external tables within the data warehouse. SQL can then be used to create models using views to enforce the data types. In addition, the transformation can be done right in the database using SQL statements with batch queries, which tends to perform a lot better than using Python.</p>
<h4 id="frameworks">Frameworks</h4>
<p>Frameworks provide libraries to handle specific technical concerns. In the case of a Python-centric solution, we can use the <a href="https://pandas.pydata.org/">Pandas</a> library, which is an open-source data manipulation, cleaning, transformation and analysis library widely use by data engineers and scientists. Pandas supports a DataFrame-based modeling and transformation. A DataFrame is a two-dimensional table-like data structure. It can hold data with different data types and allows us to perform various operations like filtering, grouping, joining, and aggregating. Pandas offers functions for handling missing data, removing duplicates, and converting data types, making data cleaning tasks easier.</p>
<p>There are also frameworks that consist of generating SQL code to build the models and process the transformation. <a href="https://www.getdbt.com/">dbt</a> (data build tool) is a SQL-centric framework which at its core is primarily focused on transforming data using SQL-based queries. It allows us to define data transformation logic using SQL and Jinja, a templating language with data transformation capabilities, such as loops, conditionals, and macros, within our SQL code. dbt enables us to build the actual data models as views, entities (tables) and SQL based transformation that are hosted on the data warehouse. </p>
<h4 id="apache-spark-platform">Apache Spark Platform</h4>
<p><a href="https://spark.apache.org/">Apache Spark</a> is a widely used open-source distributed computing system designed for big data processing and analytics. It provides a fast, scalable, and versatile platform for handling large-scale data workloads. While it can be used for data modeling and transformation, it serves a broader range of use cases, including batch processing, real-time processing and machine learning. There are many popular cloud platforms that use Spark as their core engine. Some of them include: Databricks, Azure Synapse Analytics, Google Dataproc, Amazon EMR. </p>
<p>Spark supports multiple programming languages like Scala, Python, SQL. Since Spark requires a runtime environment to manage the execution of a task, the programming model is very similar to running applications on a VM. The Spark application connects to a Spark cluster to create a session, and it can then perform data processing and run Spark SQL queries. Let's look at what a Python and SQL application looks like with Spark.</p>
<p>Data Modeling and Transformation with PySpark and SQL:</p>
<p>The next example (for both Python and SQL) show us how to create a Spark session. It then joins two data frames by using the station_id as the related column. Lastly, it selects and displays the result of the query.</p>
<ul>
<li>PySpark: PySpark provides a high-level API for Spark, allowing us to write Spark applications using Python. It exposes the core Spark functionalities and supports DataFrame and Dataset APIs for working with structured data. PySpark is popular among data engineers and data scientists.</li>
</ul>
<p><strong>PySpark Code Sample:</strong></p>
<pre><code class="lang-python">from pyspark.sql <span class="hljs-built_in">import</span> SparkSession
<span class="hljs-comment"># Assuming you already have the two DataFrames `dim_station` and `fact_turnstile`</span>
<span class="hljs-comment"># Create a SparkSession (if not already created)</span>
<span class="hljs-attr">spark</span> = SparkSession.builder.appName(<span class="hljs-string">"JoinEntities"</span>).getOrCreate()
<span class="hljs-comment"># Join the two DataFrames on the 'station_id' column</span>
<span class="hljs-attr">joined_df</span> = fact_turnstile.join(dim_station, <span class="hljs-attr">on="station_id")</span>
<span class="hljs-comment"># Select the desired columns</span>
<span class="hljs-attr">result_df</span> = joined_df.select(<span class="hljs-string">"station_name"</span>, <span class="hljs-string">"created_datetime"</span>, <span class="hljs-string">"entries"</span>, <span class="hljs-string">"exits"</span>)
<span class="hljs-comment"># Show the result</span>
result_df.show()
</code></pre>
<ul>
<li>SQL: Spark includes a SQL module that allows us to run SQL queries directly on data. This makes it convenient for those familiar with SQL to leverage their SQL skills to perform data modeling and transformation tasks using Spark.</li>
</ul>
<p><strong>PySpark and SQL Code Sample:</strong></p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql <span class="hljs-keyword">import</span> SparkSession
<span class="hljs-comment"># Assuming you already have the two DataFrames `dim_station` and `fact_turnstile`</span>
<span class="hljs-comment"># Create a SparkSession (if not already created)</span>
spark = SparkSession.builder.appName(<span class="hljs-string">"JoinEntities"</span>).getOrCreate()
<span class="hljs-comment"># Register the DataFrames as temporary views</span>
dim_station.createOrReplaceTempView(<span class="hljs-string">"dim_station_view"</span>)
fact_turnstile.createOrReplaceTempView(<span class="hljs-string">"fact_turnstile_view"</span>)
<span class="hljs-comment"># Write the SQL query for joining and selecting the desired columns</span>
sql_query = <span class="hljs-string">"""
SELECT s.station_name, t.created, t.entries, t.exits
FROM fact_turnstile_view t
JOIN dim_station_view s ON t.station_id = s.station_id
"""</span>
<span class="hljs-comment"># Execute the SQL query</span>
result_df = spark.sql(sql_query)
<span class="hljs-comment"># Show the result</span>
result_df.show()
</code></pre>
<h5 id="sample-output">Sample Output</h5>
<table>
<thead>
<tr>
<th>station_name</th>
<th>created</th>
<th>entries</th>
<th>exits</th>
</tr>
</thead>
<tbody>
<tr>
<td>Central Station</td>
<td>2023-02-13 12:00:00</td>
<td>10000</td>
<td>5000</td>
</tr>
<tr>
<td>Times Square</td>
<td>2023-02-13 12:10:00</td>
<td>8000</td>
<td>3000</td>
</tr>
<tr>
<td>Union Square</td>
<td>2023-02-13 12:20:00</td>
<td>12000</td>
<td>7000</td>
</tr>
<tr>
<td>Grand Central</td>
<td>2023-02-13 12:30:00</td>
<td>9000</td>
<td>4000</td>
</tr>
<tr>
<td>Penn Station</td>
<td>2023-02-13 12:40:00</td>
<td>11000</td>
<td>6000</td>
</tr>
</tbody>
</table>
<p>By supporting multiple languages like PySpark and SQL, Apache Spark caters to a broader audience, making it easier for developers, data engineers, and data scientists to leverage its capabilities effectively. Apache Spark provides a unified and flexible platform for data modeling and transformation at scale.</p>
<h4 id="source-control-and-ci-cd">Source Control and CI/CD</h4>
<p>As we build code for our data model and transformation tasks, we need to track it, manage the different versions and automate the deployments to our data warehouse. Storing the source code on systems like GitHub offers several benefits that enhance governance, version control, collaboration, and continuous integration/continuous deployment (CI/CD) on a data engineering project. Some of these benefits include:</p>
<ul>
<li><p>Governance and Version Control for Data Models: <a href="https://github.com/">GitHub</a> provides version control, ensuring that all changes to data models are tracked, audited, and properly managed, ensuring compliance with regulatory requirements and business standards</p>
</li>
<li><p>CI/CD for Data Transformation: CI/CD pipelines ensure that changes to data transformation code are thoroughly tested and safely deployed, reducing errors and improving data accuracy</p>
</li>
<li><p>Collaboration and Teamwork on Data Assets: GitHub's collaborative features enable data engineers and analysts to work together on data models and transformations code</p>
</li>
<li><p>Reusability and Flexibility in Data Transformation: Storing data transformation code on GitHub promotes the reuse of code snippets and best practices across the data warehouse solution</p>
</li>
<li><p>Disaster Recovery and Redundancy: GitHub acts as a secure backup for data transformation logic, ensuring redundancy and disaster recovery capabilities. In case of any issues, the data transformation code can be restored, minimizing downtime and data inconsistencies</p>
</li>
</ul>
<p>In the context of a data warehouse solution, using GitHub, or similar systems, as a version control system for managing data models and transformation assets brings numerous advantages that improve governance, collaboration, and code quality. It ensures that the data warehouse solution remains agile, reliable, and capable of adapting to changes in business requirements and data sources.</p>
<h2 id="data-warehouse-implementation">Data Warehouse Implementation</h2>
<p>The data warehouse implementation is the stage where the conceptual data model and design plans are transformed into a functional system. During this critical phase, data engineers and architects convert the abstract data model into concrete structures, including dimension and fact tables, while also defining the data transformation tasks to cleanse, integrate, and load data into the data warehouse. This implementation process lays the foundation for data accessibility, efficiency, and accuracy, ensuring that the data warehouse becomes a reliable and valuable source of insights for analytical purposes. </p>
<h3 id="data-modeling">Data Modeling</h3>
<p>Data modeling is the implementation of the structure of the data warehouse, creating models (views) and entities (tables), defining attributes (columns), and establishing data relationships to ensure efficient querying and reporting. It is also important to identify the primary keys, foreign keys, and indexes to improve data retrieval performance. This is also the area where data needs to be normalize or denormalized data based on query patterns and analytical needs.</p>
<p>When using the Star Schema model, we need to carefully understand the data, so we can identify the dimensions and fact tables that need to be created. Dimension tables represent descriptive attributes or context data (e.g., train stations, commuters), while fact tables contain quantitative data or measures (e.g., number of stations or passengers). Dimensions are used for slicing data, providing business context to the measures, whereas fact tables store numeric data that can be aggregated to derive KPIs (Key Performance Indicators).</p>
<p>To help us define the data models, we can follow these simple rules:</p>
<ul>
<li><p>Dimensions: Dimensions are textual, and categorical attributes that describe business entities. They are often discrete and used for grouping, filtering, and organizing data.</p>
</li>
<li><p>Fact Tables: Fact tables contain numeric data that can be aggregated. They hold the measurable data and are related to dimensions through foreign keys.</p>
</li>
<li><p>Measures: Measures are the quantitative values that are subject to calculations such as sum, average, minimum, maximum, etc. They represent the KPIs that organizations want to track and analyze.</p>
</li>
<li><p>ERD: Create a Entity Relationship Diagram to visualize the models and their relationships</p>
</li>
</ul>
<blockquote>
<p>π Simple Star Schema ERD with dimension and fact tables</p>
</blockquote>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-process-data-warehouse-star-schema.png" alt="ozkary-data-engineering-data-warehouse-star-schema" title="Data Engineering Process Fundamentals - Data Warehouse and Transformation Star Schema"></p>
<p>For reporting and dashboards, additional models can be created to accelerate the data analysis process. This is usually done to create common queries and abstract the join complexity with SQL views. Alternative, data scientist can choose to connect directly to the entities and create their data models using their analytical tools, which handle the building of SQL queries. The approach really depends on the expertise of the team, and the data modeling standards of the organization.</p>
<p>By defining clear dimension and fact tables with appropriate measures, a well-structured data model can enable effective analysis and visualization, supporting the generation of insightful KPIs for data-driven decision-making.</p>
<h3 id="data-transformation">Data Transformation</h3>
<p>The data transformation phase is a critical stage in a data warehouse project, where raw data is processed, cleansed, mapped to use proper naming conventions, and loaded into the data warehouse to create a reliable dataset for analysis. Additionally, implementing incremental loads to continuously insert the new information since the last update via batch processes, ensures that the data warehouse stays up-to-date with the latest data.</p>
<p>To help us define the data transformation tasks, we should do the following activities:</p>
<ul>
<li><p>Data Dictionary, Mapping and Transformation Rules: Develop a clear and comprehensive data dictionary and mapping document that outlines how source data fields correspond to target data warehouse tables and columns</p>
</li>
<li><p>Data Profiling: Identify data patterns, anomalies, and potential issues that need to be addressed during the transformation process, like removing null values, duplicates, invalid data</p>
</li>
<li><p>Transformation Logic: Apply data transformation logic to standardize formats, resolve data inconsistencies, and calculate derived measures, define the incremental data rules</p>
</li>
<li><p>Data Validation and Testing: Validate the transformed data against predefined business rules and requirements to ensure its accuracy and alignment with expectations</p>
</li>
<li><p>Complete the Orchestration: Schedule the transformation tasks to automate the data loading process</p>
</li>
<li><p>Monitor and Operations: Monitor the transformation tasks to check for failures. Track incomplete data and notify the team of errors</p>
</li>
<li><p>Database Tuning: Involves making adjustments to the database system itself to optimize query execution and overall system performance.</p>
</li>
</ul>
<p>A well-executed implementation phase ensures that the data warehouse aligns with the business requirements and enables stakeholders to make informed decisions based on comprehensive and organized data, thus playing a fundamental role in the success of the overall data warehouse project.</p>
<h2 id="summary">Summary</h2>
<p>Before we can move data into a data warehouse system, we explore two pivotal phases for our data warehouse solution: design and implementation. In the design phase, we lay the groundwork by defining the database system, schema model, and technology stack required to support the data warehouse's implementation and operations. This stage ensures a solid infrastructure for data storage and management.</p>
<p>Moving on to the implementation phase, we focus on converting conceptual data models into a functional system. By creating concrete structures like dimension and fact tables and performing data transformation tasks, including data cleansing, integration, and scheduled batch loading, we ensure that raw data is processed and unified for analysis. With this approach, we successfully complete the entire data pipeline and orchestration, seamlessly moving data from CSV files to the data warehouse. </p>
<h2 id="exercise-data-warehouse-model-and-transformation">Exercise - Data Warehouse Model and Transformation</h2>
<p>With a solid understanding of the data warehouse design and implementation, the next step is to put these concepts into practice through a hands-on exercise. In this lab, we build a cloud data warehouse system, applying the knowledge gained to create a powerful and efficient analytical platform.</p>
<blockquote>
<p>π <a href="//www.ozkary.com/2023/06/data-engineering-process-fundamentals-data-warehouse-transformation-exercise.html">Data Engineering Process Fundamentals - Data Warehouse Model and Transformation Exercise</a></p>
</blockquote>
<p>Thanks for reading.</p>
<p>Send question or comment at Twitter @ozkary</p>
<p>π Originally published by <a href="https://www.ozkary.com">ozkary.com</a></p>
<div class="blogger-post-footer">Originally Published by https://ozkary.com</div>Oscar Garcia @ozkaryhttp://www.blogger.com/profile/13427831719934915504noreply@blogger.com0tag:blogger.com,1999:blog-2250750882473644809.post-59937635014228582562023-06-03T15:50:00.011-04:002023-06-29T16:00:34.435-04:00Azure OpenAI API Service with CSharp<p>The OpenAI Service is a cloud-based API that provides access to Large Language Models (LLM) and Artificial Intelligence (AI) Capabilities. This API allows developers to leverage the LLM models to create AI application that can perform Natural Language Processing (NLP) tasks such as text generation, code generation, language translation and others.</p>
<p>Azure provides the Azure OpenAI services which integrates the OpenAI API in Azure infrastructure. This enables us to create custom hosting resources and access the OpenAI API with a custom domain and deployment configuration. There are API client libraries to support different programming languages. To access the Azure OpenAI API using .NET, we could use the OpenAI .NET client library and access an OpenAI resource in Azure. As an alternative, we could use the HttpClient class from the System.Net.Http namespace and code the HTTP requests.</p>
<blockquote>
<p>π The OpenAI client libraries is available for Python, JavaScript, .NET, Java</p>
</blockquote>
<p>In this article, we take a look at using the OpenAI API to generate code from a GitHub user story using an Azure OpenAI resource with the .NET client library. </p>
<blockquote>
<p>π An Azure OpenAI resource can be created by visiting <a href="https://oai.azure.com/portal">Azure OpenAI Portal</a></p>
</blockquote>
<p> <img src="//www.ozkary.dev/assets/2023/ozkary-openai-csharp-flow.png" alt="ozkary generate code from github user story"></p>
<h2 id="install-the-openai-api-client-dependencies">Install the OpenAI APi Client Dependencies</h2>
<p>To use the client library, we first need to install the dependencies and configure some environment variables. </p>
<pre><code>$ dotnet add package Azure<span class="hljs-selector-class">.AI</span><span class="hljs-selector-class">.OpenAI</span> --prerelease
</code></pre><h3 id="install-the-openai-dependencies-restoring-the-project-file-from-this-project">Install the OpenAI dependencies restoring the project file from this project</h3>
<ul>
<li>Clone this GitHub code repo: - <a href="https://github.com/ozkary/ai-engineering/tree/main/csharp/CodeGeneration">LLM Code Generation</a></li>
<li>Open a terminal and navigate to the CSharp folder<ul>
<li>Use the dotnet restore command when cloning the repository.</li>
</ul>
</li>
</ul>
<pre><code class="lang-bash">$ <span class="hljs-built_in">cd</span> csharp/CodeGeneration
$ dotnet <span class="hljs-built_in">restore</span>
</code></pre>
<p>This should download the code to your workstation.</p>
<h3 id="add-the-azure-openai-environment-configurations">Add the Azure OpenAI environment configurations</h3>
<p>Get the following configuration information from your Azure OpenAI resource.</p>
<blockquote>
<p>π This example uses a custom Azure OpenAI resource hosted at <a href="https://oai.azure.com/portal">Azure OpenAI Portal</a></p>
</blockquote>
<ul>
<li>GitHub Repo API Token with write permissions to push comments to an issue</li>
<li>Get an OpenAI API key</li>
<li>If you are using an Azure OpenAI resource, get your custom end-point and deployment<ul>
<li>The deployment should have the code-davinci-002 model</li>
</ul>
</li>
</ul>
<h3 id="set-the-linux-environment-variables-with-these-commands-">Set the linux environment variables with these commands:</h3>
<pre><code class="lang-bash">$ echo export AZURE_OpenAI_KEY=<span class="hljs-string">"OpenAI-key-here"</span> <span class="hljs-meta">>> </span>~<span class="hljs-regexp">/.bashrc && source ~/</span>.bashrc
$ echo export GITHUB_TOKEN=<span class="hljs-string">"github-key-here"</span> <span class="hljs-meta">>> </span>~<span class="hljs-regexp">/.bashrc && source ~/</span>.bashrc
$ echo export AZURE_OpenAI_DEPLOYMENT=<span class="hljs-string">"deployment-name"</span> <span class="hljs-meta">>> </span>~<span class="hljs-regexp">/.bashrc && source ~/</span>.bashrc
$ echo export AZURE_OpenAI_ENDPOINT=<span class="hljs-string">"https://YOUR-END-POINT.OpenAI.azure.com/"</span> <span class="hljs-meta">>> </span>~<span class="hljs-regexp">/.bashrc && source ~/</span>.bashrc
</code></pre>
<h2 id="build-and-run-the-code">Build and Run the Code</h2>
<pre><code class="lang-bash"><span class="hljs-variable">$ </span>dotnet build
</code></pre>
<h3 id="describe-the-code">Describe the code</h3>
<p>The code should run this workflow:</p>
<ul>
<li>Get a list of open GitHub issues with the label user-story</li>
<li>Each issue content is sent to the OpenAI API to generate the code</li>
<li>The generated code is posted as a comment on the user-story for the developers to review</li>
</ul>
<blockquote>
<p>π The following code uses a simple API call implementation for the GitHub and OpenAI APIs. Use the code from this repo: - <a href="https://github.com/ozkary/ai-engineering/tree/main/csharp/CodeGeneration">LLM Code Generation</a></p>
</blockquote>
<pre><code class="lang-csharp"> <span class="hljs-comment">// Get environment variables</span>
<span class="hljs-keyword">private</span> <span class="hljs-keyword">static</span> <span class="hljs-keyword">readonly</span> <span class="hljs-keyword">string</span> openaiApiKey = Environment.GetEnvironmentVariable(<span class="hljs-string">"AZURE_OPENAI_KEY"</span>) ?? String.Empty;
<span class="hljs-keyword">private</span> <span class="hljs-keyword">static</span> <span class="hljs-keyword">readonly</span> <span class="hljs-keyword">string</span> openaiBase = Environment.GetEnvironmentVariable(<span class="hljs-string">"AZURE_OPENAI_ENDPOINT"</span>) ?? String.Empty;
<span class="hljs-keyword">private</span> <span class="hljs-keyword">static</span> <span class="hljs-keyword">readonly</span> <span class="hljs-keyword">string</span> openaiEngine = Environment.GetEnvironmentVariable(<span class="hljs-string">"AZURE_OPENAI_DEPLOYMENT"</span>) ?? String.Empty;
<span class="hljs-comment">// GitHub API endpoint and authentication headers </span>
<span class="hljs-keyword">private</span> <span class="hljs-keyword">static</span> <span class="hljs-keyword">readonly</span> <span class="hljs-keyword">string</span> githubToken = Environment.GetEnvironmentVariable(<span class="hljs-string">"GITHUB_TOKEN"</span>) ?? String.Empty;
<span class="hljs-comment"><span class="hljs-doctag">///</span> <span class="hljs-doctag"><summary></span></span>
<span class="hljs-comment"><span class="hljs-doctag">///</span> Process a GitHub issue by label.</span>
<span class="hljs-comment"><span class="hljs-doctag">///</span> <span class="hljs-doctag"></summary></span></span>
<span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">static</span> <span class="hljs-keyword">async</span> Task <span class="hljs-title">ProcessIssueByLabel</span>(<span class="hljs-params"><span class="hljs-keyword">string</span> repo, <span class="hljs-keyword">string</span> label</span>)
</span>{
<span class="hljs-keyword">try</span>
{
<span class="hljs-comment">// Get the issues from the repo</span>
<span class="hljs-keyword">var</span> @<span class="hljs-keyword">params</span> = <span class="hljs-keyword">new</span> Parameter { Label = label, State = <span class="hljs-string">"open"</span> };
List<Issue> issues = <span class="hljs-keyword">await</span> GitHubService.GetIssues(repo, @<span class="hljs-keyword">params</span>, githubToken);
<span class="hljs-keyword">if</span> (issues != <span class="hljs-literal">null</span>)
{
<span class="hljs-keyword">foreach</span> (<span class="hljs-keyword">var</span> issue <span class="hljs-keyword">in</span> issues)
{
<span class="hljs-comment">// Generate code using OpenAI</span>
Console.WriteLine(<span class="hljs-string">$"Generating code from GitHub issue: <span class="hljs-subst">{issue.title}</span> to <span class="hljs-subst">{openaiBase}</span>"</span>);
OpenAIService openaiService = <span class="hljs-keyword">new</span> OpenAIService(openaiApiKey, openaiBase, openaiEngine);
<span class="hljs-keyword">string</span> generatedCode = <span class="hljs-keyword">await</span> openaiService.Create(issue.body ?? String.Empty);
<span class="hljs-keyword">if</span> (!<span class="hljs-keyword">string</span>.IsNullOrEmpty(generatedCode))
{
<span class="hljs-comment">// Post a comment with the generated code to the GitHub issue</span>
<span class="hljs-keyword">string</span> comment = <span class="hljs-string">$"Generated code:\n\n```<span class="hljs-subst">{generatedCode}</span>\n```"</span>;
<span class="hljs-keyword">bool</span> commentPosted = <span class="hljs-keyword">await</span> GitHubService.PostIssueComment(repo, issue.number, comment, githubToken);
<span class="hljs-keyword">if</span> (commentPosted)
{
Console.WriteLine(<span class="hljs-string">"Code generated and posted as a comment on the GitHub issue."</span>);
}
<span class="hljs-keyword">else</span>
{
Console.WriteLine(<span class="hljs-string">"Failed to post the comment on the GitHub issue."</span>);
}
}
<span class="hljs-keyword">else</span>
{
Console.WriteLine(<span class="hljs-string">"Failed to generate code from the GitHub issue."</span>);
}
}
}
<span class="hljs-keyword">else</span>
{
Console.WriteLine(<span class="hljs-string">"Failed to retrieve the GitHub issue."</span>);
}
}
<span class="hljs-keyword">catch</span> (Exception ex)
{
Console.WriteLine(<span class="hljs-string">$"Error: <span class="hljs-subst">{ex.Message}</span>"</span>);
}
}
</code></pre>
<p>The OpenAI service class handles the OpenAI API details. It takes default parameters for the model deployment (engine), temperature and token limits, which control the cost and amount of text (roughly four letters per token) that should be allowed. For this service, we use the "Completion" model which allows developers to interact with OpenAI's language models and generate text-based completions.</p>
<pre><code class="lang-csharp">
<span class="hljs-keyword">internal</span> <span class="hljs-keyword">class</span> <span class="hljs-title">OpenAIService</span>
{
<span class="hljs-keyword">private</span> <span class="hljs-keyword">string</span> apiKey;
<span class="hljs-keyword">private</span> <span class="hljs-keyword">string</span> engine;
<span class="hljs-keyword">private</span> <span class="hljs-keyword">string</span> endPoint;
<span class="hljs-keyword">private</span> <span class="hljs-keyword">float</span> temperature;
<span class="hljs-keyword">private</span> <span class="hljs-keyword">int</span> maxTokens;
<span class="hljs-keyword">private</span> <span class="hljs-keyword">int</span> n;
<span class="hljs-keyword">private</span> <span class="hljs-keyword">string</span> stop;
<span class="hljs-comment"><span class="hljs-doctag">///</span> <span class="hljs-doctag"><summary></span></span>
<span class="hljs-comment"><span class="hljs-doctag">///</span> OpenAI client</span>
<span class="hljs-comment"><span class="hljs-doctag">///</span> <span class="hljs-doctag"></summary></span></span>
<span class="hljs-keyword">private</span> OpenAIClient? client;
<span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-title">OpenAIService</span>(<span class="hljs-params"><span class="hljs-keyword">string</span> apiKey, <span class="hljs-keyword">string</span> endPoint, <span class="hljs-keyword">string</span> engine = <span class="hljs-string">"text-davinci-003"</span>, <span class="hljs-keyword">float</span> temperature = <span class="hljs-number">0.5</span>f, <span class="hljs-keyword">int</span> maxTokens = <span class="hljs-number">350</span>, <span class="hljs-keyword">int</span> n = <span class="hljs-number">1</span>, <span class="hljs-keyword">string</span> stop = <span class="hljs-string">""</span></span>)
</span>{
<span class="hljs-comment">// Configure the OpenAI client with your API key and endpoint </span>
client = <span class="hljs-keyword">new</span> OpenAIClient(<span class="hljs-keyword">new</span> Uri(endPoint), <span class="hljs-keyword">new</span> AzureKeyCredential(apiKey));
<span class="hljs-keyword">this</span>.apiKey = apiKey;
<span class="hljs-keyword">this</span>.endPoint = endPoint;
<span class="hljs-keyword">this</span>.engine = engine;
<span class="hljs-keyword">this</span>.temperature = temperature;
<span class="hljs-keyword">this</span>.maxTokens = maxTokens;
<span class="hljs-keyword">this</span>.n = n;
<span class="hljs-keyword">this</span>.stop = stop;
}
<span class="hljs-comment"><span class="hljs-doctag">///</span> <span class="hljs-doctag"><summary></span></span>
<span class="hljs-comment"><span class="hljs-doctag">///</span> Create a completion from a prompt</span>
<span class="hljs-comment"><span class="hljs-doctag">///</span> <span class="hljs-doctag"></summary></span></span>
<span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">async</span> Task<<span class="hljs-keyword">string</span>> <span class="hljs-title">Create</span>(<span class="hljs-params"><span class="hljs-keyword">string</span> prompt</span>)
</span>{
<span class="hljs-keyword">var</span> result = String.Empty;
<span class="hljs-keyword">if</span> (!String.IsNullOrEmpty(prompt) && client != <span class="hljs-literal">null</span>)
{
Response<Completions> completionsResponse = <span class="hljs-keyword">await</span> client.GetCompletionsAsync(engine, prompt);
Console.WriteLine(completionsResponse);
result = completionsResponse.Value.Choices[<span class="hljs-number">0</span>].Text.Trim();
Console.WriteLine(result);
}
<span class="hljs-keyword">return</span> result;
}
}
</code></pre>
<h3 id="run-the-code">Run the code</h3>
<p>After configuring your environment and downloading the code, we can run the code from a terminal by typing the following command from the project folder:</p>
<blockquote>
<p>π Make sure to enter your repo name and label your issues with either user-story or any other label you would rather use.</p>
</blockquote>
<pre><code class="lang-bash"><span class="hljs-comment">#</span> <span class="hljs-comment">dotnet</span> <span class="hljs-comment">run</span> <span class="hljs-literal">-</span><span class="hljs-literal">-</span><span class="hljs-comment">repo</span> <span class="hljs-comment">ozkary/ai</span><span class="hljs-literal">-</span><span class="hljs-comment">engineering</span> <span class="hljs-literal">-</span><span class="hljs-literal">-</span><span class="hljs-comment">label</span> <span class="hljs-comment">user</span><span class="hljs-literal">-</span><span class="hljs-comment">story</span>
</code></pre>
<p>After running the code successfully, we should be able to see the generated code as a comment on the GitHub issue.</p>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-ai-engineering-code-generated.png" alt="ozkary-ai-engineering-generate-code-from-user-stories" title="Generate Code from User Stories Back to GitHub"></p>
<h2 id="summary">Summary</h2>
<p>The Azure OpenAI Service provides a seamless integration of OpenAI models into the Azure platform, offering the benefits of Azure's security, compliance, management, and billing capabilities. On the other hand, using the OpenAI API directly allows for a more direct and independent integration with OpenAI services. It may be a preferable option if you have specific requirements, and you do not want to use Azure resources.</p>
<p>Thanks for reading.</p>
<p>Send question or comment at Twitter @ozkary</p>
<p>π Originally published by <a href="https://www.ozkary.com">ozkary.com</a></p>
<div class="blogger-post-footer">Originally Published by https://ozkary.com</div>Oscar Garcia @ozkaryhttp://www.blogger.com/profile/13427831719934915504noreply@blogger.com0tag:blogger.com,1999:blog-2250750882473644809.post-17482205044427293372023-05-27T11:22:00.009-04:002023-08-05T12:26:47.334-04:00Data Engineering Process Fundamentals - Pipeline and Orchestration Exercise<p>Once we have gained an understanding of data pipelines and their orchestration, along with the various programming options and technical tools at our disposal, we can proceed with the implementation and configuration of our own data pipeline. We have the flexibility to adopt either a code-centric approach, leveraging languages like Python, or a low-code approach, utilizing tools such as Azure Data Factory. This allows us to evaluate and compare the effectiveness of each approach based on our team's expertise and the operational responsibilities involved. Before diving into the implementation, let's first review our pipeline process to ensure a clear road map for our journey ahead.</p>
<h2 id="data-flow-process">Data Flow Process</h2>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-process-pipeline-orchestration-architecture.png" alt="ozkary-data-engineering-pipeline-orchestration-flow" title="Data Engineering Process Fundamentals - Pipeline and Orchestration Flow"></p>
<p>Our basic data flow can be defined as the following:</p>
<ul>
<li>Define the date when a new CSV becomes available</li>
<li>Perform an HTTP Get request to download the CSV file for the selected date<ul>
<li>Example: <a href="http://web.mta.info/developers/data/nyct/turnstile/turnstile_230318.txt">http://web.mta.info/developers/data/nyct/turnstile/turnstile_230318.txt</a></li>
</ul>
</li>
<li>Compress the text file and upload in chunks to the data lake container</li>
</ul>
<p>After the file is copied to our data lake, the data transformation service picks up the file, identifies new data and inserts into the Data Warehouse. We will take a look at the process on the Data WareHouse and Transformation services on the next step of the process. </p>
<blockquote>
<p>π Since a new file is available weekly, This data integration project fits into the batch processing model. For real-time scenarios, we should use a data streaming technologies like <a href="https://kafka.apache.org/">Apache Kafka</a> with <a href="https://spark.apache.org/">Apache Spark</a> </p>
</blockquote>
<h3 id="initial-data-load">Initial Data Load</h3>
<p>When there are requirements to load previous data, we need to first run a batch process to load all the previous months of data. Since the file are available weekly, we need to write code that can accept a date range, identify all the past Saturdays, and copy each file into our data lake. The process can be executed in parallel processes by running different years or months (if only one year is selected) in each process. This way multiple threads can be used to copy the data, which should reduce the processing time.</p>
<p>Moving forward, the process will target a specific date for when the file becomes available. The process will not allow for the download of future data files, so an attempt to pass future dates will not be allowed.</p>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-pipepine-data-lake.png" alt="ozkary-data-engineering-data-lake-files" title="Data Engineering Process Fundamentals- Data Lake Files"></p>
<h3 id="weekly-automation">Weekly Automation</h3>
<p>Since the files are available on a weekly basis, we use a batch processing approach to process those files. For that, we create a scheduled job on our automation tool. This trigger should run on the day that the file is available, so a dynamic parameter can be created based on the current date value. The code can then parse this date and resolve the file name format to download the corresponding file.</p>
<h3 id="monitor-the-jobs">Monitor the jobs</h3>
<p>It is very important to be able to monitor and create alerts in case there are failures. This should allow the teams to identify and address the problems quickly. Therefore, it is important that we select a code-centric framework of a platform that provides integrated monitor and alert system.</p>
<h2 id="programming-language-and-tooling">Programming Language and Tooling</h2>
<p>A code-centric data pipeline refers to a high coding effort using a programming language, supporting libraries and cloud platform that can enable us to quickly implement our pipelines and collect telemetry to monitor our jobs. In our case, Python provides a versatile and powerful programming language for building data pipelines, with various frameworks available to streamline the process. Three popular options for Python-based data pipelines are Prefect, Apache Airflow, and Apache Spark. </p>
<ul>
<li><p><a href="https://airflow.apache.org/">Apache Airflow</a> is a robust platform for creating, scheduling, and monitoring complex workflows. It uses Directed Acyclic Graphs (DAGs) to define pipelines and supports a rich set of operators for different data processing tasks.</p>
</li>
<li><p><a href="https://spark.apache.org/">Apache Spark</a> is a distributed data processing engine that provides high-speed data processing capabilities. It supports complex transformations, real-time streaming, and advanced analytics, making it suitable for large-scale data processing.</p>
</li>
<li><p><a href="https://www.prefect.io/">Prefect</a> is a modern workflow management system that enables easy task scheduling, dependency management, and error handling. It emphasizes code-driven workflows and offers a user-friendly interface.</p>
</li>
</ul>
<p>For low-code efforts, <a href="https://azure.microsoft.com/en-us/products/data-factory/">Azure Data Factory</a> is a cloud-based data integration service provided by Microsoft. It offers a visual interface for building and orchestrating data pipelines, making it suitable for users with less coding experience.</p>
<blockquote>
<p>π There are several platforms for low-code solutions. Some of them provide a total enterprise turn-key solution to build the entire pipeline and orchestration. These platforms, however, come at a higher financial cost.</p>
</blockquote>
<p>When choosing between these options, we should consider factors such as the complexity of the pipeline, scalability requirements, ease of use, and integration with other tools and systems. Each framework has its strengths and use cases, so selecting the most suitable one depends on your specific project needs.</p>
<h2 id="pipeline-implementation-requirements">Pipeline Implementation Requirements</h2>
<p>For our example, we will take on a code-centric approach and use Python as our programming language. In addition, we use the Prefect libraries and cloud services to manage the orchestration. After we are done with the code-centric approach, we take a look at using a low-code approach with Azure Data Factory, so we can compare between the two different approaches. </p>
<p>Before we get started, we need to setup our environment with all the necessary dependencies.</p>
<h3 id="requirements">Requirements</h3>
<ul>
<li>Docker and Docker Hub<ul>
<li><a href="https://github.com/ozkary/data-engineering-mta-turnstile/wiki/Configure-Docker">Install Docker</a></li>
<li><a href="https://hub.docker.com/">Create a Docker Hub Account</a></li>
</ul>
</li>
<li>Prefect dependencies and cloud account<ul>
<li>Install the Prefect Python library dependencies</li>
<li><a href="https://www.prefect.io/">Create a Prefect Cloud account</a></li>
</ul>
</li>
<li>Data Lake for storage<ul>
<li>Make sure to have the storage account and access ready</li>
</ul>
</li>
</ul>
<p>π <a href="https://github.com/ozkary/data-engineering-mta-turnstile/tree/main/Step3-Orchestration/" target="_pipeline">Clone this repo or copy the files from this folder
</a></p>
<h3 id="prefect-configuration">Prefect Configuration</h3>
<ul>
<li>Install the Python libraries using the requirements file from the repo</li>
</ul>
<pre><code class="lang-bash"><span class="hljs-variable">$ </span>cd Step3-Orchestration
<span class="hljs-variable">$ </span>pip install -r prefect-requirements.txt
</code></pre>
<ul>
<li>Make sure to run the terraform script to build the VM, Data lake and BigQuery resources as shown on the Design and Planning exercise</li>
<li>Copy the GCP credentials file to follow this format</li>
</ul>
<pre><code class="lang-bash">$ <span class="hljs-keyword">cd</span> ~ && <span class="hljs-built_in">mkdir</span> -<span class="hljs-keyword">p</span> ~/.gcp/
$ <span class="hljs-keyword">cp</span> <path <span class="hljs-keyword">to</span> JSON <span class="hljs-keyword">file</span>> ~/.gcp/credentials.json
</code></pre>
<h4 id="create-the-prefect-cloud-account">Create the PREFECT Cloud Account</h4>
<blockquote>
<p>π Login to Prefect Cloud, API keys can be created from the user profile configuration (click your profile picture)</p>
</blockquote>
<ul>
<li><p>From a terminal, login with Prefect cloud to host the blocks, deployments, and dashboards on the Cloud</p>
<pre><code class="lang-bash">$ <span class="hljs-keyword">prefect </span><span class="hljs-keyword">cloud </span>login
<span class="hljs-comment"># or use an API key to login instead</span>
<span class="hljs-comment"># prefect cloud login -k API_KEY_FROM_PREFECT</span>
</code></pre>
<p>The login creates a key file ~/.prefect/profiles.toml which the framework looks for to authenticate the pipeline.</p>
</li>
<li><p>Install the Prefect code blocks dependencies and run the "block ls" command to check that there are none installed</p>
</li>
</ul>
<pre><code class="lang-bash">$ <span class="hljs-keyword">prefect </span><span class="hljs-keyword">block </span>register -m <span class="hljs-keyword">prefect_gcp
</span>$ <span class="hljs-keyword">prefect </span><span class="hljs-keyword">block </span>ls
</code></pre>
<h3 id="list-of-resources-that-are-needed">List of resources that are needed</h3>
<p> These are the resource names that are used by the code. </p>
<ul>
<li>Data lake name<ul>
<li>mta_data_lake</li>
</ul>
</li>
<li>Prefect Account block name<ul>
<li>blk-gcp-svc-acc</li>
</ul>
</li>
<li>Prefect GCS (storage) block name<ul>
<li>blk-gcs_name</li>
</ul>
</li>
<li>Prefect Deployments<ul>
<li>dep-docker-mta </li>
</ul>
</li>
<li>Docker container name after pushing to Docker Hub<ul>
<li>ozkary/prefect:mta-de-101</li>
</ul>
</li>
</ul>
<h2 id="review-the-code">Review the Code</h2>
<p>After setting up all the dependencies, we can move forward to look at the actual code. We can start by reviewing the code blocks or components. We can then view the actual pipeline code, and how it is wired, so we can enable the flow telemetry in our pipeline.</p>
<h3 id="code-blocks-or-components">Code Blocks or Components</h3>
<blockquote>
<p>π Blocks are a secured and reusable components which can manage a single technical concern and can be used by our applications</p>
</blockquote>
<h4 id="credentials-component">Credentials Component</h4>
<p>Since we need secured access to cloud resources, we first need to create a credentials component to store the cloud key file. We can then use this component in other areas of the code whenever we need to do a cloud operation. The save operation done by the code pushes the component to the cloud, so it is centralized.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> argparse
<span class="hljs-keyword">import</span> os
<span class="hljs-keyword">from</span> pathlib <span class="hljs-keyword">import</span> Path
<span class="hljs-keyword">from</span> prefect_gcp <span class="hljs-keyword">import</span> GcpCredentials
<span class="hljs-comment"># insert your own service_account_file path or service_account_info dictionary from the json file</span>
<span class="hljs-comment"># IMPORTANT - do not store credentials in a publicly available repository!</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">main</span><span class="hljs-params">(params)</span> -> <span class="hljs-keyword">None</span>:</span>
<span class="hljs-string">"""entry point to create the prefect block for GCP service account"""</span>
gcp_file_path = params.file_path
account_block_name = params.gcp_acc_block_name
file_handle = Path(gcp_file_path) <span class="hljs-comment">#.read_text()</span>
print(file_handle.read_text())
<span class="hljs-keyword">if</span> file_handle.exists() :
content = file_handle.read_text()
<span class="hljs-keyword">if</span> content :
credentials_block = GcpCredentials(
service_account_info=content <span class="hljs-comment"># set the file credential</span>
)
credentials_block.save(account_block_name, overwrite=<span class="hljs-keyword">True</span>)
print(<span class="hljs-string">'block was saved'</span>)
<span class="hljs-keyword">else</span>:
print(F<span class="hljs-string">'{gcp_file_path} not found'</span>)
os.system(<span class="hljs-string">'prefect block ls'</span>)
<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:
parser = argparse.ArgumentParser(description=<span class="hljs-string">'Create a reusable Credential block'</span>)
parser.add_argument(<span class="hljs-string">'--file_path'</span>, required=<span class="hljs-keyword">True</span>, help=<span class="hljs-string">'key file path for the service account'</span>)
parser.add_argument(<span class="hljs-string">'--gcp_acc_block_name'</span>, required=<span class="hljs-keyword">True</span>, help=<span class="hljs-string">'prefect block name to hold the service account setting'</span>)
args = parser.parse_args()
main(args)
</code></pre>
<h4 id="cloud-storage-component">Cloud Storage Component</h4>
<p>The cloud storage component enables us to reuse the credentials component, so applications can be authenticated and authorize to access it. This component also has support to upload files to the storage container, thus simplifying our code. Similar to the credential component, this component is saved on the cloud.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> argparse
<span class="hljs-keyword">from</span> prefect_gcp <span class="hljs-keyword">import</span> GcpCredentials
<span class="hljs-keyword">from</span> prefect_gcp.cloud_storage <span class="hljs-keyword">import</span> GcsBucket
<span class="hljs-comment"># insert your own service_account_file path or service_account_info dictionary from the json file</span>
<span class="hljs-comment"># IMPORTANT - do not store credentials in a publicly available repository!</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">main</span><span class="hljs-params">(params)</span> -> <span class="hljs-keyword">None</span>:</span>
<span class="hljs-string">"""entry point to create the prefect block for GCS"""</span>
account_block_name = params.gcp_acc_block_name
gcs_bucket_name = params.gcs_bucket_name
gcs_block_name = params.gcs_block_name
credentials = GcpCredentials.load(account_block_name)
<span class="hljs-keyword">if</span> credentials :
bucket_block = GcsBucket(
gcp_credentials=credentials,
bucket=gcs_bucket_name <span class="hljs-comment"># insert your GCS bucket name</span>
)
<span class="hljs-comment"># save the bucket</span>
bucket_block.save(gcs_block_name, overwrite=<span class="hljs-keyword">True</span>)
<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:
parser = argparse.ArgumentParser(description=<span class="hljs-string">'Ingest CSV data to storage'</span>)
parser.add_argument(<span class="hljs-string">'--gcp_acc_block_name'</span>, required=<span class="hljs-keyword">True</span>, help=<span class="hljs-string">'prefect block name which holds the service account'</span>)
parser.add_argument(<span class="hljs-string">'--gcs_bucket_name'</span>, required=<span class="hljs-keyword">True</span>, help=<span class="hljs-string">'GCS bucket name'</span>)
parser.add_argument(<span class="hljs-string">'--gcs_block_name'</span>, required=<span class="hljs-keyword">True</span>, help=<span class="hljs-string">'GCS block name'</span>)
args = parser.parse_args()
main(args)
</code></pre>
<h4 id="docker-container-component">Docker Container Component</h4>
<p>Since we are running our pipeline on a Docker container, we also want to write a component which can manage that technical concern. This allow us to pull the Docker image from Docker Hub when we are ready to deploy and run the pipeline. We will learn more about deployments as we create our Docker deployment definition.</p>
<pre><code class="lang-python">
<span class="hljs-keyword">import</span> argparse
<span class="hljs-keyword">from</span> prefect.infrastructure.docker <span class="hljs-keyword">import</span> DockerContainer
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">main</span><span class="hljs-params">(params)</span> -> <span class="hljs-keyword">None</span>:</span>
<span class="hljs-string">"""Create a Docker prefect block"""</span>
block_name = params.block_name
image_name = params.image_name
<span class="hljs-comment"># alternative to creating DockerContainer block in the UI</span>
docker_block = DockerContainer(
image=image_name, <span class="hljs-comment"># insert your image here</span>
image_pull_policy=<span class="hljs-string">"ALWAYS"</span>,
auto_remove=<span class="hljs-keyword">True</span>,
)
docker_block.save(block_name, overwrite=<span class="hljs-keyword">True</span>)
<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:
parser = argparse.ArgumentParser(description=<span class="hljs-string">'Create a reusable Docker image block from Docker Hub'</span>)
parser.add_argument(<span class="hljs-string">'--block_name'</span>, required=<span class="hljs-keyword">True</span>, help=<span class="hljs-string">'Prefect block name'</span>)
parser.add_argument(<span class="hljs-string">'--image_name'</span>, required=<span class="hljs-keyword">True</span>, help=<span class="hljs-string">'Docker image name used when the image was build'</span>)
args = parser.parse_args()
main(args)
</code></pre>
<h3 id="deployments">Deployments</h3>
<p>Cloud deployments are used to deploy and manage pipelines in a production environment. Deployments provide a centralized way to run and monitor pipelines across multiple execution environments, such as local machines, cloud-based infrastructure, and on-premises clusters. </p>
<h4 id="docker-deployment">Docker Deployment</h4>
<p>With a deployment definition, we can associate a Docker image that is hosted on Docker Hub with a deployment. This enables us to automate the deployment of this image to other environments when we are ready to run the pipeline. The code below associates a Docker component with a deployment definition from the cloud. It also defines the main flow entry point (main_flow) from the etl_web_to_gcs.py file, so it can be easily executed as a scheduled task from the terminal.</p>
<pre><code class="lang-python">
<span class="hljs-keyword">import</span> argparse
<span class="hljs-keyword">import</span> sys
<span class="hljs-keyword">import</span> os
<span class="hljs-keyword">from</span> prefect.deployments <span class="hljs-keyword">import</span> Deployment
<span class="hljs-keyword">from</span> prefect.infrastructure.docker <span class="hljs-keyword">import</span> DockerContainer
sys.path.append(os.path.join(os.path.dirname(__file__), <span class="hljs-string">'..'</span>, <span class="hljs-string">'flows'</span>))
<span class="hljs-keyword">from</span> etl_web_to_gcs <span class="hljs-keyword">import</span> main_flow
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">main</span><span class="hljs-params">(params)</span> -> <span class="hljs-keyword">None</span>:</span>
<span class="hljs-string">"""Create a prefect deployment"""</span>
block_name = params.block_name
deploy_name = params.deploy_name
<span class="hljs-comment"># use the prefect block name for the container</span>
docker_block = DockerContainer.load(block_name)
docker_dep = Deployment.build_from_flow(
flow=main_flow,
name=deploy_name,
infrastructure=docker_block
)
docker_dep.apply()
<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:
parser = argparse.ArgumentParser(description=<span class="hljs-string">'Create a reusable prefect deployment script'</span>)
parser.add_argument(<span class="hljs-string">'--block_name'</span>, required=<span class="hljs-keyword">True</span>, help=<span class="hljs-string">'Prefect Docker block name'</span>)
parser.add_argument(<span class="hljs-string">'--deploy_name'</span>, required=<span class="hljs-keyword">True</span>, help=<span class="hljs-string">'Prefect deployment name'</span>)
args = parser.parse_args()
main(args)
</code></pre>
<h4 id="github-deployment">GitHub Deployment</h4>
<p>In cases when a Docker image is not used, we can also use a deployment definition using GitHub. This allows us to download the code to other environments in which dependencies will need to be installed prior to running the code. The build_from_flow operation is used to define which file and what entry point (function) of that file to use. In this example, we are using the etl_web_to_gcs.py file and the function main_flow.</p>
<pre><code class="lang-python">
<span class="hljs-keyword">import</span> argparse
<span class="hljs-keyword">from</span> prefect.deployments <span class="hljs-keyword">import</span> Deployment
<span class="hljs-keyword">from</span> etl_web_to_gcs <span class="hljs-keyword">import</span> main_flow
<span class="hljs-keyword">from</span> prefect.filesystems <span class="hljs-keyword">import</span> GitHub
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">main</span><span class="hljs-params">(params)</span> -> <span class="hljs-keyword">None</span>:</span>
<span class="hljs-string">"""Create a prefect deployment with github"""</span>
block_name = params.block_name
deploy_name = params.deploy_name
github_path = params.github_path
github_block = GitHub.load(block_name)
deployment = Deployment.build_from_flow(
flow=main_flow,
name=deploy_name,
storage=github_block,
entrypoint=f<span class="hljs-string">"{github_path}/etl_web_to_gcs.py:main_flow"</span>)
deployment.apply()
<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:
parser = argparse.ArgumentParser(description=<span class="hljs-string">'Create a reusable prefect deployment script'</span>)
parser.add_argument(<span class="hljs-string">'--block_name'</span>, required=<span class="hljs-keyword">True</span>, help=<span class="hljs-string">'Github block name'</span>)
parser.add_argument(<span class="hljs-string">'--deploy_name'</span>, required=<span class="hljs-keyword">True</span>, help=<span class="hljs-string">'Prefect deployment name'</span>)
parser.add_argument(<span class="hljs-string">'--github_path'</span>, required=<span class="hljs-keyword">True</span>, help=<span class="hljs-string">'Github folder path where the pipeline file is located'</span>)
args = parser.parse_args()
main(args)
</code></pre>
<h3 id="pipeline-flows-and-tasks">Pipeline Flows and Tasks</h3>
<p>A pipeline is implemented by defining flows and tasks, which are defined using Python, CSharp code or other languages. Flows are composed of multiple tasks and define the sequence and dependencies between them. Flows use the @flow function decorator or attributes, which is specific to the Python library being used (Prefect), and it is used to mark a function as a flow. The decorator also allows us to define the flow's name, description, and other attributes like number of retries in case of failures.</p>
<p>Tasks are defined by the @task function decorator or attribute. Tasks are individual units of work that can be combined to form a data pipeline. They represent the different steps or operations that need to be performed within a workflow. Each task is responsible for executing a specific action or computation.</p>
<p>In our example, we have the main_flow function which uses another flow (etl_web_to_local) to handle the file download from the Web to a local storage. The main flow also uses tasks to handle the input validation and file name formatting to make sure the values are only for the specific dates the new CSV file is available for download. Finally, there is task to write a compressed CSV file to the data lake using our components.</p>
<p>By putting together flows and tasks that handle a specific workflow, we build a pipeline that enables us to download files into our data lake. At the same time, by using those function decorators, we are enabling the Prefect framework to call its internal class to track telemetry information for each flow and task in our pipeline, which enable us to monitor and track failures at a specific point in the pipeline. Let's see what our pipeline implementation looks like:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> argparse
<span class="hljs-keyword">from</span> pathlib <span class="hljs-keyword">import</span> Path
<span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">from</span> prefect <span class="hljs-keyword">import</span> flow, task
<span class="hljs-keyword">from</span> prefect_gcp.cloud_storage <span class="hljs-keyword">import</span> GcsBucket
<span class="hljs-keyword">from</span> typing <span class="hljs-keyword">import</span> List
<span class="hljs-comment"># from prefect.tasks import task_input_hash</span>
<span class="hljs-keyword">from</span> settings <span class="hljs-keyword">import</span> get_block_name, get_min_date, get_max_date, get_prefix, get_url
<span class="hljs-keyword">from</span> datetime <span class="hljs-keyword">import</span> timedelta, date
<span class="hljs-meta">@task(name="write_gcs", description='Write file gcs', log_prints=False)</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">write_gcs</span><span class="hljs-params">(local_path: Path, file_name: str, prefix: str)</span> -> <span class="hljs-keyword">None</span>:</span>
<span class="hljs-string">"""
Upload local parquet file to GCS
Args:
path: File location
prefix: the folder location on storage
"""</span>
block_name = get_block_name()
gcs_path = f<span class="hljs-string">'{prefix}/{file_name}.csv.gz'</span>
print(f<span class="hljs-string">'{block_name} {local_path} {gcs_path}'</span>)
gcs_block = GcsBucket.load(block_name)
gcs_block.upload_from_path(from_path=local_path, to_path=gcs_path)
<span class="hljs-keyword">return</span>
<span class="hljs-meta">@task(name='write_local', description='Writes the file into a local folder')</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">write_local</span><span class="hljs-params">(df: pd.DataFrame, folder: str, file_path: Path)</span> -> Path:</span>
<span class="hljs-string">"""
Write DataFrame out locally as csv file
Args:
df: dataframe chunk
folder: the download data folder
file_name: the local file name
"""</span>
path = Path(folder)
<span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> os.path.exists(path):
path.mkdir(parents=<span class="hljs-keyword">True</span>, exist_ok=<span class="hljs-keyword">True</span>)
df = df.rename(columns={<span class="hljs-string">'C/A'</span>: <span class="hljs-string">'CA'</span>})
df = df.rename(columns=<span class="hljs-keyword">lambda</span> x: x.strip().replace(<span class="hljs-string">' '</span>, <span class="hljs-string">''</span>))
<span class="hljs-comment"># df = df.rename_axis('row_no').reset_index()</span>
<span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> os.path.isfile(file_path):
df.to_csv(file_path, compression=<span class="hljs-string">"gzip"</span>)
<span class="hljs-comment"># df.to_parquet(file_path, compression="gzip", engine='fastparquet')</span>
print(<span class="hljs-string">'new file'</span>, flush=<span class="hljs-keyword">True</span>)
<span class="hljs-keyword">else</span>:
df.to_csv(file_path, header=<span class="hljs-keyword">None</span>, compression=<span class="hljs-string">"gzip"</span>, mode=<span class="hljs-string">"a"</span>)
<span class="hljs-comment"># df.to_parquet(file_path, compression="gzip", engine='fastparquet', append=True) </span>
print(<span class="hljs-string">'chunk appended'</span>, flush=<span class="hljs-keyword">True</span>)
<span class="hljs-keyword">return</span> file_path
<span class="hljs-meta">@flow(name='etl_web_to_local', description='Download MTA File in chunks')</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">etl_web_to_local</span><span class="hljs-params">(name: str, prefix: str)</span> -> Path:</span>
<span class="hljs-string">"""
Download a file
Args:
name : the file name
prefix: the file prefix
"""</span>
<span class="hljs-comment"># skip an existent file</span>
path = f<span class="hljs-string">"//www.ozkary.dev/data/"</span>
file_path = Path(f<span class="hljs-string">"{path}/{name}.csv.gz"</span>)
<span class="hljs-keyword">if</span> os.path.exists(file_path):
print(f<span class="hljs-string">'{name} already processed'</span>)
<span class="hljs-keyword">return</span> file_path
url = get_url()
file_url = f<span class="hljs-string">'{url}/{prefix}_{name}.txt'</span>
print(file_url)
<span class="hljs-comment"># os.system(f'wget {url} -O {name}.csv')</span>
<span class="hljs-comment"># return</span>
df_iter = pd.read_csv(file_url, iterator=<span class="hljs-keyword">True</span>, chunksize=<span class="hljs-number">5000</span>)
<span class="hljs-keyword">if</span> df_iter:
<span class="hljs-keyword">for</span> df <span class="hljs-keyword">in</span> df_iter:
<span class="hljs-keyword">try</span>:
write_local(df, path, file_path)
<span class="hljs-keyword">except</span> StopIteration <span class="hljs-keyword">as</span> ex:
print(f<span class="hljs-string">"Finished reading file {ex}"</span>)
<span class="hljs-keyword">break</span>
<span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> ex:
print(f<span class="hljs-string">"Error found {ex}"</span>)
<span class="hljs-keyword">return</span>
print(f<span class="hljs-string">"file was downloaded {file_path}"</span>)
<span class="hljs-keyword">else</span>:
print(<span class="hljs-string">"dataframe failed"</span>)
<span class="hljs-keyword">return</span> file_path
<span class="hljs-meta">@task(name='get_file_date', description='Resolves the last file drop date') </span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_file_date</span><span class="hljs-params">(curr_date: date = date.today<span class="hljs-params">()</span>)</span> -> str:</span>
<span class="hljs-keyword">if</span> curr_date.weekday() != <span class="hljs-number">5</span>:
days_to_sat = (curr_date.weekday() - <span class="hljs-number">5</span>) % <span class="hljs-number">7</span>
curr_date = curr_date - timedelta(days=days_to_sat)
year_tag = str(curr_date.year)[<span class="hljs-number">2</span>:<span class="hljs-number">4</span>]
file_name = f<span class="hljs-string">'{year_tag}{curr_date.month:02}{curr_date.day:02}'</span>
<span class="hljs-keyword">return</span> file_name
<span class="hljs-meta">@task(name='get_the_file_dates', description='Downloads the file in chunks')</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_the_file_dates</span><span class="hljs-params">(year: int, month: int, day: int = <span class="hljs-number">1</span>, limit: bool = True )</span> -> List[str]:</span>
<span class="hljs-string">"""
Process all the Sundays of the month
Args:
year : the selected year
month : the selected month
day: the file day
"""</span>
date_list = []
curr_date = date(year, month, day)
<span class="hljs-keyword">while</span> curr_date.month == month <span class="hljs-keyword">and</span> curr_date <= date.today():
<span class="hljs-comment"># print(f'Current date {curr_date}') </span>
<span class="hljs-keyword">if</span> curr_date.weekday() == <span class="hljs-number">5</span>:
<span class="hljs-comment"># add the date filename format yyMMdd</span>
year_tag = str(curr_date.year)[<span class="hljs-number">2</span>:<span class="hljs-number">4</span>]
file_name = f<span class="hljs-string">'{year_tag}{curr_date.month:02}{curr_date.day:02}'</span>
date_list.append(file_name)
curr_date = curr_date + timedelta(days=<span class="hljs-number">7</span>)
<span class="hljs-keyword">if</span> limit:
<span class="hljs-keyword">break</span>
<span class="hljs-keyword">else</span>:
<span class="hljs-comment"># find next week</span>
days_to_sat = (<span class="hljs-number">5</span> - curr_date.weekday()) % <span class="hljs-number">7</span>
curr_date = curr_date + timedelta(days=days_to_sat)
<span class="hljs-keyword">return</span> date_list
<span class="hljs-meta">@task(name='valid_task', description='Validate the tasks input parameter')</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">valid_task</span><span class="hljs-params">(year: int, month: int, day: int = <span class="hljs-number">1</span>)</span> -> bool:</span>
<span class="hljs-string">"""
Validates the input parameters for the request
Args:
year : the selected year
month : the selected month
day: file day
"""</span>
isValid = <span class="hljs-keyword">False</span>
<span class="hljs-keyword">if</span> month > <span class="hljs-number">0</span> <span class="hljs-keyword">and</span> month < <span class="hljs-number">13</span>:
curr_date = date(year, month, day)
min_date = get_min_date()
max_date = get_max_date()
isValid = curr_date >= min_date <span class="hljs-keyword">and</span> curr_date < max_date <span class="hljs-keyword">and</span> curr_date <= date.today()
print(f<span class="hljs-string">'task request status {isValid} input {year}-{month}'</span>)
<span class="hljs-keyword">return</span> isValid
<span class="hljs-meta">@flow (name="MTA Batch flow", description="MTA Multiple File Batch Data Flow. Defaults to the last Saturday date")</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">main_flow</span><span class="hljs-params">(year: int = <span class="hljs-number">0</span> , month: int = <span class="hljs-number">0</span>, day: int = <span class="hljs-number">0</span>, limit_one: bool = True)</span> -> <span class="hljs-keyword">None</span>:</span>
<span class="hljs-string">"""
Entry point to download the data
"""</span>
<span class="hljs-keyword">try</span>:
<span class="hljs-comment"># if no params provided, resolve to the last saturday </span>
file_list: List[str] = []
<span class="hljs-keyword">if</span> (year == <span class="hljs-number">0</span>):
file_dt = get_file_date()
file_list.append(file_dt)
<span class="hljs-keyword">elif</span> valid_task(year, month, day):
file_list = get_the_file_dates(year, month, day, limit_one)
prefix = get_prefix()
<span class="hljs-keyword">for</span> file_name <span class="hljs-keyword">in</span> file_list:
print(file_name)
local_file_path = etl_web_to_local(file_name, prefix)
write_gcs(local_file_path, file_name, prefix)
<span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> ex:
print(f<span class="hljs-string">"error found {ex}"</span>)
</code></pre>
<h4 id="function-decorators">Function Decorators</h4>
<p>In some programming languages, we can create function decorators or attributes that enables to enhance a specific function without altering its purpose. In Python, this can be done by defining a class with a <code>__call__</code> method, which allows instances of the class to be callable like functions. Within the <code>__call__</code> method, logic can be implemented to track telemetry data and then return the original function unchanged. Here's an example of a simple telemetry function decorator class:</p>
<pre><code class="lang-python"><span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">TelemetryDecorator</span>:</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span><span class="hljs-params">(<span class="hljs-keyword">self</span>, tracking_type)</span></span>:
<span class="hljs-keyword">self</span>.tracking_type = tracking_type
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__call__</span><span class="hljs-params">(<span class="hljs-keyword">self</span>, func)</span></span>:
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">wrapped_func</span><span class="hljs-params">(*args, **kwargs)</span></span>:
<span class="hljs-comment"># Track telemetry data here</span>
print(f<span class="hljs-string">"Tracking {self.tracking_type} for function {func.__name__}"</span>)
<span class="hljs-comment"># Call the original function with its parameters</span>
<span class="hljs-keyword">return</span> func(*args, **kwargs)
<span class="hljs-keyword">return</span> wrapped_func
<span class="hljs-comment"># Usage example:</span>
@TelemetryDecorator(tracking_type=<span class="hljs-string">"performance"</span>)
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">my_task</span><span class="hljs-params">(x, y)</span></span>:
<span class="hljs-keyword">return</span> x + y
result = my_task(<span class="hljs-number">3</span>, <span class="hljs-number">5</span>)
</code></pre>
<h2 id="how-to-run-it">How to Run It</h2>
<p>After installing the pre-requisites and reviewing the code, we are ready to run our pipeline and set up our orchestration by configuring our components, deployment image and scheduling the runs. </p>
<h3 id="install-the-code-blocks-or-components-for-our-credentials-and-data-lake-access">Install the code blocks or components for our credentials and data lake access</h3>
<p>We should first authenticate our terminal with the cloud instance. This should enable us to call other APIs to register our components. We next register the block dependencies. From the blocks folder, we register our components by running the Python scripts. We then run a "block ls" command to see the components that have been registered.</p>
<blockquote>
<p>π Components are a secured way to download credentials and secrets that are used by your applications.</p>
</blockquote>
<pre><code class="lang-bash"><span class="hljs-variable">$ </span>prefect cloud login
<span class="hljs-variable">$ </span>prefect block register -m prefect_gcp
<span class="hljs-variable">$ </span>cd ./blocks
<span class="hljs-variable">$ </span>python3 gcp_acc_block.py --file_path=~<span class="hljs-regexp">/.gcp/credentials</span>.json --gcp_acc_block_name=blk-gcp-svc-acc
<span class="hljs-variable">$ </span>python3 gcs_block.py --gcp_acc_block_name=blk-gcp-svc-acc --gcs_bucket_name=mta_data_lake --gcs_block_name=blk-gcs-name
<span class="hljs-variable">$ </span>prefect block ls
</code></pre>
<h3 id="create-a-docker-image-and-push-to-docker-hub">Create a docker image and push to Docker Hub</h3>
<p>We are adding our Python script in a Docker container, so we can create and push the image (ozkary/prefect:mta-de-101) to Docker Hub. This should enable us to later create a deployment definition and refer to that image, so we can download it from a centralized hub location to one or more environments.</p>
<blockquote>
<p>π Make sure to run the Docker build command where the Docker file is located or use -f with the file path. Ensure Docker is also running.</p>
</blockquote>
<pre><code>$ docker login --username <span class="hljs-keyword">USER</span> <span class="hljs-title">--password</span> PW
$ docker image build -t ozkary/prefect:mta-de-<span class="hljs-number">101</span> .
$ docker image push ozkary/prefect:mta-de-<span class="hljs-number">101</span>
</code></pre><p>The Docker file defines the image dependency with Python already installed. We also copy a requirements file which contains additional dependencies that need to be installed on the container image. We finally copy our code on the container, so when we run it, it is able to find the pipeline main_flow.</p>
<pre><code class="lang-yml"><span class="hljs-keyword">FROM</span> prefecthq/prefect:<span class="hljs-number">2.7</span>.<span class="hljs-number">7</span>-python3.<span class="hljs-number">9</span>
<span class="hljs-keyword">COPY</span><span class="bash"> docker-requirements.txt .
</span>
<span class="hljs-keyword">RUN</span><span class="bash"> pip install -r docker-requirements.txt --trusted-host pypi.python.org --no-cache-dir
</span>
<span class="hljs-keyword">RUN</span><span class="bash"> mkdir -p /opt/prefect/data/
</span><span class="hljs-keyword">RUN</span><span class="bash"> mkdir -p /opt/prefect/flows/
</span>
<span class="hljs-keyword">COPY</span><span class="bash"> flows opt/prefect/flows
</span><span class="hljs-keyword">COPY</span><span class="bash"> data opt/prefect/data</span>
</code></pre>
<h3 id="create-the-prefect-block-with-the-docker-image">Create the prefect block with the docker image</h3>
<p>After creating the Docker image, we can register the Docker component (blk-docker-mta-de-101) with the image name reference, which is what allows us to pull that image from Docker Hub during a new deployment.</p>
<pre><code class="lang-bash">$ <span class="hljs-keyword">cd</span> ./blocks
$ <span class="hljs-keyword">python3</span> docker_block.<span class="hljs-keyword">py</span> --block_name=blk-docker-mta-de-<span class="hljs-number">101</span> --image_name=ozkary/prefec<span class="hljs-variable">t:mta</span>-de-<span class="hljs-number">101</span>
</code></pre>
<h3 id="create-the-deployment-with-the-docker-image">Create the deployment with the docker image</h3>
<p>We can now configure a cloud deployment by running our deployment definition file (docker_deploy_etl_web_to_gcs.py). For this configuration, we associate the Docker component (blk-docker-mta-de-101) to our definition. The configuration uses the component, which in turns defines where to get the Docker image from. We also setup a cron job to schedule the deployment to run on Saturdays at 9am. This scheduling of the deployments is an orchestration tasks. To verify all is configured properly, we list the deployment configurations by running the "deployment ls" command. The listing of the deployments also enables us to confirm the deployment name and id, which can be used when we test run the deployment.</p>
<pre><code class="lang-bash">$ <span class="hljs-string">cd </span>./<span class="hljs-string">deployments
</span>$ <span class="hljs-string">python3 </span><span class="hljs-string">docker_deploy_etl_web_to_gcs.</span><span class="hljs-string">py </span><span class="hljs-built_in">--block_name=blk-docker-mta-de-101</span> <span class="hljs-built_in">--deploy_name=dep-docker-mta-de-101</span>
$ <span class="hljs-string">prefect </span><span class="hljs-string">deployments </span><span class="hljs-string">build </span><span class="hljs-string">etl_web_to_gcs.</span><span class="hljs-string">py:main_flow </span><span class="hljs-built_in">--name</span> <span class="hljs-string">dep-docker-</span><span class="hljs-string">mta-de-</span><span class="hljs-string">101 </span><span class="hljs-built_in">--tag</span> <span class="hljs-string">mta </span><span class="hljs-built_in">--work-queue</span> <span class="hljs-string">default </span><span class="hljs-built_in">--cron</span> <span class="hljs-string">'0 9 * * 6'</span>
$ <span class="hljs-string">prefect </span><span class="hljs-string">deployments </span><span class="hljs-string">ls</span>
</code></pre>
<blockquote>
<p>π Scheduled jobs can also be managed from the cloud dashboards</p>
</blockquote>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-pipeline-job.png" alt="ozkary-data-engineering-pipeline-jobs" title="Data Engineering Process Fundamentals- Pipeline Jobs"></p>
<h3 id="start-the-prefect-agent">Start the Prefect agent</h3>
<p>The agent should be running, so the scheduled deployments can be executed. If the image Docker image is not downloaded yet, it is downloaded, so the code can be executed.</p>
<pre><code class="lang-bash">$ prefect agent start -q <span class="hljs-keyword">default</span>
</code></pre>
<h3 id="test-run-the-prefect-deployments-with-the-docker-image">Test run the prefect deployments with the docker image</h3>
<p>This next command will download the Docker image and run the entry point, main_flow. The additional parameters are also provided. so the pipeline can download the file for the specified year, month and day.</p>
<pre><code class="lang-bash">$ prefect deployment <span class="hljs-keyword">run</span><span class="bash"> <span class="hljs-string">"MTA Batch flow/dep-docker-mta-de-101"</span> -p <span class="hljs-string">"year=2023 month=3 day=25"</span></span>
</code></pre>
<h3 id="manual-test-run-can-be-done-from-a-terminal">Manual test run can be done from a terminal</h3>
<p>A manual test run can also be executed from the command line to help us identify any possible bugs without having to run the app from the container. Run the code directly from the terminal by typing this command:</p>
<pre><code class="lang-bash"><span class="hljs-comment">$</span> <span class="hljs-comment">python3</span> <span class="hljs-comment">etl_web_to_gcs</span><span class="hljs-string">.</span><span class="hljs-comment">py</span> <span class="hljs-literal">-</span><span class="hljs-literal">-</span><span class="hljs-comment">year</span> <span class="hljs-comment">2023</span> <span class="hljs-literal">-</span><span class="hljs-literal">-</span><span class="hljs-comment">month</span> <span class="hljs-comment">5</span> <span class="hljs-literal">-</span><span class="hljs-literal">-</span><span class="hljs-comment">day</span> <span class="hljs-comment">6</span>
</code></pre>
<h3 id="see-the-flow-runs-from-the-cli">See the flow runs from the CLI</h3>
<p>To check the actual flow runs, we can use the "flow-run ls" command. This should show the date and time when the flow has been executed.</p>
<pre><code class="lang-bash">$ prefect flow-<span class="hljs-keyword">run</span><span class="bash"> ls</span>
</code></pre>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-pipeline-console-flows.png" alt="ozkary-data-engineering-prefect-flow-run" title="Data Engineering Process Fundamentals- Pipeline Runs CLI"></p>
<blockquote>
<p>π Flow runs can also be visualized from the cloud dashboards
To get more telemetry details about the pipeline, we can look at the flow dashboards on the cloud.</p>
</blockquote>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-pipeline-dashboard-runs.png" alt="ozkary-data-engineering-prefect-flow-run" title="Data Engineering Process Fundamentals- Pipeline Runs Dashboard"></p>
<h3 id="github-action-to-build-and-deploy-the-docker-image-to-docker-hub">GitHub Action to build and deploy the Docker image to Docker Hub</h3>
<p>So far, we have shown how to build and push our Docker images via the CLI. A more mature way to do this is to enable that process on a deployment pipeline. With GitHub, we have CI/CD pipelines that can automate this process. This pipeline can be triggered when a change is made to the code, and a pull request (PR) is merged into the branch. This is called a GitHub action. A simple script to handle that automation is shown below:</p>
<pre><code class="lang-yml">
<span class="hljs-attribute">name</span>: Build and Push Docker Image
<span class="yaml"><span class="hljs-attr">on:</span>
<span class="hljs-attr"> push:</span>
<span class="hljs-attr"> branches:</span>
<span class="hljs-bullet"> -</span> main
<span class="hljs-attr">jobs:</span>
<span class="hljs-attr"> build-and-push:</span>
<span class="hljs-attr"> runs-on:</span> ubuntu-latest
<span class="hljs-attr"> steps:</span>
<span class="hljs-attr"> - name:</span> Checkout repository
<span class="hljs-attr"> uses:</span> actions/checkout@v2
<span class="hljs-attr"> - name:</span> Set up Docker Buildx
<span class="hljs-attr"> uses:</span> docker/setup-buildx-action@v1
<span class="hljs-attr"> - name:</span> Login to Docker Hub
<span class="hljs-attr"> uses:</span> docker/login-action@v1
<span class="hljs-attr"> with:</span>
<span class="hljs-attr"> username:</span> ${{ secrets.DOCKERHUB_USERNAME }}
<span class="hljs-attr"> password:</span> ${{ secrets.DOCKERHUB_PASSWORD }}
<span class="hljs-attr"> - name:</span> Build and push Docker image
<span class="hljs-attr"> env:</span>
<span class="hljs-attr"> DOCKER_REPOSITORY:</span> ${{ secrets.DOCKERHUB_USERNAME }}/prefect:mta-de<span class="hljs-bullet">-101</span>
<span class="hljs-attr"> run:</span> <span class="hljs-string">|
docker buildx create --use
docker buildx build --push --platform linux/amd64,linux/arm64 -t $DOCKER_REPOSITORY .</span></span>
</code></pre>
<h2 id="low-code-data-pipeline">Low-Code Data Pipeline</h2>
<p>After learning about a code-centric pipeline, we can transition into a low-code approach, which marks a significant evolution in the way data engineering projects are implemented. In the code-centric approach, engineers create and manage every aspect of the pipeline through code, providing maximum flexibility and control. On the other hand, the low-code approach, exemplified by platforms like Azure Data Factory, empowers data engineers to design and orchestrate pipelines with visual interfaces and pre-built components. This results in faster development and a more streamlined pipeline creation process. The low-code approach is especially beneficial for less experienced developers or projects where speed and simplicity are essential.</p>
<h3 id="pipeline-with-azure-data-factory">Pipeline with Azure Data Factory</h3>
<blockquote>
<p>π <a href="https://learn.microsoft.com/en-us/azure/data-factory/quickstart-create-data-factory-azure-cli">Setup an Azure Data Factory Resource</a></p>
</blockquote>
<p>To show a low-code approach, we will write our data pipeline using Azure Data Factory. Following a similar approach, we can design an efficient data ingestion process that involves compressing and copying CSV files to Blob storage. The pipeline consists of two essential steps to streamline the process.</p>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-azure-data-factory.png" alt="ozkary-data-engineering-azure-data-factory" title="Data Engineering Process Fundamentals- Azure Data Factory"></p>
<ul>
<li><p>Set Pipeline Variable - To ensure proper file naming, we use a code snippet to dynamically set a pipeline variable with today's date in the format "yymmdd.txt" This allows us to create a file name for a specific drop date. This variable is then used by the Copy Data activity.</p>
</li>
<li><p>Copy with Compression - We initiate a data copy action from the website "<a href="http://web.mta.info/developers/data/nyct/turnstile/turnstile_230318.txt">http://web.mta.info/developers/data/nyct/turnstile/turnstile_230318.txt</a>". This action has a source configuration where we can define the file to download dynamically. There is also a destination configuration, which links to our Blog storage and has a setting to compress and parse the CSV file. As the data is copied, the CSV file is compressed into the GZ format, optimizing storage and reducing costs. The compressed file is then stored in the designated Blob container in our Data Lake.</p>
</li>
</ul>
<p>By implementing this data pipeline, we achieve a seamless and automated data ingestion process, ensuring that data is efficiently transferred and stored in a cost-effective manner. The platform also manages all the orchestration concerns like monitoring, scheduling, logging, integration. We should also note that this is a third party managed service, and there is a cost based on the resource usage. Depending on the project, this cost could be less than a coding effort or could be higher compared to the code-centric approach.</p>
<h2 id="summary">Summary</h2>
<p>For our code-centric approach, we used Python to code each step of the pipeline to meet our specific requirements. Python allows us to create custom tasks and workflows, providing flexibility and control over the pipeline process. We deploy our pipeline within Docker containers, ensuring consistency across different environments. This facilitates seamless deployment and scalability, making it easier to manage the pipeline as it grows in complexity and volume.</p>
<p>For the pipeline orchestration, we are using the power of cloud technologies to host our code for deployments and execution, log the telemetry data to track the performance and health of the process, schedule and monitor our deployments to manage our operational concerns. </p>
<p>While the code-centric approach offers more granular control, it also demands more development and DevOps activities. On the other hand, a low-code approach, like Azure Data Factory, abstracts some complexity, making it faster and simpler to set up data pipelines.</p>
<p>The choice between a code-centric and low-code approach depends on the team's expertise, project requirements, and long-term goals. Python, combined with Docker and CI/CD, empowers data engineers to create sophisticated pipelines, while platforms like Azure Data Factory offer a faster and more accessible solution for specific use cases.</p>
<h2 id="next-step">Next Step</h2>
<p>Having successfully established a robust data pipeline and data orchestration, it is now time to embark on the next phase of our data engineering process β the design and implementation of a data warehouse.
</p>
<blockquote>
<p>π <a href="https://www.ozkary.dev/data-engineering-process-fundamentals-data-warehouse-transformation/" title="Data Engineering Process Fundamentals - Data Warehouse and Transformation by ozkary">Data Engineering Process Fundamentals - Data Warehouse and Transformation</a></p>
</blockquote>
<p>Thanks for reading.</p>
<p>Send question or comment at Twitter @ozkary</p>
<p>π Originally published by <a href="https://www.ozkary.com">ozkary.com</a></p>
<div class="blogger-post-footer">Originally Published by https://ozkary.com</div>Oscar Garcia @ozkaryhttp://www.blogger.com/profile/13427831719934915504noreply@blogger.com0tag:blogger.com,1999:blog-2250750882473644809.post-85830960650247131552023-05-20T14:08:00.034-04:002023-07-27T11:35:07.035-04:00Data Engineering Process Fundamentals - Pipeline and Orchestration<p>After completing the Design and Planning phase in the data engineering process, we can transition into the implementation and orchestration of our data pipeline. For this step, it is important to have a clear understanding on what is the implementation and orchestration effort, as well as what are the programming languages and tooling that are available to enable us to complete those efforts. </p>
<p>It is also important to understand some of the operational requirements, so we can choose the correct platform that should help us deliver on those requirements. Additionally, this is the time to leverage the cloud resources we have provisioned to support an operational pipeline, but before we get deep into those concepts, let's review some background information about what is exactly a pipeline, how can it be implemented and executed with orchestration? </p>
<p><img alt="ozkary-data-engineering-design-planning" height="578" src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-process-pipeline-orchestration.png" title="Data Engineering Process Fundamentals - Pipeline and Orchestration" width="640" /></p>
<h2 id="data-pipelines">Data Pipelines</h2>
<p>A data pipeline refers to a series of connected tasks that handles the extract, transform and load (ETL) as well as the extract, load and transform (ELT) operations and integration from a source to a target storage like a data lake or data warehouse. Properly designed pipelines ensure data integrity, quality, and consistency throughout the system.</p>
<p>The use of ETL or ELT depends on the design. For some solutions, a flow task may transform the data prior to loading it into storage. This approach tends to increase the amount of python code and hardware resources used by the hosting environment. For the ELT process, the transformation may be done using SQL code and the data warehouse resources, which often tend to perform great for big data scenarios.</p>
<h3 id="pipeline-implementation">Pipeline Implementation</h3>
<p>The implementation of a pipeline refers to the building and/or coding of each task in the pipeline. A task can be implemented using a programming languages like Python or SQL. It can also be implemented using a no-code or low-code tool, which provides a visual interface that allows the engineer to connect to Web services, databases, data lakes and other sources that provide access via API. The use of which technology to use depends on the skill set of the engineering team and cost analysis of the tools that should be used. Let's compare some of these options in more detail:</p>
<ul>
<li><p>Python is a versatile programming language widely used in data engineering. It offers robust libraries and frameworks like Apache Airflow, Apache Beam, and Pandas that provide powerful capabilities for building and managing data pipelines. With Python, we have granular control over pipeline logic, allowing for complex transformations and custom data processing. It is ideal for handling diverse data sources and implementing advanced data integration scenarios. Even in some low-code scenarios, Python is used to build components which do special transformation or logic that may not be available right out of the box of the low-code tool.</p>
</li>
<li><p>SQL (Structured Query Language) is a standard language for interacting with relational databases. Many data pipeline frameworks, such as Apache NiFi and Azure Data Factory, offer SQL-based transformation capabilities. SQL allows for declarative and set-based operations, making it efficient for querying and transforming structured data. It is well-suited for scenarios where the data transformations align closely with SQL operations and can be expressed concisely.</p>
</li>
<li><p>Low-code tools, such as Azure Logic Apps, Power Platform Automate, provide a visual interface for designing and orchestrating data pipelines. They offer a no-code or low-code approach, making it easier for non-technical users to build pipelines with drag-and-drop functionality. These tools abstract the underlying complexity, enabling faster development and easier maintenance. Low-code tools are beneficial when simplicity, speed, and ease of use are prioritized over fine-grained control and advanced data processing capabilities.</p>
</li>
</ul>
<p>The choice between Python, SQL, or low-code tools depends on specific project requirements, team skills, and the complexity of the data processing tasks. Python offers flexibility and control, SQL excels in structured data scenarios, while low-code tools provide rapid development and simplicity.</p>
<h3 id="pipeline-orchestration">Pipeline Orchestration</h3>
<p>Pipeline orchestration refers to the automation, management and coordination of the data pipeline tasks. It involves the scheduling, workflows, monitoring and recovery of those tasks. The orchestration ensures the execution of those tasks, and it takes care of error handling, retry and the alerting of problems in the pipeline.</p>
<p>Similar to the implementation effort, there are several options for the orchestration approach. There are code-centric, low-code and no-code platforms. Let's take a look at some of those options.</p>
<h4 id="orchestration-tooling">Orchestration Tooling</h4>
<p>When it comes to orchestrating data pipelines, there are several options available. </p>
<ul>
<li><p>One popular choice is <a href="https://airflow.apache.org/">Apache Airflow</a>, an open-source platform that provides workflow automation, task scheduling, and monitoring capabilities. With Airflow, engineers can define complex workflows using Python code, allowing for flexibility and customization. Apache Airflow requires an active service or process to be running. It operates as a centralized service that manages and schedules workflows. </p>
</li>
<li><p><a href="https://spark.apache.org/">Apache Spark</a> can be a good choice for batch processing tasks that involve calling APIs and downloading files using Python. Spark provides a distributed processing framework that can handle large-scale data processing and analysis efficiently. Spark provides a Python API (PySpark) that allows you to write Spark applications using Python. Spark runs as a distributed processing engine that provides high-performance data processing capabilities. To use Spark for data pipeline processing, we need to set up and run a Spark cluster or Spark standalone server.</p>
</li>
<li><p>For those who prefer a code-centric approach, frameworks like <a href="https://www.prefect.io/">Prefect</a> can be a good choice. Prefect is an open-source workflow management system that allows us to define and manage data pipelines as code. It provides a Python-native API for building workflows, allowing for version control, testing, and collaboration in addition to the monitoring and reporting capabilities. Prefect requires an agent to be running in order to execute scheduled jobs. The agent acts as the workflow engine that coordinates the execution of tasks and manages the scheduling and orchestration of workflows.</p>
</li>
<li><p>For low-code and no-code efforts, <a href="https://azure.microsoft.com/en-us/products/data-factory/">Azure Data Factory</a> is a cloud-based data integration service provided by Microsoft. It offers a visual interface for building and orchestrating data pipelines, making it suitable for users with less coding experience. Data Factory supports a wide range of data sources and provides features like data movement, transformation, and scheduling. It also integrates well with other Azure services, enabling seamless data integration within the Microsoft ecosystem.</p>
</li>
</ul>
<p>When comparing these options, it's essential to consider factors like ease of use, scalability, extensibility, integration with other tools and systems.</p>
<h4 id="orchestration-operations">Orchestration Operations</h4>
<p>In addition to the technical skill set requirements, there is an operational requirement that should be highly considered. Important aspects include automation and monitoring:</p>
<ul>
<li><p>Automation allows us to streamline and automate repetitive tasks, ensures consistent execution of tasks and workflows, thus eliminating the human factor, and enables us to scale up or down the data workflows based on demand. </p>
</li>
<li><p>Monitoring plays a critical role in identifying issues, errors, or bottlenecks in data pipelines. We can also gather insights into the performance of the data pipelines. This information helps identify areas for improvement, optimize resource utilization, and enhance overall pipeline efficiency. </p>
</li>
</ul>
<p>Automation and monitoring contribute to compliance and governance requirements. By tracking and documenting data lineage, monitoring data quality, and implementing data governance policies, engineers can ensure regulatory compliance and maintain data integrity and security.</p>
<h2 id="cloud-resources">Cloud Resources</h2>
<p>When it comes to cloud resources, there are often two components that play a significant role in this process: a Virtual Machine (VM) and the Data Lake.</p>
<p><img alt="ozkary-data-engineering-design-planning" src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-process-orchestration-flow.png" title="Data Engineering Process Fundamentals - Orchestration Flow" /></p>
<ul>
<li><p>A Virtual Machine (VM) serves as the compute power for the pipelines. It is responsible for executing the pipeline workflows and managing the overall orchestration. It provides the computational resources needed to process and transform data, ensuring the smooth execution of data pipeline tasks. The code executed on this resource is often running on Docker containers, which enables the use of automated deployments when code changes become available. In addition, containers can be deployed on Kubernetes clusters to support high availability and automated management use cases.</p>
</li>
<li><p>A Data Lake acts as a central repository for storing vast amounts of raw and unprocessed data. It offers a scalable and cost-effective solution for capturing and storing data from various sources. The Data Lake allows for easy integration and flexibility to support evolving data requirements. There are also data retention policies that can be implemented to manage old files.</p>
</li>
</ul>
<p>Together, a VM and Data Lake are the backbone of the data pipeline infrastructure. They enable efficient data processing, facilitate data integration, and lay the foundation for seamless data analysis and visualization. By leveraging these components, we can stage the data flow into other resources like a data warehouse, which in turn enables the analytical process.</p>
<h2 id="summary">Summary</h2>
<p>A data pipeline is basically a workflow of tasks that can be executed in Docker containers. The execution, scheduling, managing and monitoring of the pipeline is referred to as orchestration. In order to support the operations of the pipeline and its orchestration, we need to provision a VM and data lake cloud resources, which we can also automate with Terraform. By selecting the appropriate programming language and orchestration tools, we can construct resilient pipelines capable of scaling and meeting evolving data demands effectively.</p>
<h2 id="exercise-data-pipeline-and-orchestration">Exercise - Data Pipeline and Orchestration</h2>
<p>Now that we understand the concepts of a pipeline and its orchestration, we should dive into a hands-on exercise in which we can implement a pipeline to extract CSV data from a source and send it to our data lake.</p>
<blockquote>
<p>π <a href="https://www.ozkary.com/2023/05/data-engineering-process-fundamentals-pipeline-orchestration-exercise.html">Data Engineering Process Fundamentals - Pipeline and Orchestration Exercise</a></p>
</blockquote>
<p>Thanks for reading.</p>
<p>Send question or comment at Twitter @ozkary</p>
<p>π Originally published by <a href="https://www.ozkary.com">ozkary.com</a></p>
<div class="blogger-post-footer">Originally Published by https://ozkary.com</div>Oscar Garcia @ozkaryhttp://www.blogger.com/profile/13427831719934915504noreply@blogger.com0tag:blogger.com,1999:blog-2250750882473644809.post-6044111302674273452023-05-06T10:15:00.028-04:002023-06-29T15:49:54.455-04:00AI Engineering Generate Code from User Stories<h1 id="ai-engineering-introduction">Introduction</h1>
<p>Large Language Model (LLM) refers to a class of AI models that are designed to understand and generate human-like text based on large amounts of training data. LLM can play a significant role when it comes to generating code by leveraging the language understanding and generative capabilities. Users can simply create text prompts with a user story format and provide enough context like technologies, requirements, and technical specifications, and the LLM model can generate code snippets to match what is requested by the prompt.</p>
<blockquote>
<p>π Note that LLM-generated code may not always be perfect, and developers should manually review and validate the generated code to ensure its correctness, efficiency, and adherence to coding standards.</p>
</blockquote>
<h2>Benefits of LLM for code generation</h2>
<p>When it comes to generating code, LLMs can be used in various ways:</p>
<ul>
<li>Code completion: LLMs can assist developers by suggesting code completions as they write code. By providing a partial code snippet or a description of the desired functionality, developers can prompt the LLM to generate the remaining code, saving time and reducing manual effort.</li>
<li>Code synthesis: LLMs can synthesize code based on high-level descriptions or requirements. Developers can provide natural language specifications or user stories, and the LLM can generate code that implements the desired functionality. This can be particularly useful in the early stages of development or when exploring different approaches to solve a problem.</li>
<li>Code refactoring: LLMs can help with code refactoring by suggesting improvements or alternative implementations. By analyzing existing code snippets or code bases, LLMs can generate refactored versions that follow best practices, optimize performance, or enhance readability.</li>
<li>Documentation generation: LLMs can assist in generating code documentation or comments. Developers can provide descriptions of functions, classes, or code blocks, and the LLM can generate the corresponding documentation or comments that describe the code's purpose and usage.</li>
<li>Code translation: LLMs can be utilized for translating code snippets between different programming languages. By providing a code snippet in one programming language, developers can prompt the LLM to generate the equivalent code in another programming language.</li>
</ul>
<h2 id="prompt-engineering">Prompt Engineering</h2>
<p>Prompt engineering is the process of designing and optimizing prompts to better utilize LLMs. Well described prompts can help the AI models better understand the context and generate more accurate responses. It is also helpful to provide some labels or expected results as examples, as this help the AI models evaluate its responses and provide more accurate results.</p>
<h3 id="improve-code-completeness">Improve Code Completeness</h3>
<p>Due to some of the API configuration, the generated code may be incomplete. To improve the completeness of the generated code when using the API, you can experiment with the following:</p>
<ul>
<li>Simplify or shorten your prompt to ensure it fits within the token limit</li>
<li>Split your prompt into multiple API calls if it exceeds the response length limit</li>
<li>Try refining and iterating on your prompt to provide clearer instructions to the model</li>
<li>Experiment with different temperature and max tokens settings to influence the output</li>
</ul>
<h2 id="generate-code-from-user-stories">Generate code from user stories</h2>
<p><img alt="ozkary-openai-generate-code-from-user-stories" src="//www.ozkary.dev/assets/2023/ozkary-openai-user-story-flow.png" title="Generate Code from User Stories" /></p>
<p>In the Agile development methodology, user stories are used to capture requirements or a feature from the perspective of end user or customer. For code generation, developers can write user stories to capture the context, requirements and technical specification necessary to generate code. This user story can then be processed by the LLM models to generate the code. As an example, a user story could be written in this way:</p>
<pre><code>
As <span class="hljs-keyword">a</span> data scientist, I want <span class="hljs-built_in">to</span> generate code <span class="hljs-keyword">using</span> <span class="hljs-keyword">the</span> following technologies, requirements, <span class="hljs-keyword">and</span> specifications:
Technologies:
- Python
Requirements:
- Transform <span class="hljs-keyword">a</span> data frame <span class="hljs-keyword">by</span> consolidating <span class="hljs-keyword">the</span> <span class="hljs-string">'date'</span> <span class="hljs-keyword">and</span> <span class="hljs-string">'time'</span> columns <span class="hljs-keyword">into</span> <span class="hljs-keyword">a</span> <span class="hljs-built_in">date</span> <span class="hljs-built_in">time</span> field column named <span class="hljs-string">'created'</span>.
Specifications:
- Create <span class="hljs-keyword">a</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">with</span> <span class="hljs-title">the</span> <span class="hljs-title">name</span> <span class="hljs-string">'transform_data'</span></span>
- Use pandas <span class="hljs-built_in">to</span> perform <span class="hljs-keyword">the</span> data transformation
- Load <span class="hljs-keyword">the</span> data <span class="hljs-built_in">from</span> <span class="hljs-keyword">a</span> parameter <span class="hljs-keyword">with</span> <span class="hljs-keyword">the</span> CSV <span class="hljs-built_in">file</span> path
- Save <span class="hljs-keyword">the</span> resulting data frame <span class="hljs-built_in">to</span> disk <span class="hljs-keyword">in</span> Parquet <span class="hljs-built_in">format</span>
- Return <span class="hljs-literal">true</span> <span class="hljs-keyword">if</span> successful <span class="hljs-keyword">or</span> <span class="hljs-literal">false</span> <span class="hljs-keyword">if</span> is <span class="hljs-keyword">not</span>
- For coding standards, use <span class="hljs-keyword">the</span> guidelines outlined <span class="hljs-keyword">in</span> <span class="hljs-keyword">the</span> [PEP <span class="hljs-number">8</span> style guide <span class="hljs-keyword">for</span> Python](<span class="hljs-keyword">https</span>://www.python.org/dev/peps/pep<span class="hljs-number">-0008</span>/).
- Create <span class="hljs-keyword">a</span> unit test <span class="hljs-keyword">for</span> <span class="hljs-keyword">the</span> `transform_data` <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">to</span> <span class="hljs-title">verify</span> <span class="hljs-title">its</span> <span class="hljs-title">correctness</span>. <span class="hljs-title">The</span> <span class="hljs-title">unit</span> <span class="hljs-title">test</span> <span class="hljs-title">should</span> <span class="hljs-title">cover</span> <span class="hljs-title">different</span> <span class="hljs-title">scenarios</span> <span class="hljs-title">and</span> <span class="hljs-title">assert</span> <span class="hljs-title">the</span> <span class="hljs-title">expected</span> <span class="hljs-title">behavior</span> <span class="hljs-title">of</span> <span class="hljs-title">the</span> <span class="hljs-title">function</span>.</span>
</code></pre><p>From this user story, we can see that the end user of the story is a data scientist who would like to generate Python code. The requirement is basically to transform two columns from a data frame into a date-time column. There is also additional technical specification written that provides more context to the AI model, so it is able to generate the code with the specifics including what coding standards to use. </p>
<p>Let's take a look a python example to see how we can generate code from a user story:</p>
<h3 id="install-the-openai-dependencies">Install the OpenAI dependencies</h3>
<pre><code>$ pip <span class="hljs-keyword">install</span> OpenAI
</code></pre><h3 id="add-the-openai-environment-configurations">Add the OpenAI environment configurations</h3>
<p>Get the following configuration information:</p>
<blockquote>
<p>π This example can run directly on the OpenAI APIs or it can use a custom Azure OpenAI resource.</p>
</blockquote>
<ul>
<li>GitHub Repo API Token with write permissions to push comments to an issue</li>
<li>Get an OpenAI API key</li>
<li>If you are using an Azure OpenAI resource, get your custom end-point and deployment<ul>
<li>The deployment should have the code-davinci-002 model</li>
</ul>
</li>
</ul>
<p>Set the linux environment variables with these commands:</p>
<pre><code class="lang-bash">$ echo export AZURE_OpenAI_KEY=<span class="hljs-string">"OpenAI-key-here"</span> <span class="hljs-meta">>> </span>~<span class="hljs-regexp">/.bashrc && source ~/</span>.bashrc
$ echo export GITHUB_TOKEN=<span class="hljs-string">"github-key-here"</span> <span class="hljs-meta">>> </span>~<span class="hljs-regexp">/.bashrc && source ~/</span>.bashrc
<span class="hljs-comment"># only set these variables when using your Azure OpenAI resource</span>
$ echo export AZURE_OpenAI_DEPLOYMENT=<span class="hljs-string">"deployment-name"</span> <span class="hljs-meta">>> </span>~<span class="hljs-regexp">/.bashrc && source ~/</span>.bashrc
$ echo export AZURE_OpenAI_ENDPOINT=<span class="hljs-string">"https://YOUR-END-POINT.OpenAI.azure.com/"</span> <span class="hljs-meta">>> </span>~<span class="hljs-regexp">/.bashrc && source ~/</span>.bashrc
</code></pre><h3 id="describe-the-code">Describe the code</h3>
<p> The code should run this workflow: (see the diagram for a visual reference)</p>
<ul>
<li>1 Get a list of open GitHub issues with the label user-story</li>
<li>2 Each issue content is sent to the OpenAI API to generate the code</li>
<li>3 The generated code is posted as a comment on the user-story for the developers to review</li>
</ul>
<blockquote>
<p>π The following code is a simple implementation for the GitHub and OpenAI APIs. Use the code from this GitHub repo: <a title="ozkary LLM code generation" href="https://github.com/ozkary/ai-engineering/tree/main/python/code_generation">LLM Code Generation</a></p>
</blockquote>
<pre><code class="lang-python">
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">process_issue_by_label</span><span class="hljs-params">(repo: str, label: str)</span>:</span>
<span class="hljs-keyword">try</span>:
<span class="hljs-comment"># get the issues from the repo</span>
issues = GitHubService.get_issues(repo=repo, label=label, access_token=github_token)
<span class="hljs-keyword">if</span> issues <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">None</span>:
<span class="hljs-keyword">for</span> issue <span class="hljs-keyword">in</span> issues:
<span class="hljs-comment"># Generate code using OpenAI </span>
print(f<span class="hljs-string">'Generating code from GitHub issue: {issue.title}'</span>)
openai_service = OpenAIService(api_key=openai_api_key, engine=openai_api_deployment, end_point=openai_api_base)
generated_code = openai_service.create(issue.body)
<span class="hljs-keyword">if</span> generated_code <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">None</span>:
<span class="hljs-comment"># Post a comment with the generated code to the GitHub issue</span>
comment = f<span class="hljs-string">'Generated code:\n\n```{generated_code}\n```'</span>
comment_posted = GitHubService.post_issue_comment(repo, issue.id, comment, access_token=github_token)
<span class="hljs-keyword">if</span> comment_posted:
print(<span class="hljs-string">'Code generated and posted as a comment on the GitHub issue.'</span>)
<span class="hljs-keyword">else</span>:
print(<span class="hljs-string">'Failed to post the comment on the GitHub issue.'</span>)
<span class="hljs-keyword">else</span>:
print(<span class="hljs-string">'Failed to generate code from the GitHub issue.'</span>)
<span class="hljs-keyword">else</span>:
print(<span class="hljs-string">'Failed to retrieve the GitHub issue.'</span>)
<span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> ex:
print(f<span class="hljs-string">"Error: {ex}"</span>)
</code></pre><p>The OpenAI service handles the API details. It takes default parameters for the model (engine), temperature and token limits, which control the cost and amount of text (roughly four letters per token) that should be allowed. For this service, we use the "Completion" model which allows developers to interact with OpenAI's language models and generate text-based completions.</p>
<p>Other parameters include:</p>
<ul>
<li><p>Temperature: Controls the randomness of the model's output. Higher values like 0.8 make the output more diverse and creative, while lower values like 0.2 make it more focused and deterministic.</p>
</li>
<li><p>Max Tokens: Specifies the maximum length of the response generated by the model, measured in tokens. It helps to limit the length of the output and prevent excessively long responses.</p>
</li>
<li><p>Top-p (Nucleus) Sampling: Also known as "P-Top" or "Nucleus Sampling," this parameter determines the probability distribution of the next token based on the likelihood of the most likely tokens. It helps in controlling the diversity of the generated output.</p>
</li>
<li><p>Frequency Penalty: A parameter used to discourage repetitive or redundant responses by penalizing the model for repeating the same tokens too often.</p>
</li>
<li><p>Presence Penalty: Used to encourage the model to generate responses that include all the provided information. A higher presence penalty value, such as 0.6, can help in ensuring more accurate and relevant responses.</p>
</li>
<li><p>Stop Sequences: Optional tokens or phrases that can be specified to guide the model to stop generating output. It can be used to control the length of the response or prevent the model from continuing beyond a certain point.</p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">OpenAIService</span>:</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span><span class="hljs-params">(<span class="hljs-keyword">self</span>, <span class="hljs-symbol">api_key:</span> str, <span class="hljs-symbol">engine:</span> str = <span class="hljs-string">'code-davinci-002'</span>, <span class="hljs-symbol">end_point:</span> str = None, <span class="hljs-symbol">temperature:</span> float = <span class="hljs-number">0</span>.<span class="hljs-number">5</span>, <span class="hljs-symbol">max_tokens:</span> int = <span class="hljs-number">350</span>, <span class="hljs-symbol">n:</span> int = <span class="hljs-number">1</span>, <span class="hljs-symbol">stop:</span> str = None)</span></span>:
openai.api_key = api_key
<span class="hljs-comment"># Azure OpenAI API custom resource</span>
<span class="hljs-comment"># Use these settings only when using a custom endpoint like https://ozkary.openai.azure.com </span>
<span class="hljs-keyword">if</span> end_point is <span class="hljs-keyword">not</span> <span class="hljs-symbol">None:</span>
openai.api_base = end_point
openai.api_type = <span class="hljs-string">'azure'</span>
openai.api_version = <span class="hljs-string">'2023-05-15'</span> <span class="hljs-comment"># this will change as the API evolves</span>
<span class="hljs-keyword">self</span>.engine = engine
<span class="hljs-keyword">self</span>.temperature = temperature
<span class="hljs-keyword">self</span>.max_tokens = max_tokens
<span class="hljs-keyword">self</span>.n = n
<span class="hljs-keyword">self</span>.stop = stop
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">create</span><span class="hljs-params">(<span class="hljs-keyword">self</span>, <span class="hljs-symbol">prompt:</span> str)</span></span> -> <span class="hljs-symbol">str:</span>
<span class="hljs-string">""</span><span class="hljs-string">"Create a completion using the OpenAI API."</span><span class="hljs-string">""</span>
response = openai.Completion.create(
engine=<span class="hljs-keyword">self</span>.engine,
prompt=prompt,
max_tokens=<span class="hljs-keyword">self</span>.max_tokens,
n=<span class="hljs-keyword">self</span>.n,
stop=<span class="hljs-keyword">self</span>.stop,
temperature=<span class="hljs-keyword">self</span>.temperature
)
print(response)
<span class="hljs-keyword">return</span> response.choices[<span class="hljs-number">0</span>].text.strip()
</code></pre><h3 id="run-the-code">Run the code</h3>
<p>After configuring your environment and downloading the code, we can run the code from a terminal by typing the following command:</p>
<blockquote>
<p>π Make sure to enter your repo name and label your issues with either user-story or any other label you would rather use.</p>
</blockquote>
<pre><code class="lang-bash"><span class="hljs-comment">#</span> <span class="hljs-comment">python3</span> <span class="hljs-comment">gen_code_from_issue</span><span class="hljs-string">.</span><span class="hljs-comment">py</span> <span class="hljs-literal">-</span><span class="hljs-literal">-</span><span class="hljs-comment">repo</span> <span class="hljs-comment">ozkary/ai</span><span class="hljs-literal">-</span><span class="hljs-comment">engineering</span> <span class="hljs-literal">-</span><span class="hljs-literal">-</span><span class="hljs-comment">label</span> <span class="hljs-comment">user</span><span class="hljs-literal">-</span><span class="hljs-comment">story</span>
</code></pre><p>After running the code successfully, we should be able to see the generated code as a comment on the GitHub issue.</p>
<p><img alt="ozkary-ai-engineering-generate-code-from-user-stories" src="//www.ozkary.dev/assets/2023/ozkary-ai-engineering-code-generated.png" title="Generate Code from User Stories Back to GitHub" /></p>
<h3 id="summary">Summary</h3>
<p>LLM plays a crucial role in code generation by harnessing its language understanding and generative capabilities. Developers, data engineers, and scientists can utilize AI models to swiftly generate scripts in various programming languages, streamlining their programming tasks. Moreover, documenting user stories with intricate details forms an integral part of the brainstorming process before writing a single line of code</p>
<p>Thanks for reading.</p>
<p>Send question or comment at Twitter @ozkary</p>
π Originally published by <a href="https://www.ozkary.com" title="oscar garcia, ozkary">ozkary.com</a>
<div class="blogger-post-footer">Originally Published by https://ozkary.com</div>Oscar Garcia @ozkaryhttp://www.blogger.com/profile/13427831719934915504noreply@blogger.com0tag:blogger.com,1999:blog-2250750882473644809.post-26183705534802228482023-04-22T12:40:00.007-04:002023-07-21T11:43:43.405-04:00Data Engineering Process Fundamentals - Design and Planning Exercise<p>Having laid a strong design foundation, it's time to embark on a hands-on exercise that's crucial to our data engineering project's success. Our immediate focus is on building the essential cloud resources that will serve as the backbone for our data pipelines, data lake, and data warehouse. Taking a cloud-agnostic approach ensures our implementation remains flexible and adaptable across different cloud providers, enabling us to leverage the advantages of multiple platforms or switch providers seamlessly if required. By completing this step, we set the stage for efficient and effective coding of our solutions. Let's get started on this vital infrastructure-building journey.</p>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-design-terraform-docker.png" alt="ozkary-data-engineering-design-planning-docker-terraform" title="Data Engineering Process Fundamentals- Design and Planning Docker Terraform"></p>
<h2 id="cloud-infrastructure-planning">Cloud Infrastructure Planning</h2>
<p>Infrastructure planning is a critical aspect of every technical project, laying the foundation for successful project delivery. In the case of a Data Engineering project, it becomes even more crucial. To support our project's objectives, we need to carefully consider and provision specific resources:</p>
<ul>
<li>VM instance: This serves as the backbone for hosting our data pipelines and orchestration, ensuring efficient execution of our data workflows.</li>
<li>Data Lake: A vital component for storing various data formats, such as CSV or Parquet files, in a scalable and flexible manner.</li>
<li>Data Warehouse: An essential resource that hosts transformed and curated data, providing a high-performance environment for easy access by visualization tools.</li>
</ul>
<p>Infrastructure automation, facilitated by tools like Terraform, is important in modern data engineering projects. It enables the provisioning and management of cloud resources, such as virtual machines and storage, in a consistent and reproducible manner. Infrastructure as Code (IaC) allows teams to define their infrastructure declaratively, track it in source control, version it, and apply changes as needed. Automation reduces manual efforts, ensures consistency, and enables infrastructure to be treated as a code artifact, improving reliability and scalability.</p>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-terraform.png" alt="ozkary-data-engineering-terraform" title="Data Engineering Process - Terraform"></p>
<h1 id="infrastructure-implementation-requirements">Infrastructure Implementation Requirements</h1>
<p>When using Terraform with a any cloud provider, there are several key artifacts and considerations to keep in mind for successful configuration and security. Terraform needs access to the cloud account where it can provision the resources. The account token or credentials can vary based on your cloud configuration. For our purpose, we will use a GCP (Google) project to build our resources, but first we need to install the Terraform dependencies for the development environment.</p>
<h2 id="install-terraform">Install Terraform</h2>
<p>To install Terraform, open a bash terminal and run the commands below:</p>
<ul>
<li>Download the package file</li>
<li>Unzip the file</li>
<li>Copy the Terraform binary from the extracted file to the /usr/bin/ directory</li>
<li>Verify the version</li>
</ul>
<pre><code class="lang-bash"><span class="hljs-variable">$ </span>wget <span class="hljs-symbol">https:</span>/<span class="hljs-regexp">/releases.hashicorp.com/terraform</span><span class="hljs-regexp">/1.3.7/terraform</span>_1.<span class="hljs-number">3.7_</span>linux_amd64.zip
<span class="hljs-variable">$ </span>unzip terraform_1.<span class="hljs-number">1.2_</span>linux_amd64.zip
<span class="hljs-variable">$ </span>cp terraform /usr/bin/
<span class="hljs-variable">$ </span>terraform -v
</code></pre>
<p>We should get an output similar to this:</p>
<pre><code class="lang-bash">Terraform v1<span class="hljs-number">.3</span><span class="hljs-number">.7</span>
<span class="hljs-keyword">on</span> linux_amd64
</code></pre>
<h2 id="configure-a-cloud-account">Configure a Cloud Account</h2>
<h3 id="create-a-google-account-here-https-cloud-google-com-">Create a Google account. <a href="https://cloud.google.com/">Here</a></h3>
<ul>
<li>Create a new project</li>
<li>Make sure to keep track of the Project ID and the location for your project</li>
</ul>
<h3 id="create-a-service-account">Create a service account</h3>
<ul>
<li>In the left-hand menu, click on "IAM & Admin" and then select "Service accounts"</li>
<li>Click on the "Create Service Account" button at the top of the page</li>
<li>Enter a name for the service account and an optional description</li>
<li>Then add the BigQuery Admin, Storage Admin, Storage Object Admin as roles for our service account and click the save button.</li>
</ul>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-gcp-roles.png" alt="ozkary gcp roles"></p>
<ul>
<li>Enable IAM APIs by clicking the following links:<ul>
<li><a href="https://console.cloud.google.com/apis/library/iam.googleapis.com">IAM-API</a></li>
<li><a href="https://console.cloud.google.com/apis/library/iamcredentials.googleapis.com">IAM-credential-API</a></li>
</ul>
</li>
</ul>
<h3 id="authenticate-the-vm-or-local-environment-with-gcp">Authenticate the VM or Local environment with GCP</h3>
<ul>
<li>In the left navigation menu (GCP), click on "IAM & Admin" and then "Service accounts"</li>
<li>Click on the three verticals dots under the action section for the service name you just created</li>
<li>Then click Manage keys, Add key, Create new key. Select JSON option and click Create</li>
<li>Move the key file to a system folder</li>
</ul>
<pre><code class="lang-bash"><span class="hljs-variable">$ </span>mkdir -p <span class="hljs-variable">$HOME</span>/.gcp/
<span class="hljs-variable">$ </span>mv ~<span class="hljs-regexp">/Downloads/</span>{xxxxxx}.json ~<span class="hljs-regexp">/.gcp/</span>{acc_credentials}.json
</code></pre>
<ul>
<li>install the SDK and CLI Tools<ul>
<li><a href="https://cloud.google.com/sdk/docs/install-sdk" target="_new">Follow the instruction here</a></li>
</ul>
</li>
<li>Validate the installation and login to GCP with the following commands<pre><code>$ echo '<span class="hljs-keyword">export</span> <span class="hljs-type">GOOGLE_APPLICATION_CREDENTIALS</span>=<span class="hljs-string">"~/.gcp/{acc_credentials}.json"</span>' >> ~/.bashrc
$ <span class="hljs-keyword">export</span> <span class="hljs-type">GOOGLE_APPLICATION_CREDENTIALS</span>=<span class="hljs-string">"~/.gcp/{acc_credentials}.json"</span>
$ gcloud auth application-<span class="hljs-keyword">default</span> login
</code></pre></li>
<li>Follow the login process on the browser</li>
</ul>
<h1 id="review-the-code">Review the Code</h1>
<p>π <a href="https://github.com/ozkary/data-engineering-mta-turnstile/tree/main/Step2-Cloud-Infrastructure/terraform" target="_terraform">Clone this repo or copy the files from this folder
</a></p>
<p>Terraform uses declarative configuration files written in a domain-specific language (DSL) called HCL (HashiCorp Configuration Language). It provides a concise and human-readable syntax for defining resources, dependencies, and configurations, enabling us to provision, modify, and destroy infrastructure in a predictable and reproducible manner.</p>
<p>At a minimum, we should define a variables file, which contains the cloud provider information and a resource file which define what kind of resources should be provision on the cloud. There could be a file for each resource or a single file can define multiple resources. </p>
<h2 id="variables-file">Variables File</h2>
<p>The variables file is used to define a set of variables that can be used on the resource file. This allows for the use of those variables in one more more resource configuration files. The format looks as follows:</p>
<pre><code class="lang-terraform">locals {
data_lake_bucket = <span class="hljs-string">"mta_data_lake"</span>
}
<span class="hljs-keyword">variable</span> <span class="hljs-string">"project"</span> {
description <span class="hljs-comment">=</span> <span class="hljs-comment">"Enter Your GCP Project ID"</span>
type <span class="hljs-comment">= string</span>
}
<span class="hljs-keyword">variable</span> <span class="hljs-comment">"region"</span><span class="hljs-comment"> {</span>
description <span class="hljs-comment">=</span> <span class="hljs-comment">"Region for GCP resources. Choose as per your location: https://cloud.google.com/about/locations"</span>
default <span class="hljs-comment">=</span> <span class="hljs-comment">"us-east1"</span>
type <span class="hljs-comment">= string</span>
}
<span class="hljs-keyword">variable</span> <span class="hljs-comment">"storage_class"</span><span class="hljs-comment"> {</span>
description <span class="hljs-comment">=</span> <span class="hljs-comment">"Storage class type for your bucket. Check official docs for more info."</span>
default <span class="hljs-comment">=</span> <span class="hljs-comment">"STANDARD"</span>
type <span class="hljs-comment">= string</span>
}
<span class="hljs-keyword">variable</span> <span class="hljs-comment">"stg_dataset"</span><span class="hljs-comment"> {</span>
description <span class="hljs-comment">=</span> <span class="hljs-comment">"BigQuery Dataset that raw data (from GCS) will be written to"</span>
type <span class="hljs-comment">= string</span>
default <span class="hljs-comment">=</span> <span class="hljs-comment">"mta_data"</span>
}
<span class="hljs-keyword">variable</span> <span class="hljs-comment">"vm_image"</span><span class="hljs-comment"> {</span>
description <span class="hljs-comment">=</span> <span class="hljs-comment">"Base image for your Virtual Machine."</span>
type <span class="hljs-comment">= string</span>
default <span class="hljs-comment">=</span> <span class="hljs-comment">"ubuntu-os-cloud/ubuntu-2004-lts"</span>
}
</code></pre>
<h2 id="resource-file">Resource File</h2>
<p>The resource file is where we define what should be provisioned on the cloud. This is also the file where we need to define the provider specific resources that need to be created.</p>
<pre><code class="lang-terraform">terraform {
<span class="hljs-attr">required_version</span> = <span class="hljs-string">">= 1.0"</span>
backend <span class="hljs-string">"local"</span> {} <span class="hljs-comment"># Can change from "local" to "gcs" (for google) or "s3" (for aws), if you would like to preserve your tf-state online</span>
required_providers {
<span class="hljs-attr">google</span> = {
<span class="hljs-attr">source</span> = <span class="hljs-string">"hashicorp/google"</span>
}
}
}
provider <span class="hljs-string">"google"</span> {
<span class="hljs-attr">project</span> = var.project
<span class="hljs-attr">region</span> = var.region
// <span class="hljs-attr">credentials</span> = file(var.credentials) <span class="hljs-comment"># Use this if you do not want to set env-var GOOGLE_APPLICATION_CREDENTIALS</span>
}
<span class="hljs-comment"># Data Lake Bucket</span>
<span class="hljs-comment"># Ref: https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/storage_bucket</span>
resource <span class="hljs-string">"google_storage_bucket"</span> <span class="hljs-string">"data-lake-bucket"</span> {
<span class="hljs-attr">name</span> = <span class="hljs-string">"<span class="hljs-subst">${local.data_lake_bucket}</span>_<span class="hljs-subst">${var.project}</span>"</span> <span class="hljs-comment"># Concatenating DL bucket & Project name for unique naming</span>
<span class="hljs-attr">location</span> = var.region
<span class="hljs-comment"># Optional, but recommended settings:</span>
<span class="hljs-attr">storage_class</span> = var.storage_class
<span class="hljs-attr">uniform_bucket_level_access</span> = <span class="hljs-literal">true</span>
versioning {
<span class="hljs-attr">enabled</span> = <span class="hljs-literal">true</span>
}
lifecycle_rule {
action {
<span class="hljs-attr">type</span> = <span class="hljs-string">"Delete"</span>
}
condition {
<span class="hljs-attr">age</span> = <span class="hljs-number">15</span> // days
}
}
<span class="hljs-attr">force_destroy</span> = <span class="hljs-literal">true</span>
}
<span class="hljs-comment"># BigQuery data warehouse</span>
<span class="hljs-comment"># Ref: https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/bigquery_dataset</span>
resource <span class="hljs-string">"google_bigquery_dataset"</span> <span class="hljs-string">"stg_dataset"</span> {
<span class="hljs-attr">dataset_id</span> = var.stg_dataset
<span class="hljs-attr">project</span> = var.project
<span class="hljs-attr">location</span> = var.region
}
<span class="hljs-comment"># VM instance</span>
resource <span class="hljs-string">"google_compute_instance"</span> <span class="hljs-string">"vm_instance"</span> {
<span class="hljs-attr">name</span> = <span class="hljs-string">"mta-instance"</span>
<span class="hljs-attr">project</span> = var.project
<span class="hljs-attr">machine_type</span> = <span class="hljs-string">"e2-standard-4"</span>
<span class="hljs-attr">zone</span> = var.region
boot_disk {
initialize_params {
<span class="hljs-attr">image</span> = var.vm_image
}
}
network_interface {
<span class="hljs-attr">network</span> = <span class="hljs-string">"default"</span>
access_config {
// Ephemeral public IP
}
}
}
</code></pre>
<p>This Terraform file defines the infrastructure components to be provisioned on the Google Cloud Platform (GCP). It includes the configuration for the Terraform backend, required providers, and the resources to be created.</p>
<ul>
<li>The backend section specifies the backend type as "local" for storing the Terraform state locally. It can be changed to "gcs" or "s3" for cloud storage if desired.</li>
<li>The required_providers block defines the required provider and its source, in this case, the Google Cloud provider.</li>
<li>The provider block configures the Google provider with the project and region specified as variables. It can use either environment variable GOOGLE_APPLICATION_CREDENTIALS or the credentials file defined in the variables.</li>
<li>The resource blocks define the infrastructure resources to be created, such as a Google Storage Bucket for the data lake, Google BigQuery datasets for staging and production, and a Google Compute Engine instance named "mta-instance" with specific configuration settings.</li>
</ul>
<p>Overall, this Terraform file automates the provisioning of a data lake bucket, BigQuery datasets, and a virtual machine instance on the Google Cloud Platform.</p>
<h1 id="how-to-run-it-">How to run it!</h1>
<ul>
<li>Refresh service-account's auth-token for this session</li>
</ul>
<pre><code class="lang-bash">$ gcloud auth application-<span class="hljs-keyword">default</span> login
</code></pre>
<ul>
<li>Set the credentials file on the bash configuration file<ul>
<li>Add the export line and replace filename-here with your file</li>
</ul>
</li>
</ul>
<pre><code class="lang-bash">$ <span class="hljs-built_in">echo</span> <span class="hljs-built_in">export</span> GOOGLE_APPLICATION_CREDENTIALS=<span class="hljs-string">"<span class="hljs-variable">${HOME}</span>/.gcp/filename-here.json"</span> >> ~/.bashrc && <span class="hljs-built_in">source</span> ~/.bashrc
</code></pre>
<ul>
<li><p>Open the terraform folder in your project</p>
</li>
<li><p>Initialize state file (.tfstate) by running terraform init</p>
</li>
</ul>
<pre><code class="lang-bash"><span class="hljs-variable">$ </span>cd ./terraform
<span class="hljs-variable">$ </span>terraform init
</code></pre>
<ul>
<li>Check changes to new infrastructure plan before applying them</li>
</ul>
<p>It is important to always review the plan to make sure no unwanted changes are showing up.</p>
<blockquote>
<p>π Get the project id from your GCP cloud console and replace it on the next command</p>
</blockquote>
<pre><code class="lang-bash"><span class="hljs-string">$ </span>terraform plan -var=<span class="hljs-comment">"project=<your-gcp-project-id>"</span>
</code></pre>
<ul>
<li>Apply the changes </li>
</ul>
<p>This provisions the resources on the cloud project.</p>
<pre><code class="lang-bash">$ terraform <span class="hljs-built_in">apply</span> -<span class="hljs-built_in">var</span>=<span class="hljs-string">"project=<your-gcp-project-id>"</span>
</code></pre>
<ul>
<li>(Optional) Delete infrastructure after your work, to avoid costs on any running services</li>
</ul>
<pre><code class="lang-bash"><span class="hljs-variable">$ </span>terraform destroy
</code></pre>
<h4 id="terraform-lifecycle">Terraform Lifecycle</h4>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-Engineering-terraform-lifecycle.png" alt="ozkary-data-engineering-terraform-lifecycle" title="Data Engineering Process - Terraform Lifecycle"></p>
<h3 id="github-action">GitHub Action</h3>
<p>In order to be able to automate the building of infrastructure with GitHub, we need to define the cloud provider token as a secret with GitHub. This can be done by following the steps from this link:</p>
<blockquote>
<p>π <a href="https://github.com/ozkary/data-engineering-mta-turnstile/wiki/GitHub-Configure-Secrets-for-Build-Actions">Configure GitHub Secrets</a></p>
</blockquote>
<p>Once the secret has been configured, we can create a build action script with the cloud provider secret information as shown with this GitHub Action workflow YAML file:</p>
<pre><code class="lang-yml">
<span class="hljs-attribute">name</span>: Terraform Deployment
<span class="yaml"><span class="hljs-attr">on:</span>
<span class="hljs-attr"> push:</span>
<span class="hljs-attr"> branches:</span>
<span class="hljs-bullet"> -</span> main
<span class="hljs-attr">jobs:</span>
<span class="hljs-attr"> deploy:</span>
<span class="hljs-attr"> runs-on:</span> ubuntu-latest
<span class="hljs-attr"> steps:</span>
<span class="hljs-attr"> - name:</span> Checkout repository
<span class="hljs-attr"> uses:</span> actions/checkout@v2
<span class="hljs-attr"> - name:</span> Set up Terraform
<span class="hljs-attr"> uses:</span> hashicorp/setup-terraform@v1
<span class="hljs-attr"> - name:</span> Terraform Init
<span class="hljs-attr"> env:</span>
<span class="hljs-attr"> GOOGLE_APPLICATION_CREDENTIALS:</span> ${{ secrets.GOOGLE_APPLICATION_CREDENTIALS }}
<span class="hljs-attr"> run:</span> <span class="hljs-string">|
cd Step2-Cloud-Infrastructure/terraform
terraform init
</span><span class="hljs-attr"> - name:</span> Terraform Apply
<span class="hljs-attr"> run:</span> <span class="hljs-string">|
cd path/to/terraform/project
terraform apply -auto-approve</span></span>
</code></pre>
<h1 id="conclusion">Conclusion</h1>
<p>With this exercise, we gain practical experience in using tools like Terraform to automate the provisioning of resources, such as VM, a data lakes and other components essential to our data engineering system. By following cloud-agnostic practices, we can achieve interoperability and avoid vendor lock-in, ensuring our project remains scalable, cost-effective, and adaptable to future requirements.</p>
<h1 id="next-step">Next Step</h1>
<p>After building our cloud infrastructure, we are now ready to talk about the implementation and orchestration of a data pipeline and review some of the operational requirements that can enable us to make decisions.</p>
<p>Coming Soon!</p>
<blockquote>
<p>π [Data Engineering Process Fundamentals - Data Pipeline and Orchestration]</p>
</blockquote>
<p>Thanks for reading.</p>
<p>Send question or comment at Twitter @ozkary</p>
<p>π Originally published by <a href="https://www.ozkary.com">ozkary.com</a></p>
<div class="blogger-post-footer">Originally Published by https://ozkary.com</div>Oscar Garcia @ozkaryhttp://www.blogger.com/profile/13427831719934915504noreply@blogger.com0tag:blogger.com,1999:blog-2250750882473644809.post-10610907557039081432023-04-15T11:17:00.026-04:002023-07-21T11:36:03.815-04:00Data Engineering Process Fundamentals - Design and Planning<p>Now that we have completed the discovery step and the scope of work on the project is clearly defined, we move on to the design and planning step. The design and planning phase of a data engineering project is crucial for laying out the foundation of a successful system. It involves defining the system architecture, designing data pipelines, implementing source control practices, ensuring continuous integration and deployment (CI/CD), and leveraging tools like Docker and Terraform for infrastructure automation.</p>
<p><img alt="ozkary-data-engineering-design-planning" height="640" src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-process-design-planning.png" title="Data Engineering Process Fundamentals- Design and Planning" width="576" /></p>
<h3 id="data-engineering-design">Data Engineering Design</h3>
<p>A data engineering design is the actual plan to build the technical solution. It includes the system architecture, data integration, flow and pipeline orchestration, the data storage platform, transformation and management, data processing and analytics tooling. This is the area where we need to clearly define the different technologies that should be used for each area. </p>
<h4 id="system-architecture">System Architecture</h4>
<p>The system architecture is a critical high-level design that encompasses various components and their integration within the solution. This includes data sources, data ingestion resources, workflow and data orchestration frameworks, storage resources, data services for transformation, continuous data ingestion, and validation, as well as data analysis and visualization tools. Properly designing the system architecture ensures a robust and efficient data engineering solution.</p>
<h4 id="data-pipelines">Data Pipelines</h4>
<p>A data pipeline refers to a series of connected tasks that handles the extract, transform and load (ETL) as well as the extract, load and transform (ELT) operations and integration from a source to a target storage like a data lake or data warehouse. Properly designed pipelines ensure data integrity, quality, and consistency throughout the system.</p>
<p>The use of ETL or ELT depends on the design. For some solutions, a flow task may transform the data prior to loading it into storage. This approach tends to increase the amount of python code and hardware resources used by the hosting environment. For the ELT process, the transformation may be done using SQL code and the data warehouse resources, which often tend to perform great for big data scenarios.</p>
<h4 id="data-orchestration">Data Orchestration</h4>
<p>Data orchestration refers to the automation, management and coordination of the data pipeline tasks. It involves the scheduling, workflows, monitoring and recovery of those tasks. The orchestration ensures the execution of those tasks, and it takes care of error handling, retry and the alerting of problems in the pipeline.</p>
<h3 id="source-control-and-deployment">Source Control and Deployment</h3>
<p>Source control is an essential practice for managing code and configuration files. Utilizing version control systems allows teams to collaborate effectively, track changes, and revert to previous states if necessary. It is important to properly define the tooling that should be used for source control and deployments automation. Source code should include the Python code, Terraform scripts, Docker files as well as any deployment automation scripts.</p>
<h4 id="source-control">Source Control</h4>
<p>Client side source control systems such as Git help in tracking and maintaining the source code for our projects. Cloud side systems such as GitHub should be used to enable the team to push their code and configuration changes to a remote repository, so it can be shared with other team members.</p>
<h4 id="continuous-integration-and-continuous-delivery-ci-cd-">Continuous Integration and Continuous Delivery (CI/CD)</h4>
<p> A remote code repository like GitHub also provides deployment automation pipelines that enable us to push changes to other environments for testing and finally production releases. </p>
<p>Continuous Integration (CI) is the practice to continuously integrate the code changes into the central repository, so it can be built and tested to validate the latest change and provide feedback in case of problems. Continuous Deployment (CD) is the practice to automate the deployment of the latest code builds to other environments like staging and production.</p>
<h4 id="docker-containers-and-docker-hub">Docker Containers and Docker Hub</h4>
<p>When deploying a new build, we need to also deploy the environment dependencies to avoid any run-time errors. Docker containers enable us to automate the management of the application by creating a self-contained environment with the build and its dependencies. A data pipeline can be built and imported into a container image, which should contain everything needed for the pipeline to reliably execute.</p>
<p>Docker Hub is a container registry which allows us to push our pipeline images into a cloud repository. The goal is to provide the ability to download those images from the repository as part of the new environment provisioning process.</p>
<h4 id="terraform-for-cloud-infrastructure">Terraform for Cloud Infrastructure</h4>
<p>Terraform is an Infrastructure as Code (IaC) tool that enables us to manage cloud resources across multiple cloud providers. By creating resource definition scripts and tracking them under version control, we can automate the creation, modification and deletion of resources. Terraform tracks the state of the infrastructure, so when changes are made, they can be applied to the environments as part of a CI/CD process. </p>
<p><img alt="ozkary-data-engineering-design-planning-docker-terraform" src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-design-terraform-docker.png" title="Data Engineering Process Fundamentals- Design and Planning Docker Terraform" /></p>
<h2 id="data-analysis-and-visualization-tools">Data Analysis and Visualization Tools</h2>
<p>The selection of an analytical and visualization tool is very important in any data engineering projects. Tools like Looker, Power Builder among others enable business to gain insights from their data by visualizing the information on easy to read dashboards. By selecting the right tool for the job, organizations can transform complex data into actionable insights, empowering users across the business to uncover valuable information and drive strategic outcomes.</p>
<h2 id="summary">Summary</h2>
<p>The design and planning phase of a data engineering project sets the stage for success. From designing the system architecture and data pipelines to implementing source control, CI/CD, Docker, and infrastructure automation with Terraform, every aspect contributes to efficient and reliable deployment. Infrastructure automation, in particular, plays a critical role by simplifying provisioning of cloud resources, ensuring consistency, and enabling scalability, ultimately leading to a robust and manageable data engineering system. </p>
<h2 id="exercise-infrastructure-planning-and-automation">Exercise - Infrastructure Planning and Automation</h2>
<p>Having established a solid design foundation, it's time to put that into practice with a hands-on exercise. Our objective is to construct the necessary infrastructure that will serve as the hosting environment for our solution. Let's delve into the practical implementation to bring our data engineering project to life.</p>
<blockquote>
<p>π <a href="//www.ozkary.com/2023/04/data-engineering-process-fundamentals-design-planning-exercise.html">Data Engineering Process Fundamentals - Design and Planning Exercise</a></p>
</blockquote>
<p>Thanks for reading.</p>
Send question or comment at Twitter @ozkary
<p>π Originally published by <a href="https://www.ozkary.com" title="oscar garcia, ozkary">ozkary.com</a>
</p><div class="blogger-post-footer">Originally Published by https://ozkary.com</div>Oscar Garcia @ozkaryhttp://www.blogger.com/profile/13427831719934915504noreply@blogger.com0tag:blogger.com,1999:blog-2250750882473644809.post-67262915856193500442023-04-08T15:31:00.060-04:002023-07-14T14:57:01.838-04:00Data Engineering Process Fundamentals - Discovery Exercise<p>In this discovery exercise lab, we review a problem statement and do the analysis to define the scope of work and requirements.</p>
<div class="separator" style="clear: both; display: none; text-align: center;"><a href="https://www.ozkary.dev/assets/2023/ozkary-data-engineering-jupyter-vscode.png" style="margin-left: 1em; margin-right: 1em;"><img alt="ozkary Data engineering MTA jupyter notebook loaded" border="0" data-original-height="551" data-original-width="800" src="https://www.ozkary.dev/assets/2023/ozkary-data-engineering-jupyter-notepbook.png" title="ozkary Data engineering MTA jupyter notebook loaded with vscode" /></a></div>
<h2 id="problem-statement">Problem Statement</h2>
<p>In the city of New York, commuters use the Metropolitan Transportation Authority (MTA) subway system for transportation. There are millions of people that use this system every day; therefore, businesses around the subway stations would like to be able to use Geofencing advertisement to target those commuters or possible consumers and attract them to their business locations at peak hours of the day.</p>
<p>Geofencing is a location based technology service in which mobile devicesβ electronic signal is tracked as it enters or leaves a virtual boundary (geo-fence) on a geographical location. Businesses around those locations would like to use this technology to increase their sales.</p>
<p><img alt="ozkary-data-engineering-mta-geo-fence" src="https://github.com/ozkary/data-engineering-mta-turnstile/raw/main/images/mta-geo-fencing.png" title="Data Engineering Process - Problem Statement" /></p>
<p>The MTA subway system has stations around the city. All the stations are equipped with turnstiles or gates which tracks as each person enters or leaves the station. MTA provides this information in CSV files, which can be imported into a data warehouse to enable the analytical process to identify patterns that can enable these businesses to understand how to best target consumers.</p>
<h2 id="analytical-approach">Analytical Approach</h2>
<h3 id="dataset-criteria">Dataset Criteria</h3>
<p>We are using the MTA Turnstile data for 2023. Using this data, we can investigate the following criteria:</p>
<ul>
<li>Stations with the high number of exits by day and hours</li>
<li>Stations with high number of entries by day and hours</li>
</ul>
<p>Exits indicates that commuters are arriving to those locations. Entries indicate that commuters are departing from those locations.</p>
<h3 id="data-analysis-criteria">Data Analysis Criteria</h3>
<p>The data can be grouped into stations, date and time of the day. This data is audited in blocks of fours hours apart. This means that there are intervals of 8am to 12pm as an example. We analyze the data into those time block intervals to help us identify the best times both in the morning and afternoon for each station location. This should allow businesses to target a particular geo-fence that is close to their business.</p>
<p> In the discovery process, we take a look at the data that is available for our analysis. We are using the MTA turnstiles information which is available at this location:</p>
<p>π <a href="http://web.mta.info/developers/turnstile.html">New York Metropolitan Transportantion Authority Turnstile Data</a></p>
<p>We can download a single file to take a look at the data structure and make the following observations about the data:</p>
<h3 id="observations">Observations</h3>
<ul>
<li>It is available in weekly batches every Sunday</li>
<li>The information is audited in blocks of fours hours apart</li>
<li>The date and time field are on different columns</li>
<li>The cumulative entries are on the ENTRIES field</li>
<li>The cumulative exits are on the EXITS field</li>
<li>This data is audited in blocks of fours hours apart</li>
</ul>
<p><img alt="ozkary-data-engineering-mta-discovery" src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-mta-discovery.png" title="Data Engineering Process - Discovery" /></p>
<h3 id="field-description">Field Description</h3>
<table>
<thead>
<tr>
<th>Name</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>C/A</td>
<td>Control Area (A002) (Booth)</td>
</tr>
<tr>
<td>UNIT</td>
<td>Remote Unit for a station (R051)</td>
</tr>
<tr>
<td>SCP</td>
<td>Subunit Channel Position represents an specific address for a device (02-00-00)</td>
</tr>
<tr>
<td>STATION</td>
<td>Represents the station name the device is located at</td>
</tr>
<tr>
<td>LINENAME</td>
<td>Represents all train lines that can be boarded at this station. Normally lines are represented by one character. LINENAME 456NQR repersents train server for 4, 5, 6, N, Q, and R trains.</td>
</tr>
<tr>
<td>DIVISION</td>
<td>Represents the Line originally the station belonged to BMT, IRT, or IND</td>
</tr>
<tr>
<td>DATE</td>
<td>Represents the date (MM-DD-YY)</td>
</tr>
<tr>
<td>TIME</td>
<td>Represents the time (hh:mm:ss) for a scheduled audit event</td>
</tr>
<tr>
<td>DESc</td>
<td>Represent the βREGULARβ scheduled audit event (Normally occurs every 4 hours). Audits may occur more that 4 hours due to planning, or troubleshooting activities. Additionally, there may be a βRECOVR AUDβ entry: This refers to missed audit that was recovered.</td>
</tr>
<tr>
<td>ENTRIES</td>
<td>The cumulative entry register value for a device</td>
</tr>
<tr>
<td>EXIST</td>
<td>The cumulative exit register value for a device</td>
</tr>
</tbody>
</table>
<h3 id="data-example">Data Example</h3>
<p>The data below shows the entry/exit register values for one turnstile at control area (A002) from 09/27/14 at 00:00 hours to 09/29/14 at 00:00 hours</p>
<table>
<thead>
<tr>
<th>C/A</th>
<th>UNIT</th>
<th>SCP</th>
<th>STATION</th>
<th>LINENAME</th>
<th>DIVISION</th>
<th>DATE</th>
<th>TIME</th>
<th>DESC</th>
<th>ENTRIES</th>
<th>EXITS</th>
</tr>
</thead>
<tbody>
<tr>
<td>A002</td>
<td>R051</td>
<td>02-00-00</td>
<td>LEXINGTON AVE</td>
<td>456NQR</td>
<td>BMT</td>
<td>09-27-14</td>
<td>00:00:00</td>
<td>REGULAR</td>
<td>0004800073</td>
<td>0001629137</td>
</tr>
<tr>
<td>A002</td>
<td>R051</td>
<td>02-00-00</td>
<td>LEXINGTON AVE</td>
<td>456NQR</td>
<td>BMT</td>
<td>09-27-14</td>
<td>04:00:00</td>
<td>REGULAR</td>
<td>0004800125</td>
<td>0001629149</td>
</tr>
<tr>
<td>A002</td>
<td>R051</td>
<td>02-00-00</td>
<td>LEXINGTON AVE</td>
<td>456NQR</td>
<td>BMT</td>
<td>09-27-14</td>
<td>08:00:00</td>
<td>REGULAR</td>
<td>0004800146</td>
<td>0001629162</td>
</tr>
</tbody>
</table>
<h3 id="conclusions">Conclusions</h3>
<p>Based on observations, the following conclusions can be made:</p>
<ul>
<li>Merge the DATE and TIME columns and create a date time column, CREATED</li>
<li>The STATION column is a location dimension</li>
<li>The CREATED column is the datetime dimension to enable the morning and afternoon timeframes</li>
<li>The ENTRIES column is the measure for entries</li>
<li>The EXITS column is the measure for exits</li>
<li>A gate can be identified by using the C/A, SCP and UNIT columns</li>
</ul>
<h3 id="requirements">Requirements</h3>
<p>These observations can be used to define technical requirements that can enable us to deliver a successful project.</p>
<ul>
<li>Define the infrastructure requirements to host the technology<ul>
<li>Automate the provisioning of the resources using Terraform</li>
<li>Deploy the technology on a cloud platform</li>
</ul>
</li>
<li>Define the data orchestration process<ul>
<li>On the original pipeline, load the initial data for 2023</li>
<li>Create a data pipeline that runs every week after a new file has been published</li>
<li>Copy the unstructured CSV files into a Data Lake</li>
</ul>
</li>
<li>Define a well-structured and optimized model on a Data Warehouse<ul>
<li>Keep the source code for the models under source control</li>
<li>Copy the data into the Data Warehouse</li>
<li>Allow access to the Data Warehouse, so visualization tools can consume the data.</li>
</ul>
</li>
<li>Create Data Analysis dashboard with the following information <ul>
<li>Data Analysis dashboard</li>
<li>Identify the time slots for morning and afternoon analysis</li>
<li>Look at the distribution by stations</li>
<li>Look at the daily models</li>
<li>Look at the time slot models</li>
</ul>
</li>
</ul>
<h2 id="review-the-code">Review the Code</h2>
<p>In order to do our data analysis, we need to first download some sample data by writing a Python script. We can the analyze this data by writing some code snippets and use the power of the Python Pandas library. We can also use Jupyter Notebooks to quickly manipulate the data and create some charts that can help us as baseline requirements for the final visualization dashboard.</p>
<p>π <a href="https://github.com/ozkary/data-engineering-mta-turnstile/tree/main/Step1-Discovery/" target="_pipeline">Clone this repo or copy the files from this folder
</a></p>
<h3 id="download-a-csv-file-from-the-mta-site">Download a CSV File from the MTA Site</h3>
<p>With this Python script (mta_discovery.py), we download a CSV file with the URL of <a href="http://web.mta.info/developers/data/nyct/turnstile/turnstile_230318.txt">http://web.mta.info/developers/data/nyct/turnstile/turnstile_230318.txt</a>. The code creates a data stream to download the file in chunks to avoid any timeouts. We append the chunks into a local compressed file to reduce the size of the file. In order to reuse this code, we use the command line parser, so we can pass as parameters the URL.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> argparse
<span class="hljs-keyword">from</span> time <span class="hljs-keyword">import</span> time
<span class="hljs-keyword">from</span> pathlib <span class="hljs-keyword">import</span> Path
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">read_local</span><span class="hljs-params">(file_path: str)</span> -> Path:</span>
<span class="hljs-string">"""
Reads a local file
Args:
file_path: local file
"""</span>
print(F<span class="hljs-string">'Reading local file {file_path}'</span>)
df_iter = pd.read_csv(file_path, iterator=<span class="hljs-keyword">True</span>,compression=<span class="hljs-string">"gzip"</span>, chunksize=<span class="hljs-number">10000</span>)
<span class="hljs-keyword">if</span> df_iter:
<span class="hljs-keyword">for</span> df <span class="hljs-keyword">in</span> df_iter:
<span class="hljs-keyword">try</span>:
print(<span class="hljs-string">'File headers'</span>,df.columns)
print(<span class="hljs-string">'Top 10 rows'</span>,df.head(<span class="hljs-number">10</span>))
<span class="hljs-keyword">break</span>
<span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> ex:
print(f<span class="hljs-string">"Error found {ex}"</span>)
<span class="hljs-keyword">return</span>
print(f<span class="hljs-string">"file was loaded {file_path}"</span>)
<span class="hljs-keyword">else</span>:
print(F<span class="hljs-string">"failed to read file {file_path}"</span>)
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">write_local</span><span class="hljs-params">(df: pd.DataFrame, folder: str, file_name: str)</span> -> Path:</span>
<span class="hljs-string">"""
Write DataFrame out locally as csv file
Args:
df: dataframe chunk
folder: the download data folder
file_name: the local file name
"""</span>
path = Path(f<span class="hljs-string">"{folder}"</span>)
<span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> os.path.exists(path):
path.mkdir(parents=<span class="hljs-keyword">True</span>, exist_ok=<span class="hljs-keyword">True</span>)
file_path = Path(f<span class="hljs-string">"{folder}/{file_name}"</span>)
<span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> os.path.isfile(file_path):
df.to_csv(file_path, compression=<span class="hljs-string">"gzip"</span>)
print(<span class="hljs-string">'new file'</span>)
<span class="hljs-keyword">else</span>:
df.to_csv(file_path, header=<span class="hljs-keyword">None</span>, compression=<span class="hljs-string">"gzip"</span>, mode=<span class="hljs-string">"a"</span>)
print(<span class="hljs-string">'chunk appended'</span>)
<span class="hljs-keyword">return</span> file_path
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">etl_web_to_local</span><span class="hljs-params">(url: str, name: str)</span> -> <span class="hljs-keyword">None</span>:</span>
<span class="hljs-string">"""
Download a file
Args:
url : The file url
name : the file name
"""</span>
print(url, name)
<span class="hljs-comment"># skip an existent file</span>
path = f<span class="hljs-string">"../data/"</span>
file_path = Path(f<span class="hljs-string">"{path}/{name}.csv.gz"</span>)
<span class="hljs-keyword">if</span> os.path.exists(file_path):
read_local(file_path)
<span class="hljs-keyword">return</span>
df_iter = pd.read_csv(url, iterator=<span class="hljs-keyword">True</span>, chunksize=<span class="hljs-number">10000</span>)
<span class="hljs-keyword">if</span> df_iter:
file_name = f<span class="hljs-string">"{name}.csv.gz"</span>
<span class="hljs-keyword">for</span> df <span class="hljs-keyword">in</span> df_iter:
<span class="hljs-keyword">try</span>:
write_local(df, path, file_name)
<span class="hljs-keyword">except</span> StopIteration <span class="hljs-keyword">as</span> ex:
print(f<span class="hljs-string">"Finished reading file {ex}"</span>)
<span class="hljs-keyword">break</span>
<span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> ex:
print(f<span class="hljs-string">"Error found {ex}"</span>)
<span class="hljs-keyword">return</span>
print(f<span class="hljs-string">"file was loaded {file_path}"</span>)
<span class="hljs-keyword">else</span>:
print(<span class="hljs-string">"dataframe failed"</span>)
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">main_flow</span><span class="hljs-params">(params: str)</span> -> <span class="hljs-keyword">None</span>:</span>
<span class="hljs-string">"""
Process a CSV file from a url location with the goal to understand the data structure
"""</span>
url = params.url
prefix = params.prefix
<span class="hljs-keyword">try</span>:
start_index = url.index(<span class="hljs-string">'_'</span>)
end_index = url.index(<span class="hljs-string">'.txt'</span>)
file_name = F<span class="hljs-string">"{prefix}{url[start_index:end_index]}"</span>
<span class="hljs-comment"># print(file_name)</span>
etl_web_to_local(url, file_name)
<span class="hljs-keyword">except</span> ValueError:
print(<span class="hljs-string">"Substring not found"</span>)
<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:
os.system(<span class="hljs-string">'clear'</span>)
parser = argparse.ArgumentParser(description=<span class="hljs-string">'Process CSV data to understand the data'</span>)
parser.add_argument(<span class="hljs-string">'--url'</span>, required=<span class="hljs-keyword">True</span>, help=<span class="hljs-string">'url of the csv file'</span>)
parser.add_argument(<span class="hljs-string">'--prefix'</span>, required=<span class="hljs-keyword">True</span>, help=<span class="hljs-string">'the file prefix or group name'</span>)
args = parser.parse_args()
print(<span class="hljs-string">'running...'</span>)
main_flow(args)
print(<span class="hljs-string">'end'</span>)
</code></pre>
<h3 id="analyze-the-data">Analyze the Data</h3>
<p>With some sample data, we can now take a look at the data and make some observations. There are a few ways to approach the analysis. We could create another Python script and play with the data, but this will require to run the script from the console after every code change. A more productive way is to use Jupyter Notebooks. This tools enables us to edit and run code snippets in cells without having to run the entire script. This is a friendlier analysis tool that can help us focus on the data analysis instead of coding and running the script. In addition, once we are good with our changes, the notebook can be exported into a Python file. Let's look at that file discovery.ipynb:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> argparse
<span class="hljs-keyword">from</span> time <span class="hljs-keyword">import</span> time
<span class="hljs-keyword">from</span> pathlib <span class="hljs-keyword">import</span> Path
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-comment"># read the file and display the top 10 rows</span>
df = pd.read_csv(<span class="hljs-string">'../data/230318.csv.gz'</span>, iterator=<span class="hljs-keyword">False</span>,compression=<span class="hljs-string">"gzip"</span>)
df.head(<span class="hljs-number">10</span>)
<span class="hljs-comment"># Create a new DateTime column and merge the DATE and TIME columns</span>
df[<span class="hljs-string">'CREATED'</span>] = pd.to_datetime(df[<span class="hljs-string">'DATE'</span>] + <span class="hljs-string">' '</span> + df[<span class="hljs-string">'TIME'</span>], format=<span class="hljs-string">'%m/%d/%Y %H:%M:%S'</span>)
df = df.drop(<span class="hljs-string">'DATE'</span>, axis=<span class="hljs-number">1</span>).drop(<span class="hljs-string">'TIME'</span>,axis=<span class="hljs-number">1</span>)
df.head(<span class="hljs-number">10</span>)
<span class="hljs-comment"># Aggregate the information by station and datetime</span>
df[<span class="hljs-string">"ENTRIES"</span>] = df[<span class="hljs-string">"ENTRIES"</span>].astype(int)
df[<span class="hljs-string">"EXITS"</span>] = df[<span class="hljs-string">"EXITS"</span>].astype(int)
df_totals = df.groupby([<span class="hljs-string">"STATION"</span>,<span class="hljs-string">"CREATED"</span>], as_index=<span class="hljs-keyword">False</span>)[[<span class="hljs-string">"ENTRIES"</span>,<span class="hljs-string">"EXITS"</span>]].sum()
df_totals.head(<span class="hljs-number">10</span>)
df_station_totals = df.groupby([<span class="hljs-string">"STATION"</span>], as_index=<span class="hljs-keyword">False</span>)[[<span class="hljs-string">"ENTRIES"</span>,<span class="hljs-string">"EXITS"</span>]].sum()
df_station_totals.head(<span class="hljs-number">10</span>)
<span class="hljs-comment"># Show the total entries by station, use a subset of data</span>
<span class="hljs-keyword">import</span> plotly.express <span class="hljs-keyword">as</span> px
<span class="hljs-keyword">import</span> plotly.graph_objects <span class="hljs-keyword">as</span> go
df_stations = df_station_totals.head(<span class="hljs-number">25</span>)
donut_chart = go.Figure(data=[go.Pie(labels=df_stations[<span class="hljs-string">"STATION"</span>], values=df_stations[<span class="hljs-string">"ENTRIES"</span>], hole=<span class="hljs-number">.2</span>)])
donut_chart.update_layout(title_text=<span class="hljs-string">'Entries Distribution by Station'</span>, margin=dict(t=<span class="hljs-number">40</span>, b=<span class="hljs-number">0</span>, l=<span class="hljs-number">10</span>, r=<span class="hljs-number">10</span>))
donut_chart.show()
<span class="hljs-comment"># Show the data by the day of the week</span>
df_by_date = df_totals.groupby([<span class="hljs-string">"CREATED"</span>], as_index=<span class="hljs-keyword">False</span>)[[<span class="hljs-string">"ENTRIES"</span>]].sum()
day_order = [<span class="hljs-string">'Sun'</span>, <span class="hljs-string">'Mon'</span>, <span class="hljs-string">'Tue'</span>, <span class="hljs-string">'Wed'</span>, <span class="hljs-string">'Thu'</span>, <span class="hljs-string">'Fri'</span>, <span class="hljs-string">'Sat'</span>]
df_by_date[<span class="hljs-string">"WEEKDAY"</span>] = pd.Categorical(df_by_date[<span class="hljs-string">"CREATED"</span>].dt.strftime(<span class="hljs-string">'%a'</span>), categories=day_order, ordered=<span class="hljs-keyword">True</span>)
df_entries_by_date = df_by_date.groupby([<span class="hljs-string">"WEEKDAY"</span>], as_index=<span class="hljs-keyword">False</span>)[[<span class="hljs-string">"ENTRIES"</span>]].sum()
df_entries_by_date.head(<span class="hljs-number">10</span>)
bar_chart = go.Figure(data=[go.Bar(x=df_entries_by_date[<span class="hljs-string">"WEEKDAY"</span>], y=df_entries_by_date[<span class="hljs-string">"ENTRIES"</span>])])
bar_chart.update_layout(title_text=<span class="hljs-string">'Total Entries by Week Day'</span>)
bar_chart.show()
</code></pre>
<h2 id="how-to-run-it-">How to Run it!</h2>
<p>With an understanding of the code and tools, let's run the process.</p>
<h3 id="requirements">Requirements</h3>
<p>π <a href="https://github.com/ozkary/data-engineering-mta-turnstile/wiki/Configure-Python-Dependencies" target="_python">Install Python, Pandas and Jupyter notebook
</a></p>
<p>π <a target="_vscode" href="https://code.visualstudio.com/download">
Install Visual Studio Code
</a></p>
<p>π <a href="https://github.com/ozkary/data-engineering-mta-turnstile/tree/main/Step1-Discovery" target="_python">Clone this repo or copy the files from this folder
</a></p>
<h3 id="follow-these-steps-to-run-the-analysis">Follow these steps to run the analysis</h3>
<ul>
<li>Download a file to look at the data<ul>
<li>This should create a gz file under the ../data folder</li>
</ul>
</li>
</ul>
<pre><code><span class="hljs-variable">$ </span>python3 mta_discovery.py --url <span class="hljs-symbol">http:</span>/<span class="hljs-regexp">/web.mta.info/developers</span><span class="hljs-regexp">/data/nyct</span><span class="hljs-regexp">/turnstile/turnstile</span>_230318.txt
</code></pre><p>Run the Jupyter notebook (dicovery.ipynb) to do some analysis on the data. </p>
<ul>
<li>Load the Jupyter notebook to do analysis<ul>
<li>First start the Jupyter server from the terminal</li>
</ul>
</li>
</ul>
<pre><code class="lang-bash"><span class="hljs-variable">$ </span>jupyter notebook
</code></pre>
<ul>
<li>See the URL on the terminal and click it to load it on the browser<ul>
<li>Click the discovery.ipynb file link</li>
</ul>
</li>
<li>Or open the file with VSCode and enter the URL when prompted from a kernel url</li>
<li>Run every cell from the top down as this is required to load the dependencies</li>
</ul>
<p>The following images show Jupyter notebook loaded on the browser or directly from VSCode.</p>
<h4 id="jupyter-notebook-loaded-on-the-browser">Jupyter Notebook loaded on the browser</h4>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-jupyter-mta.png" alt="ozkary-data-engineering-jupyter-notebook" title="Data Engineering Process - Discovery"></p>
<p></p>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-jupyter-notepbook.png" alt="ozkary-data-engineering-discovery-query" title="ozkary MTA jupyter notebook loaded"></p>
<h4 id="using-vscode-to-load-the-data-and-create-charts">Using VSCode to load the data and create charts</h4>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-jupyter-vscode.png" alt="ozkary-data-engineering-discovery-jupyter-vscode" title="ozkary MTA jupyter vscode"></p>
<h4 id="show-the-total-entries-by-station-using-a-subset-of-data-using-vscode">Show the total entries by station using a subset of data using VSCode</h4>
<p><img src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-jupyter-pie-chart.png" alt="ozkary-data-engineering-discovery-donut-chart" title="ozkary MTA jupyter donut chart"></p>
<h1 id="next-step">Next Step</h1>
<blockquote>
<p>π <a href="//www.ozkary.com/2023/04/data-engineering-process-fundamentals-design-planning.html" title="Data Engineering Process Fundamentals - Design and Planning">Data Engineering Process Fundamentals - Design and Planning</a></p>
</blockquote>
<p>Thanks for reading.</p>
<p>Send question or comment at Twitter @ozkary </p>
<p>π Originally published by <a href="https://www.ozkary.com" title="Software and data engineering professional information">ozkary.com</a></p>
<div class="blogger-post-footer">Originally Published by https://ozkary.com</div>Oscar Garcia @ozkaryhttp://www.blogger.com/profile/13427831719934915504noreply@blogger.com0tag:blogger.com,1999:blog-2250750882473644809.post-61761556361879937112023-04-01T12:22:00.044-04:002023-07-14T14:59:17.480-04:00Data Engineering Process Fundamentals - Discovery<h2 id="introduction">Introduction</h2>
<p>In this series of Data Engineering Process Fundamentals, we explore the Data Engineering Process (DEP) with key concepts, principles and relevant technologies, and explain how they are being used to help us deliver solutions. The first step, and important to never skip, in this process is the Discovery step.</p>
<p>During the discovery step of a Data Engineering Process, we look to identify and clearly document a problem statement, which helps us have an understanding of what we are trying to solve. We also look at our analytical approach to make observations about at the data, its structure and source. This leads us into defining the requirements for the project, so we can define the scope, design and architecture of the solution.</p>
<p><img alt="ozkary-data-engineering-process-discovery" src="//www.ozkary.dev/assets/2023/ozkary-data-engineering-process-discovery.png" title="Data Engineering Process - Discovery" /></p>
<h3 id="problem-statement">Problem Statement</h3>
<p>A Problem Statement is a description of what it is that we are trying to solve. As part of the problem statement, we should provide some background or context on how the data is processed or collected. We should also provide a specific description of what the data engineering process is looking to solve by taking a specific approach to integrate the data. Finally, the objective and goals should also be described with information about how this data will be made available for analytical purposes.</p>
<h3 id="analytical-approach">Analytical Approach</h3>
<p>The Analytical Approach is a systematic method to observe the data and arrive to insights from it. It involves the use of different techniques, tools and frameworks to make sense of the data in order to arrive to conclusions and actionable requirements. </p>
<h4 id="dataset-criteria">Dataset Criteria</h4>
<p>A Dataset Criteria technique refers to the set of characteristics used to evaluate the data, so we can determine the quality and accuracy of it. </p>
<p>In the data collection process, we should identify the various sources that can provide us with accurate and complete information. Data cleaning and preprocessing needs to be done to identify and eliminate missing values, invalid data and outliers that can skew the information. In addition, we should understand how this data is available for the ongoing integration. Some integrations may require a batch process integration at a scheduled interval. Others may require a real-time integration and/or a combination of batch and real-time processing.</p>
<h4 id="exploratory-data-analysis">Exploratory Data Analysis</h4>
<p>We should conduct exploratory data analysis to understand the structure, patterns and characteristics of the data. We need to make observations about the data, identify the valuable fields, create statistical summaries, and run some data profiling to identify trends, metrics and correlations that are relevant to the main objective of the project.</p>
<h4 id="tools-and-framework">Tools and Framework</h4>
<p>Depending on the size and budget of the organization, the solution can be built with lots of coding and integration, or instead a low-code turn-key solution that provides enterprise quality resources could be used instead. Regardless of the approach, a programming language like Python is a popular programming language for data science and engineers, and it is always applicable. The Python Pandas library is great for data manipulation and analysis. Jupyter notes with Python scripts is great for experiments and discovery.</p>
<p>
To run our Python scripts and Jupyter notebooks, we can use Visual Studio Code (VSCode), which is cross-platform Integrated Development Environment (IDE) tool. This tool also enables the integration with source control and deployments platforms like GitHub, so we can maintain version control and automate the deployment and test of our code changes.
</p>
<p>To orchestrate the pipelines, we often use a workflow framework like Apache Airflow, Prefect. To host the data, we use data lakes (blob storage) and a relational data warehouse. For data modeling, incremental data and continuous test and data ingestion, Apache Spark or gbt cloud are used.</p>
<p>For the final data analysis and visualization, we could use tools like Looker, PowerBI and Tableau. These are tools that can connect to a data warehouse and consume the data models, so they can visualize in ways that enable stakeholders to make decisions based on the story provided by the data.</p>
<h3 id="requirements">Requirements</h3>
<p>Requirements refer to the needs, capabilities and constraints that are needed to deliver a data engineering solution. They should outline the project deliverables that are required to meet the main objectives. The requirements should include data related areas like: </p>
<ul>
<li>Sources and integration</li>
<li>Modeling and transformation</li>
<li>Quality and validation</li>
<li>Storage and infrastructure </li>
<li>Processing and Analytics</li>
<li>Governance and Security</li>
<li>Scalability and performance</li>
<li>Monitoring</li>
</ul>
<h3 id="design">Design</h3>
<p>A data engineering design is the actual plan to build the technical solution. It includes the system architecture, data integration, flow and pipeline orchestration, the data storage platform, transformation and management, data processing and analytics tooling. This is the area where we need to clearly define the different technologies that should be used for each area. </p>
<h4 id="system-architecture">System Architecture</h4>
<p>The system architecture is a high-level design of the solution, its components and how they integrate with each other. This often includes the data sources, data ingestion resources, workflow and data orchestration resources and frameworks, storage resources, data services for data transformation and continuous data ingestion and validation, and data analysis and visualization tooling.</p>
<h4 id="data-pipelines">Data Pipelines</h4>
<p>A data pipeline refers to a series of connected tasks that handles the extract, transform and load (ETL) as well as the extract, load and transform (ELT) operations and integration from a source to a target storage like a data lake or data warehouse. </p>
<p>The use of ETL or ELT depends on the design. For some solutions, a flow task may transform the data prior to loading it into storage. This approach tends to increase the amount of python code and hardware resources used by the hosting environment. For the ELT process, the transformation may be done using SQL code and the data warehouse resources, which often tend to perform great for big data scenarios.</p>
<h4 id="data-orchestration">Data Orchestration</h4>
<p>Data orchestration refers to the automation, management and coordination of the data pipeline tasks. It involves the scheduling, workflows, monitoring and recovery of those tasks. The orchestration ensures the execution of those tasks, and it takes care of error handling, retry and the alerting of problems in the pipeline.</p>
<h2 id="summary">Summary</h2>
<p>The data engineering discovery process involves defining the problem statement, gathering requirements, and determining the scope of work. It also includes a data analysis exercise utilizing Python and Jupyter Notebooks ,or other tools, to extract valuable insights from the data. These steps collectively lay the foundation for successful data engineering endeavors.</p>
<h2 id="exercise-hands-on-use-case">Exercise - Hands-on Use Case</h2>
<p>Since we now understand the discovery step, we should be able to put that into practice. Letβs move on to a hands-on use case and see how we apply those concepts.</p>
<blockquote>
<p>π <a href="//www.ozkary.com/2023/04/data-engineering-process-fundamental-discovery-exercise.html" title="Data Engineering Process Fundamentals - Discovery Exercise">Data Engineering Process Fundamentals - Discovery Exercise</a></p>
</blockquote>
<p>
Thanks for reading.
</p><p>Send question or comment at Twitter @ozkary</p>
<p>π Originally published by <a href="https://www.ozkary.com" title="ozkary.com AI, Data Engineering">ozkary.com</a></p><div class="blogger-post-footer">Originally Published by https://ozkary.com</div>Oscar Garcia @ozkaryhttp://www.blogger.com/profile/13427831719934915504noreply@blogger.com0tag:blogger.com,1999:blog-2250750882473644809.post-85146545861634795852023-03-25T11:41:00.040-04:002023-06-30T12:49:46.181-04:00Data Engineering Process Fundamentals<h2 id="introduction">Introduction</h2>
<p>Data Engineering is changing constantly. From cloud data platforms and pipeline automation to data streaming and visualizations tools, new innovations are impacting that way we build todayβs data and analytical solutions. </p>
<p>In this series of Data Engineering Process Fundamentals, we explore the Data Engineering Process (DEP) with key concepts, principles and relevant technologies, and explain how they are being used to help us deliver the solution. We discuss concepts and take on a real use case where we execute an end-to-end process from downloading data to visualizing the results. </p>
<p>
The end-goal of this series is to take us thru a process in which we deliver an architecture, which facilitates the ongoing analysis of big data via analytical and visualization tools. In the following images, we can get a preview of what we will be delivering as we execute each step of the process. </p>
<h3 id="data-engineering-process-architecture">Data Engineering - Architecture</h3>
<p class="separator" style="clear: both; text-align: center;">
<img alt="ozkary-data-engineering-process-architecture" height="562" src="https://www.ozkary.dev/assets/2023/ozkary-data-engineering-process-architecture.png" title="Data Engineering Process Architecture" />
</p>
<h3 id="data-engineering-process-Analysis-Results">Data Engineering - Analysis Results</h3>
<p class="separator" style="clear: both; text-align: center;">
<img alt="ozkary-data-engineering-process-analysis-results" height="562" src="https://www.ozkary.dev/assets/2023/ozkary-data-engineering-process-dashboard.png" title="Data Engineering Process Analysis Results" />
</p>
<h3 id="data-engineering-process">Data Engineering Process</h3>
<p class="separator" style="clear: both; text-align: center;">
<img alt="ozkary-data-engineering-process" height="562" src="https://www.ozkary.dev/assets/2023/ozkary-data-engineering-process.png" title="Data Engineering Process" width="640" />
</p>
<p>A Data Engineering Process follows a series of steps that should be executed to properly understand the problem statement, scope of work, design and architecture that should be used to create the solution. Some of these steps include the following:</p>
<!--
<blockquote>
<p>π Follow each link for more details as they become available</p>
</blockquote>
-->
<blockquote>
<p>π Join this newsletter to receive updates <a href="https://maven.com/forms/56ae79">Sign up here</a></p>
</blockquote>
<ul>
<li><a href="//www.ozkary.dev/data-engineering-process-foundamentals-discovery">Discovery</a><ul>
<li>Problem Statement</li>
<li>Data Analysis</li>
<li>Define the Requirements and Scope of Work</li>
<li>Discovery Exercise</li>
</ul>
</li>
<li><a href="//www.ozkary.dev/data-engineering-process-foundamentals-design-planning">Design and Planning</a><ul>
<li>Design Approach</li>
<li>System Architecture</li>
<li>Cloud Engineering and Automation</li>
<li>Design Exercise</li>
</ul>
</li>
</ul>
<ul>
<li>Data Orchestration and Operations<ul>
<li>Pipeline Orchestration<ul>
<li>Batch Processing</li>
</ul>
</li>
<li>Workflow Automation</li>
<li>Deployment, Schedules and Monitoring</li>
</ul>
</li>
<li>Data Warehouse and Modeling<ul>
<li>Data modeling</li>
<li>Data Warehouse Design</li>
<li>Continuous Integration</li>
</ul>
</li>
<li>Data Analysis and Visualization<ul>
<li>Analyze the data</li>
<li>Visualization Concepts</li>
<li>Create a Dashboard<ul>
<li>Provide answers to the problem statement</li>
</ul>
</li>
</ul>
</li>
<li>Streaming Data<ul>
<li>Data Warehouse Integration</li>
<li>Real-time dashboard</li>
</ul>
</li>
</ul>
<h2 id="concepts">Concepts</h2>
<h3 id="what-is-data-engineering-">What is Data Engineering?</h3>
<p>Data Engineering is the practice of designing and building solutions by integrating, transforming and consolidating various data sources into a centralized and structured system, Data Warehouse, at scale, so the data becomes available for building analytics solutions.</p>
<h3 id="what-is-a-data-engineering-process-">What is a Data Engineering Process?</h3>
<p>A Data Engineering Process (DEP) is the sequence of steps that engineers should follow in order to build a testable, robust and scalable solution. This process starts really early on with a problem statement to understand what the team is trying to solve. It is then followed with data analysis and requirements discovery, which leads to a design and architecture approach, in which the different applicable technologies are identified.</p>
<h3 id="operational-and-analytical-data">Operational and Analytical data</h3>
<p>Operational data is often generated by applications, and it is stored in transactional databases like SQL Server, CosmosDB, Firebase and others. This is the data that is created after an application saves a user transaction like contact information, a purchase or other activities that are available from the application. This system is not design to support Big Data query scenarios, so the reporting system should not be overloading its resources with large queries.</p>
<p>Analytical data is the transaction data that has been processed and optimized for analytical and visualization purposes. This data is often processed via Data Lakes and stored on Data Warehouse.</p>
<h3 id="data-pipelines-and-orchestration">Data Pipelines and Orchestration</h3>
<p>Data Pipelines are used to orchestrate and automate workflows to move and process the transactional into Data Lakes and Data Warehouse. The pipelines execute repeatable Extract Transform and Load (ETL) or Extract Load and Transform (ELT) processes that can be triggered by a schedule or a data event. </p>
<h3 id="data-lakes">Data Lakes</h3>
<p>A Data Lake is an optimized storage system for Big Data scenarios. The primary function is to store the data in its raw format without any transformation. This can include structure data like CSV files, unstructured data like JSON and XML documents, or column-base data like parquet files.</p>
<h3 id="data-warehouse">Data Warehouse</h3>
<p>A Data Warehouse is a centralized storage system that stores integrated data from multiple sources. This system stores historical data in relational tables with an optimized schema, which enables the data analysis process. This system can also integrate external resources like CSV and parquet files that are stored on Data Lakes as external tables. The system is designed to host and serve Big Data scenarios. It is not meant to be used as a transactional system. </p>
<h3 id="data-batch-processing">Data Batch Processing</h3>
<p>Batch Processing is a method often used to run high-volume, repetitive data jobs. It is usually scheduled during certain time windows that do not impact the application operations, as these processes are often used to export the data from transactional systems. A batch job is an automated software task that may include one or more workflows. These workflows can often run without supervision, and they are monitored by other tools to ensure that the process is not failing. </p>
<h3 id="streaming-data">Streaming Data</h3>
<p>Streaming Data is a data source that sends messages with small content but with high volume of messages in real-time. This data often comes from Internet-of-things (IoT) devices, manufacturing equipment or social media sources, often producing a high volume of information per second. This information is often captured in aggregated time windows and then store in a Data Warehouse, so it can be combined with other analytical data. It can also be sent to monitoring and/or real-time systems to show the current system KPI or any type of variance in the system.</p>
<h2 id="next-step">Next Step</h2>
<blockquote>
<p>π <a href="//www.ozkary.com/2023/04/data-engineering-process-fundamentals-discovery.html" title="">Data Engineering Process Fundamentals - Discovery</a></p>
</blockquote>
<p>Thanks for reading.</p>
<p>Send question or comment at Twitter @ozkary</p>
<p>π Originally published by <a href="https://www.ozkary.com" title="oscar garcia, ozkary">ozkary.com</a></p><div class="blogger-post-footer">Originally Published by https://ozkary.com</div>Oscar Garcia @ozkaryhttp://www.blogger.com/profile/13427831719934915504noreply@blogger.com0tag:blogger.com,1999:blog-2250750882473644809.post-4992832309936880342023-03-18T12:37:00.027-04:002023-05-18T12:54:53.133-04:00GitHub Codespaces Quick ReviewThis is a quick review about GitHub Codespaces, which you can load right on the browser.
<div class="separator" style="clear: both;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgZhUFYrHSkwyafbtfVFdmWXg55DOfl5EjaLwnFq-klw4dsVXPTfj4NV3VvK9v874keEZZmOyss_6BayXCTfWMR9ziCcIJRZG-ztaBobhI7PIuIpp40WfOVGxIQv2mZk3uuHbXwaf-VztNyG5BUQ4W6mxEhSYbDgkWQ7mIlWDTZYfbNbBPnKl_Izlv1ag/s1294/ozkary-github-codespaces-review.png" style="display: block; padding: 1em 0px; text-align: center;"><img alt="ozkary github codespaces review" border="0" data-original-height="676" data-original-width="1294" height="334" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgZhUFYrHSkwyafbtfVFdmWXg55DOfl5EjaLwnFq-klw4dsVXPTfj4NV3VvK9v874keEZZmOyss_6BayXCTfWMR9ziCcIJRZG-ztaBobhI7PIuIpp40WfOVGxIQv2mZk3uuHbXwaf-VztNyG5BUQ4W6mxEhSYbDgkWQ7mIlWDTZYfbNbBPnKl_Izlv1ag/w640-h334/ozkary-github-codespaces-review.png" title="ozkary github codespaces review" width="640" /></a></div>
<p>
In this video, we talk about creating a Codespaces instance from a GitHub Repository. We load a GitHub project using the VM instance that is provisioned for us when a GitHub Codespace is added to the repo. To edit the project files, the browser loads a VS Code online version of the IDE, which then uses the SSH protocol to virtualize the code from the VM onto our browser. Since we are basically using a VM, we can open a terminal from the browser session and run NPM and Git commands. All these commands are executed on the VM.
GitHub Codespaces are a quick way to provision a VM instance without the complexity of manually building on a cloud account.
</p>
<div>
<p style="text-align: center;">
<iframe frameborder="0" height="270" src="https://youtube.com/embed/woDCq_gHh94" width="480"></iframe>
</p>
</div>
<p>
Thanks for reading.
</p>
<p>Send question or comment at Twitter @ozkary</p>
Originally published by <a href="https://wwww.ozkary.com"> ozkary.com </a>
<div class="blogger-post-footer">Originally Published by https://ozkary.com</div>Oscar Garcia @ozkaryhttp://www.blogger.com/profile/13427831719934915504noreply@blogger.com0tag:blogger.com,1999:blog-2250750882473644809.post-59910205595504359412023-02-11T15:55:00.076-05:002023-03-16T16:34:04.830-04:00React Suspense Fallback Keeps Rendering When Using Lazy-Loaded Routes<div><span id="docs-internal-guid-559a116c-7fff-3e4c-7f6f-2402dce5e123"><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><br /></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">When a React application is loading a new lazy-loaded route or a component, it is a common practice to show a loading component with an animation to provide feedback to the user. In React, we use the Suspense component to handle this scenario, but there are times when the fallback component never unloads, and it keeps rendering on the browser.</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">When a Suspense fallback component never unloads, it is because there must be a child component that keeps rendering. This causes the Suspense component to continue to run, thinking that a child component in still in suspense or loading state. This gives us the wrong sense that the Suspense component is misbehaving, when it is in fact the child components that is at fault. For a deeper understanding, letβs take a look at a real example.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"></span></p><div class="separator" style="clear: both; text-align: center;"><span style="font-family: Arial; font-size: 11pt; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><br /></span></div><span style="font-family: Arial; font-size: 11pt; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><br /></span><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiqPXM_8xNoiFLgBIvmKByqJdIy9U3WzoNMx6H7IQUo8ZF3lRc5bR8511EG5WBjMJk0PoI8BKoeCtlgM9TXDIEtqVyqYA3SizWAchZLuk66BYE8aH8AEx7tfWTkY-7-Pv8CqSFxiUh4FEHZndtV0VUJTTP3uOsBYSCO8GtY5Ob9tZXcQqZDeDPXoCO35A/s531/Ozkary-react-suspend.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="ozkary react suspend component" border="0" data-original-height="411" data-original-width="531" height="310" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiqPXM_8xNoiFLgBIvmKByqJdIy9U3WzoNMx6H7IQUo8ZF3lRc5bR8511EG5WBjMJk0PoI8BKoeCtlgM9TXDIEtqVyqYA3SizWAchZLuk66BYE8aH8AEx7tfWTkY-7-Pv8CqSFxiUh4FEHZndtV0VUJTTP3uOsBYSCO8GtY5Ob9tZXcQqZDeDPXoCO35A/w400-h310/Ozkary-react-suspend.png" title="ozkary react suspend component" width="400" /></a></div><br /><p></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 700; vertical-align: baseline; white-space: pre-wrap;">Suspense with Declarative Routes</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">To demonstrate this problem, letβs first take a look at an implementation of a React application that loads some route information in a declarative approach, which is simple enough and does not introduce any rendering problems. We should also notice how the components are lazy-loaded to help us do code splitting for performance improvements on the load time and enable us to trigger the Suspense component automatically.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">
<script src="https://gist.github.com/ozkary/f01785e0ee46935d02f5cff9b3d31df2.js?file=App-Declarative-Routes.tsx"></script>
</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"></p><div class="pro-tip"><p>π This is a typical approach when creating apps with a simple routing structure.</p></div><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 700; vertical-align: baseline; white-space: pre-wrap;">Suspense with a Router Component</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Letβs now take a look at a more complex scenario where we need to load the route information from a JSON configuration file. This introduces a variation on the process by adding the routes using a function. </span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><script src="https://gist.github.com/ozkary/f01785e0ee46935d02f5cff9b3d31df2.js?file=App-Imperative-Routes.tsx"></script>
</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">
<script src="https://gist.github.com/ozkary/f01785e0ee46935d02f5cff9b3d31df2.js?file=router-with-rendering-problem.tsx"></script>
</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">This code change introduces new behavior, and it causes the component to re-render multiple times. We can trace that by adding a console.log operation before returning the content. If we look at the browser console, we should see the output of our console.log call. From the application standpoint, this behavior is noticeable because the fallback component continues to show on the browser as the Suspense component does not detect that the child component is done rendering. Now that we understand the problem, how should we correct it?</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"></p><div class="pro-tip"><p>π A child component continues to be in suspense until it stops rendering.</p></div><p></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 700; vertical-align: baseline; white-space: pre-wrap;">Adding State Management</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">To correct this behavior, we should clearly understand the root cause. Since we introduce a dynamic way to load the routes, the component has no way to understand its current state. It only knows that some data is being loaded every time it calls the function, and the data seems new or different. To avoid this, we need to add state management to the component, which is a React best practice when writing data driven components. Letβs refactor our code to see how state management can make a difference.</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p>
<script src="https://gist.github.com/ozkary/f01785e0ee46935d02f5cff9b3d31df2.js?file=router-with-state.tsx"></script>
<p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">By looking at our new implementation, we can see that the route collection is now managed in a state variable. We also use an </span><span style="font-family: Arial; font-size: 11pt; font-style: italic; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">useEffect </span><span style="font-family: Arial; font-size: 11pt; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">hook to load the data. This enables us to initiate the state of the component with some valid data. We also track the route collection as a dependency, and since there are no changes to the data, the state does not change, and the component completes its suspense state thus allowing the app to complete its rendering process.</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 700; vertical-align: baseline; white-space: pre-wrap;">Conclusion</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">The React Suspense component is a great feature to provide feedback to users while a component is rendering. When we add dynamic data to the application, we should understand that this impacts the state management process, which is used to signal when a component is in a suspense state. Depending on how nested your components are, a child component can continue to render, causing the Suspense fallback UI to continue to load non-stop. </span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"></p><div class="pro-tip"><p>π To diagnose a component, you can add a suspense component around it, and we should notice only that component content to continue to use the suspense fallback animation.</p></div><p></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Thanks for reading.</span></p><br /><br /></span></div>Send question or comment at Twitter @ozkary
<h4>Originally published by <a href="https://www.ozkary.com" title="oscar garcia, ozkary">ozkary.com</a>
</h4><div class="blogger-post-footer">Originally Published by https://ozkary.com</div>Oscar Garcia @ozkaryhttp://www.blogger.com/profile/13427831719934915504noreply@blogger.com0tag:blogger.com,1999:blog-2250750882473644809.post-29518630821332416642023-01-14T14:00:00.082-05:002023-02-24T12:03:13.182-05:00Use Remote Dev Container with GitHub Codespaces<div><span id="docs-internal-guid-566b8831-7fff-7fa2-8b46-7b0101fc685c"><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><br /></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">As Software Engineers, we usually work with multiple projects in parallel. This forces us to configure our work stations with multiple software development tools, which eventually leaves our workstation performing poorly. To overcome this problem, we often use virtual machine (VM) instances that run in our workstations or a cloud provider like Azure. Setting up those VMs also introduces some overhead into our software development process. As engineers, we want to be able to accelerate this process by using a remote development environment provider like GitHub Codespaces.</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: center;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><span style="border: none; display: inline-block; height: 503px; overflow: hidden; width: 353px;"><img alt="ozkary-github-codespaces-layers" height="400" src="https://lh4.googleusercontent.com/9TxffnPtSbWGLsG5kZ5BLTUxLffULu19FHsyTrwrXdt4ahhj1PrAI-dzFNL4maf8kLL6nOcMjQ9ysOdA0bw2MshnraaDvhcv5xopbm3dKEcgDDtCvSAzwNS8KAOlALmdP8PjAbEU6LKsJ4ILMjaSYOI=w281-h400" style="margin-left: 0px; margin-top: 0px;" title="GitHub Codespaces layers" width="281" /></span></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 700; vertical-align: baseline; white-space: pre-wrap;">What is GitHub Codespaces?</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">GitHub Codespaces is a cloud hosted development environment that is associated to a GitHub repository. Each environment or Dev Container is cloud hosted in a Docker container with the core dependencies that are required for that project. The container is hosted on a VM running Ubuntu Linux. The hardware is also configurable. It starts with a minimum of 2 cores, 8 GB of RAMs and 32 GB of storage, which should be a good foundation to run small projects. In addition, the hardware resources can be increased up to 32 cores, 64 GB RAM and 128 GB of storage which matches a very good workstation configuration.</span></p><br /><div class="pro-tip"><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 700; vertical-align: baseline; white-space: pre-wrap;">π There are monthly quotas for using the remote environments of 120 hrs for personal accounts, and 180 hrs for the PRO account.</span></p></div><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 700; vertical-align: baseline; white-space: pre-wrap;">How to use Codespaces?</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Codespaces leverages the Secure Shell (SSH) protocol, which provides a secure channel between client and server. This protocol is used to provide remote access to resources like VMs that are hosted on cloud platforms. This protocol makes it possible for browsers and IDE tools like VS Code to connect remotely and manage the projects.</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">When using the browser, the VS Code Browser IDE is loaded. We can also use a local installation of VS Code or any IDE that support SSH. The development process works the same way as if running locally with the exception that the files are hosted remotely, and we can also use a terminal window to execute build commands within the VM space.</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 700; vertical-align: baseline; white-space: pre-wrap;">How to start a project with GitHub Codespaces</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">We can start a Codespaces environment right from GitHub. Open a repo on GitHub and click on the Code tab and then click the Code button from the toolbar. This opens up the options to create a new Codespaces, connect to an existent one, or even configure your Codespaces resources, more on this later. </span></p><br /><div class="pro-tip"><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 700; vertical-align: baseline; white-space: pre-wrap;">π You can use this repo if you do not have one </span><a href="https://github.com/ozkary/Data-Engineering-Bootcamp" style="text-decoration-line: none;"><span style="color: #1155cc; font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 700; text-decoration-line: underline; text-decoration-skip-ink: none; vertical-align: baseline; white-space: pre-wrap;">https://github.com/ozkary/Data-Engineering-Bootcamp</span></a></p></div><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: center;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><span style="border: none; display: inline-block; height: 535px; overflow: hidden; width: 624px;"><img alt="ozkary-create-github-code-space" height="549" src="https://lh6.googleusercontent.com/S7r8X3fX9XKmQrIQXBLZCrH3tZc1tCLEA2ICPqzL3xUkAtixdz-al1rz6ZGo_Bh3SAahmYztlOvySVpqHgvrjRpKN6mVQlVTMbRvNfUBYCWr2D9WebYUmceQ44-NkHRFqSVgW0_OkD9IDVJNPephbNY=w640-h549" style="margin-left: 0px; margin-top: 0px;" title="ozkary create github code space" width="640" /></span></span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">When you add a new environment, GitHub essentially provisions a VM on Azure. It loads a Docker image with some of the dependencies of your project. For example, if your code is .Net Core or a TypeScript with React project, a Docker image with those dependencies is built and provisioned into the VM.</span></p><br /><div class="pro-tip"><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><b>π The Docker images are preconfigured. We can also build a custom image to meet specific requirements.</b></span></p></div><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Once the environment is provisioned, we can open the project using any of the options listed on the image below. I prefer to use my local VS Code instance, as I often have all the tools needed to work on my projects. Once the project is open on VS Code, the project connection is cached, and we only need to open VS Code again to load the remote project. The browser feature is also very useful, so do take it for a spin and see how you like it.</span></p><div><span><br /></span></div><span id="docs-internal-guid-b522682d-7fff-3b57-f984-b0c834ef8d25"><div style="text-align: center;"><img alt="ozkary how to open github codespaces" height="510" src="https://lh3.googleusercontent.com/T0uB4aFukF5smoA7KsCJRLHRa7cg3x-6Ann8SFvWq7WEw4Fp9uz2yYHKANqNT3GsMbAcq2nkbpm-CF4PmcLGoNYIodi5MtO6TL-17D-bg4bJS1uJivzAXZRklg8PycIZZUCf1oi6WkjWpHCYlviDedM=w640-h510" style="font-family: Arial; font-size: 11pt; margin-left: 0px; margin-top: 0px; white-space: pre-wrap;" title="ozkary how to open github codespaces" width="640" /></div></span><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: center;"><br /></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 700; vertical-align: baseline; white-space: pre-wrap;">Use a Terminal to Manage the Project</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">When the project is open remotely, we can run common activities like adding additional dependencies, building and debugging the project. Since the environment is running on Ubuntu, we can open a terminal window on VS Code. This enables us to run the CLI commands that we need in order to manage our project. </span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">In the case of Web projects, we can run the project remotely using our browser. Even though the project runs remotely on the VM, port forwarding is used for secured remote access, so we can open our local browser and load the app. We can see the forwarded ports for our application on the ports tab of VS Code.</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: center;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><span style="border: none; display: inline-block; height: 109px; overflow: hidden; width: 624px;"><img alt="ozkary vscode port forwarding" height="113" src="https://lh3.googleusercontent.com/bVLQzf8bxQoo_xTBgLnjGUc5kh2hDWbGBXqYGu2brBw7uW2me5cLdo7ApqTOQEgAqa8I8iIRyZvUx6oJkYKkMWeH0DR_cpE8gHue0KVYYNyPKspBwk0lbBnhyt4MDWXybDCeRSkNCQ1_Lo26RQ2-bR4=w640-h113" style="margin-left: 0px; margin-top: 0px;" title="ozkary vscode port forwarding" width="640" /></span></span></p><br /><br /><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 700; vertical-align: baseline; white-space: pre-wrap;">Managing your Codespaces Instance</span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">In some cases, we may see some performance issue on our remote environment. If this is the case, we need to inspect the current instance configuration and if possible upgrade the resources. Since this is an Ubuntu instance, we can use the terminal to run commands like lscpu to check the current configuration like cpuβs and memory. We can also use the Codespaces command toolbar and change the machine type, which provides a quick shortcut to change the machine type or configure the container. </span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">The Dev Container can also be customized by making changes to the devcontainer.json file. Additional customization can be done by building a custom Docker image to meet specific development environment requirements.</span></p><br /><div class="pro-tip"><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 700; vertical-align: baseline; white-space: pre-wrap;">π When the Dev Container is changed, the VM requires a re-start, which is done automatically</span></p></div><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: center;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><span style="border: none; display: inline-block; height: 359px; overflow: hidden; width: 503px;"><img alt="ozkary github codespaces vscode commands" height="286" src="https://lh5.googleusercontent.com/TLvL6p8svKt70N2hgGsnc2fWWaRpUWk9QA1G2KT07BvwZsklH0SVD4_K20FCuD5UYFWE7EhZNSdeKCeLhrjlGXKdccFrt1KGu-8r13S0YIHggc2__b4v8sJhCkzaisFb-hmRL2JgUHvnw27mOZr4H6Q=w400-h286" style="margin-left: 0px; margin-top: 0px;" title="ozkary github codespaces vscode commands" width="400" /></span></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; font-weight: 700; vertical-align: baseline; white-space: pre-wrap;">Summary</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">By leveraging the use of remote managed development environments, software engineers can save time by not having to work on a development environment configuration, we can instead use GitHub Codespaces to quickly provision Dev Containers that get us up and running in a short time, thus allowing us to focus on our development tasks instead of environment managing tasks.</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Thanks for reading.</span></p><br /></span></div><div><br /></div>Send question or comment at Twitter @ozkary
<h4>Originally published by <a href="https://www.ozkary.com" title="oscar garcia, ozkary">ozkary.com</a>
</h4><div class="blogger-post-footer">Originally Published by https://ozkary.com</div>Oscar Garcia @ozkaryhttp://www.blogger.com/profile/13427831719934915504noreply@blogger.com0