3/31/26

Architecting an Agentic Data Pipeline - From Data Lake Discovery to Managed Orchestration

Overview

This session explores the strategy of leveraging AI to move beyond manual implementation and into the next level of data engineering. We dive into a process that positions the AI not as a syntax generator, but as a cognitive partner in the engineering lifecycle. We will examine the architectural shift required to transform raw data lake assets into high-performance, orchestrated systems, focusing on the strategic collaboration between human intent and agentic design.

Architecting an Agentic Data Pipeline - From Data Lake Discovery to Managed Orchestration

🚀 Featured Open Source Projects

Explore these curated resources to level up your engineering skills. If you find them helpful, a ⭐️ is much appreciated!

🏗️ Data Engineering

Focus: Real-world ETL & MTA Turnstile Data

🤖 Artificial Intelligence

Focus: LLM Patterns and Agentic Workflows

📉 Machine Learning

Focus: MLOps and Productionizing Models

💡 Contribute: Found a bug or have a suggestion? Open an issue! and be part of the open source project.

🔗 Related Repository: AI Agents for Data Engineering

Explore the full implementation of the AI Agents used in this workflow:
https://github.com/ozkary/data-engineering-mta-turnstile/tree/main/ai-agents

YouTube Video

👍 Subscribe to the channel to get notify on new events!

📅 Agenda

  • Data Lake Discovery: The strategy of deploying discovery agents to autonomously identify patterns and define the foundation of the data grain.
  • Governance & Requirements: Establishing the strategic guardrails and requirements that empower an "Architect" agent to maintain system consistency.
  • Logical Design for the Staging Area: A process dive into using AI to propose and build a logical abstraction layer, separating raw sources from core business logic.
  • Designing and Implementing the Physical Model: How agents navigate the transition to physical storage, building Dimension and Fact tables while maintaining referential integrity.
  • Incremental Update Strategy: Developing a sustainable approach to support continuous data feeds from the data lake using idempotent, self-healing processes.
  • Pipeline Design and Orchestration: The coordination of complex tasks to manage the relationship between dimensions and facts, ensuring strict lineage and integrated observability.

⭐ Why Attend?

  • Elevate Your Role: Learn how to shift your focus from writing repetitive code to defining high-level architectural intent and performing strategic design reviews.
  • Master Systemic Reasoning: Understand how to leverage AI to solve complex engineering challenges like referential integrity and dependency management at scale.
  • Build for Operations: Move toward a model where system health and observability are built-in byproducts of the design process, not afterthoughts.

👥 Who Is This For?

  • Data Engineers & Architects: Looking to evolve their workflow from manual scripting to high-level systemic design.
  • Engineering Leaders: Interested in the ROI and reliability of integrating autonomous agents into the development lifecycle.
  • AI Enthusiasts: Wanting to see a practical, "beyond-the-chatbot" application of agentic reasoning in a production environment.
  • Technical Decision Makers: Seeking a strategy for maintaining governance and referential integrity in an AI-augmented organization.

Presentation

Automating the Data Engineering Lifecycle

We are running a modern Data Engineering process by combining the reasoning power of AI Agents with the standardized connectivity of MCP tools.

  • Goal: Move from manual scripting to an intelligent, agent-led pipeline.
  • Outcome: A system that can discover, map, and orchestrate data across the cloud.

How do we leverage these tools?

The "Brains" and the "Hands" of the process.

  • AI Agents: Use Large Language Models (LLMs) to understand complex system instructions and specific user prompts. They provide the "logic" behind the process.
  • MCP Tools: Provide the "connectivity." They expose metadata to the agent, which allows the AI to understand exactly what actions are available and how to execute them correctly.

How does this all work?

The Execution Loop

  • The Model: The agent calls a managed LLM service in the cloud (Gemini) for high-level reasoning.
  • Discovery: The agent "sees" the available MCP tools and automatically understands how to use them to interact with GCS or BigQuery.
  • Governance: System Prompts provide the guardrails, core requirements, and engineering standards the agent must follow.
  • Action: The User Prompt provides the specific task (e.g., "Find today's files"). The agent then executes the work.

Intelligent Orchestration

We build an AI-powered Data Engineering process that successfully handles:

  • Data Lake Discovery: Automatically identifying patterns and namespaces in GCS.
  • Data Warehouse Orchestration: Mapping those discoveries directly into BigQuery and creating the data models for analysis.

AI-Driven Data Engineering

  • Agents can connect to a data lake and run discovery on the file
  • Agents can use the result of the discovery to build external tables, views, tables and even stored procedures for the incremental update process

Architecting an Agentic Data Pipeline - From Data Lake Discovery to Managed Orchestration

🌟 Let's Connect & Build Together

Thanks for reading! 😊 If you enjoyed these resources, let's stay in touch! I share deep-dives into AI/ML patterns and host community events here:

  • GDG Broward: Join our local dev community for meetups and workshops.
  • Global AI Events: Join Global AI Events.
  • LinkedIn: Let's connect professionally! I share insights on engineering.
  • GitHub: Follow my open-source journey and star the repos you find useful.
  • YouTube: Watch step-by-step tutorials on the projects listed above.
  • BlueSky / X / Twitter: Daily tech updates and quick engineering tips.

👉 Originally published at ozkary.com

2/25/26

AI Driven App Architecture - Smart Development Life Cycle Governance

Overview

As development teams scale, maintaining architectural consistency becomes the biggest bottleneck. Documents are ignored, and linters only catch syntax errors, not design patterns.

In this session, we will demonstrate how to transform AI from a passive coding assistant into an active Architectural Enforcer. By embedding your "unwritten rules" directly into the repository configuration, you create a developer experience where the AI enforces your patterns in real-time.

We will explore how this shifts the workflow: new developers are guided by the AI from day one, preventing architectural leakage before a pull request is ever opened.

AI Driven App Architecture - Smart Development Life Cycle Governance

Explore these curated resources to level up your engineering skills. If you find them helpful, a ⭐️ is much appreciated!

🏗️ Data Engineering

Focus: Real-world ETL & MTA Turnstile Data
Maintained License

🤖 Artificial Intelligence

Focus: LLM Patterns and Agentic Workflows
Status Topic

📉 Machine Learning

Focus: MLOps and Productionizing Models
Build Stage


💡 Contribute: Found a bug or have a suggestion? Open an issue! and be part of the open source project.

YouTube Video

👍 Subscribe to the channel to get notify on new events!

Video Agenda

The Problem: Architectural Drift

Why strict rules (Controller-View, Pascal/camelCase) degrade over time and how AI can fix it.

The Intelligence Engine

Breakdown of the core components: Global Rules, Contextual Guardrails, Agent Tools, and Directory Structure.

Configuration: Global Governance

Setting up global "system prompts" for the repository to enforce tech stack and naming conventions.

Configuration: Contextual Guardrails

Creating "firewalls" for specific folders (e.g., preventing logic in views, preventing API calls in Controllers).

Configuration: The Tooling

Building custom Slash Commands (/new-module) to automate "Vertical Slice" scaffolding.

Configuration: The Auditor Agent

Implementing a specialized "Gatekeeper" persona that scans imports to ensure strict layer separation.

Agent Mapping

A conceptual framework comparing repository configuration to autonomous agent architecture.

💡 Why Attend?

  • Stop writing boilerplate: Learn to automate complex folder structures with one command.
  • Reduce PR Reviews: Shift governance "left" by having the AI catch architectural errors instantly.
  • Interactive Demo: See the .github configuration in action on a real codebase.
  • Takeaway Code: Leave with the copy-paste markdown templates to implement this in your own repo tomorrow.

Target Audience

  • Tech Leads & Architects who need to enforce standards across scaling teams.
  • Developers who are tired of correcting the same patterns in code reviews.
  • DevOps Engineers interested in "Governance as Code."
  • Leadership teams that are trying to raise standards and productivity in their organizations.

Presentation

SETTING THE STAGE

The Context

  • We enforce a strict pattern using the ViCSA architecture
  • PascalCase for UI Components.
  • camelCase for Logic & Services.
  • Separation of Concerns (SoC) is non-negotiable.

The Problem

  • Architectural Drift: Patterns degrade over time.
  • Passive Docs: Wiki pages are ignored.
  • Linter Limits: Linters catch syntax, not architecture.
  • Solution: Active Governance via AI.

THE INTELLIGENCE ENGINE

Core AI Policies

  • Centralized Config: Rules live in the repo, not the user's IDE.
  • Global Rules: Applied to every interaction (System Prompt).
  • Contextual Rules: Triggered only when specific files are opened.
  • Agent Tools: Custom commands to scaffold new components, controllers or services.

AI Driven App Architecture - Smart Development Life Cycle Governance - Project Structure

CONFIGURATION: GLOBAL GOVERNANCE

Global Instructions

File: .github/copilot-instructions.md

This acts as the System Prompt for the entire repository. It is silently added to every interaction.

  • Tech Stack: TS, Tailwind, Hooks.
  • Naming: Pascal vs camelCase.
  • Flow: View → Controller → Service -> API.

AI Driven App Architecture - Smart Development Life Cycle Governance - Global Governance

DEV EXPERIENCE: THE SILENT ENFORCER

Without Config

A developer asks:

How do I create a new service?

  • AI suggests a generic Class-based service.
  • Suggests creating a utils.js file.
  • Ignores project folder structure.

With Config

A developer asks: How do I create a new service?"

  • AI reads the Governance.
  • Response: Create src/services/userAuth/index.ts using a functional export, as per project standards.

CONFIGURATION: CONTEXTUAL GUARDRAILS

View Layer Rules

File: .github/instructions/controller-layer.md

Trigger: Opening any **/*.tsx file.

  • "You are a View."
  • "No Logic allowed."
  • "No direct API calls."

Controller Layer Rules

File: .github/instructions/view-layer.md

Trigger: Opening any **/controller.ts file.

  • "You are a Controller."
  • "Use Services, NOT Fetch."
  • "Manage State here."

DEV EXPERIENCE: REAL-TIME INTERVENTION

The Scenario

  • A developer tries to write fetch() inside a UI Component (index.tsx).
  • They ask Copilot: "Write a fetch call here for me."

The Intervention

Ghost Text: Copilot refuses to autocomplete the network call.

Chat Reply:

I cannot. This is a View file. Please move this logic to the sibling Controller (index.ts) and import it.

CONFIGURATION: THE TOOLING

Prompt Library

File: .github/prompts/new-module.md

These act as Agent Tools or "Slash Commands".

  • Goal: Automate the "Vertical Slice".
  • Benefit: Complex scaffolding logic is stored in the repo, not in the developer's head.
  • Usage: /new-module
# Prompt Library (The Scaffolder)
File: `.github/prompts/new-component.md`
Goal: Automate the creation of a standalone UI Component with optional Service/API layers.

# Create New Component
I need to generate a new component following our **Folder-as-Namespace** pattern.
**Command:** `/new-component:{{componentName}} {{args}}`

Please generate the code blocks for the layers requested in the arguments (service, api). 
*Note: Logic folders must be camelCase. UI folders must be PascalCase.*

---

### Component Layer (Required)
**Folder:** `src/components/{{componentName (PascalCase)}}/`
- **File:** `controller.ts` (Controller): Logic and State only.
- **File:** `index.tsx` (View): Pure UI. Imports Controller.
---


### Service Layer (Optional)
*Condition: Generate only if 'service' is present in {{args}}.*

**File:** `src/services/{{componentName (camelCase)}}/index.ts`
- **Role:** Business logic and data transformation.
- **Code:** Import the API (if requested). Export a service object or functional exports.

---

### API Layer (Optional)
*Condition: Generate only if 'api' is present in {{args}}.*

**File:** `src/apis/{{componentName (camelCase)}}/index.ts`
- **Role:** Define specific endpoints.
- **Code:** Import `coreClient` from `src/apis/index.ts`. Export async functions with typed responses.

---

### Style Guidelines
- **Typing:** Use TypeScript interfaces for all Props and Data models.
- **Separation:** Logic stays in `controller.ts`, JSX stays in `index.tsx`.
- **Naming:** Components use PascalCase; Services/APIs use camelCase.

DEV EXPERIENCE: THE SCAFFOLDING

The Command

Starting a new feature called "Sales Dashboard".

Action:

/new-module featureName:Sales Dashboard

The Execution

  • Analyzes the request.
  • Applies PascalCase to Containers/Components folders.
  • Applies camelCase to api/service folders.
  • Generates the Controller-View pair instantly.

THE RESULT: GENERATED ARCHITECTURE

The Results

  • Layers generated instantly.
  • Correct naming conventions applied.
  • Zero manual boilerplate.

AI Driven App Architecture - Smart Development Life Cycle Governance - Project Structure

CONFIGURATION: THE AUDITOR AGENT

Specialized Persona

File: .github/agents/arch-auditor.md

This creates a named Agent that acts as a Gatekeeper. It doesn't write features; it verifies them.

  • Role: Architecture Enforcer.
  • Task: Scans imports to ensure strict layer separation.
  • Rule: "Views never talk to APIs."
# Custom AI Agent (The Reviewer)
Agent ID: `@vicsa-auditor`

Context: A bot that ensures the chain of command is respected using the ViCSA architecture (View Controller Service API)

## Primary Objective
name: Architecture Auditor
description: Verifies strict separation of Controller, Service, and View layers.
tools: [code-search]

---
## Role
You ensure the integrity of the data flow: View -> Controller -> Service -> API.

## Audit Logic
When asked to "Audit this feature":

1. **Check the View (.tsx):** - FAIL if it imports `src/services`.
   - FAIL if it imports `src/apis`.
   - PASS only if it imports `./index`.

2. **Check the Controller (.ts):**
   - FAIL if it uses `fetch` or `axios`.
   - PASS only if it delegates to `src/services`.

3. **Check the Service:**
   - FAIL if it defines its own URL logic.
   - PASS only if it imports `src/apis/index.ts`.

DEV EXPERIENCE: THE CODE REVIEW

The Interaction

Before raising a pull request, the developer invokes the auditor.

Prompt:

@vicsa-auditor check this component for violations.

Response:

✅ PASS: SalesDashboard/index.tsx imports only from its sibling controller. No direct API calls found.

AI Driven App Architecture - Smart Development Life Cycle Governance - Review Process

THE AUTONOMY ADVANTAGE

AI enforces the ViCSA architecture through continuous observation and autonomous execution.

  • Perception: Continuously observes the active workspace, file paths (e.g., src/components/), and context to understand the developer's structural intent.
  • Reasoning: Evaluates the perceived context against the repository's .github Guardrails, determining if a View is bypassing a Controller or violating Separation of Concerns, SoC.
  • Action: Executes autonomous scaffolding, enforces strict ViCSA governance, provides recommended fixes feedback.

SUMMARY & AGENT MAPPING

Embedding governance directly into the repository transforms the development lifecycle. It replaces passive wiki pages with active, real-time enforcement, ensuring that every AI suggestion aligns with architectural standards. This eliminates "drift", accelerates onboarding, and turns Copilot into a domain-expert partner.

Agent Component GitHub Implementation
System Prompt Global Instructions (copilot-instructions.md)
Context / RAG Modular Instructions (instructions/*.md)
Tools / Functions Prompt Library (prompts/*.md)
Human Prompt Chat Window
Persona Agent Personas (i.e. agents/arch-auditor.md)

RAG: Retrieval augmented generation

🌟 Let's Connect & Build Together

Thanks for reading! 😊 If you enjoyed these resources, let's stay in touch! I share deep-dives into AI/ML patterns and host community events here:

  • GDG Broward: Join our local dev community for meetups and workshops.
  • Global AI Events: Join Global AI Events.
  • LinkedIn: Let's connect professionally! I share insights on engineering.
  • GitHub: Follow my open-source journey and star the repos you find useful.
  • YouTube: Watch step-by-step tutorials on the projects listed above.
  • BlueSky / X / Twitter: Daily tech updates and quick engineering tips.

1/21/26

The Cognitive Data Lakehouse: AI-Driven Unification and Semantic Modeling in a Zero-ETL Environment

Overview

In the modern data landscape, the wall between "where data lives" and "how we get insights" is crumbling. This session focuses on the Cognitive Data Lakehouse. A paradigm shift that allows developers to treat a fragmented data lake as a unified, high-performance warehouse.

We will explore how to move beyond brittle ETL pipelines using Zero-ETL architecture in the cloud. The core of our discussion will center on using integrated AI capabilities and semantic modeling to solve the "Metadata Mess" inherent in global manufacturing feeds without moving a single byte of data. From raw telemetry in object storage to semantic intelligence via large language models, we’ll show you the real-world application of AI in modern data engineering.

The Cognitive Data Lakehouse: AI-Driven Unification and Semantic Modeling in a Zero-ETL Environment

Explore these curated resources to level up your engineering skills. If you find them helpful, a ⭐️ is much appreciated!

🏗️ Data Engineering

Focus: Real-world ETL & MTA Turnstile Data
Maintained License

🤖 Artificial Intelligence

Focus: LLM Patterns and Agentic Workflows
Status Topic

📉 Machine Learning

Focus: MLOps and Productionizing Models
Build Stage


💡 Contribute: Found a bug or have a suggestion? Open an issue! and be part of the open source project.

YouTube Video

Video Agenda

Phase 1: Foundations & The Zero-ETL Strategy

We kick off with the infrastructure layer. We'll discuss the design of cross-region telemetry tables and how modern cloud engines allow us to query raw files in object storage with the performance of a native table. We’ll establish why "0x data movement" is the goal for modern scalability.

Phase 2: Confronting the Metadata Mess

Schema drift and inconsistent naming across global regions are the enemies of unified analytics. We will look at why traditional manual mapping fails and how we can use AI inference to bridge these gaps and standardize naming conventions automatically.

Phase 3: AI-Driven Unification & Semantic Modeling

The "Cognitive" part of the Lakehouse. We’ll dive into the technical implementation of registering AI models directly within your data warehouse environment. You'll see how to create an abstraction layer that uses AI to normalize data on the fly, creating a robust semantic model.

Phase 4: Scaling to a Global Feed

Finally, we’ll demonstrate the DevOps workflow for integrating a new international factory feed into a global telemetry view. We'll show how to maintain a "Single Source of Intelligence" that BI tools and analysts can consume without needing to know the complexities of the underlying lake.

💡 Why Attend?

  • Master Modern Architecture: Learn the "Abstraction Layer" design pattern that is replacing traditional, slow ETL/ELT processes.
  • Hands-on AI for Data Ops: See exactly how to use AI and semantic modeling within SQL-based workflows to automate data cleaning and schema mapping.
  • Scale Without Pain: Discover how to manage global data sources (multi-region, multi-format) through a single governing layer.
  • Developer Networking: Connect with other data architects, engineering leaders, and professionals solving similar scale and complexity challenges.

Target Audience: Data Engineers, Analytics Architects, Cloud Developers, and anyone interested in the intersection of Big Data and Generative AI.

Presentation

Phase 1: The Zero-ETL Strategy

INFRASTRUCTURE: DATA STAYS LOCAL

Architecting for Scale

  • Storage Decoupling: Raw files remain in the Data Lake, eliminating replication overhead.
  • Virtual Access: Data Warehouse external tables allow immediate querying of CSV, Parquet, and JSON.
  • Minimal Latency: No waiting for ingest pipelines; analysis starts upon file arrival.

The Cognitive Data Lakehouse: AI-Driven Unification and Semantic Modeling in a Zero-ETL Environment -  Medallion Architecture Design Diagram

UNMATCHED STORAGE EFFICIENCY

Zero Data Replication

  • Traditional ETL requires moving data across multiple tiers. Our architecture ensures a single source of truth with zero data movement between GCS and BigQuery compute.
  • This is similar to the Bronze Zone in a Medallion Architecture.

The Cognitive Data Lakehouse: AI-Driven Unification and Semantic Modeling in a Zero-ETL Environment -  Medallion Architecture Design Diagram

Phase 2: The Metadata Mess

CHALLENGES OF UNIFICATION

Schema Friction

  • Feeds arrive with inconsistent headers (e.g., 'Device Number' vs 'deviceNo'). Manual aliasing is fragile and slow.

Entity Drift

  • Names and IDs vary across systems, preventing standard joins from matching records effectively.

Type Mismatches

  • Varying data types for the same concept (Integer vs String) crash standard SQL aggregation views.

Phase 3: The AI Solution

BIGQUERY STUDIO: THE AI INTERFACE

Remote AI Registration

  • Register Gemini Pro directly inside BigQuery to enable cognitive functions within your SQL workspace.
CREATE MODEL `gemini_remote`
REMOTE WITH CONNECTION `bq_connection`
OPTIONS(endpoint = 'gemini-1.5-pro');

Automated Inference

  • AI "reads" information schemas to infer mapping logic, moving you from Code Author to Logic Approver.
SELECT ml_generate_text_result
FROM ML.GENERATE_TEXT(
  MODEL `gemini_remote`,
  (SELECT "Compare Source A and B schemas. Write a SQL view to unify them." AS prompt)
);

AI-ASSISTED SCHEMA DISCOVERY

Prompting for Base Tables

  • Using AI to generate the DDL for external tables by pointing to compressed feeds in the lake (USA & MEX factories).
SELECT ml_generate_text_result
FROM ML.GENERATE_TEXT(
  MODEL `gemini_remote`,
  (SELECT "Create External Tables as smart_factory.us_telemetry with path 'gs://factory-dl/us/dev-540/telemetry-*.csv.gz' '. Include option CSV, GZIP compression and skip 1 row. Infer and add the schema using lower case" AS prompt));

SELECT ml_generate_text_result
FROM ML.GENERATE_TEXT(
  MODEL `gemini_remote`,
  (SELECT "Create External Tables as smart_factory.mx_telemetry with path 'gs://factory-dl/mx/dev-940/telemetry-*.csv.gz' '. Include option CSV, GZIP compression and skip 1 row. Use schema device_number STRING, bay_id INT64, factory STRING, created STRING" AS prompt));

Generated BigLake DDL

-- USA Factory Feed
CREATE OR REPLACE EXTERNAL TABLE `smart_factory.us_telemetry` (
  device_number STRING,
  bay_id INT64,
  factory STRING,
  created STRING
)
OPTIONS (
  format = 'CSV',
  uris = ['gs://factory-dl/us/dev-540/telemetry*.csv.gz'],
  skip_leading_rows = 1,
  compression = 'GZIP'
);

-- MEX Factory Feed
CREATE OR REPLACE EXTERNAL TABLE `smart_factory.mx_telemetry` (
  device_number STRING,
  bay_id INT64,
  factory STRING,
  created STRING
)
OPTIONS (
  format = 'CSV',
  uris = ['gs://factory-dl/mx/dev-940/telemetry*.csv.gz'],
  skip_leading_rows = 1,
  compression = 'GZIP'
);

AI-ABSTRACTION: THE VIEW LAYER

Generating the Interface

  • AI creates a clean abstraction view for each external table, decoupling raw storage from the analytics model.
    -- AI Instruction
    "Create a view named 
    smart_factory.vw_us_telemetry 
    selecting all columns from the
    usa_telemetry table. Safe cast the created column as datetime."
    

Abstraction Layer DDL

-- Semantic Abstraction Layer
CREATE OR REPLACE VIEW `smart_factory.vw_us_telemetry` AS
SELECT 
  device_number,
  bay_id,
  factory,
  SAFE_CAST(created as DATETIME) AS created
FROM `smart_factory.us_telemetry`;

COGNITIVE UNIFICATION

The Multi-Region Model

  • The unified view now consumes from the abstraction layer, ensuring that changes to raw storage don't break the views down stream.
-- AI Instruction
"Create a view with name
smart_factory.vw_telemetry that creates a union of all the fields from the views vw_[region]_telemetry. The regions include us and mx. List out all the field names. Never use * for field names"

Unified Global View

-- Semantic Abstraction Layer
CREATE OR REPLACE VIEW `smart_factory.vw_telemetry` AS
SELECT 
  device_number,
  bay_id,
  factory,
  created
FROM `smart_factory.vw_us_telemetry`
UNION ALL
SELECT 
  device_number,
  bay_id,
  factory,
  created
FROM `smart_factory.vw_mx_telemetry`

SCALING TO CHINA FACTORY

Evolving the Model

  • Adding the new China feed by generating the External Table definition via AI.
CREATE OR REPLACE EXTERNAL TABLE `smart_factory.cn_telemetry` (
  device_number STRING,
  bay_id INT64,
  factory STRING,
  created STRING
)
OPTIONS (
  format = 'CSV',
  uris = ['gs://factory-dl/cn/dev-900/telemetry*.csv.gz'],
  skip_leading_rows = 1,
  compression = 'GZIP'

Human-in-the-Loop DevOps

  • Use AI to update the unified view with the new data feed. Review and apply the changes by the DevOps team, as changes to a production view require approval.

Manufacturing SPC & Root Cause Analysis

  • This query calculates a rolling mean and standard deviation over the last 10 minutes of telemetry to detect anomalies, “Out of Control” conditions.
WITH TelemetryStats AS (
  SELECT
    machine_id,
    timestamp,
    sensor_reading,
    -- Calculate rolling stats for the "Control Chart"
    AVG(sensor_reading) OVER(PARTITION BY machine_id ORDER BY timestamp ROWS BETWEEN 20 PRECEDING AND CURRENT ROW) as rolling_avg,
    STDDEV(sensor_reading) OVER(PARTITION BY machine_id ORDER BY timestamp ROWS BETWEEN 20 PRECEDING AND CURRENT ROW) as rolling_stddev
  FROM `production_data.mx_telemetry_stream`
  WHERE timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
),
Anomalies AS (
  SELECT *,
    -- Define "Out of Control" (Reading > 3 Sigma from mean)
    ABS(sensor_reading - rolling_avg) > (3 * rolling_stddev) AS is_out_of_control
  FROM TelemetryStats
)
SELECT * FROM Anomalies WHERE is_out_of_control = TRUE;

Control Chart Visualization

The Cognitive Data Lakehouse: AI-Driven Unification and Semantic Modeling in a Zero-ETL Environment - Control Charts

ADVANTAGE COMPARISON MATRIX

Metric Manual Data Engineering AI-Augmented Zero-ETL
Unification Speed Days/Weeks per Source Minutes via Generative AI
Schema Drift Manual Script Rewrites Adaptive AI View Discovery
Infrastructure Cost High (Data Redundancy) Minimal (In-place on GCS)

Strategic Intelligence ROI:

ROI(ai) = Insights Velocity / (Movement Cost + Labor Hours)

FINAL THOUGHTS: STRATEGIC SUMMARY

Legacy Challenges

  • Brittle ETL: Manual pipelines break with every schema change.
  • Cost Inefficiency: Redundant storage for processed data.
  • Semantic Silos: Hard-coded aliases for disparate naming conventions.
  • Slow Time-to-Insight: Weeks spent on manual schema alignment.

AI-Assisted Solutions

  • Zero-ETL Arch: Cost-effective storage with Data Lake virtual access.
  • Automated Inference: Vertex AI handles the "heavy lifting" of mapping.
  • Adaptive DevOps: Scalable model evolution (USA → MEX → China).
  • Unified Intelligence: One virtual source of truth for global analytics.

Moving from Data Reporting to Active Semantic Intelligence.

We've covered a lot today, but this is just the beginning!

If you're interested in learning more about building cloud data pipelines, I encourage you to check out my book, 'Data Engineering Process Fundamentals,' part of the Data Engineering Process Fundamentals series. It provides in-depth explanations, code samples, and practical exercises to help in your learning.

Data Engineering Process Fundamentals - Book by Oscar Garcia Data Engineering Process Fundamentals - Book by Oscar Garcia


📅 Upcoming Sessions

Our upcoming series expands beyond data engineering to bridge the gap between AI, Machine Learning, and modern cloud architecture. Using our Data, AI, and ML GitHub blueprints, we provide the code-first patterns needed to build everything from Zero-ETL pipelines to scalable LLM-powered systems. Join us to explore how these integrated disciplines work together to turn raw data into production-ready intelligence.


🌟 Let's Connect & Build Together

If you enjoyed these resources, let's stay in touch! I share deep-dives into AI/ML patterns and host community events here:

  • GDG Broward: Join our local dev community for meetups and workshops.
  • LinkedIn: Let's connect professionally! I share insights on engineering.
  • GitHub: Follow my open-source journey and star the repos you find useful.
  • YouTube: Watch step-by-step tutorials on the projects listed above.
  • BlueSky / X / Twitter: Daily tech updates and quick engineering tips.

👉 Originally published at ozkary.com

12/10/25

From Raw Data to Governance: Refining Data with the Medallion Architecture Dec 2025

Overview

Build upon your existing data engineering expertise and discover how Medallion Architecture can transform your data strategy. This session provides a hands-on approach to implementing Medallion principles, empowering you to create a robust, scalable, and governed data platform.

We'll explore how to align data engineering processes with Medallion Architecture, identifying opportunities for optimization and improvement. By understanding the core principles and practical implementation steps, you'll learn how to optimize data pipelines, enhance data quality, and unlock valuable insights through a structured, layered approach to drive business success.

From Raw Data to Governance: Refining Data with the Medallion Architecture

  • Follow this GitHub repo during the presentation: (Give it a star)

👉 https://github.com/ozkary/data-engineering-mta-turnstile

  • Read more information on my blog at:

👉 https://www.ozkary.com/2023/03/data-engineering-process-fundamentals.html

YouTube Video - Dec 2025

Video Agenda

  • Introduction to Medallion Architecture

    • Defining Medallion Architecture
    • Core Principles
    • Benefits of Medallion Architecture
  • The Raw Zone

    • Understanding the purpose of the Raw Zone
    • Best practices for data ingestion and storage
  • The Bronze Zone

    • Data transformation and cleansing
    • Creating a foundation for analysis
  • The Silver Zone

    • Data optimization and summarization
    • Preparing data for consumption
  • The Gold Zone

    • Curated data for insights and action
    • Enabling self-service analytics
  • Empowering Insights

    • Data-driven decision-making
    • Accelerated Insights
  • Data Governance

    • Importance of data governance in Medallion Architecture
    • Implementing data ownership and stewardship
    • Ensuring data quality and security

Why Attend:

Gain a deep understanding of Medallion Architecture and its application in modern data engineering. Learn how to optimize data pipelines, improve data quality, and unlock valuable insights. Discover practical steps to implement Medallion principles in your organization and drive data-driven decision-making.

Presentation

Introducing Medallion Architecture

Medallion architecture is a data management approach that organizes data into distinct layers based on its quality and processing level.

  • Improved Data Quality: By separating data into different zones, you can focus on data quality at each stage.
  • Enhanced Data Governance: Clear data ownership and lineage improve data trustworthiness.
  • Accelerated Insights: Optimized data in the Silver and Gold zones enables faster query performance.
  • Scalability: The layered approach can accommodate growing data volumes and complexity.
  • Cost Efficiency: Optimized data storage and processing can reduce costs.

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Design Diagram

The Raw Zone: Foundation of Your Data Lake

The Raw Zone is the initial landing place for raw, unprocessed data. It serves as a historical archive of your data sources.

  • Key Characteristics:
    • Unstructured or semi-structured format (e.g., CSV, JSON, Parquet)
    • Data is ingested as-is, without any cleaning or transformation
    • High volume and velocity
    • Data retention policies are crucial
  • Benefits:
    • Preserves original data for potential future analysis
    • Enables data reprocessing
    • Supports data lineage and auditability

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Raw Zone Diagram

The Bronze Zone: Transforming Raw Data

The Bronze Zone is where raw data undergoes initial cleaning, structuring, and transformation. It serves as a staging area for data before moving to the Silver Zone.

  • Key Characteristics:
    • Data is cleansed and standardized
    • Basic transformations are applied (e.g., data type conversions, null handling)
    • Data is structured into tables or views
    • Data quality checks are implemented
    • Data retention policies may be shorter than the Raw Zone
  • Benefits:
    • Improves data quality and consistency
    • Provides a foundation for further analysis
    • Enables data exploration and discovery

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Bronze Zone Diagram

The Silver Zone: A Foundation for Insights

The Silver Zone houses data that has been further refined, aggregated, and optimized for specific use cases. It serves as a bridge between the raw data and the final curated datasets.

  • Key Characteristics:
    • Data is cleansed, standardized, and enriched
    • Data is structured for analytical purposes (e.g., normalized, de-normalized)
    • Data is optimized for query performance (e.g., partitioning, indexing)
    • Data is aggregated and summarized for specific use cases
  • Benefits:
    • Improved query performance
    • Supports self-service analytics
    • Enables advanced analytics and machine learning
    • Reduces query costs

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Silver Zone Diagram

The Gold Zone: Your Data's Final Destination

  • Definition: The Gold Zone contains the final, curated datasets ready for consumption by business users and applications. It is the pinnacle of data transformation and optimization.
  • Key Characteristics:
    • Data is highly refined, aggregated, and optimized for specific use cases
    • Data is often materialized for performance
    • Data is subject to rigorous quality checks and validation
    • Data is secured and governed
  • Benefits:
    • Enables rapid insights and decision-making
    • Supports self-service analytics and reporting
    • Provides a foundation for advanced analytics and machine learning
    • Reduces query latency

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Gold Zone Diagram

The Gold Zone: Empowering Insights and Actions

The Gold Zone is the final destination for data, providing a foundation for insights, analysis, and action. It houses curated, optimized datasets ready for consumption.

  • Key Characteristics:
    • Data is accessible and easily consumable
    • Supports various analytical tools and platforms (BI, ML, data science)
    • Enables self-service analytics
    • Drives business decisions and actions
  • Examples of Consumption Tools:
    • Business Intelligence (BI) tools (Looker, Tableau, Power BI)
    • Data science platforms (Python, R, SQL)
    • Machine learning platforms (TensorFlow, PyTorch)
    • Advanced analytics tools

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Analysis Diagram

Data Governance: The Cornerstone of Data Management

Data governance is the framework that defines how data is managed within an organization, while data management is the operational execution of those policies. Data Governance is essential for ensuring data quality, consistency, and security.

Key components of data governance include:

  • Data Lineage: Tracking data's journey from source to consumption.
  • Data Ownership: Defining who is responsible for data accuracy and usage.
  • Data Stewardship: Managing data on a day-to-day basis, ensuring quality and compliance.
  • Data Security: Protecting data from unauthorized access, use, disclosure, disruption, modification, or destruction.
  • Compliance: Adhering to industry regulations (e.g., GDPR, CCPA, HIPAA) and internal policies.

By establishing clear roles, responsibilities, and data lineage, organizations can build trust in their data, improve decision-making, and mitigate risks.

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Data Governance

Data Transformation and Incremental Strategy

The data transformation phase is a critical stage in a data warehouse project. This phase involves several key steps, including data extraction, cleaning, loading, data type casting, use of naming conventions, and implementing incremental loads to continuously insert the new information since the last update via batch processes.

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Data transformation lineage

Data Lineage: Tracks the flow of data from its origin to its destination, including all the intermediate processes and transformations that it undergoes.

Data Governance : Metadata

Assigns the owner, steward and responsibilities of the data.

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Governance Metadata

Summary: Leverage Medallion Architecture for Success

  • Key Benefits:
    • Improved data quality
    • Enhanced governance
    • Accelerated insights
    • Scalability
    • Cost Efficiency.

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Diagram

We've covered a lot today, but this is just the beginning!

If you're interested in learning more about building cloud data pipelines, I encourage you to check out my book, 'Data Engineering Process Fundamentals,' part of the Data Engineering Process Fundamentals series. It provides in-depth explanations, code samples, and practical exercises to help in your learning.

Data Engineering Process Fundamentals - Book by Oscar Garcia Data Engineering Process Fundamentals - Book by Oscar Garcia

Upcoming Talks:

Join us for subsequent sessions in our Data Engineering Process Fundamentals series, where we will delve deeper into specific facets of data engineering, exploring topics such as data modeling, pipelines, and best practices in data governance.

This presentation is based on the book, Data Engineering Process Fundamentals, which provides a more comprehensive guide to the topics we'll cover. You can find all the sample code and datasets used in this presentation on our popular GitHub repository Introduction to Data Engineering Process Fundamentals.

Thanks for reading! 😊 If you enjoyed this post and would like to stay updated with our latest content, don’t forget to follow us. Join our community and be the first to know about new articles, exclusive insights, and more!

👍 Originally published by ozkary.com

11/19/25

From Raw Data to Governance: Refining Data with the Medallion Architecture Nov 2025

Overview

Build upon your existing data engineering expertise and discover how Medallion Architecture can transform your data strategy. This session provides a hands-on approach to implementing Medallion principles, empowering you to create a robust, scalable, and governed data platform.

We'll explore how to align data engineering processes with Medallion Architecture, identifying opportunities for optimization and improvement. By understanding the core principles and practical implementation steps, you'll learn how to optimize data pipelines, enhance data quality, and unlock valuable insights through a structured, layered approach to drive business success.

From Raw Data to Governance: Refining Data with the Medallion Architecture

  • Follow this GitHub repo during the presentation: (Give it a star)

👉 https://github.com/ozkary/data-engineering-mta-turnstile

  • Read more information on my blog at:

👉 https://www.ozkary.com/2023/03/data-engineering-process-fundamentals.html

YouTube Video

Video Agenda

  • Introduction to Medallion Architecture

    • Defining Medallion Architecture
    • Core Principles
    • Benefits of Medallion Architecture
  • The Raw Zone

    • Understanding the purpose of the Raw Zone
    • Best practices for data ingestion and storage
  • The Bronze Zone

    • Data transformation and cleansing
    • Creating a foundation for analysis
  • The Silver Zone

    • Data optimization and summarization
    • Preparing data for consumption
  • The Gold Zone

    • Curated data for insights and action
    • Enabling self-service analytics
  • Empowering Insights

    • Data-driven decision-making
    • Accelerated Insights
  • Data Governance

    • Importance of data governance in Medallion Architecture
    • Implementing data ownership and stewardship
    • Ensuring data quality and security

Why Attend:

Gain a deep understanding of Medallion Architecture and its application in modern data engineering. Learn how to optimize data pipelines, improve data quality, and unlock valuable insights. Discover practical steps to implement Medallion principles in your organization and drive data-driven decision-making.

Presentation

Introducing Medallion Architecture

Medallion architecture is a data management approach that organizes data into distinct layers based on its quality and processing level.

  • Improved Data Quality: By separating data into different zones, you can focus on data quality at each stage.
  • Enhanced Data Governance: Clear data ownership and lineage improve data trustworthiness.
  • Accelerated Insights: Optimized data in the Silver and Gold zones enables faster query performance.
  • Scalability: The layered approach can accommodate growing data volumes and complexity.
  • Cost Efficiency: Optimized data storage and processing can reduce costs.

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Design Diagram

The Raw Zone: Foundation of Your Data Lake

The Raw Zone is the initial landing place for raw, unprocessed data. It serves as a historical archive of your data sources.

  • Key Characteristics:
    • Unstructured or semi-structured format (e.g., CSV, JSON, Parquet)
    • Data is ingested as-is, without any cleaning or transformation
    • High volume and velocity
    • Data retention policies are crucial
  • Benefits:
    • Preserves original data for potential future analysis
    • Enables data reprocessing
    • Supports data lineage and auditability

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Raw Zone Diagram

The Bronze Zone: Transforming Raw Data

The Bronze Zone is where raw data undergoes initial cleaning, structuring, and transformation. It serves as a staging area for data before moving to the Silver Zone.

  • Key Characteristics:
    • Data is cleansed and standardized
    • Basic transformations are applied (e.g., data type conversions, null handling)
    • Data is structured into tables or views
    • Data quality checks are implemented
    • Data retention policies may be shorter than the Raw Zone
  • Benefits:
    • Improves data quality and consistency
    • Provides a foundation for further analysis
    • Enables data exploration and discovery

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Bronze Zone Diagram

The Silver Zone: A Foundation for Insights

The Silver Zone houses data that has been further refined, aggregated, and optimized for specific use cases. It serves as a bridge between the raw data and the final curated datasets.

  • Key Characteristics:
    • Data is cleansed, standardized, and enriched
    • Data is structured for analytical purposes (e.g., normalized, de-normalized)
    • Data is optimized for query performance (e.g., partitioning, indexing)
    • Data is aggregated and summarized for specific use cases
  • Benefits:
    • Improved query performance
    • Supports self-service analytics
    • Enables advanced analytics and machine learning
    • Reduces query costs

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Silver Zone Diagram

The Gold Zone: Your Data's Final Destination

  • Definition: The Gold Zone contains the final, curated datasets ready for consumption by business users and applications. It is the pinnacle of data transformation and optimization.
  • Key Characteristics:
    • Data is highly refined, aggregated, and optimized for specific use cases
    • Data is often materialized for performance
    • Data is subject to rigorous quality checks and validation
    • Data is secured and governed
  • Benefits:
    • Enables rapid insights and decision-making
    • Supports self-service analytics and reporting
    • Provides a foundation for advanced analytics and machine learning
    • Reduces query latency

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Gold Zone Diagram

The Gold Zone: Empowering Insights and Actions

The Gold Zone is the final destination for data, providing a foundation for insights, analysis, and action. It houses curated, optimized datasets ready for consumption.

  • Key Characteristics:
    • Data is accessible and easily consumable
    • Supports various analytical tools and platforms (BI, ML, data science)
    • Enables self-service analytics
    • Drives business decisions and actions
  • Examples of Consumption Tools:
    • Business Intelligence (BI) tools (Looker, Tableau, Power BI)
    • Data science platforms (Python, R, SQL)
    • Machine learning platforms (TensorFlow, PyTorch)
    • Advanced analytics tools

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Analysis Diagram

Data Governance: The Cornerstone of Data Management

Data governance is the framework that defines how data is managed within an organization, while data management is the operational execution of those policies. Data Governance is essential for ensuring data quality, consistency, and security.

Key components of data governance include:

  • Data Lineage: Tracking data's journey from source to consumption.
  • Data Ownership: Defining who is responsible for data accuracy and usage.
  • Data Stewardship: Managing data on a day-to-day basis, ensuring quality and compliance.
  • Data Security: Protecting data from unauthorized access, use, disclosure, disruption, modification, or destruction.
  • Compliance: Adhering to industry regulations (e.g., GDPR, CCPA, HIPAA) and internal policies.

By establishing clear roles, responsibilities, and data lineage, organizations can build trust in their data, improve decision-making, and mitigate risks.

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Data Governance

Data Transformation and Incremental Strategy

The data transformation phase is a critical stage in a data warehouse project. This phase involves several key steps, including data extraction, cleaning, loading, data type casting, use of naming conventions, and implementing incremental loads to continuously insert the new information since the last update via batch processes.

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Data transformation lineage

Data Lineage: Tracks the flow of data from its origin to its destination, including all the intermediate processes and transformations that it undergoes.

Data Governance : Metadata

Assigns the owner, steward and responsibilities of the data.

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Governance Metadata

Summary: Leverage Medallion Architecture for Success

  • Key Benefits:
    • Improved data quality
    • Enhanced governance
    • Accelerated insights
    • Scalability
    • Cost Efficiency.

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Diagram

We've covered a lot today, but this is just the beginning!

If you're interested in learning more about building cloud data pipelines, I encourage you to check out my book, 'Data Engineering Process Fundamentals,' part of the Data Engineering Process Fundamentals series. It provides in-depth explanations, code samples, and practical exercises to help in your learning.

Data Engineering Process Fundamentals - Book by Oscar Garcia Data Engineering Process Fundamentals - Book by Oscar Garcia

Upcoming Talks:

Join us for subsequent sessions in our Data Engineering Process Fundamentals series, where we will delve deeper into specific facets of data engineering, exploring topics such as data modeling, pipelines, and best practices in data governance.

This presentation is based on the book, Data Engineering Process Fundamentals, which provides a more comprehensive guide to the topics we'll cover. You can find all the sample code and datasets used in this presentation on our popular GitHub repository Introduction to Data Engineering Process Fundamentals.

Thanks for reading! 😊 If you enjoyed this post and would like to stay updated with our latest content, don’t forget to follow us. Join our community and be the first to know about new articles, exclusive insights, and more!

👍 Originally published by ozkary.com