I am Oscar Garcia, OzkaryTM. I author this site, speak at conferences and events, contribute to OSS, mentor people. I use this blog to post ideas and experiences about software development, with the goal to both learn from and help the technology communities around the world.
Dive into the future of web applications. We're moving beyond traditional API polling and embracing real-time integration. Imagine your client app maintaining a persistent connection with the server, enabling bidirectional communication and live data streaming. We'll also tackle scalability challenges and integrate Redis as our in-memory data solution.
Follow this GitHub repo during the presentation: (Give it a star)
This presentation explores strategies for building highly responsive and interactive live dashboards. We'll delve into the challenges of traditional API polling and demonstrate how to leverage Node.js, Angular, Socket.IO, and Redis to achieve real-time updates and a seamless user experience.
Introduction:
Understanding telemetry data and the importance to monitor the data
Challenges of traditional API polling for real-time data.
Design patterns to enhance an app with minimum changes
Traditional Solution Architecture
SQL Database Integration.
Restful API
Angular and Node.js Integration
Real-Time Integration with Web Sockets
Database Optimization Challenges
Introduction to Web Sockets for bidirectional communication.
Implementing Web Sockets in a Web application.
Handling data synchronization and consistency.
Distributed Caching with Redis:
Benefits of in-memory caching for improving performance and scalability.
Integrating Redis into your Node.js application.
Caching strategies for distributed systems.
Case Study: Building a Live Telemetry Dashboard
Step-by-step demonstration of the implementation.
Performance comparison with and without optimization techniques.
User experience benefits of real-time updates.
Benefits and Considerations
Improved dashboard performance and responsiveness.
Reduced server load and costs.
Scalability and scalability considerations.
Best practices for implementing real-time updates.
Why Attend:
Gain a deep understanding of real-time data integration for your Web application.
Presentation
Telemetry Data Story
Devices send telemetry data via API integration with SQL Server. There are inherit performance problems with a disk-based database. We progressively enhance the system with minimum changes by adding real-time integration and an in-memory cache system.
Database Integration
Solution Architecture
Disk-based Storage
Web apps and APIs query database to get the data
Applications can do both high reads and writes
Web components, charts polling back-end database for reads
Let’s Start our Journey
Review our API integration and talk about concerns
Do not refactor everything
Enhance to real-time integration with sockets
Add Redis as the distributed cache
Add the service broker strategy to sync the data sources
Centralized the real-time integration with Redis
RESTful API Integration
Applied Technologies
REST API Written with Node.js
TypeORM Library Repository
Angular Client Application with Plotly.js Charts
Disk-based storage – SQL Server
API Telemetry (GET, POST) route
Use Case
IoT devices report telemetry information via API
Dashboard reads that most recent data only via API calls which queries the storage service
Redis supports a set of atomic operations on these data types (available until committed)
Other features include transactions, publish/subscribe, limited time to live -TTL
You can use Redis from most of today's programming languages using libraries
Use Case
As application load and data frequency increases, we need to use a cache for performance. We also need to centralize the events, so all the socket servers behind a load balancer can notify the clients. Update both storage and cache
Demo
Start Redis-cli on Ubuntu and show some inserts, reads and sync events.
sudo service redis-server restart
redis-cli -c -p 6379 -h localhost
zadd table:data 100 "{data:'100'}“
zrangebycore table:data 100 200
subscribe telemetry:data
Summary: Boosting Your App Performance
When your application starts to slow down due to heavy read and writes on your database, consider moving the read operations to a cache solution and broadcasting the data to your application via a real-time integration using Web Sockets. This approach can significantly enhance performance and user experience.
Key Benefits
Improved Performance: Offloading reads to a cache system like Redis reduces load on the database.
Real-Time Updates: Using Web Sockets ensures that your application receives updates in real-time, with no need for manual refreshes.
Scalability: By reducing the database load, your application can handle more concurrent users.
Efficient Resource Utilization: Leveraging caching and real-time technologies optimizes the user of server resources, leading to savings and better performance.
We've covered a lot today, but this is just the beginning!
If you're interested in learning more about building cloud data pipelines, I encourage you to check out my book, 'Data Engineering Process Fundamentals,' part of the Data Engineering Process Fundamentals series. It provides in-depth explanations, code samples, and practical exercises to help in your learning.
Thanks for reading.
Send question or comment at Twitter @ozkary
👍 Originally published by ozkary.com
Gain understanding of Medallion Architecture and its application in modern data engineering. Learn how to optimize data pipelines, improve data quality, and unlock valuable insights. Discover practical steps to implement Medallion principles in your organization and drive data-driven decision-making.
Follow this GitHub repo during the presentation: (Give it a star)
Importance of data governance in Medallion Architecture
Implementing data ownership and stewardship
Ensuring data quality and security
Why Attend:
Gain a deep understanding of Medallion Architecture and its application in modern data engineering. Learn how to optimize data pipelines, improve data quality, and unlock valuable insights. Discover practical steps to implement Medallion principles in your organization and drive data-driven decision-making.
Presentation
Introducing Medallion Architecture
Medallion architecture is a data management approach that organizes data into distinct layers based on its quality and processing level.
Improved Data Quality: By separating data into different zones, you can focus on data quality at each stage.
Enhanced Data Governance: Clear data ownership and lineage improve data trustworthiness.
Accelerated Insights: Optimized data in the Silver and Gold zones enables faster query performance.
Scalability: The layered approach can accommodate growing data volumes and complexity.
Cost Efficiency: Optimized data storage and processing can reduce costs.
The Raw Zone: Foundation of Your Data Lake
The Raw Zone is the initial landing place for raw, unprocessed data. It serves as a historical archive of your data sources.
Key Characteristics:
Unstructured or semi-structured format (e.g., CSV, JSON, Parquet)
Data is ingested as-is, without any cleaning or transformation
High volume and velocity
Data retention policies are crucial
Benefits:
Preserves original data for potential future analysis
Enables data reprocessing
Supports data lineage and auditability
Use case Background
The Metropolitan Transportation Authority (MTA) subway system in New York has stations around the city. All the stations are equipped with turnstiles or gates which tracks as each person enters (departure) or exits (arrival) the station.
The MTA subway system has stations around the city.
All the stations are equipped with turnstiles or gates which tracks as each person enters or leaves the station.
CSV files provide information about the amount of commuters per stations at different time slots.
Problem Statement
In the city of New York, commuters use the Metropolitan Transportation Authority (MTA) subway system for transportation. There are millions of people that use this system every day; therefore, businesses around the subway stations would like to be able to use Geofencing advertisement to target those commuters or possible consumers and attract them to their business locations at peak hours of the day.
Geofencing is a location based technology service in which mobile devices’ electronic signal is tracked as it enters or leaves a virtual boundary (geo-fence) on a geographical location. Businesses around those locations would like to use this technology to increase their sales.
Businesses around those locations would like to use this technology to increase their sales by pushing ads to potential customers at specific times.
The Bronze Zone: Transforming Raw Data
The Bronze Zone is where raw data undergoes initial cleaning, structuring, and transformation. It serves as a staging area for data before moving to the Silver Zone.
Key Characteristics:
Data is cleansed and standardized
Basic transformations are applied (e.g., data type conversions, null handling)
Data is structured into tables or views
Data quality checks are implemented
Data retention policies may be shorter than the Raw Zone
Benefits:
Improves data quality and consistency
Provides a foundation for further analysis
Enables data exploration and discovery
The Silver Zone: A Foundation for Insights
The Silver Zone houses data that has been further refined, aggregated, and optimized for specific use cases. It serves as a bridge between the raw data and the final curated datasets.
Key Characteristics:
Data is cleansed, standardized, and enriched
Data is structured for analytical purposes (e.g., normalized, de-normalized)
Data is optimized for query performance (e.g., partitioning, indexing)
Data is aggregated and summarized for specific use cases
Benefits:
Improved query performance
Supports self-service analytics
Enables advanced analytics and machine learning
Reduces query costs
The Gold Zone: Your Data's Final Destination
Definition: The Gold Zone contains the final, curated datasets ready for consumption by business users and applications. It is the pinnacle of data transformation and optimization.
Key Characteristics:
Data is highly refined, aggregated, and optimized for specific use cases
Data is often materialized for performance
Data is subject to rigorous quality checks and validation
Data is secured and governed
Benefits:
Enables rapid insights and decision-making
Supports self-service analytics and reporting
Provides a foundation for advanced analytics and machine learning
Reduces query latency
The Gold Zone: Empowering Insights and Actions
The Gold Zone is the final destination for data, providing a foundation for insights, analysis, and action. It houses curated, optimized datasets ready for consumption.
Key Characteristics:
Data is accessible and easily consumable
Supports various analytical tools and platforms (BI, ML, data science)
Enables self-service analytics
Drives business decisions and actions
Examples of Consumption Tools:
Business Intelligence (BI) tools (Looker, Tableau, Power BI)
Data science platforms (Python, R, SQL)
Machine learning platforms (TensorFlow, PyTorch)
Advanced analytics tools
Data Governance: The Cornerstone of Data Management
Data governance is the framework that defines how data is managed within an organization, while data management is the operational execution of those policies. Data Governance is essential for ensuring data quality, consistency, and security.
Key components of data governance include:
Data Lineage: Tracking data's journey from source to consumption.
Data Ownership: Defining who is responsible for data accuracy and usage.
Data Stewardship: Managing data on a day-to-day basis, ensuring quality and compliance.
Data Security: Protecting data from unauthorized access, use, disclosure, disruption, modification, or destruction.
Compliance: Adhering to industry regulations (e.g., GDPR, CCPA, HIPAA) and internal policies.
By establishing clear roles, responsibilities, and data lineage, organizations can build trust in their data, improve decision-making, and mitigate risks.
Data Transformation and Incremental Strategy
The data transformation phase is a critical stage in a data warehouse project. This phase involves several key steps, including data extraction, cleaning, loading, data type casting, use of naming conventions, and implementing incremental loads to continuously insert the new information since the last update via batch processes.
Data Lineage: Tracks the flow of data from its origin to its destination, including all the intermediate processes and transformations that it undergoes.
Data Governance : Metadata
Assigns the owner, steward and responsibilities of the data.
Summary: Leverage Medallion Architecture for Success
Key Benefits:
Improved data quality
Enhanced governance
Accelerated insights
Scalability
Cost Efficiency.
We've covered a lot today, but this is just the beginning!
If you're interested in learning more about building cloud data pipelines, I encourage you to check out my book, 'Data Engineering Process Fundamentals,' part of the Data Engineering Process Fundamentals series. It provides in-depth explanations, code samples, and practical exercises to help in your learning.
Thanks for reading.
Send question or comment at Twitter @ozkary
👍 Originally published by ozkary.com
In modern data engineering solutions, handling streaming data is very important. Businesses often need real-time insights to promptly monitor and respond to operational changes and performance trends. A data streaming pipeline facilitates the integration of real-time data into data warehouses and visualization dashboards.
Follow this GitHub repo during the presentation: (Give it a star)
Spark Structured Streaming for real-time processing
Writing processed data to the data lake
Q&A Session
Get your questions answered by the presenters.
Why Join This Session?
Stay Ahead of the Curve: Gain a comprehensive understanding of data streaming, a crucial aspect of modern data engineering.
Unlock Real-Time Insights: Learn how to leverage data streaming for immediate processing and analysis, enabling faster decision-making.
Learn Kafka and Spark: Explore the power of Apache Kafka as a message broker and Apache Spark Structured Streaming for real-time data processing.
Build a Robust Data Lake: Discover how to integrate real-time data into your data lake for a unified data repository.
Presentation
Introduction - What is Data Streaming?
Data streaming enables us to build data integration in real-time. Unlike traditional batch processing, where data is collected and processed periodically, streaming data arrives continuously, and it is processed on-the-fly.
Understanding the concept of continuous data flow
Real-time, uninterrupted transfer of data from various channels.
Allows for immediate processing and analysis of data as it is generated.
Real-time vs. batch processing
Data is collected and process in chunks at certain times
The data can take hours and even days depending on the source
Benefits and use cases of data streaming
React instantly to events
Predict trends with real-time updates
Update dashboard with up to the minute/seconds data
Data Streaming Channels
Data streams can arrive from various channels, often hosted on HTTP endpoints. The specific channel technology depends on the provider. Generally, the integration involves either a push or a pull connection.
Events (Push Model): These can be delivered using a subscription model like Pub/Sub, where your system subscribes to relevant topics and receives data "pushed" to it whenever events occur. Examples include user clicks, sensor readings, or train arrivals.
Webhooks (Push-Based on Events): These are HTTP callbacks triggered by specific events on external platforms. You set up endpoints that listen for these notifications to capture the data stream.
APIs (Pull Model): Application Programming Interfaces are used to actively fetch data from external services, like social media platforms. Scheduled calls are made to the API at specific intervals to retrieve the data.
Data Streaming System
Powering real-time data pipelines, Apache Kafka efficiently ingests data streams, while Apache Spark analyzes and transforms it, enabling large-scale insights.
Apache Kafka:
Apache Kafka: The heart of the data stream. It's a high-performance platform that acts as a message broker, reliably ingesting data (events) from various sources like applications, sensors, and webhooks. These events are published to categorized channels (topics) within Kafka for further processing.
Spark Structured Streaming:
Built on Spark, it processes Kafka data streams in real-time. Unlike simple ingestion, it allows for transformations, filtering, and aggregations on the fly, enabling real-time analysis of streaming data.
Data Streaming Components
Apache Kafka acts as the central message broker, facilitating real-time data flow. Producers, like applications or sensors, publish data (events) to categorized channels (topics) within Kafka. Spark then subscribes as a consumer, continuously ingesting and processing these data streams in real-time.
Message Broker (Kafka): Routes real-time data streams.
Producers & Consumers: Producers send data to topics, Consumers receive and process it.
Topics (Categories): Organize data streams by category.
The Metropolitan Transportation Authority (MTA) subway system in New York has stations around the city. All the stations are equipped with turnstiles or gates which tracks as each person enters (departure) or exits (arrival) the station.
The MTA subway system has stations around the city.
All the stations are equipped with turnstiles or gates which tracks as each person enters or leaves the station.
CSV files provide information about the amount of commuters per stations at different time slots.
Data Specifications
Since we already have a data transformation layer that incrementally updates the data warehouse, our real-time integration will focus on leveraging this existing pipeline. We'll achieve this by aggregating data from the stream and writing the results directly to the data lake.
Group by these categorical fields: "AC", "UNIT","SCP","STATION","LINENAME","DIVISION", "DATE", "DESC"
# Define the schema for the incoming data
turnstiles_schema = StructType([
StructField("AC", StringType()),
StructField("UNIT", StringType()),
StructField("SCP", StringType()),
StructField("STATION", StringType()),
StructField("LINENAME", StringType()),
StructField("DIVISION", StringType()),
StructField("DATE", StringType()),
StructField("TIME", StringType()),
StructField("DESC", StringType()),
StructField("ENTRIES", IntegerType()),
StructField("EXITS", IntegerType()),
StructField("ID", StringType()),
StructField("TIMESTAMP", StringType())
])
Solution Architecture for Real-time Data Integration
Data streams are captured by the Kafka producer and sent to Kafka topics. The Spark-based stream consumer retrieves and processes the data in real-time, aggregating it for storage in the data lake.
Components:
Real-Time Data Source: Continuously emits data streams (events or messages).
Message Broker Layer:
Kafka Broker Instance: Acts as a central hub, efficiently collecting and organizing data into topics.
Kafka Producer (Python): Bridges the gap between the source and Kafka.
Stream Processing Layer:
Spark Instance: Processes and transforms data in real-time using Apache Spark.
Stream Consumer (Python): Consumes messages from Kafka and acts as both a Kafka consumer and Spark application:
Retrieves data as soon as it arrives.
Processes and aggregates data.
Saves results to a data lake.
Data Storage: Data transformation for visualization tools (Looker, Power BI) to access.
Docker Containers: Use containers for deployments
Data Transformation and Incremental Strategy
The data transformation phase is a critical stage in a data warehouse project. This phase involves several key steps, including data extraction, cleaning, loading, data type casting, use of naming conventions, and implementing incremental loads to continuously insert the new information since the last update via batch processes.
Data Lineage: Tracks the flow of data from its origin to its destination, including all the intermediate processes and transformations that it undergoes.
Impact on Data Visualization
Our architecture efficiently processes real-time data by leveraging our existing data transformation layer.
This optimized flow enables significantly faster data visualization.
The dashboard refresh time can increase their frequency to load the new data.
For real-time updates directly on the dashboard, a socket-based integration would be necessary.
Key Takeaways: Real-Time Integration
Data streaming solutions are an absolute necessity, enabling the rapid processing and analysis of vast amounts of real-time data. Technologies like Kafka and Spark play a pivotal role in empowering organizations to harness real-time insights from their data streams.
Real-time Power: Kafka handles various data streams, feeding them to data topics.
Spark Processing Power: Spark reads from these topics, analyzes messages in real-time, and aggregates the data to our specifications.
Existing Pipeline Integration: Leverages existing pipelines to write data to data lakes for transformation.
Faster Insights: Delivers near real-time information for quicker data analysis and visualization.
We've covered a lot today, but this is just the beginning!
If you're interested in learning more about building cloud data pipelines, I encourage you to check out my book, 'Data Engineering Process Fundamentals,' part of the Data Engineering Process Fundamentals series. It provides in-depth explanations, code samples, and practical exercises to help in your learning.
Thanks for reading.
Send question or comment at Twitter @ozkary
👍 Originally published by ozkary.com
Curious about the possibilities and where your passion fits in the ever-evolving world of technology? Join us as we decode your unique technical journey! This presentation is designed to equip you with the knowledge and confidence to navigate your path in the exciting world of technology
YouTube Video
Video Agenda
What's Next?:
Understanding the Technical Landscape.
Continuous Learning.
Exploring Industry Trends and Job Market.
Explore Your Passion: Diverse Areas of Specialization:
Showcase different areas of CS specialization (e.g., web development, data science, artificial intelligence, cybersecurity).
Building Blocks of Tech: Programming Languages:
Showcase and explain some popular programming languages used in different areas.
Beyond Coding: Programming vs. Non-Programming Roles:
Debunk the myth that all CS careers involve coding.
Introduce non-programming roles in tech.
Code-Centric vs. Low-Code/No-Code Development:
Explain the concept of code-centric and low-code/no-code development approaches.
Discuss the advantages and disadvantages of each approach.
The Future is Bright:
Discuss emerging technologies like AI, cloud computing, and automation, and their impact on the future of CS careers.
Emphasize the importance of continuous learning and adaptability in this ever-changing landscape.
Why Attend?
In-demand skills: Discover the technical and soft skills sought after by employers in today's tech industry.
Matching your passion with a career: Explore diverse areas of specialization and identify the one that aligns with your interests and strengths.
Career paths beyond coding: Uncover a range of opportunities in tech, whether you're a coding whiz or have a different area of expertise.
Future-proofing your career: Gain knowledge of emerging technologies and how they'll shape the future of computer science.
By attending, you'll leave equipped with the knowledge and confidence to make informed decisions about your future in the ever-evolving world of technology.
Presentation
What's Next for Your Tech Career?
Feeling overwhelmed by the possibilities after graduation? You're not alone! Learning never ends, as there are some Technical foundation (hard skills) areas to consider as you embark on a tech career.
Understanding the Technical Landscape
Stay Informed: Keep up with the latest trends and advancements in technology
Broaden Your Horizons: Look beyond your core area of study. Explore other fields
Continuous Learning and Skill Development
Adapt and Evolve: The tech industry is constantly changing
Technical Skills: Focus on in-demand skills such as Cloud Computing, Cybersecurity, and Data Science
Technical skills are crucial, but success in the tech industry also hinges on strong soft skills. These skills are essential for success in today's collaborative tech environment:
Networking and Professional Growth:
Build Your Tech Network: Connect and collaborate with online and offline tech communities.
Invest in Your Soft Skills: Enhance your communication, teamwork, and problem-solving skills.
Find Your Tech Mentor: Seek guidance and support from experienced professionals.
The tech industry is bursting with opportunities. To navigate this exciting landscape and land your dream job, consider these key areas to craft your career roadmap and take a chance:
Work style Preferences:
Remote vs. Relocation: Do you thrive in a remote work environment, or are you open to relocating for exciting opportunities?
Big Companies vs. Startups: Compare the established structure and resources of large companies or the fast-paced, dynamic culture of startups.
Explore an Industry Specialization:
Healthcare: Revolutionize patient care by contributing to advancements in medical technology and data analysis.
Manufacturing: Fuel innovation by optimizing production processes and integrating automation through industrial tech.
Diverse Areas of Specialization
Do you like creating websites? Web development might be your calling. Do you dream of building mobile apps? Mobile development could be your fit. Are you intrigued by the power of data and its ability to unlock valuable insights? Data science might be your ideal path.
Web Development: Build user interfaces and functionalities for websites and web applications.
Mobile Development: Create applications specifically designed for smartphones and tablets.
Data Engineering: Build complex data pipelines and data storage solutions.
Data Analyst: Process data, discover insights, create visualizations
Data Science: Analyze large datasets to extract valuable insights and inform decision-making.
Artificial Intelligence: Develop intelligent systems that can learn and make decisions.
Cloud Engineering: Design, build, and manage applications and data in the cloud.
Cybersecurity: Protect computer systems and networks from digital threats
Game Development: Create video games and AR experiences
Building Blocks of Tech: Programming Languages
The world of software development hinges on a powerful tool - programming languages. These languages, with their unique syntax and functionalities, has advantages for certain platforms like Web, Data, Mobile.
Versatile Languages:
JavaScript (JS): The king of web development, also used for building interactive interfaces and mobile apps (React Native).
Python: A beginner-friendly language, popular for data science, machine learning, web development (Django), and automation.
Java: An industry standard, widely used for enterprise applications, web development (Spring), and mobile development (Android), high-level programming.
C#: A powerful language favored for game development (Unity), web development (ASP.NET), and enterprise applications.
SQL: A powerful language essential for interacting with relational databases, widely used in web development, data analysis, and business intelligence.
Specialized Languages:
PHP: Primarily used for server-side scripting and web development (WordPress).
C++: A high-performance language for system programming, game development, and scientific computing, low-level programming.
Mobile-Centric Languages:
Swift: The go-to language for native iOS app development.
Objective-C: The predecessor to Swift, still used in some legacy iOS apps.
JavaScript Extensions:
TypeScript: A superset of JavaScript, adding optional static typing for larger web applications.
Beyond Coding: Programming vs. Non-Programming Roles
Programming involve writing code to create apps and systems. Non-programming tech roles, like project managers, QA, UX designers, and technical writers, use their skills to guide the development process, design user experiences, and document technical information.
Programming Roles: Developers, software engineers, data engineers
Non-Programming Roles: Project managers, systems analysts, user experience (UX) designers, QA, DevOps, technical writers.
The industry continuous to define new specialized roles.
Empowering Everyone: Code-Centric vs. Low-Code/No-Code Development
Do you enjoy diving into the code itself using tools like Visual Studio Code? Or perhaps you prefer a more visual approach, leveraging designer tools and writing code snippets when needed?
Code-Centric Development:
Traditional approach where developers write code from scratch using programming languages like Python, C#, or C++.
Offers maximum flexibility and control over the application's functionality and performance.
Requires strong programming skills and a deep understanding of software development principles.
Low-Code/No-Code Development:
User-friendly platforms that enable rapid application development with minimal coding or no coding required.
Utilize drag-and-drop interfaces, pre-built components, and templates to streamline the development process.
Ideal for building simple applications, automating workflows, or creating prototypes.
Evolving with Technology
The landscape of software development is constantly transforming, with new technologies like AI, low-code/no-code platforms, automation, and cloud engineering emerging. Keep evolving!
AI as a Co-Pilot: AI won't replace programmers; it will become a powerful collaborator. Imagine AI tools that:
Generate code snippets based on your requirements.
Refactor and debug code for efficiency and security.
Automate repetitive tasks, freeing you for more creative problem-solving.
Low-Code/No-Code Democratization: These platforms will empower citizen developers to build basic applications, streamlining workflows. Programmers will focus on complex functionalities and integrating these solutions.
Automation Revolution: Repetitive coding tasks will be automated, allowing programmers to focus on higher-level logic, system design, and innovation.
Cloud Engineering Boom: The rise of cloud platforms will create a demand for skilled cloud engineers who can design, build, and manage scalable applications in the cloud.
Final Thoughts: Your Future in Tech Awaits
The tech world is yours to explore! Keep learning, join a community, choose your path in tech and industry, and build your roadmap. Find a balance between your professional pursuits and personal well-being.
Thanks for reading.
Send question or comment at Twitter @ozkary
👍 Originally published by ozkary.com
Delve into unlocking the insights from our data with data analysis and visualization. In this continuation of our data engineering process series, we focus on visualizing insights. We learn about best practices for data analysis and visualization, we then move into an implementation using a code-centric dashboard using Python, Pandas and Plotly. We then follow up by using a high-quality enterprise tool, such as Looker, to construct a low-code cloud-hosted dashboard, providing us with insights into the type of effort each method takes.
Follow this GitHub repo during the presentation: (Give it a star)
Recap the importance of data warehousing, data modeling and transition to data analysis and visualization.
Data Analysis Foundations:
Data Profiling: Understand the structure and characteristics of your data.
Data Preprocessing: Clean and prepare data for analysis.
Statistical Analysis: Utilize statistical techniques to extract meaningful patterns.
Business Intelligence: Define key metrics and answer business questions.
Identifying Data Analysis Requirements: Explore filtering criteria, KPIs, data distribution, and time partitioning.
Mastering Data Visualization:
Common Chart Types: Explore a variety of charts and graphs for effective data visualization.
Designing Powerful Reports and Dashboards: Understand user-centered design principles for clarity, simplicity, consistency, filtering options, and mobile responsiveness.
Layout Configuration and UI Components: Learn about dashboard design techniques for impactful presentations.
Implementation Showcase:
Code-Centric Dashboard: Build a data dashboard using Python, Pandas, and Plotly (demonstrates code-centric approach).
Low-Code Cloud-Hosted Dashboard: Explore a high-quality enterprise tool like Looker to construct a dashboard (demonstrates low-code efficiency).
Effort Comparison: Analyze the time and effort required for each development approach.
Conclusion:
Recap key takeaways and the importance of data analysis and visualization for data-driven decision-making.
Why Join This Session?
Learn best practices for data analysis and visualization to unlock hidden insights in your data.
Gain hands-on experience through code-centric and low-code dashboard implementations using popular tools.
Understand the effort involved in different dashboard development approaches.
Discover how to create user-centered, impactful visualizations for data-driven decision-making.
This session empowers data engineers and analysts with the skills and tools to transform data into actionable insights that drive business value.
Presentation
How Do We Gather Insights From Data?
We leverage the principles of data analysis and visualization. Data analysis reveals patterns and trends, while visualization translates these insights into clear charts and graphs. It's the approach to turning raw data into actionable insights for smarter decision-making.
Let’s Explore More About:
Data Modeling
Data Analysis
Python and Jupyter Notebook
Statistical Analysis vs Business Intelligence
Data Visualization
Chart Types and Design Principles
Code-centric with Python Graphs
Low-code with tools like Looker, PowerBI, Tableau
Data Modeling
Data modeling lays the foundation for a data warehouse. It starts with modeling raw data into a logical model outlining the data and its relationships, with a focus based on data requirements. This model is then translated, using DDL, into the specific views, tables, columns (data types), and keys that make up the physical model of the data warehouse, with a focus on technical requirements.
Importance of a Date Dimension
A date dimension allows us to analyze your data across different time granularities (e.g., year, quarter, month, day). By storing dates and related attributes in a separate table, you can efficiently join it with your fact tables containing metrics. When filtering or selecting dates for analysis, it's generally better to choose options from the dimension table rather than directly filtering the date column in the fact table.
CREATETABLE dim_date (
date_id INTNOTNULL PRIMARY KEY, -- Surrogate key for the date dimension
full_date DATENOTNULL, -- Full date in YYYY-MM-DD formatyearINTNOTNULL, -- Year (e.g., 2024)quarterINTNOTNULL, -- Quarter of the year (1-4)monthINTNOTNULL, -- Month of the year (1-12)
month_name VARCHAR(20) NOTNULL, -- Name of the month (e.g., January)dayINTNOTNULL, -- Day of the month (1-31)
day_of_week INTNOTNULL, -- Day of the week (1-7, where 1=Sunday)
day_of_week_name VARCHAR(20) NOTNULL, -- Name of the day of the week (e.g., Sunday)
is_weekend BOOLEANNOTNULL, -- Flag indicating weekend (TRUE) or weekday (FALSE)
is_holiday BOOLEANNOTNULL, -- Flag indicating holiday (TRUE) or not (FALSE)
fiscal_year INT, -- Fiscal year (optional)
fiscal_quarter INT-- Fiscal quarter (optional) -- Optional
);
Data Analysis
Data analysis is the practice of exploring data and understanding its meaning. It involves activities that can help us achieve a specific goal, such as identifying data dimensions and measures, as well as the process to identify outliers, trends, and distributions.
We can accomplish these activities by writing code using Python and Pandas, SQL, Jupyter Notebooks.
We can use libraries, such as Plotly, to generate some visuals to further analyze data and create prototypes.
The use of low-code tools also aids in the Exploratory Data Analysis (EDA) process
Data Analysis - Profiling
Data profiling is the process to identify the data types, dimensions, measures, and quantitative values, which allows the analyst to view the characteristics of the data, so we can understand how to group the information.
Data Types: This is the type classification of the data fields. It enables us to identify categorical (text), numeric and date-time values, which define the schema
Dimensions: Dimensions are textual, and categorical attributes that describe business entities. They are often discrete and used for grouping, filtering, organizing and partition the data
Measures: Measures are the quantitative values that are subject to calculations such as sum, average, minimum, maximum, etc. They represent the KPIs that the organization wants to track and analyze
dimension
data_type
measure
datetime_dimension
station_name
True
object
False
False
created_dt
True
object
False
True
entries
False
int64
True
False
exits
False
int64
True
False
Data Analysis - Cleaning and Preprocessing
Data cleaning is the process of finding bad data and outliers that can affect the results. In preprocessing, we set the data types, combine or split columns, and rename columns to follow our standards.
Bad Data:
Bad data could be null values
Values that are not within the range of the average trend for that day
Pre-Process:
Cast fields with the correct type
Rename columns and following naming conventions
Transform values from labels to numbers when applicable
# Check for null values in each column
null_counts = df.isnull().sum()
null_counts.head()
# fill null values with a specific value
df = df.fillna(0)
# cast a column toa specific data type
df['created_dt'] = pd.to_datetime(df['created_dt'])
# get the numeric col names and cast them toint
numeric_cols = df.select_dtypes(include=[np.number]).columns
df[numeric_cols] = df[numeric_cols].astype(int)
# Rename all columns to lowercase
df.columns = [col.lower() forcol in df.columns]
Data Analysis - Preprocess Outliers
Outliers are values that are notably different from the other data points in terms of magnitude or distribution. They can be either unusually high (positive outliers) or unusually low (negative outliers) in comparison to the majority of data points.
Process:
Calculate the z-score for numeric values, which describes how far is the data point from a group of data
Define a threshold
Chose a value that determines when a z-score is considered high enough to be labeled as an outlier (2 or 3)
Identify the outliers based on the z-score
# measure outliers for entries and exits
# Calculate z-scores within each station group
z_scores = df.groupby('station_name')[numeric_cols] \
.transform(lambda x: (x - x.mean()) / x.std())
# Set a threshold for outliers
threshold = 3
# Identify outliers based on z-scores within each station
outliers = (z_scores.abs() > threshold)
# Print the count of outliers foreach station
outliers_by_station = outliers.groupby(df['station_name']).sum()
print(outliers_by_station)
Data Analysis - Statistical Analysis
Statistical analysis focuses on applying statistical techniques in order to draw meaningful conclusions about a set of data. It involves mathematical computations, probability theory, correlation analysis, and hypothesis testing to make inferences and predictions based on the data. This is use for manufacturing, data science industries, machine learning.
Pearson Correlation Coefficient and p-value are statistical measures used to assess the strength and significance of the linear relationship between two variables.
P-Value: measures the statistical significance of the correlation
Interpretation:
If the p-value is small (.05) there is solid linear correlation. Otherwise, there is no correlation
Business intelligence (BI) is a strategic approach that involves the collection, analysis, and presentation of data to facilitate informed decision-making within an organization. In the context of business analytics, BI is a powerful tool for extracting meaningful insights from data and turning them into actionable strategies.
Analysts:
Look at data distribution
Understanding of data variations
Focus analysis based on locations, date and time periods
Provide insights that impact business operations
Provide insights for business strategy and decision-making
# Calculate total passengers for arrivals and departures
total_arrivals = df['exits'].sum()/divisor_t
total_departures = df['entries'].sum()/divisor_t
print(f"Total Arrivals: {total_arrivals} Total Departures: {total_departures}")
# Create distribution analysis by station
df_by_station = analyze_distribution(df,'station_name',measures,divisor_t)
# Create distribution analysis by day of the week
df_by_date = df.groupby(["created_dt"], as_index=False)[measures].sum()
day_order = ['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat']
df_by_date["weekday"] = pd.Categorical(df_by_date["created_dt"].dt.strftime('%a'), categories=day_order, ordered=True)
df_entries_by_date = analyze_distribution(df_by_date,'weekday',measures,divisor_t)
# Create distribution analysis time slots
for slot, (start_hour, end_hour) in time_slots.items():
slot_data = df[(df['created_dt'].dt.hour >= start_hour) & (df['created_dt'].dt.hour <= end_hour)]
arrivals = slot_data['exits'].sum()/divisor_t
departures = slot_data['entries'].sum()/divisor_t
print(f"{slot.capitalize()} - Arrivals: {arrivals:.2f}, Departures: {departures:.2f}")
What is Data Visualization?
Data visualization is a practice that takes the insights derived from data analysis and presents them in a visual format. While tables with numbers on a report provide raw information, visualizations allow us to grasp complex relationships and trends at a glance with the use of charts, controls and colors.
Visualization Solutions:
A code-centric solution involves writing programs with a language like Python, JavaScript to manage the data analysis and create the visuals
A low-code solution uses cloud-hosted tools like Looker, PowerBI and Tableau to accelerate the data analysis and visualization by using a design approach
Data Visualization - Design Principles
These design principles prioritize the user's experience by ensuring clarity, simplicity, and consistency.
User-centered design: Focus on the needs and preferences of your audience when designing your visualizations.
Clarity: Ensure your visualizations are easy to understand, even for people with no prior knowledge of the data.
Simplicity: Avoid using too much clutter or complex charts.
Consistency: Maintain a consistent visual style throughout your visualizations.
Filtering options: Allow users to filter the data based on their specific interests.
Device responsiveness: Design your visualizations to be responsive and viewable on all devices, including mobile phones and tablets.
Visual Perception
Over half of our brain is dedicated to processing visual information. This means our brains are constantly working to interpret and make sense of what we see.
Key elements influencing visual perception:
Color: Colors evoke emotions, create hierarchy, and guide the eye.
Size: Larger elements are perceived as more important. (Use different sized circles or bars to show emphasis)
Position: Elements placed at the top or center tend to grab attention first.
Shape: Different shapes can convey specific meanings or represent categories. (Use icons or charts with various shapes)
Statistical Analysis - Basic Charts
Control Charts: Monitor process stability over time, identifying potential variations or defects.
Histograms: Depict the frequency distribution of data points, revealing patterns and potential outliers.
Box Plots: Summarize the distribution of data using quartiles, providing a quick overview of central tendency and variability.
Business Intelligence Charts
Scorecards: Provide a concise overview of key performance indicators (KPIs) at a glance, enabling performance monitoring.
Pie Charts: Illustrate proportional relationships between parts of a whole, ideal for composition comparisons.
Doughnut Charts: Similar to pie charts but emphasize a specific category by leaving a blank center space.
Bar Charts: Represent comparisons between categories using rectangular bars, effective for showcasing differences in magnitude.
Line Charts: Reveal trends or patterns over time by connecting data points with a line, useful for visualizing continuous changes.
Area charts: Can be helpful for visually emphasizing the magnitude of change over time.
Stacked area charts: can be used to show multiple data series.
Data Visualization - Code Centric
Python, coupled with libraries like Plotly, Seaborn offers a versatile platform for data visualization that comes with its own set of advantages and limitations. Great for team sharing but, it is heavy in code and deployments tasks.
Data Visualization - Low Code
Instead of focusing on code, a low-code tool enables data professionals to focus on the data by using design tools with prebuilt components and connectors. The hosting and deployment is mostly managed by the providers. This is often the solution for broader sharing and enterprise solutions.
Final Thoughts
The synergy between data analysis and visualization is pivotal for data-driven projects. Navigating data analysis with established principles and communicating insights through visually engaging dashboards empowers us to extract value from data.
The Future is Bright
Augmented Reality (AR) and Virtual Reality (VR): Imagine exploring a dataset within a 3D environment & having charts and graphs overlaid on the real world
(AI) and Machine Learning (ML): AI can automate data analysis tasks like identifying patterns and trends, while ML can personalize visualizations based on user preferences or past interactions.
Tools will focus on creating visualizations that are accessible to people with disabilities
We've covered a lot today, but this is just the beginning!
If you're interested in learning more about building cloud data pipelines, I encourage you to check out my book, 'Data Engineering Process Fundamentals,' part of the Data Engineering Process Fundamentals series. It provides in-depth explanations, code samples, and practical exercises to help in your learning.
Thanks for reading.
Send question or comment at Twitter @ozkary
👍 Originally published by ozkary.com
Oscar Garcia is a Principal Software Architect and VP of product development. He is a Microsoft MVP and certified solutions' developer with many years of experience building software solutions. He specializes in building cloud solutions using technologies like GCP, AWS, Azure, ASP.NET, NodeJS, Angular, React, SharePoint, Microsoft 365, Firebase as well as BI projects for data visualization using tools like Tableau and PowerBI, Spark. You can follow Oscar on Twitter @ozkary or his blog at ozkary.com
VP of Product Development
5 consecutive Microsoft Most Valuable Professional (MVP).