
Image by Author
# Introduction
You’ve likely heard the cliche: “Data is the backbone of modern organizations.” This holds true, but only if you can rely on that backbone. I’m not necessarily talking about the condition of the data itself, but rather the system that produces and moves the data.
If the dashboards break, pipelines fail, and metrics change randomly, the problem isn’t a lack of data quality, but a lack of observability.
# What Is Data Observability?
Data observability is a process of monitoring the health and reliability of data systems.
This process helps data teams detect, diagnose, and prevent issues across the analytics stack — from ingestion to storage to analysis — before they impact decision-making.
With data observability, you monitor the following aspects of data and the system.

Image by Author
- Data Freshness: Tracks how current the data is compared to the expected update schedule. Example: If a daily sales table hasn’t been updated by 7 a.m. as scheduled, observability tools raise an alert before business users use sales reports.
- Data Volume: Measures how much data is being ingested or processed at each stage. Example: A 38% drop in transaction records overnight might mean a broken ingestion job.
- Data Schema: Detects changes to column names, data types, or table structures. Example: If a new data producer pushes an updated schema to production without notice.
- Data Distribution: Check the statistical shape of the data, i.e., whether it looks normal. Example: The percentage of premium customers drops from 29% to 3% overnight. Observability will detect this as an anomaly and prevent misleading churn rate analysis.
- Data Lineage: Visualizes the flow of data across the ecosystem, from ingestion through transformation to final dashboards. Example: A source table in Snowflake fails, and the lineage view will show that three Looker dashboards and two machine learning models depend on it.
# Why Data Observability Matters
The benefits of data observability in analytics are shown below.

Image by Author
Each of the data observability dimensions or pillars we mentioned earlier has a specific role in achieving the overall benefits of data observability.
- Fewer Bad Decisions: Data observability ensures that analytics reflect current business conditions (data freshness dimension) and that the numbers and data patterns make sense before they’re used for insights (data distribution dimension), which results in fewer decisions that could go wrong.
- Faster Issue Detection: When the early warning systems alert you that data loads are incomplete or duplicated (data volume dimension) and/or there are structural changes that would silently break pipelines, anomalies are caught before business users even notice them.
- Improved Data Team Productivity: Data lineage dimension maps how data flows across systems, making it easy to trace where an error started and which assets are affected. The data team focuses on development instead of firefighting.
- Better Stakeholder Trust: This is the final boss of data observability benefits. The stakeholder trust is the ultimate outcome of the three previous benefits. If stakeholders can trust the data team that the data is current, complete, stable, accurate, and everyone knows where it came from, confidence in analytics follows naturally.
# Data Observability Lifecycle & Techniques
As we mentioned earlier, data observability is a process. Its continuous lifecycle consists of these stages.

Image by Author
// 1. Monitoring and Detection Stage
Goal: A reliable early-warning system that checks in real-time if something drifted, broke, or deviated in your data.
What happens here:

Image by Author
- Automated Monitoring: Observability tools automatically monitor data observability through all five of its pillars
- Anomaly Detection: machine learning is used to detect statistical anomalies in data, e.g. unexpected drops in the number of rows
- Alerting Systems: Whenever any violation occurs, the systems send alerts to Slack, PagerDuty, or email
- Metadata & Metrics Tracking: The systems also track information, such as job duration, success rate, and last update time, to understand what “normal behavior” means
// Monitoring and Detection Techniques
Here is an overview of the common techniques used in this stage.

// 2. Diagnosis and Understanding Stage
Goal: Understanding where the issue started and which systems it impacted. That way, the recovery can be fast or, if there are several issues, they can be prioritized, depending on the severity of their impact.
What happens here:

Image by Author
- Data Lineage Analysis: Observability tools visualize data from raw sources to final dashboards, making it easier to locate where the issue occurred
- Metadata Correlation: Metadata is also used here to pinpoint the problem and its location
- Impact Assessment: What is impacted? Tools identify assets (e.g. dashboards or models) that are downstream from the problem location and rely on the affected data
- Root Cause Investigation: Lineage and metadata are used to determine the root cause of the problem
// Diagnosis and Understanding Techniques
Here is an overview of techniques used in this stage.

// 3. Prevention and Improvement Stage
Goal: Learning from what broke and making data systems more resilient with every incident by establishing standards, automating enforcement, and monitoring compliance.
What happens here:

Image by Author
- Data Contracts: Agreements between producers and consumers define acceptable schema and quality standards, so there are no unannounced changes to data
- Testing & Validation: Automated tests (e.g. through dbt tests or Great Expectations) check that new data meets defined thresholds before going live. For teams strengthening their data analytics and SQL debugging skills, platforms like StrataScratch can help practitioners develop the analytical rigor needed to identify and prevent data quality issues
- SLA & SLO Tracking: Teams define and monitor measurable reliability goals (Service Level Agreements and Service Level Objectives), e.g. 99% of pipelines complete on time
- Incident Postmortems: Each issue is reviewed, helping to improve monitoring rules and observability in general
- Governance & Version Control: The changes are tracked, documentation created, and there’s an ownership assignment
// Prevention and Improvement Techniques
Here is an overview of the techniques.

# Data Observability Tools
Now that you understand what data observability does and how it works, it’s time to introduce you to the tools that you’ll use to implement it.
The most commonly used tools are shown below.

Image by Author
We will explore each of these tools in more detail.
// 1. Monte Carlo
Monte Carlo is an industry standard and the first to formalize the five pillars model. It provides complete visibility into data health across the pipeline.
Key strengths:
- Covers all data observability pillars
- Anomalies and schema change is automatic, i.e. no need for a manual rule setup
- Detailed data lineage mapping and impact analysis
Limitations:
- Not exactly suitable for smaller teams, as it’s designed for large-scale deployments
- Enterprise pricing
// 2. Datadog
Datadog started as a tool to monitor servers, applications, and infrastructure. Now, it provides unified observability across servers, applications, and pipelines.
Key strengths:
- Correlates data issues with infrastructure metrics (CPU, latency, memory)
- Real-time dashboards and alerts
- Integrates, for example, with Apache Airflow, Apache Spark, Apache Kafka, and most cloud platforms
Limitations:
- Focus is more on operational health and less on deep data quality checks
- Lacks advanced anomaly detection or schema validation found in specialized tools
// 3. Bigeye
Bigeye automates data quality monitoring through machine learning and statistical baselines.
Key strengths:
- Automatically generates hundreds of metrics for freshness, volume, and distribution
- Allows users to set and monitor data SLAs/SLOs visually
- Easy setup with minimal engineering overhead
Limitations:
- Less focus on deep lineage visualization or system-level monitoring
- Smaller feature set for diagnosing root causes compared to Monte Carlo
// 4. Soda
Soda is an open-source tool that connects directly to databases and data warehouses to test and monitor data quality in real time.
Key strengths:
- Developer-friendly with SQL-based tests that integrate into CI/CD workflows
- Open-source version available for smaller teams
- Strong collaboration and governance features
Limitations:
- Requires manual setup for complex text coverage
- Limited automation capabilities
// 5. Acceldata
Acceldata is a tool that combines data quality, performance, and cost checks.
Key strengths:
- Monitors data reliability, pipeline performance, and cloud cost metrics together
- Managing hybrid and multi-cloud environments
- Integrates easily with Spark, Hadoop, and modern data warehouses
Limitations:
- Enterprise-focused and complex setup
- Less focused on column-level data quality or anomaly detection
// 6. Anomalo
Anomalo is an AI-powered platform focused on automated anomaly detection requiring minimal configuration.
Key strengths:
- Automatically learns expected behavior from historical data, no rules needed
- Excellent for monitoring schema changes and value distributions
- Detects subtle, non-obvious anomalies at scale
Limitations:
- Limited customization and manual rule creation for advanced use cases
- Focused on detection, with fewer diagnostic or governance tools
# Conclusion
Data observability is an essential process that will make your analytics trustworthy. The process is built on five pillars: freshness, volume, schema, distribution, and data lineage.
Its thorough implementation will help your organization make fewer bad decisions, because you’ll be able to avoid issues in data pipelines and diagnose them faster. This improves the data team’s efficiency and enhances the trustworthiness of their insights.
Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.