All posts

Data Observability: A Practical Approach

Data observability tracks quality, freshness, and volume at every stage of the pipeline. How to implement practical monitoring without changing your stack.

Nov 7, 2024

Data observability dashboard with data quality metrics, freshness metrics, and pipeline alerts

We know how important it is to implement data observability mechanisms. In this article, we’ll cover a quick and simple way to put that into practice. Best of all, with very little change to your stack or integration flows.

Let’s start with a medallion architecture approach, where we have multiple layers that represent data from the rawest (in green) to analysis-ready (in orange).

It’s useful to understand that each layer in the medallion architecture acts as a safety level—a kind of logical separation in our analytics environment. This way, if an error happens in a rawer layer, we can prevent dependent layers from showing the same error. In a practical case: if we have an ETL error, it’s better for your dashboard to not show today’s data than to show no information at all.

A great way to implement data modeling following medallion architecture is DBT Core. This project provides a range of tools that speed up development, such as data models that can be materialized in different formats, extensive function customization, documentation, data lineage, and several ways to run tests on our data. This last capability is what we’ll explore in this article.

There are two main ways to define data tests in DBT. The first is called singular data tests, and it is nothing more than an SQL query that returns invalid records from the perspective of the test (if the result is empty, the test passes). To implement this type of test, just create a file in your DBT project’s tests directory, like the following example that tests sales with a positive total amount:

-- ./testes/validar_valor_total_positivo.sql
select
    venda_id,
    sum(valor) as valor_total
from {{ ref('fct_vendas' )}}
group by 1
having valor_total < 0

However, there are data tests that can be applied across multiple models, right? To handle this, we use generic tests, which are created using jinja in a format similar to macro creation. A null-check test implementation would look like this:

{% test nao_nulo(tabela, coluna) %}

    select *
    from {{ tabela }}
    where {{ coluna }} is null

{% endtest %}

For this test to run on the model and column we want, you need to include it in the project schema YML. To test whether there are sales records with a null identifier (venda_id), we would implement something like:

version: 2

models:
  - name: fct_vendas
    columns:
      - name: venda_id
        tests

Finally, we use the following command to validate the tests configured in the project.

I want to emphasize that thinking through every data testing use case can be a lot of work, and many times when I implemented tests with DBT, I felt like I was reinventing the wheel. To handle this, I suggest using the dbt_expectations package, which brings the data observability power of Great Expectations to the practicality of DBT.

To do this, just add the package to the dbt project’s packages.yml file:

packages:
  -

Now we have access to a large number of generic tests for different scenarios, such as:

Table formats;
Data types, null values, and unique values;
Values within ranges or sets;
Aggregation functions;
Multiple columns;
Distribution functions;
String comparison.

The best part is that we use it exactly the same way we use generic tests: just declare the test to be executed for the desired column in the schema YML. The null-value test we built earlier can be implemented as follows:

version: 2

models:
  - name: fct_vendas
    columns:
      - name: venda_id
        tests

For more details on which tests can be used, I recommend checking the dbt_expectations documentation.

Let’s start with a medallion architecture approach, where we have multiple layers that represent data from the rawest (in green) to analysis-ready (in orange).

-- ./testes/validar_valor_total_positivo.sql
select
    venda_id,
    sum(valor) as valor_total
from {{ ref('fct_vendas' )}}
group by 1
having valor_total < 0

{% test nao_nulo(tabela, coluna) %}

    select *
    from {{ tabela }}
    where {{ coluna }} is null

{% endtest %}

version: 2

models:
  - name: fct_vendas
    columns:
      - name: venda_id
        tests

Finally, we use the following command to validate the tests configured in the project.

To do this, just add the package to the dbt project’s packages.yml file:

packages:
  -

Now we have access to a large number of generic tests for different scenarios, such as:

Table formats;
Data types, null values, and unique values;
Values within ranges or sets;
Aggregation functions;
Multiple columns;
Distribution functions;
String comparison.

version: 2

models:
  - name: fct_vendas
    columns:
      - name: venda_id
        tests

For more details on which tests can be used, I recommend checking the dbt_expectations documentation.

Ingest data into your data warehouse - reliably

Start a free trial