All posts

Hidden Costs of In-House ETL Solutions

In-house ETL costs more than it seems: maintenance, incidents, and technical debt all add up. What to consider before deciding to build pipelines in-house.

Sep 18, 2024

Iceberg representing the hidden costs of in-house ETL solutions

What are the costs of having in-house ETL solutions?

During our sales meeting validation process, people often compare cloud costs with the price of our subscription. That’s why I decided to talk about the hidden costs of building and maintaining in-house data pipelines.

First, a few important disclaimers before people come for me in the comments:

I’m not bashing internal libs that I know you DEs are addicted to using or building
I’m also not criticizing Airflow or using DAGs
I’m specifically saying that the data ingestion process can be optimized.

So let’s go, starting with the basics.

⏰ Team hours:

It’s easy to forget to account for this team’s hourly cost when building and maintaining data pipelines. How much time will someone on your data team spend building or maintaining data pipelines?

🔍 Hiring and retention:

Data engineers are very hard to find (and expensive), great engineers are rare, and let’s face it: you’re competing with international companies to hire and retain these people. To build this team, beyond the engineers’ salaries themselves, you’ll have recruiting and hiring costs, which are rarely taken into account.

🔁 Turnover:

Data teams often have high turnover. Who hasn’t had to fix a stone-age pipeline built by someone who’s no longer at the company, with zero documentation of what had been done? Many data teams literally lose everything when someone leaves, because no one else knew how that process worked. Of course, good docs would solve the problem, but the reality is there are so many demands that teams choose to leave that for later 😩 Another point is that engineers like to be challenged—no one can stand doing tiny ETL tasks forever.

⚙️ Pipeline maintenance:

Your team needs to keep pipelines running—maintenance is forever—and for that, they’ll need to manage API changes in the systems your company uses. Are you ready to rebuild the Google Ads integration every 3 months? Outsourcing this problem sounds like a much better alternative.

🔥 Firefighting:

Not to mention the surprise maintenance incidents: CxOs coming in hot because the dashboard didn’t refresh. And there goes another day debugging the pipeline to understand what happened, derailing the entire development plan.

👎 Downtime:

Speaking of firefighting, how long does it take to detect and fix errors? Obviously, there’s no world without downtime, but ideally it should go unnoticed by the team consuming the data, because resolution is fast and your team detects the issue before they complain. Data observability makes ALL the difference.

💰 Opportunity cost:

Do you really prefer having an engineer reading API docs instead of solving a core business problem? Maybe they could be ingesting a business-specific data source, like raw open-data files, building and optimizing transformations inside your analytics environment, or even optimizing the embeddings generation process if you use LLMs.

Of course, in some contexts doing this internally makes sense, at least for some data sources. But in most cases, the hidden costs end up not justifying the choice. Evaluate it with your data team, finance, etc., before making a decision—and don’t rush. But for your company’s best interest, don’t ignore the indirect costs.

If you’re unsure whether to build or buy data pipelines, get in touch with our team!

Gelson Bagetti

Ingest data into your data warehouse - reliably

Start a free trial