Data Engineer Interviews: Pipelines Under The Microscope
The Mythic Intel Team · Jan 29, 2026 · 8 min read
A data engineer interview puts your pipelines under a microscope. Panels want to know whether you can write SQL that holds up under real data, model a warehouse that answers the questions the business actually asks, and build pipelines that stay correct when data arrives late, gets reprocessed, or has to be backfilled across months of history. If you are preparing data engineer interview questions, the recurring theme is correctness under messy, real-world conditions, not happy-path ETL.
This guide covers the rounds you will face, what each probes, and the technical depth a data engineering interview expects in 2026. The loop usually includes a recruiter screen, a SQL and data-modeling round, a pipeline or systems design round, and a behavioral conversation.
The recruiter screen
The screen confirms your stack and scope: the warehouses you have used, the orchestration tool, batch versus streaming experience, and the scale of data you have moved. Be specific about volumes, SLAs, and the worst data-quality incident you owned. Vague pipeline ownership gets exposed quickly in the technical rounds.
SQL depth
SQL is non-negotiable, and the bar is higher than for most roles. Expect live problems that go well past basic joins:
- Window functions: running totals, rank and dense_rank, lag and lead, and partitioned aggregates.
- Common table expressions and recursive CTEs for hierarchical data.
- Deduplication, gaps-and-islands, and sessionization problems.
- Query performance: reading an execution plan, spotting a full scan, and reasoning about partitioning and clustering.
-- second-most-recent order per customer, a common window-function probe
SELECT customer_id, order_id, ordered_at
FROM (
SELECT customer_id, order_id, ordered_at,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY ordered_at DESC) AS rn
FROM orders
) ranked
WHERE rn = 2;
Interviewers care that you reach for a window function instead of a self-join, and that you can explain the partition and ordering.
Data modeling
Dimensional modeling shows up in nearly every data engineering interview. Know the patterns cold:
- Star schema: a central fact table surrounded by denormalized dimension tables. Simple, fast for analytical queries, the default for most warehouses.
- Snowflake schema: dimensions normalized into sub-dimensions, reducing redundancy at the cost of more joins.
- Fact table grain: defining exactly what one row represents is the first decision and the one candidates most often get wrong.
- Slowly changing dimensions, especially Type 2, where you keep history by versioning dimension rows.
Expect a follow-up asking you to choose between star and snowflake for a given query pattern and justify it on join cost and maintainability.
Batch versus streaming
You will be asked when to use which, and the answer hinges on latency, throughput, and cost rather than novelty. Batch processes bounded data on a schedule and is simpler to reason about and backfill. Streaming processes unbounded data continuously for low-latency needs. Be ready to discuss:
- Event time versus processing time, and why the gap between them is the source of most streaming complexity.
- Watermarks and windowing for handling unbounded streams.
- Where a micro-batch or a lambda/kappa-style architecture fits.
Idempotency, late data, and backfills
This cluster is where data engineering interviews separate strong candidates, because it is where pipelines actually break.
- Idempotency: a pipeline run must produce the same result whether it runs once or five times. The common technique is to make writes idempotent with a
MERGE(upsert) keyed on a stable identifier, or to overwrite a partition wholesale rather than appending. This is what makes retries and reruns safe. - Late-arriving data: events that show up after their window has closed. Handle them by keying on event time, keeping raw immutable source data so you can reprocess, and using merge logic that updates the correct historical partition. For streams, replay from Kafka offsets or route stragglers through a dead-letter queue.
- Backfills: reprocessing historical data after a logic change or a fix. The reason idempotency matters so much is that a backfill reruns partitions that may already hold data, so the write must overwrite or merge cleanly without duplicating. Partitioned tables make this tractable by letting you target specific time windows.
A clean answer ties all three together: immutable raw data, partitioned tables, and idempotent merge writes make late data and backfills routine instead of dangerous.
Orchestration with Airflow
Orchestration questions usually center on Airflow. Know the model:
- DAGs as the unit of a pipeline, tasks as nodes, and dependencies as edges.
- Why tasks should be idempotent and atomic, so a retry or a backfill does not corrupt state.
- Scheduling, execution dates, and how Airflow's backfill mechanics line up with partitioned outputs.
- Sensors, retries, and SLAs, plus how you alert when a pipeline misses its window.
Modern variants like Dagster and Prefect come up, but the concepts transfer.
dbt and the warehouse
dbt is now standard for transformation, so expect questions on it. Cover:
- The ELT model: load raw data into the warehouse first, then transform with dbt using SQL and Jinja.
- Models, refs, and the dependency graph dbt builds from them.
- Incremental models, which process only new or changed rows instead of full reloads, and how they pair with merge logic for idempotency.
- Tests and documentation as first-class parts of the pipeline, since data quality is the job.
Warehouse-specific knowledge (Snowflake, BigQuery, Redshift) helps: partitioning, clustering, columnar storage, and separation of storage and compute.
Behavioral
The behavioral round probes how you handle a data-quality incident: a dashboard showing wrong numbers, a silent pipeline failure, a schema change upstream that broke everything downstream. Walk the detection, the fix, and the safeguard you added so it cannot recur. Ownership of correctness is what panels want to hear.
Rehearse out loud
The idempotency and late-data answers are the ones that tangle when spoken cold, because they connect several ideas at once. Practice saying your backfill story end to end, out loud, until immutable raw data, partitioning, and merge writes come out as one coherent argument. A voice-driven trainer like Mythic Intel can build a verified data-engineering room and grade your spoken answers on accuracy, completeness, and structure.