-
Notifications
You must be signed in to change notification settings - Fork 361
Description
Incremental models using INCREMENTAL_BY_TIME_RANGE with batch_size 1 (or any batch size creating single-day intervals) generate incorrect SQL with identical start and end dates, resulting in:
Zero rows being inserted in all batches
Missing boundary rows at exactly midnight timestamps (especially critical for rows timestamped 1 day before model execution)
Environment
SQLMesh Version: All versions (tested on 0.0.1.dev4573)
Database: All vendors (DB2(Testing is in progress in my dev environment), PostgreSQL, MySQL, Oracle, Snowflake, BigQuery, etc.)
Impact: Critical - causes complete data loss in incremental models
Reproduction Steps
Note: These steps work on any database - PostgreSQL, MySQL, Snowflake, etc. The bug is in SQLMesh core, not database-specific.
- Create Source Table with Boundary Data
PostgreSQL/MySQL/Snowflake/DuckDB:
CREATE TABLE sales_source (
id INT,
customer_name VARCHAR(100),
amount DECIMAL(10,2),
updated_at TIMESTAMP
);
Oracle
CREATE TABLE sales_source (
id NUMBER,
customer_name VARCHAR2(100),
amount NUMBER(10,2),
updated_at TIMESTAMP
);
BigQuery:
CREATE TABLE sales_source (
id INT64,
customer_name STRING,
amount NUMERIC(10,2),
updated_at TIMESTAMP
);
Db2 In my case:
CREATE TABLE sales_source (
id INT,
customer_name VARCHAR(100),
amount DECIMAL(10,2),
updated_at TIMESTAMP
);
Insert test data (works on all databases):
-- Row 1-4: March 1st
INSERT INTO sales_source VALUES
(1, 'Customer A', 100.00, '2026-03-01 00:00:00'),
(2, 'Customer B', 200.00, '2026-03-01 12:30:00'),
(3, 'Customer C', 150.00, '2026-03-01 18:45:00'),
(4, 'Customer D', 250.00, '2026-03-01 23:59:59');
-- Row 5-8: March 2nd (includes boundary row at midnight)
INSERT INTO sales_source VALUES
(5, 'Customer E', 300.00, '2026-03-02 00:00:00'), -- Boundary row!
(6, 'Customer F', 175.00, '2026-03-02 09:15:00'),
(7, 'Customer G', 225.00, '2026-03-02 15:30:00'),
(8, 'Customer H', 275.00, '2026-03-02 21:45:00');
-- Continue inserting data for more days...
-- Total: 50(in my test case, you can take any number) rows spanning March 1-18
-- Row 50: CRITICAL BOUNDARY ROW
-- This row is timestamped at midnight on March 18th
-- When you run the model on March 19th, this row from "yesterday at midnight"
-- should be included in the March 18th batch, but the bug causes it to be lost!
INSERT INTO sales_source VALUES
(50, 'Customer Z', 500.00, '2026-03-18 00:00:00');
-- ⚠️ This is 1 day before model execution (March 19th)
-- ⚠️ Represents data that arrived exactly at midnight
-- ⚠️ This row will be LOST due to the bug!
- Create Incremental Model
Works on all databases:
-- models/sales_incremental.sql
MODEL (
name sales_incremental,
kind INCREMENTAL_BY_TIME_RANGE (
time_column updated_at,
batch_size 1 -- Single day batches trigger the bug
)
);
SELECT
id,
customer_name,
amount,
updated_at
FROM sales_source
WHERE updated_at >= CAST(@start_ds AS TIMESTAMP)
AND updated_at < CAST(@end_ds AS TIMESTAMP);
- Run Plan on March 19th
# Running on March 19th, 2026
sqlmesh plan --auto-apply prod
Expected Behavior
Batch 1 (2026-03-01):
WHERE updated_at >= '2026-03-01' AND updated_at < '2026-03-02'
-- Should insert 4 rows
Batch 2 (2026-03-02):
WHERE updated_at >= '2026-03-02' AND updated_at < '2026-03-03'
-- Should insert 4 rows (including row 5 at midnight)
Batch 18 (2026-03-18) - The Critical Batch:
WHERE updated_at >= '2026-03-18' AND updated_at < '2026-03-19'
-- Should insert row 50 at '2026-03-18 00:00:00' ✅
-- This is yesterday's midnight data!
Total: All 50 rows inserted ✅
Actual Behavior
All batches generate identical start/end dates:
WHERE updated_at >= '2026-03-01' AND updated_at < '2026-03-01' -- Impossible!
WHERE updated_at >= '2026-03-02' AND updated_at < '2026-03-02' -- Impossible!
WHERE updated_at >= '2026-03-18' AND updated_at < '2026-03-18' -- Impossible!
-- ... all batches have this pattern
Result:
❌ 0 rows inserted in all batches
❌ Row 50 at midnight (1 day before execution) is LOST
❌ All boundary rows at 00:00:00 are excluded
The Critical Loss:
Row 50: updated_at = '2026-03-18 00:00:00'
Model run date: 2026-03-19
Expected: Row 50 should be in the 2026-03-18 batch (yesterday's data)
Actual: Row 50 is LOST because the WHERE clause is impossible
Log Evidence
DEBUG date_dict: start_ds=2026-03-01, end_ds=2026-03-01 ❌ Both identical!
INSERT INTO ...
WHERE "UPDATED_AT" >= CAST('2026-03-01' AS TIMESTAMP)
AND "UPDATED_AT" < CAST('2026-03-01' AS TIMESTAMP)
[1/18] sales_incremental [insert 2026-03-01 - 2026-03-01 (0 rows)] ❌
[2/18] sales_incremental [insert 2026-03-02 - 2026-03-02 (0 rows)] ❌
[18/18] sales_incremental [insert 2026-03-18 - 2026-03-18 (0 rows)] ❌ Row 50 lost!
Why Row 50 Is Critical
Row 50 represents a common real-world scenario:
- Data arrives at exactly midnight (00:00:00)
- This is 1 day before the model execution
- In production, this could be:
- End-of-day batch processing
- Scheduled data loads
- Timezone boundary data
- Daily cutoff timestamps
This bug causes systematic loss of midnight boundary data!
The Problem:
- SQLMesh passes millisecond timestamps (e.g.,
1772409600000for2026-03-02 00:00:00) is_date(1772409600000)returns False (can't recognize integers as dates)- Function doesn't add 1 day to end date
make_inclusive()subtracts 1 microsecond from wrong date- Both
@start_dsand@end_dsend up with the same value
Why Boundary Rows Are Lost:
Even when dates are different, midnight timestamps (00:00:00) aren't recognized as categorical dates, causing incorrect range calculations that exclude boundary data.