There was a nice stack overflow question today, on one of my favorite SQL themes of overlapping date ranges.
I was quite pleased with my “find distinct’s, join, and pick the winners per block” code:
with data (ITEM_ID, PRICE, START_DATE, END_DATE, LOADED_DATETIME) as (
select * from values
('A', 1.00, '2023-01-01'::date, '2023-01-31'::date, '2023-11-01'::date),
('A', 2.00, '2023-01-10'::date, '2023-01-15'::date, '2023-11-02'::date),
('A', 3.00, '2023-01-14'::date, '2023-01-17'::date, '2023-11-03'::date),
('A', 4.00, '2023-01-15'::date, '2023-01-18'::date, '2023-11-04'::date),
('B', 4.00, '2023-01-10'::date, '2023-01-31'::date, '2023-11-01'::date)
), dist_dates as (
select distinct item_id, start_date from data
union
select distinct item_id, end_date from data
), id_date_ranges as (
select
item_id,
start_date,
lead(start_date) over (partition by item_id
order by start_date) as end_date
from dist_dates
qualify end_date is not null
)
select
dr.*,
d.price,
d.loaded_datetime
from id_date_ranges as dr
join data as d
on d.item_id = dr.item_id
and dr.start_date < d.end_date
and d.start_date < dr.end_date
qualify row_number() over (partition by dr.item_id, dr.start_date
order by d.loaded_datetime desc) = 1
order by 1,2;
But if you use highly overlapping data like the above, you get results like:
This could be worked around with a blind merge, but that might merge rows that appear the same but where not part of the same event/streak to be merged. The workaround is to allocate a sequence value to each row, before chopping, and then after the join, only merges rows of the same source id.
so using SQE8() and CONDITIONAL_CHANGE_EVENT we get:
with data (ITEM_ID, PRICE, START_DATE, END_DATE, LOADED_DATETIME) as (
select * from values
('A', 1.00, '2023-01-01'::date, '2023-01-31'::date, '2023-11-01'::date),
('A', 2.00, '2023-01-10'::date, '2023-01-15'::date, '2023-11-02'::date),
('A', 3.00, '2023-01-14'::date, '2023-01-17'::date, '2023-11-03'::date),
('A', 4.00, '2023-01-15'::date, '2023-01-18'::date, '2023-11-04'::date),
('B', 4.00, '2023-01-10'::date, '2023-01-31'::date, '2023-11-01'::date)
), sdata as (
select
*,
seq8() as seq_id
from data
), dist_dates as (
select distinct item_id, start_date from data
union
select distinct item_id, end_date from data
), id_date_ranges as (
select
item_id,
start_date,
lead(start_date) over (partition by item_id
order by start_date) as end_date
from dist_dates
qualify end_date is not null
), over_sliced as (
select
dr.*,
d.price,
d.loaded_datetime
,d.seq_id
from id_date_ranges as dr
join sdata as d
on d.item_id = dr.item_id
and dr.start_date < d.end_date
and d.start_date < dr.end_date
qualify row_number() over (partition by dr.item_id, dr.start_date
order by d.loaded_datetime desc) = 1
)
select
*
,conditional_change_event(seq_id) over (partition by item_id
order by start_date) as cce
from over_sliced
order by 1,2;
So now we can min/max those dates, to get a single row back:
with data (ITEM_ID, PRICE, START_DATE, END_DATE, LOADED_DATETIME) as (
select * from values
('A', 1.00, '2023-01-01'::date, '2023-01-31'::date, '2023-11-01'::date),
('A', 2.00, '2023-01-10'::date, '2023-01-15'::date, '2023-11-02'::date),
('A', 3.00, '2023-01-14'::date, '2023-01-17'::date, '2023-11-03'::date),
('A', 4.00, '2023-01-15'::date, '2023-01-18'::date, '2023-11-04'::date),
('B', 4.00, '2023-01-10'::date, '2023-01-31'::date, '2023-11-01'::date)
), sdata as (
select
*,
seq8() as seq_id
from data
), dist_dates as (
select distinct item_id, start_date from data
union
select distinct item_id, end_date from data
), id_date_ranges as (
select
item_id,
start_date,
lead(start_date) over (partition by item_id
order by start_date) as end_date
from dist_dates
qualify end_date is not null
), over_sliced as (
select
dr.*,
d.price,
d.loaded_datetime
,d.seq_id
from id_date_ranges as dr
join sdata as d
on d.item_id = dr.item_id
and dr.start_date < d.end_date
and d.start_date < dr.end_date
qualify row_number() over (partition by dr.item_id, dr.start_date
order by d.loaded_datetime desc) = 1
), slices_grouped as (
select
*
,conditional_change_event(seq_id) over (partition by item_id
order by start_date) as cce
from over_sliced
)
select
item_id
,min(start_date) as start_date
,max(end_date) as end_date
,any_value(price) as price
,any_value(loaded_datetime) as loaded_datetime
from slices_grouped
group by item_id, cce
order by 1,2;