Impute with “most recent” available value in a time series
I want to impute missing values in two columns (called x
and y
below) with the “most recent” (most recent as defined in a temporal sense) non-null value.
I have this:
| customer_id | t | x | y
| ----------- | ---------- | ---- | -
| abc123 | 2021-04-01 | 0 | 633
| abc123 | 2021-05-01 | 1 | 648
| abc123 | 2021-06-01 | 2 | 653
| abc123 | 2021-07-01 | null | null <--- missing x and y
| abc123 | 2021-08-01 | null | null <--- missing x and y
| abc123 | 2021-09-01 | null | null <--- missing x and y
| abc123 | 2021-10-01 | 3 | 642
| def456 | 2021-07-01 | 0 | 711
| def456 | 2021-08-01 | 1 | 719
| def456 | 2021-09-01 | 2 | 707
| def456 | 2021-10-01 | null | null <--- missing x and y
Note how rows 4-6 and 11 are null
for x
and y
, I want to fill them in like so:
| customer_id | t | x | y
| ----------- | ---------- | ---- | -
| abc123 | 2021-04-01 | 0 | 633
| abc123 | 2021-05-01 | 1 | 648
| abc123 | 2021-06-01 | 2 | 653
| abc123 | 2021-07-01 | 2 | 653 <--- x and y filled with most recent values
| abc123 | 2021-08-01 | 2 | 653 <--- x and y filled with most recent values
| abc123 | 2021-09-01 | 2 | 653 <--- x and y filled with most recent values
| abc123 | 2021-10-01 | 3 | 642
| def456 | 2021-07-01 | 0 | 711
| def456 | 2021-08-01 | 1 | 719
| def456 | 2021-09-01 | 2 | 707
| def456 | 2021-10-01 | 2 | 707 <--- x and y filled with most recent values
Notice how it’s already ordered by customer_id and t. The future data I do this on may not be ordered like so.