Fri. May 27th, 2022

Impute with “most recent” available value in a time series

I want to impute missing values in two columns (called x and y below) with the “most recent” (most recent as defined in a temporal sense) non-null value.

I have this:

| customer_id |     t      | x    | y
| ----------- | ---------- | ---- | -
| abc123      | 2021-04-01 | 0    | 633
| abc123      | 2021-05-01 | 1    | 648
| abc123      | 2021-06-01 | 2    | 653
| abc123      | 2021-07-01 | null | null   <--- missing x and y
| abc123      | 2021-08-01 | null | null   <--- missing x and y
| abc123      | 2021-09-01 | null | null   <--- missing x and y
| abc123      | 2021-10-01 | 3    | 642
| def456      | 2021-07-01 | 0    | 711
| def456      | 2021-08-01 | 1    | 719
| def456      | 2021-09-01 | 2    | 707
| def456      | 2021-10-01 | null | null   <--- missing x and y

Note how rows 4-6 and 11 are null for x and y, I want to fill them in like so:

| customer_id |     t      | x    | y
| ----------- | ---------- | ---- | -
| abc123      | 2021-04-01 | 0    | 633
| abc123      | 2021-05-01 | 1    | 648
| abc123      | 2021-06-01 | 2    | 653
| abc123      | 2021-07-01 | 2    | 653   <--- x and y filled with most recent values 
| abc123      | 2021-08-01 | 2    | 653   <--- x and y filled with most recent values
| abc123      | 2021-09-01 | 2    | 653   <--- x and y filled with most recent values
| abc123      | 2021-10-01 | 3    | 642   
| def456      | 2021-07-01 | 0    | 711
| def456      | 2021-08-01 | 1    | 719
| def456      | 2021-09-01 | 2    | 707
| def456      | 2021-10-01 | 2    | 707   <--- x and y filled with most recent values

Notice how it’s already ordered by customer_id and t. The future data I do this on may not be ordered like so.