Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate code for univariate forcasting task maybe is not right. There is information leaking. #35

Open
wiwi opened this issue May 26, 2023 · 4 comments

Comments

@wiwi
Copy link

wiwi commented May 26, 2023

As title said that the results of univariate forcasting provide by this repository and the related paper maybe is not right because the evaluate code of univariate forcasting leads to information leaking. The following figure described the evaluate code of univariate forcasting in this repository.

image

The main problem is that eval_forecasting() encode data through sliding window(sliding step=1) and then split the training/validation/test data set. Hence, it will lead to information leaking. The general way to do this is splitting the data first and then encodes them respectively.

This bug is found in another repository which inherit evaluate code of TS2VEC, and we found the unfair evaluate code would contribute about 10% improvement on Electricity dataset about such project. Then, we checked the evaluate code of univariate forcasting in this repository and found the same operation.

So, I hope authors check the evaluate code of univariate forcasting, especially the result presented in the TS2VEC paper because maybe such results are wrong. If this problem is confirmed, I hope the authors will revise the results of the paper to avoid unfair comparisons.

@zhihanyue
Copy link
Owner

zhihanyue commented May 26, 2023

The figure you provided is wrong. We perform causal inference. For timestamp t, the sliding window is from t-T+1 to t. Therefore, there is no information leakage. The following is the corrected figure.
WX20230526-100917@2x

@zhihanyue
Copy link
Owner

I have checked your another issue xingyu617/SimTS_Representation_Learning#5 and seem to understand your misunderstanding. The code you provided is inappropriate.

# original code
# all_repr = model.encode(
#     data[:, :, n_covariate_cols:],
#     casual=True,
#     sliding_length=1,
#     sliding_padding=padding,
#     batch_size=128
# )
# print("all_repr",data.shape, all_repr.shape)
# train_repr = all_repr[:, train_slice]
# valid_repr = all_repr[:, valid_slice]
# test_repr = all_repr[:, test_slice]

# your code
train_data = data[:, train_slice, n_covariate_cols:]
valid_data = data[:, valid_slice, n_covariate_cols:]
test_data = data[:, test_slice, n_covariate_cols:]
print("data shape:", train_data.shape, valid_data.shape, test_data.shape)

train_repr = model.encode(
    train_data,
    casual=True,
    sliding_length=1,
    sliding_padding=padding,
    batch_size=128
)
valid_repr = model.encode(
    valid_data,
    casual=True,
    sliding_length=1,
    sliding_padding=padding,
    batch_size=128
)
test_repr = model.encode(
    test_data,
    casual=True,
    sliding_length=1,
    sliding_padding=padding,
    batch_size=128
)

For example, we assume t is the first timestamp in test set. For the first sample in the test set, the original input is [t-padding, t]. However, your input is [t, t], which feeds only one timestamp as input, resulting in poor performance and biased distribution.

@wiwi
Copy link
Author

wiwi commented May 26, 2023

I have checked your another issue xingyu617/SimTS_Representation_Learning#5 and seem to understand your misunderstanding. The code you provided is inappropriate.

# original code
# all_repr = model.encode(
#     data[:, :, n_covariate_cols:],
#     casual=True,
#     sliding_length=1,
#     sliding_padding=padding,
#     batch_size=128
# )
# print("all_repr",data.shape, all_repr.shape)
# train_repr = all_repr[:, train_slice]
# valid_repr = all_repr[:, valid_slice]
# test_repr = all_repr[:, test_slice]

# your code
train_data = data[:, train_slice, n_covariate_cols:]
valid_data = data[:, valid_slice, n_covariate_cols:]
test_data = data[:, test_slice, n_covariate_cols:]
print("data shape:", train_data.shape, valid_data.shape, test_data.shape)

train_repr = model.encode(
    train_data,
    casual=True,
    sliding_length=1,
    sliding_padding=padding,
    batch_size=128
)
valid_repr = model.encode(
    valid_data,
    casual=True,
    sliding_length=1,
    sliding_padding=padding,
    batch_size=128
)
test_repr = model.encode(
    test_data,
    casual=True,
    sliding_length=1,
    sliding_padding=padding,
    batch_size=128
)

For example, we assume t is the first timestamp in test set. For the first sample in the test set, the original input is [t-padding, t]. However, your input is [t, t], which feeds only one timestamp as input, resulting in poor performance and biased distribution.

You means that you use padding to seperate training set and testing set. But, I think I didnot misunderstand because the whole seqence are feed into eval_forecasting().
In the train.py:

elif task_type == 'forecasting':
            out, eval_res = tasks.eval_forecasting(model, data, train_slice, valid_slice, test_slice, scaler, pred_lens, n_covariate_cols)

Here, 'data' is came from:

elif args.loader == 'forecast_csv_univar':
        task_type = 'forecasting'
        data, train_slice, valid_slice, test_slice, scaler, pred_lens, n_covariate_cols = datautils.load_forecast_csv(args.dataset, univar=True)
        train_data = data[:, train_slice]

If you still think I misunderstand, please give me more detail information about which code to make sure the training set and testing set are independent. Thanks.

@zhihanyue
Copy link
Owner

zhihanyue commented May 26, 2023

  1. Please read the code of model.encode() carefully, and you can confirm that the real behavior is our corrected figure.
  2. I've already explained why your code is inappropriate. The following is what I think you misunderstood:

For example, we assume t is the first timestamp in test set. For the first sample in the test set, the original input is [t-padding, t]. However, your input is [t, t], which feeds only one timestamp as input, resulting in poor performance and biased distribution.

We confirm that our code ensures the following behavior:

  • Behavior 1: In our implementation, assuming t denotes the first timestamp in test set, the input of the first sample in test set is [t-padding, t], and the label of the last sample in valid set is [t-pred_len, t-1]. Only the information in "the input" is used to generate the predicted values.

From Behavior 1, we conclude that there is no data leakage. If you think Behavior 1 is not true, please show the evidence rather than asking for code explaining. I'm sorry I don't have that much time.

Note that we are responsible for Behavior 1. This means that if you find Behavior 1 is not true under specific cases, please reopen the issue. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants