Skip to content

Test/train data split overlap concern #16

@iAmGiG

Description

@iAmGiG

Problem

Both training and testing scripts use the same random split with random_state=17, which could lead to testing on training data if not managed carefully.

Code:

x_train, x_opt, x_test = np.split(df.sample(frac=1, random_state=17), ...)

Appears in:

  • train_og.py:26-27
  • test.py:37-38

Concern

If test.py is run on the same combined dataset used during training, it will test on data the model has already seen during training.

Recommendation

Use proper train/test split methodology:

  • Separate hold-out test set
  • Time-based split for network traffic
  • Or different random seeds

Priority

MODERATE - Could affect validity of test results

Metadata

Metadata

Assignees

Labels

archiveRelated to archiving old research codetechnical-debtTechnical debt and code quality

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions