Data Leakage #249

nova-land · 2023-05-28T12:22:16Z

The use of tf.keras.utils.normalize will provide invalid test result by normalising the whole dataset.

An evaluation script is required to verify the accuracy of the model

The text was updated successfully, but these errors were encountered:

kyleskom · 2023-06-04T15:48:03Z

I don't understand what the issue here is?

chriseling · 2023-06-04T17:56:54Z

I think the worry is the normalize is applied to the whole data set which could potentially overfit the model because the validation data is also normalized. Best, Chris

…

On Sun, Jun 4, 2023 at 8:48 AM Kyle Skompinski ***@***.***> wrote: I don't understand what the issue here is? — Reply to this email directly, view it on GitHub <#249 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAVP5WZGNGLQYZS6PVW3IATXJSU35ANCNFSM6AAAAAAYRZ5LRA> . You are receiving this because you are subscribed to this thread.Message ID: <kyleskom/NBA-Machine-Learning-Sports-Betting/issues/249/1575616982@ github.com>

kyleskom · 2023-06-05T15:18:26Z

Ill take a look when I revisit this next season

kyleskom · 2023-10-08T22:38:49Z

Hi looking for more info on what the potential fix for this would be. Thank you.

nova-land · 2023-10-09T11:29:53Z

You will need to separate train and test data when you are using tf.keras.utils.normalize. But normally you should use Scaler in scikit-learn to separate train and test data, fit the train data then transform both train and test data.

STRATZ-Ken · 2023-10-11T15:33:31Z

I am not sure I agree with @nova-land. The idea of normalize is to set the data for the entire dataset equally. Imagine you have a data set that has values of [3,1,0.50] and you normalize this. It would change to [1, .33, .165]. If your next dataset has a higher value, it would adjust based on the highest data on the column.

There are keras layers you can do which will normalize the data inside the model itself, which would not require this function to be called. Or you can normalize the data when it comes in, setting max values. For example, if a player scores 56 points, and your goal is predict how many points a player is going to score from 0 to 50 (Your force normalizing here), then the max he can score is 50. Just an example.

I am not an expert here, but you have to make sure you have this code in your training set. Then when your ready to predict, you load these values and send the predictions through the normalize function as well.

if not os.path.exists(model_dir + '/scaler.pkl'):
        joblib.dump(min_max_scaler, model_dir + '/scaler.pkl')

STRATZ-Ken · 2023-10-11T15:39:23Z

Here is information on the normalize layer. You would add this before your first dense layer, this will normalize the incoming data and store its weights inside the model file itself. Then you would not have to make any changes to the data or even call MinMax normalize within the file itself.

https://keras.io/api/layers/normalization_layers/batch_normalization/

Also worth noting, this is for the NN model, not XGBoost.

Gxent · 2023-10-11T15:47:08Z

but then which would be better xg or nn model?

STRATZ-Ken · 2023-10-11T15:49:00Z

Better is not a good word at all to use in models. There are a million factors. That question cannot be answered.

Gxent · 2023-10-11T15:54:04Z

Okay, put another way. What probability would be closest since I made $2,000 in two weeks via XGboost with just a $10 stake. in the end season in May, and I didn’t pay attention to the NN model...

Gxent · 2023-10-11T15:55:06Z

so I always relied on over and under

cafeTechne · 2024-01-02T02:12:22Z

Okay, put another way. What probability would be closest since I made $2,000 in two weeks via XGboost with just a $10 stake. in the end season in May, and I didn’t pay attention to the NN model...

How's this working out for you now?

Gxent · 2024-01-02T02:14:42Z

this year wasn't so good

cafeTechne · 2024-01-02T02:16:28Z

this year wasn't so good

So you're not seeing 55% win rates with this strategy?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Leakage #249

Data Leakage #249

nova-land commented May 28, 2023

kyleskom commented Jun 4, 2023

chriseling commented Jun 4, 2023 via email

kyleskom commented Jun 5, 2023

kyleskom commented Oct 8, 2023

nova-land commented Oct 9, 2023

STRATZ-Ken commented Oct 11, 2023 •

edited

Loading

STRATZ-Ken commented Oct 11, 2023 •

edited

Loading

Gxent commented Oct 11, 2023

STRATZ-Ken commented Oct 11, 2023

Gxent commented Oct 11, 2023

Gxent commented Oct 11, 2023

cafeTechne commented Jan 2, 2024

Gxent commented Jan 2, 2024

cafeTechne commented Jan 2, 2024

Data Leakage #249

Data Leakage #249

Comments

nova-land commented May 28, 2023

kyleskom commented Jun 4, 2023

chriseling commented Jun 4, 2023 via email

kyleskom commented Jun 5, 2023

kyleskom commented Oct 8, 2023

nova-land commented Oct 9, 2023

STRATZ-Ken commented Oct 11, 2023 • edited Loading

STRATZ-Ken commented Oct 11, 2023 • edited Loading

Gxent commented Oct 11, 2023

STRATZ-Ken commented Oct 11, 2023

Gxent commented Oct 11, 2023

Gxent commented Oct 11, 2023

cafeTechne commented Jan 2, 2024

Gxent commented Jan 2, 2024

cafeTechne commented Jan 2, 2024

STRATZ-Ken commented Oct 11, 2023 •

edited

Loading

STRATZ-Ken commented Oct 11, 2023 •

edited

Loading