[QUESTION] Chapter 2: Definition of `similarities` is subject to information leakage? #125

liganega · 2024-03-19T12:08:02Z

This question is referring to the jupyter notebook of Chapter 2.

===

The code below creates new 10 similarity features based on the location of the districts.
But it also uses the information of "median_house_value" as sample weight.

housing_labels = strat_train_set["median_house_value"].copy()
...
similarities = cluster_simil.fit_transform(housing[["latitude", "longitude"]],
                                           sample_weight=housing_labels)

But isn't it kind of information leakage to the model?
The model is going to be trained on predicting the median house value and should NOT have any direct information about it.

The text was updated successfully, but these errors were encountered:

liganega · 2024-03-19T12:25:21Z

Using "median_house_value" as sample weight is nonsense because for prediction in the future it shouldn't be available.
On the other hand, the "median_income" feature instead would be adequate for sample weight.

liganega · 2024-03-31T11:15:00Z

In fact, the sample_weight option is used only for the demonstration of how to use ClusterSimilarity class and is ignored after that. There is therefore no information leakage during the training.

However, it is still misleading to use "median_house_value" as the value for sample_weight. Using instead "median_income" results in almost the same clustering.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] Chapter 2: Definition of `similarities` is subject to information leakage? #125

[QUESTION] Chapter 2: Definition of `similarities` is subject to information leakage? #125

liganega commented Mar 19, 2024

liganega commented Mar 19, 2024 •

edited

Loading

liganega commented Mar 31, 2024

[QUESTION] Chapter 2: Definition of similarities is subject to information leakage? #125

[QUESTION] Chapter 2: Definition of similarities is subject to information leakage? #125

Comments

liganega commented Mar 19, 2024

liganega commented Mar 19, 2024 • edited Loading

liganega commented Mar 31, 2024

[QUESTION] Chapter 2: Definition of `similarities` is subject to information leakage? #125

[QUESTION] Chapter 2: Definition of `similarities` is subject to information leakage? #125

liganega commented Mar 19, 2024 •

edited

Loading