Homework

Note: sometimes your answer doesn't match one of the options exactly. That's fine. Select the option that's closest to your solution.

Solution: homework.ipynb

Dataset

In this homework, we will use the Laptops price dataset from Kaggle.

Here's a wget-able link:

wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/laptops.csv

The goal of this homework is to create a regression model for predicting the prices (column 'Final Price').

Preparing the dataset

First, we'll normalize the names of the columns:

df.columns = df.columns.str.lower().str.replace(' ', '_')

Now, instead of 'Final Price', we have 'final_price'.

Next, use only the following columns:

'ram',
'storage',
'screen',
'final_price'

EDA

Look at the final_price variable. Does it have a long tail?

Question 1

There's one column with missing values. What is it?

'ram'
'storage'
'screen'
'final_price'

Question 2

What's the median (50% percentile) for variable 'ram'?

8
16
24
32

Prepare and split the dataset

Shuffle the dataset (the filtered one you created above), use seed 42.
Split your data in train/val/test sets, with 60%/20%/20% distribution.

Use the same code as in the lectures

Question 3

We need to deal with missing values for the column from Q1.
We have two options: fill it with 0 or with the mean of this variable.
Try both options. For each, train a linear regression model without regularization using the code from the lessons.
For computing the mean, use the training only!
Use the validation dataset to evaluate the models and compare the RMSE of each option.
Round the RMSE scores to 2 decimal digits using round(score, 2)
Which option gives better RMSE?

Options:

With 0
With mean
Both are equally good

Question 4

Now let's train a regularized linear regression.
For this question, fill the NAs with 0.
Try different values of r from this list: [0, 0.01, 0.1, 1, 5, 10, 100].
Use RMSE to evaluate the model on the validation dataset.
Round the RMSE scores to 2 decimal digits.
Which r gives the best RMSE?

If there are multiple options, select the smallest r.

Options:

0
0.01
1
10
100

Question 5

We used seed 42 for splitting the data. Let's find out how selecting the seed influences our score.
Try different seed values: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9].
For each seed, do the train/validation/test split with 60%/20%/20% distribution.
Fill the missing values with 0 and train a model without regularization.
For each seed, evaluate the model on the validation dataset and collect the RMSE scores.
What's the standard deviation of all the scores? To compute the standard deviation, use np.std.
Round the result to 3 decimal digits (round(std, 3))

What's the value of std?

19.176
29.176
39.176
49.176

Note: Standard deviation shows how different the values are. If it's low, then all values are approximately the same. If it's high, the values are different. If standard deviation of scores is low, then our model is stable.

Question 6

Split the dataset like previously, use seed 9.
Combine train and validation datasets.
Fill the missing values with 0 and train a model with r=0.001.
What's the RMSE on the test dataset?

Options:

598.60
608.60
618.60
628.60

Submit the results

Submit your results here: https://courses.datatalks.club/ml-zoomcamp-2024/homework/hw02
If your answer doesn't match options exactly, select the closest one

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

homework.md

homework.md

Homework

Dataset

Preparing the dataset

EDA

Question 1

Question 2

Prepare and split the dataset

Question 3

Question 4

Question 5

Question 6

Submit the results

Files

homework.md

Latest commit

History

homework.md

File metadata and controls

Homework

Dataset

Preparing the dataset

EDA

Question 1

Question 2

Prepare and split the dataset

Question 3

Question 4

Question 5

Question 6

Submit the results