\n",
+ "\n",
+ "\n",
+ "[**Le TP est à faire de préférence sur Capytale**](https://capytale2.ac-paris.fr/web/c/5ee4-1148472). Sinon, utilisez Pyzo et télécharger [les données](https://raw.githubusercontent.com/cpge-itc/itc2/6fff3c359a4761aab625b2adb8e5b83697d5c72f/titanic.csv) (à mettre dans le même dossier que votre fichier Python). \n",
+ "\n",
+ "On souhaite prédire si un passager du Titanic a survécu ou non à l'accident, en utilisant l'algorithme des plus proches voisins. On pourra s'inspirer de l'[exemple du cours sur la classification des iris](https://cpge-itc.github.io/itc2/4_knn/exemple/knn_iris.html).\n",
+ "\n",
+ "Voici les informations sur chaque passager : \n",
+ "- `Survived` : 0 = Non, 1 = Oui\n",
+ "- `Pclass` : Classe de ticket (1 = 1ère classe, 2 = 2ème, 3 = 3ème) \n",
+ "- `Sex` : Genre du passager (`male` ou `female`) \n",
+ "- `Age` : Âge du passager (en années) \n",
+ "- `Fare` : Tarif du ticket (en dollars)"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Chargement des données avec Pandas\n",
+ "\n",
+ "Pandas est un module Python qui permet de manipuler des données sous forme de tableau appelé **DataFrame** (qui ressemble à un peu à une table SQL) :"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
Survived
\n",
+ "
Pclass
\n",
+ "
Sex
\n",
+ "
Age
\n",
+ "
Fare
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
0
\n",
+ "
0
\n",
+ "
3
\n",
+ "
male
\n",
+ "
22.0
\n",
+ "
7.2500
\n",
+ "
\n",
+ "
\n",
+ "
1
\n",
+ "
1
\n",
+ "
1
\n",
+ "
female
\n",
+ "
38.0
\n",
+ "
71.2833
\n",
+ "
\n",
+ "
\n",
+ "
2
\n",
+ "
1
\n",
+ "
3
\n",
+ "
female
\n",
+ "
26.0
\n",
+ "
7.9250
\n",
+ "
\n",
+ "
\n",
+ "
3
\n",
+ "
1
\n",
+ "
1
\n",
+ "
female
\n",
+ "
35.0
\n",
+ "
53.1000
\n",
+ "
\n",
+ "
\n",
+ "
4
\n",
+ "
0
\n",
+ "
3
\n",
+ "
male
\n",
+ "
35.0
\n",
+ "
8.0500
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Survived Pclass Sex Age Fare\n",
+ "0 0 3 male 22.0 7.2500\n",
+ "1 1 1 female 38.0 71.2833\n",
+ "2 1 3 female 26.0 7.9250\n",
+ "3 1 1 female 35.0 53.1000\n",
+ "4 0 3 male 35.0 8.0500"
+ ]
+ },
+ "execution_count": 1,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "import pandas as pd\n",
+ "\n",
+ "df = pd.read_csv('titanic.csv') # df est un DataFrame\n",
+ "df.head() # pour afficher les 5 premières lignes"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Ainsi `df` est un tableau contient 5 colonnes (`Survived`, `Pclass`, `Sex`, `Age`, `Fare`) et chaque ligne correspondant à un passager du Titanic. On peut obtenir le nombres de lignes avec `len(df)` :"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "891"
+ ]
+ },
+ "execution_count": 2,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "len(df)"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "Chaque ligne est identifiée par un index (= nom de la ligne), ici $0, 1, 2, ...$. On peut accéder à la ligne d'indice $i$ avec `df.loc[i]` :"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Survived 0\n",
+ "Pclass 3\n",
+ "Sex male\n",
+ "Age 22.0\n",
+ "Fare 7.25\n",
+ "Name: 0, dtype: object"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df.loc[0]"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "`df.loc[i]` donne en fait une series, qui peut être vu comme un tableau à une dimension.\n",
+ "\n",
+ "On peut récupérer une colonne (également sous forme de series), par exemple `Age`, avec `df[\"Age\"]` :"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0 22.0\n",
+ "1 38.0\n",
+ "2 26.0\n",
+ "3 35.0\n",
+ "4 35.0\n",
+ " ... \n",
+ "886 27.0\n",
+ "887 19.0\n",
+ "888 28.0\n",
+ "889 26.0\n",
+ "890 32.0\n",
+ "Name: Age, Length: 891, dtype: float64"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df[\"Age\"]"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "On peut combiner ces deux méthodes pour récupérer une valeur précise, par exemple l'âge du 3ème passager :"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "26.0"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df.loc[2, \"Age\"]"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "On peut aussi modifier une valeur avec, par exemple, `df.loc[2, \"Age\"] = ...`."
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "On peut parcourir les indices d'un dataframe avec `df.index`. Par exemple, pour trouver le passager le plus vieux :"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "80.0"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "maxi_age = 0\n",
+ "for i in df.index:\n",
+ " if df.loc[i, \"Age\"] > maxi_age:\n",
+ " maxi_age = df.loc[i, \"Age\"]\n",
+ "maxi_age"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Remarque** : avec Pandas, il faut normalement utiliser au maximum des opérations vectorielles pour que le processeur puisse effectuer les calculs en parallèle. Cependant, comme l'utilisation de Pandas n'est pas au programme, nous allons nous limiter à une approche élémentaire."
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Statistiques"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Question** : Écrire une fonction `moyenne(df, c)` qui renvoie la moyenne des valeurs sur la colonne `c` du dataframe `df`. Quelle est l'âge moyen des passagers du Titanic ? Le prix moyen du ticket ?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {
+ "tags": [
+ "cor"
+ ]
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Âge moyen : 29.36158249158249\n",
+ "Prix moyen du billet : 32.2042079685746\n"
+ ]
+ }
+ ],
+ "source": [
+ "def moyenne(df, col):\n",
+ " m = 0\n",
+ " for i in df.index:\n",
+ " m += df.loc[i, col]\n",
+ " return m / len(df)\n",
+ "\n",
+ "print(\"Âge moyen :\", moyenne(df, \"Age\"))\n",
+ "print(\"Prix moyen du billet :\", moyenne(df, \"Fare\"))"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Question** : Écrire une fonction `ecart_type(df, c)` qui renvoie l'écart-type des valeurs de la colonne `c` du dataframe `df`. On rappelle que l'écart-type d'une série de valeurs $x_1, \\ldots, x_n$ est donné par : \n",
+ "\n",
+ "$$\\sqrt{\\frac{1}{n} \\sum_{i=1}^n (x_i - \\bar{x})^2}$$\n",
+ "\n",
+ "où $\\bar{x}$ est la moyenne des valeurs $x_1, ..., x_n$.\n",
+ "\n",
+ "**Remarque : On évitera de calculer plusieurs fois la même moyenne.**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {
+ "tags": [
+ "cor"
+ ]
+ },
+ "outputs": [],
+ "source": [
+ "def ecart_type(df, col):\n",
+ " m = moyenne(df, col)\n",
+ " s = 0\n",
+ " for i in df.index:\n",
+ " s += (df.loc[i, col] - m)**2\n",
+ " return (s / len(df))**0.5"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "13.01238827279366"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "ecart_type(df, \"Age\")"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Question** : Afficher le pourcentage de survivants parmi :\n",
+ "- les hommes\n",
+ "- les femmes\n",
+ "- les passagers de 1ère classe\n",
+ "- les passagers de 3ème classe"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Taux de survie pour Sex = male : 0.18890814558058924\n",
+ "Taux de survie pour Sex = female : 0.7420382165605095\n",
+ "Taux de survie pour Pclass = 1 : 0.6296296296296297\n",
+ "Taux de survie pour Pclass = 3 : 0.24236252545824846\n"
+ ]
+ }
+ ],
+ "source": [
+ "def survivants(col, val):\n",
+ " n_survivants = 0\n",
+ " n = 0\n",
+ " for i in df.index:\n",
+ " if df.loc[i, col] == val:\n",
+ " n_survivants += df.loc[i, \"Survived\"]\n",
+ " n += 1\n",
+ " return n_survivants/n\n",
+ "\n",
+ "for c, v in [(\"Sex\", \"male\"), (\"Sex\", \"female\"), (\"Pclass\", 1), (\"Pclass\", 3)]:\n",
+ " print(\"Taux de survie pour\", c, \"=\", v, \":\", survivants(c, v))"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Variables catégorielles\n",
+ "\n",
+ "Nous souhaitons modéliser chaque passager par un vecteur de $\\mathbb{R}^4$ (car il y a $4$ informations pour chaque passager : âge, genre, classe et prix du ticket). Cependant, le genre est une variable catégorielle qu'il faut transformer en variable numérique :"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
Survived
\n",
+ "
Pclass
\n",
+ "
Sex
\n",
+ "
Age
\n",
+ "
Fare
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
0
\n",
+ "
0
\n",
+ "
3
\n",
+ "
0
\n",
+ "
22.0
\n",
+ "
7.2500
\n",
+ "
\n",
+ "
\n",
+ "
1
\n",
+ "
1
\n",
+ "
1
\n",
+ "
1
\n",
+ "
38.0
\n",
+ "
71.2833
\n",
+ "
\n",
+ "
\n",
+ "
2
\n",
+ "
1
\n",
+ "
3
\n",
+ "
1
\n",
+ "
26.0
\n",
+ "
7.9250
\n",
+ "
\n",
+ "
\n",
+ "
3
\n",
+ "
1
\n",
+ "
1
\n",
+ "
1
\n",
+ "
35.0
\n",
+ "
53.1000
\n",
+ "
\n",
+ "
\n",
+ "
4
\n",
+ "
0
\n",
+ "
3
\n",
+ "
0
\n",
+ "
35.0
\n",
+ "
8.0500
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Survived Pclass Sex Age Fare\n",
+ "0 0 3 0 22.0 7.2500\n",
+ "1 1 1 1 38.0 71.2833\n",
+ "2 1 3 1 26.0 7.9250\n",
+ "3 1 1 1 35.0 53.1000\n",
+ "4 0 3 0 35.0 8.0500"
+ ]
+ },
+ "execution_count": 11,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df[\"Sex\"] = df[\"Sex\"].map({\"male\": 0, \"female\": 1}) # remplace male par 0 et female par 1\n",
+ "df.head()"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Standardisation\n",
+ "\n",
+ "On remarque que les attributs sont sur des échelles très différentes (par exemple, l'âge est entre 0 et 80, alors que la classe du billet est entre 1 et 3). \n",
+ "Les différences d'âge contribuent alors beaucoup plus dans les calculs de distance, ce qui ferait que l'âge aurait un poids plus important que la classe du billet pour la prédiction. \n",
+ "Pour éviter cela, on va standardiser les données, c'est-à-dire les transformer de manière à ce que chaque attribut ait une moyenne nulle et un écart-type égal à 1. \n",
+ "\n",
+ "Si un attribut $x$ a une moyenne $\\bar{x}$ et un écart-type $\\sigma$, on peut le standardiser en le remplaçant par :\n",
+ "\n",
+ "$$\\frac{x - \\bar{x}}{\\sigma}$$"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Question** : Écrire une fonction `standardiser(df, c)` qui standardise la colonne `c` du dataframe `df`. L'utiliser pour standardiser les colonnes `Age`, `Fare`, `Pclass` et `Sex`. On rappelle qu'on peut modifier l'élément sur la ligne `i` et la colonne `c` avec `df.loc[i, c] = ...`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {
+ "tags": [
+ "cor"
+ ]
+ },
+ "outputs": [],
+ "source": [
+ "def standardiser(df, col):\n",
+ " m = moyenne(df, col)\n",
+ " s = ecart_type(df, col)\n",
+ " for i in df.index:\n",
+ " df.loc[i, col] = (df.loc[i, col] - m) / s"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
Survived
\n",
+ "
Pclass
\n",
+ "
Sex
\n",
+ "
Age
\n",
+ "
Fare
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
0
\n",
+ "
0
\n",
+ "
0.827377
\n",
+ "
-0.737695
\n",
+ "
-0.565736
\n",
+ "
-0.502445
\n",
+ "
\n",
+ "
\n",
+ "
1
\n",
+ "
1
\n",
+ "
-1.566107
\n",
+ "
1.355574
\n",
+ "
0.663861
\n",
+ "
0.786845
\n",
+ "
\n",
+ "
\n",
+ "
2
\n",
+ "
1
\n",
+ "
0.827377
\n",
+ "
1.355574
\n",
+ "
-0.258337
\n",
+ "
-0.488854
\n",
+ "
\n",
+ "
\n",
+ "
3
\n",
+ "
1
\n",
+ "
-1.566107
\n",
+ "
1.355574
\n",
+ "
0.433312
\n",
+ "
0.420730
\n",
+ "
\n",
+ "
\n",
+ "
4
\n",
+ "
0
\n",
+ "
0.827377
\n",
+ "
-0.737695
\n",
+ "
0.433312
\n",
+ "
-0.486337
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Survived Pclass Sex Age Fare\n",
+ "0 0 0.827377 -0.737695 -0.565736 -0.502445\n",
+ "1 1 -1.566107 1.355574 0.663861 0.786845\n",
+ "2 1 0.827377 1.355574 -0.258337 -0.488854\n",
+ "3 1 -1.566107 1.355574 0.433312 0.420730\n",
+ "4 0 0.827377 -0.737695 0.433312 -0.486337"
+ ]
+ },
+ "execution_count": 13,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "for c in [\"Age\", \"Fare\", \"Pclass\", \"Sex\"]:\n",
+ " standardiser(df, c)\n",
+ "df.head()"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Distance\n",
+ "\n",
+ "Pour la question suivante, on rappelle comment accéder aux attributs d'une donnée :"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "(-0.565736461074875,\n",
+ " -0.5024451714361915,\n",
+ " 0.8273772438659676,\n",
+ " -0.7376951317802897)"
+ ]
+ },
+ "execution_count": 14,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "p = df.loc[0] # 1er passager\n",
+ "p[\"Age\"], p[\"Fare\"], p[\"Pclass\"], p[\"Sex\"] # attributs de p"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Question** : Écrire une fonction `distance(p1, p2)` qui calcule la distance euclidienne entre les passagers `p1` et `p2`. On prendra en compte tous les attributs sauf `Survived`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {
+ "tags": [
+ "cor"
+ ]
+ },
+ "outputs": [],
+ "source": [
+ "def distance(p1, p2):\n",
+ " d = 0\n",
+ " for c in [\"Pclass\", \"Sex\", \"Age\", \"Fare\"]:\n",
+ " d += (p1[c] - p2[c])**2\n",
+ " return d**0.5"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "3.644820996221396"
+ ]
+ },
+ "execution_count": 16,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "distance(df.loc[0], df.loc[1]) # distance entre les deux premiers passagers"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Séparation des données\n",
+ "\n",
+ "On sépare les données en deux : une partie `train` utilisée pour la prédiction, et une partie `test` utilisée pour évaluer la qualité de la prédiction :"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 44,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "nombre de données dans train : 802\n",
+ "nombre de données dans test : 89\n"
+ ]
+ }
+ ],
+ "source": [
+ "train = df.sample(frac=0.9,random_state=0)\n",
+ "test = df.drop(train.index)\n",
+ "print(\"nombre de données dans train :\", len(train))\n",
+ "print(\"nombre de données dans test :\", len(test))"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Algorithmes des plus proches voisins"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Question** : Écrire une fonction `voisins(x, k)` qui renvoie les indices des $k$ plus proches voisins de `x` dans `train`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 45,
+ "metadata": {
+ "tags": [
+ "cor"
+ ]
+ },
+ "outputs": [],
+ "source": [
+ "def voisins(x, k):\n",
+ " indices = sorted(train.index, key=lambda i: distance(x, train.loc[i]))\n",
+ " return indices[:k]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 46,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[446, 651, 546, 427, 389]"
+ ]
+ },
+ "execution_count": 46,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "voisins(test.iloc[0], 5)"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Question** : Écrire une fonction `plus_frequent(L)` qui renvoie l'élément le plus fréquent d'une liste `L`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 47,
+ "metadata": {
+ "tags": [
+ "cor"
+ ]
+ },
+ "outputs": [],
+ "source": [
+ "def plus_frequent(L): # renvoie la classe qui apparaît le plus souvent dans L\n",
+ " compte = {}\n",
+ " for e in L:\n",
+ " compte[e] = compte.get(e, 0) + 1\n",
+ " return max(compte, key=compte.get)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 48,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "5"
+ ]
+ },
+ "execution_count": 48,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "plus_frequent([2, 1, 5, 1, 2, 5, 5])"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Question** : Écrire une fonction `knn(x, k)` qui renvoie la prédiction de survie de `x` en utilisant l'algorithme des $k$ plus proches voisins."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 49,
+ "metadata": {
+ "tags": [
+ "cor"
+ ]
+ },
+ "outputs": [],
+ "source": [
+ "def knn(x, k):\n",
+ " return plus_frequent([train.loc[i, \"Survived\"] for i in voisins(x, k)])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 50,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "1"
+ ]
+ },
+ "execution_count": 50,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "knn(test.iloc[0], 5)"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Analyse des résultats"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Question** : Écrire une fonction `precision(k)` qui renvoie la précision de l'algorithme des $k$ plus proches voisins en utilisant `k` voisins."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 51,
+ "metadata": {
+ "tags": [
+ "cor"
+ ]
+ },
+ "outputs": [],
+ "source": [
+ "def precision(k):\n",
+ " n = 0\n",
+ " for i in test.index:\n",
+ " if knn(test.loc[i], k) == test.loc[i, \"Survived\"]:\n",
+ " n += 1\n",
+ " return n / len(test)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 52,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0.8314606741573034"
+ ]
+ },
+ "execution_count": 52,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "precision(3)"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Question** : Écrire une fonction `plot_precision(kmax)` qui trace la précision pour $k$ variant de $1$ à `kmax`. Quelle est la meilleure précision obtenue pour k entre 1 et 5 (cela prend environ 20 secondes) ? Quelle est le nombre de voisins optimal ?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 53,
+ "metadata": {
+ "tags": [
+ "cor"
+ ]
+ },
+ "outputs": [],
+ "source": [
+ "def plot_precision(kmax):\n",
+ " import matplotlib.pyplot as plt\n",
+ " R = range(1, kmax)\n",
+ " plt.plot(R, [precision(k) for k in R])\n",
+ " plt.show()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 54,
+ "metadata": {
+ "tags": [
+ "cor"
+ ]
+ },
+ "outputs": [
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ "