Lecture 3 – Data 100, Fall 2024¶
Data 100, Fall 2024
+A demonstration of advanced pandas
syntax to accompany Lecture 3.
import numpy as np
+
+import numpy as np
import pandas as pd
-
-
# Loading the elections DataFrame
-elections = pd.read_csv("data/elections.csv")
+
+# Loading the elections DataFrame
+elections = pd.read_csv("data/elections.csv")
elections.head()
-
-
- | Year | -Candidate | -Party | -Popular vote | -Result | -% | -
---|---|---|---|---|---|---|
0 | -1824 | -Andrew Jackson | -Democratic-Republican | -151271 | -loss | -57.210122 | -
1 | -1824 | -John Quincy Adams | -Democratic-Republican | -113142 | -win | -42.789878 | -
2 | -1828 | -Andrew Jackson | -Democratic | -642806 | -win | -56.203927 | -
3 | -1828 | -John Quincy Adams | -National Republican | -500897 | -loss | -43.796073 | -
4 | -1832 | -Andrew Jackson | -Democratic | -702735 | -win | -54.574789 | -
+ | Year | +Candidate | +Party | +Popular vote | +Result | +% | +
0 | +1824 | +Andrew Jackson | +Democratic-Republican | +151271 | +loss | +57.210122 | +
1 | +1824 | +John Quincy Adams | +Democratic-Republican | +113142 | +win | +42.789878 | +
2 | +1828 | +Andrew Jackson | +Democratic | +642806 | +win | +56.203927 | +
3 | +1828 | +John Quincy Adams | +National Republican | +500897 | +loss | +43.796073 | +
4 | +1832 | +Andrew Jackson | +Democratic | +702735 | +win | +54.574789 | +
iloc
selects items by row and column integer position.
Arguments to .iloc
can be:
-
@@ -15383,44 +7665,35 @@
- A single value.
Integer-Based Extraction Using A slice (syntax is exclusive of the right hand side of the slice).
# Select the rows at positions 1, 2, and 3.
+
+# Select the rows at positions 1, 2, and 3.
# Select the columns at positions 0, 1, and 2.
# Remember that Python indexing begins at position 0!
elections.iloc[[1, 2, 3], [0, 1, 2]]
-
-
- | Year | -Candidate | -Party | -
---|---|---|---|
1 | -1824 | -John Quincy Adams | -Democratic-Republican | -
2 | -1828 | -Andrew Jackson | -Democratic | -
3 | -1828 | -John Quincy Adams | -National Republican | -
+ | Year | +Candidate | +Party | +
1 | +1824 | +John Quincy Adams | +Democratic-Republican | +
2 | +1828 | +Andrew Jackson | +Democratic | +
3 | +1828 | +John Quincy Adams | +National Republican | +
# Index-based extraction using a list of rows and a slice of column indices
+
+# Index-based extraction using a list of rows and a slice of
+# column integer positions
elections.iloc[[1, 2, 3], 0:3]
-
-
- | Year | -Candidate | -Party | -
---|---|---|---|
1 | -1824 | -John Quincy Adams | -Democratic-Republican | -
2 | -1828 | -Andrew Jackson | -Democratic | -
3 | -1828 | -John Quincy Adams | -National Republican | -
+ | Year | +Candidate | +Party | +
1 | +1824 | +John Quincy Adams | +Democratic-Republican | +
2 | +1828 | +Andrew Jackson | +Democratic | +
3 | +1828 | +John Quincy Adams | +National Republican | +
# Selecting all rows using a colon
+
+# Selecting all rows using a colon
elections.iloc[:, 0:3]
-
-
- | Year | -Candidate | -Party | -
---|---|---|---|
0 | -1824 | -Andrew Jackson | -Democratic-Republican | -
1 | -1824 | -John Quincy Adams | -Democratic-Republican | -
2 | -1828 | -Andrew Jackson | -Democratic | -
3 | -1828 | -John Quincy Adams | -National Republican | -
4 | -1832 | -Andrew Jackson | -Democratic | -
... | -... | -... | -... | -
177 | -2016 | -Jill Stein | -Green | -
178 | -2020 | -Joseph Biden | -Democratic | -
179 | -2020 | -Donald Trump | -Republican | -
180 | -2020 | -Jo Jorgensen | -Libertarian | -
181 | -2020 | -Howard Hawkins | -Green | -
+ | Year | +Candidate | +Party | +
0 | +1824 | +Andrew Jackson | +Democratic-Republican | +
1 | +1824 | +John Quincy Adams | +Democratic-Republican | +
2 | +1828 | +Andrew Jackson | +Democratic | +
3 | +1828 | +John Quincy Adams | +National Republican | +
4 | +1832 | +Andrew Jackson | +Democratic | +
... | +... | +... | +... | +
177 | +2016 | +Jill Stein | +Green | +
178 | +2020 | +Joseph Biden | +Democratic | +
179 | +2020 | +Donald Trump | +Republican | +
180 | +2020 | +Jo Jorgensen | +Libertarian | +
181 | +2020 | +Howard Hawkins | +Green | +
182 rows × 3 columns
elections.iloc[[1, 2, 3], 1]
+
+# Extracting data from one column
+elections.iloc[[1, 2, 3], 1]
-
-
1 John Quincy Adams 2 Andrew Jackson 3 John Quincy Adams Name: Candidate, dtype: object
# Extracting the value at row 0 and the second column
+
+# Extracting the value at row 0 and the second column
elections.iloc[0,1]
-
-
'Andrew Jackson'+
'Andrew Jackson'
We could technically do anything we want using loc
or iloc
. However, in practice, the []
operator is often used instead to yield more concise code.
[]
is a bit trickier to understand than loc
or iloc
, but it achieves essentially the same functionality. The difference is that []
is context-dependent.
[]
only takes one argument, which may be:
Context-dependent Extraction
If we provide a slice of row numbers, we get the numbered rows.
-elections[3:7]
+
+elections[3:7]
-
-
- | Year | -Candidate | -Party | -Popular vote | -Result | -% | -
---|---|---|---|---|---|---|
3 | -1828 | -John Quincy Adams | -National Republican | -500897 | -loss | -43.796073 | -
4 | -1832 | -Andrew Jackson | -Democratic | -702735 | -win | -54.574789 | -
5 | -1832 | -Henry Clay | -National Republican | -484205 | -loss | -37.603628 | -
6 | -1832 | -William Wirt | -Anti-Masonic | -100715 | -loss | -7.821583 | -
+ | Year | +Candidate | +Party | +Popular vote | +Result | +% | +
3 | +1828 | +John Quincy Adams | +National Republican | +500897 | +loss | +43.796073 | +
4 | +1832 | +Andrew Jackson | +Democratic | +702735 | +win | +54.574789 | +
5 | +1832 | +Henry Clay | +National Republican | +484205 | +loss | +37.603628 | +
6 | +1832 | +William Wirt | +Anti-Masonic | +100715 | +loss | +7.821583 | +
If we provide a list of column names, we get the listed columns.
-elections[["Year", "Candidate", "Result"]]
+
+elections[["Year", "Candidate", "Result"]]
-
-
- | Year | -Candidate | -Result | -
---|---|---|---|
0 | -1824 | -Andrew Jackson | -loss | -
1 | -1824 | -John Quincy Adams | -win | -
2 | -1828 | -Andrew Jackson | -win | -
3 | -1828 | -John Quincy Adams | -loss | -
4 | -1832 | -Andrew Jackson | -win | -
... | -... | -... | -... | -
177 | -2016 | -Jill Stein | -loss | -
178 | -2020 | -Joseph Biden | -win | -
179 | -2020 | -Donald Trump | -loss | -
180 | -2020 | -Jo Jorgensen | -loss | -
181 | -2020 | -Howard Hawkins | -loss | -
+ | Year | +Candidate | +Result | +
0 | +1824 | +Andrew Jackson | +loss | +
1 | +1824 | +John Quincy Adams | +win | +
2 | +1828 | +Andrew Jackson | +win | +
3 | +1828 | +John Quincy Adams | +loss | +
4 | +1832 | +Andrew Jackson | +win | +
... | +... | +... | +... | +
177 | +2016 | +Jill Stein | +loss | +
178 | +2020 | +Joseph Biden | +win | +
179 | +2020 | +Donald Trump | +loss | +
180 | +2020 | +Jo Jorgensen | +loss | +
181 | +2020 | +Howard Hawkins | +loss | +
182 rows × 3 columns
And if we provide a single column name we get back just that column, stored as a Series
.
elections["Candidate"]
+
+elections["Candidate"]
-
-
0 Andrew Jackson 1 John Quincy Adams 2 Andrew Jackson @@ -16112,59 +8296,45 @@Context-dependent Extraction 181 Howard Hawkins Name: Candidate, Length: 182, dtype: object
weird = pd.DataFrame({
- 1:["topdog","botdog"],
- "1":["topcat","botcat"]
-})
+
+weird = pd.DataFrame({"a":["one fish", "two fish"],
+ "b":["red fish", "blue fish"]})
weird
-
-
- | 1 | -1 | -
---|---|---|
0 | -topdog | -topcat | -
1 | -botdog | -botcat | -
+ | a | +b | +
0 | +one fish | +red fish | +
1 | +two fish | +blue fish | +
weird[1]
+
+weird.iloc[1, 1]
-
-
'blue fish'+
# weird.loc[1, 1]
+
weird.iloc[1, :]
+
0 topdog -1 botdog +Out[14]:++-a two fish +b blue fish Name: 1, dtype: object
weird["1"]
+
+weird.loc[1, 'b']
-
-
0 topcat -1 botcat -Name: 1, dtype: object+
'blue fish'
weird[1:]
+
+weird.iloc[1, 0]
-
-
- | 1 | -1 | -
---|---|---|
1 | -botdog | -botcat | -
'two fish'
Dataset: California baby names¶
In today's lecture, we'll work with the babynames
dataset, which contains information about the names of infants born in California.
Dataset: California baby names¶
In today's lecture, we'll work with the babynames
dataset, which contains information about the names of infants born in California.
The cell below pulls census data from a government website and then loads it into a usable form. The code shown here is outside of the scope of Data 100, but you're encouraged to dig into it if you are interested!
-import urllib.request
+
+import urllib.request
import os.path
import zipfile
-data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
-local_filename = "data/babynamesbystate.zip"
-if not os.path.exists(local_filename): # If the data exists don't download again
- with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
+data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
+local_filename = "data/babynamesbystate.zip"
+if not os.path.exists(local_filename): # If the data exists don't download again
+ with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
f.write(resp.read())
-zf = zipfile.ZipFile(local_filename, 'r')
+zf = zipfile.ZipFile(local_filename, 'r')
-ca_name = 'STATE.CA.TXT'
-field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
+ca_name = 'STATE.CA.TXT'
+field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
babynames = pd.read_csv(fh, header=None, names=field_names)
babynames.head()
-
-
- | State | -Sex | -Year | -Name | -Count | -
---|---|---|---|---|---|
0 | -CA | -F | -1910 | -Mary | -295 | -
1 | -CA | -F | -1910 | -Helen | -239 | -
2 | -CA | -F | -1910 | -Dorothy | -220 | -
3 | -CA | -F | -1910 | -Margaret | -163 | -
4 | -CA | -F | -1910 | -Frances | -134 | -
+ | State | +Sex | +Year | +Name | +Count | +
0 | +CA | +F | +1910 | +Mary | +295 | +
1 | +CA | +F | +1910 | +Helen | +239 | +
2 | +CA | +F | +1910 | +Dorothy | +220 | +
3 | +CA | +F | +1910 | +Margaret | +163 | +
4 | +CA | +F | +1910 | +Frances | +134 | +
# Ask yourself: Why is :9 is the correct slice to select the first 10 rows?
+
+# Ask yourself: Why is :9 is the correct slice to select the first 10 rows?
babynames_first_10_rows = babynames.loc[:9, :]
babynames_first_10_rows
-
-
- | State | -Sex | -Year | -Name | -Count | -
---|---|---|---|---|---|
0 | -CA | -F | -1910 | -Mary | -295 | -
1 | -CA | -F | -1910 | -Helen | -239 | -
2 | -CA | -F | -1910 | -Dorothy | -220 | -
3 | -CA | -F | -1910 | -Margaret | -163 | -
4 | -CA | -F | -1910 | -Frances | -134 | -
5 | -CA | -F | -1910 | -Ruth | -128 | -
6 | -CA | -F | -1910 | -Evelyn | -126 | -
7 | -CA | -F | -1910 | -Alice | -118 | -
8 | -CA | -F | -1910 | -Virginia | -101 | -
9 | -CA | -F | -1910 | -Elizabeth | -93 | -
+ | State | +Sex | +Year | +Name | +Count | +
0 | +CA | +F | +1910 | +Mary | +295 | +
1 | +CA | +F | +1910 | +Helen | +239 | +
2 | +CA | +F | +1910 | +Dorothy | +220 | +
3 | +CA | +F | +1910 | +Margaret | +163 | +
4 | +CA | +F | +1910 | +Frances | +134 | +
5 | +CA | +F | +1910 | +Ruth | +128 | +
6 | +CA | +F | +1910 | +Evelyn | +126 | +
7 | +CA | +F | +1910 | +Alice | +118 | +
8 | +CA | +F | +1910 | +Virginia | +101 | +
9 | +CA | +F | +1910 | +Elizabeth | +93 | +
By passing in a sequence (list, array, or Series
) of boolean values, we can extract a subset of the rows in a DataFrame
. We will keep only the rows that correspond to a boolean value of True
.
# Notice how we have exactly 10 elements in our boolean array argument.
+
+# Notice how we have exactly 10 elements in our boolean array argument.
babynames_first_10_rows[[True, False, True, False, True, False, True, False, True, False]]
-
-
- | State | -Sex | -Year | -Name | -Count | -
---|---|---|---|---|---|
0 | -CA | -F | -1910 | -Mary | -295 | -
2 | -CA | -F | -1910 | -Dorothy | -220 | -
4 | -CA | -F | -1910 | -Frances | -134 | -
6 | -CA | -F | -1910 | -Evelyn | -126 | -
8 | -CA | -F | -1910 | -Virginia | -101 | -
+ | State | +Sex | +Year | +Name | +Count | +
0 | +CA | +F | +1910 | +Mary | +295 | +
2 | +CA | +F | +1910 | +Dorothy | +220 | +
4 | +CA | +F | +1910 | +Frances | +134 | +
6 | +CA | +F | +1910 | +Evelyn | +126 | +
8 | +CA | +F | +1910 | +Virginia | +101 | +
# Or using .loc to filter a DataFrame by a Boolean array argument.
+
+# Or using .loc to filter a DataFrame by a Boolean array argument.
babynames_first_10_rows.loc[[True, False, True, False, True, False, True, False, True, False], :]
-
-
- | State | -Sex | -Year | -Name | -Count | -
---|---|---|---|---|---|
0 | -CA | -F | -1910 | -Mary | -295 | -
2 | -CA | -F | -1910 | -Dorothy | -220 | -
4 | -CA | -F | -1910 | -Frances | -134 | -
6 | -CA | -F | -1910 | -Evelyn | -126 | -
8 | -CA | -F | -1910 | -Virginia | -101 | -
+ | State | +Sex | +Year | +Name | +Count | +
0 | +CA | +F | +1910 | +Mary | +295 | +
2 | +CA | +F | +1910 | +Dorothy | +220 | +
4 | +CA | +F | +1910 | +Frances | +134 | +
6 | +CA | +F | +1910 | +Evelyn | +126 | +
8 | +CA | +F | +1910 | +Virginia | +101 | +
Oftentimes, we'll use boolean selection to check for entries in a DataFrame
that meet a particular condition.
# First, use a logical condition to generate a boolean Series
-logical_operator = (babynames["Sex"] == "F")
+
+# First, use a logical condition to generate a boolean Series
+logical_operator = (babynames["Sex"] == "F")
logical_operator
-
-
# Then, use this boolean Series to filter the DataFrame
+
+# Then, use this boolean Series to filter the DataFrame
babynames[logical_operator]
-
-
- | State | -Sex | -Year | -Name | -Count | -
---|---|---|---|---|---|
0 | -CA | -F | -1910 | -Mary | -295 | -
1 | -CA | -F | -1910 | -Helen | -239 | -
2 | -CA | -F | -1910 | -Dorothy | -220 | -
3 | -CA | -F | -1910 | -Margaret | -163 | -
4 | -CA | -F | -1910 | -Frances | -134 | -
... | -... | -... | -... | -... | -... | -
239532 | -CA | -F | -2022 | -Zemira | -5 | -
239533 | -CA | -F | -2022 | -Ziggy | -5 | -
239534 | -CA | -F | -2022 | -Zimal | -5 | -
239535 | -CA | -F | -2022 | -Zosia | -5 | -
239536 | -CA | -F | -2022 | -Zulay | -5 | -
+ | State | +Sex | +Year | +Name | +Count | +
0 | +CA | +F | +1910 | +Mary | +295 | +
1 | +CA | +F | +1910 | +Helen | +239 | +
2 | +CA | +F | +1910 | +Dorothy | +220 | +
3 | +CA | +F | +1910 | +Margaret | +163 | +
4 | +CA | +F | +1910 | +Frances | +134 | +
... | +... | +... | +... | +... | +... | +
239532 | +CA | +F | +2022 | +Zemira | +5 | +
239533 | +CA | +F | +2022 | +Ziggy | +5 | +
239534 | +CA | +F | +2022 | +Zimal | +5 | +
239535 | +CA | +F | +2022 | +Zosia | +5 | +
239536 | +CA | +F | +2022 | +Zulay | +5 | +
239537 rows × 5 columns
Boolean selection also works with loc
!
# Notice that we did not have to specify columns to select
+
+# Notice that we did not have to specify columns to select
# If no columns are referenced, pandas will automatically select all columns
-babynames.loc[babynames["Sex"] == "F"]
+babynames.loc[babynames["Sex"] == "F"]
-
-
- | State | -Sex | -Year | -Name | -Count | -
---|---|---|---|---|---|
0 | -CA | -F | -1910 | -Mary | -295 | -
1 | -CA | -F | -1910 | -Helen | -239 | -
2 | -CA | -F | -1910 | -Dorothy | -220 | -
3 | -CA | -F | -1910 | -Margaret | -163 | -
4 | -CA | -F | -1910 | -Frances | -134 | -
... | -... | -... | -... | -... | -... | -
239532 | -CA | -F | -2022 | -Zemira | -5 | -
239533 | -CA | -F | -2022 | -Ziggy | -5 | -
239534 | -CA | -F | -2022 | -Zimal | -5 | -
239535 | -CA | -F | -2022 | -Zosia | -5 | -
239536 | -CA | -F | -2022 | -Zulay | -5 | -
+ | State | +Sex | +Year | +Name | +Count | +
0 | +CA | +F | +1910 | +Mary | +295 | +
1 | +CA | +F | +1910 | +Helen | +239 | +
2 | +CA | +F | +1910 | +Dorothy | +220 | +
3 | +CA | +F | +1910 | +Margaret | +163 | +
4 | +CA | +F | +1910 | +Frances | +134 | +
... | +... | +... | +... | +... | +... | +
239532 | +CA | +F | +2022 | +Zemira | +5 | +
239533 | +CA | +F | +2022 | +Ziggy | +5 | +
239534 | +CA | +F | +2022 | +Zimal | +5 | +
239535 | +CA | +F | +2022 | +Zosia | +5 | +
239536 | +CA | +F | +2022 | +Zulay | +5 | +
239537 rows × 5 columns
To filter on multiple conditions, we combine boolean operators using bitwise comparisons.
Symbol | -Usage | -Meaning | +Symbol | +Usage | +Meaning |
---|---|---|---|---|---|
~ | -~p | -Returns negation of p | +~ | +~p | +Returns negation of p |
| | -p | q | -p OR q | +| | +p | q | +p OR q |
& | -p & q | -p AND q | +& | +p & q | +p AND q |
^ | -p ^ q | -p XOR q (exclusive or) | +^ | +p ^ q | +p XOR q (exclusive or) |
babynames[(babynames["Sex"] == "F") & (babynames["Year"] < 2000)]
+
+babynames[(babynames["Sex"] == "F") & (babynames["Year"] < 2000)]
-
-
- | State | -Sex | -Year | -Name | -Count | -
---|---|---|---|---|---|
0 | -CA | -F | -1910 | -Mary | -295 | -
1 | -CA | -F | -1910 | -Helen | -239 | -
2 | -CA | -F | -1910 | -Dorothy | -220 | -
3 | -CA | -F | -1910 | -Margaret | -163 | -
4 | -CA | -F | -1910 | -Frances | -134 | -
... | -... | -... | -... | -... | -... | -
149050 | -CA | -F | -1999 | -Zareen | -5 | -
149051 | -CA | -F | -1999 | -Zeinab | -5 | -
149052 | -CA | -F | -1999 | -Zhane | -5 | -
149053 | -CA | -F | -1999 | -Zoha | -5 | -
149054 | -CA | -F | -1999 | -Zoila | -5 | -
+ | State | +Sex | +Year | +Name | +Count | +
0 | +CA | +F | +1910 | +Mary | +295 | +
1 | +CA | +F | +1910 | +Helen | +239 | +
2 | +CA | +F | +1910 | +Dorothy | +220 | +
3 | +CA | +F | +1910 | +Margaret | +163 | +
4 | +CA | +F | +1910 | +Frances | +134 | +
... | +... | +... | +... | +... | +... | +
149050 | +CA | +F | +1999 | +Zareen | +5 | +
149051 | +CA | +F | +1999 | +Zeinab | +5 | +
149052 | +CA | +F | +1999 | +Zhane | +5 | +
149053 | +CA | +F | +1999 | +Zoha | +5 | +
149054 | +CA | +F | +1999 | +Zoila | +5 | +
149055 rows × 5 columns
babynames[(babynames["Sex"] == "F") | (babynames["Year"] < 2000)]
+
+babynames[(babynames["Sex"] == "F") | (babynames["Year"] < 2000)]
-
-
- | State | -Sex | -Year | -Name | -Count | -
---|---|---|---|---|---|
0 | -CA | -F | -1910 | -Mary | -295 | -
1 | -CA | -F | -1910 | -Helen | -239 | -
2 | -CA | -F | -1910 | -Dorothy | -220 | -
3 | -CA | -F | -1910 | -Margaret | -163 | -
4 | -CA | -F | -1910 | -Frances | -134 | -
... | -... | -... | -... | -... | -... | -
342435 | -CA | -M | -1999 | -Yuuki | -5 | -
342436 | -CA | -M | -1999 | -Zakariya | -5 | -
342437 | -CA | -M | -1999 | -Zavier | -5 | -
342438 | -CA | -M | -1999 | -Zayn | -5 | -
342439 | -CA | -M | -1999 | -Zayne | -5 | -
+ | State | +Sex | +Year | +Name | +Count | +
0 | +CA | +F | +1910 | +Mary | +295 | +
1 | +CA | +F | +1910 | +Helen | +239 | +
2 | +CA | +F | +1910 | +Dorothy | +220 | +
3 | +CA | +F | +1910 | +Margaret | +163 | +
4 | +CA | +F | +1910 | +Frances | +134 | +
... | +... | +... | +... | +... | +... | +
342435 | +CA | +M | +1999 | +Yuuki | +5 | +
342436 | +CA | +M | +1999 | +Zakariya | +5 | +
342437 | +CA | +M | +1999 | +Zavier | +5 | +
342438 | +CA | +M | +1999 | +Zayn | +5 | +
342439 | +CA | +M | +1999 | +Zayne | +5 | +
342440 rows × 5 columns
babynames.iloc[[0, 233, 484], [3, 4]]
+
+babynames.iloc[[0, 233, 484], [3, 4]]
-
-
- | Name | -Count | -
---|---|---|
0 | -Mary | -295 | -
233 | -Mary | -390 | -
484 | -Mary | -534 | -
+ | Name | +Count | +
0 | +Mary | +295 | +
233 | +Mary | +390 | +
484 | +Mary | +534 | +
babynames.loc[[0, 233, 484]]
+
+babynames.loc[[0, 233, 484]]
-
-
- | State | -Sex | -Year | -Name | -Count | -
---|---|---|---|---|---|
0 | -CA | -F | -1910 | -Mary | -295 | -
233 | -CA | -F | -1911 | -Mary | -390 | -
484 | -CA | -F | -1912 | -Mary | -534 | -
+ | State | +Sex | +Year | +Name | +Count | +
0 | +CA | +F | +1910 | +Mary | +295 | +
233 | +CA | +F | +1911 | +Mary | +390 | +
484 | +CA | +F | +1912 | +Mary | +534 | +
babynames.loc[babynames["Count"] > 250, ["Name", "Count"]].head(3)
+
+babynames.loc[babynames["Count"] > 250, ["Name", "Count"]].head(3)
-
-
- | Name | -Count | -
---|---|---|
0 | -Mary | -295 | -
233 | -Mary | -390 | -
484 | -Mary | -534 | -
+ | Name | +Count | +
0 | +Mary | +295 | +
233 | +Mary | +390 | +
484 | +Mary | +534 | +
babynames.loc[babynames["Count"] > 250, ["Name", "Count"]].iloc[0:2, :]
+
+babynames.loc[babynames["Count"] > 250, ["Name", "Count"]].iloc[0:2, :]
-
-
- | Name | -Count | -
---|---|---|
0 | -Mary | -295 | -
233 | -Mary | -390 | -
+ | Name | +Count | +
0 | +Mary | +295 | +
233 | +Mary | +390 | +
# Note: The parentheses surrounding the code make it possible to break the code into multiple lines for readability
+
+# Note: The parentheses surrounding the code make it possible to
+# break the code into multiple lines for readability
(
- babynames[(babynames["Name"] == "Bella") |
- (babynames["Name"] == "Alex") |
- (babynames["Name"] == "Narges") |
- (babynames["Name"] == "Lisa")]
+ babynames[(babynames["Name"] == "Bella") |
+ (babynames["Name"] == "Alex") |
+ (babynames["Name"] == "Narges") |
+ (babynames["Name"] == "Lisa")]
)
-
-
- | State | -Sex | -Year | -Name | -Count | -
---|---|---|---|---|---|
6289 | -CA | -F | -1923 | -Bella | -5 | -
7512 | -CA | -F | -1925 | -Bella | -8 | -
12368 | -CA | -F | -1932 | -Lisa | -5 | -
14741 | -CA | -F | -1936 | -Lisa | -8 | -
17084 | -CA | -F | -1939 | -Lisa | -5 | -
... | -... | -... | -... | -... | -... | -
393248 | -CA | -M | -2018 | -Alex | -495 | -
396111 | -CA | -M | -2019 | -Alex | -438 | -
398983 | -CA | -M | -2020 | -Alex | -379 | -
401788 | -CA | -M | -2021 | -Alex | -333 | -
404663 | -CA | -M | -2022 | -Alex | -344 | -
+ | State | +Sex | +Year | +Name | +Count | +
6289 | +CA | +F | +1923 | +Bella | +5 | +
7512 | +CA | +F | +1925 | +Bella | +8 | +
12368 | +CA | +F | +1932 | +Lisa | +5 | +
14741 | +CA | +F | +1936 | +Lisa | +8 | +
17084 | +CA | +F | +1939 | +Lisa | +5 | +
... | +... | +... | +... | +... | +... | +
393248 | +CA | +M | +2018 | +Alex | +495 | +
396111 | +CA | +M | +2019 | +Alex | +438 | +
398983 | +CA | +M | +2020 | +Alex | +379 | +
401788 | +CA | +M | +2021 | +Alex | +333 | +
404663 | +CA | +M | +2022 | +Alex | +344 | +
317 rows × 5 columns
# A more concise method to achieve the above: .isin
-names = ["Bella", "Alex", "Narges", "Lisa"]
-display(babynames["Name"].isin(names))
-display(babynames[babynames["Name"].isin(names)])
+
+# A more concise method to achieve the above: .isin
+names = ["Bella", "Alex", "Narges", "Lisa"]
+display(babynames["Name"].isin(names))
+display(babynames[babynames["Name"].isin(names)])
-
-
# What if we only want names that start with "N"?
-display(babynames["Name"].str.startswith("N"))
-display(babynames[babynames["Name"].str.startswith("N")])
-
# What if we only want names that start with "N"?
+display(babynames["Name"].str.startswith("N"))
-
To add a column, use []
to reference the desired new column, then assign it to a Series
or array of appropriate length.
# Create a Series of the length of each name
-babyname_lengths = babynames["Name"].str.len()
+
+# Create a Series of the length of each name.
+babyname_lengths = babynames["Name"].str.len()
-# Add a column named "name_lengths" that includes the length of each name
-babynames["name_lengths"] = babyname_lengths
+# Add a column named "name_lengths" that includes the length of each name.
+babynames["name_lengths"] = babyname_lengths
babynames
-
-
- | State | -Sex | -Year | -Name | -Count | -name_lengths | -
---|---|---|---|---|---|---|
0 | -CA | -F | -1910 | -Mary | -295 | -4 | -
1 | -CA | -F | -1910 | -Helen | -239 | -5 | -
2 | -CA | -F | -1910 | -Dorothy | -220 | -7 | -
3 | -CA | -F | -1910 | -Margaret | -163 | -8 | -
4 | -CA | -F | -1910 | -Frances | -134 | -7 | -
... | -... | -... | -... | -... | -... | -... | -
407423 | -CA | -M | -2022 | -Zayvier | -5 | -7 | -
407424 | -CA | -M | -2022 | -Zia | -5 | -3 | -
407425 | -CA | -M | -2022 | -Zora | -5 | -4 | -
407426 | -CA | -M | -2022 | -Zuriel | -5 | -6 | -
407427 | -CA | -M | -2022 | -Zylo | -5 | -4 | -
+ | State | +Sex | +Year | +Name | +Count | +name_lengths | +
0 | +CA | +F | +1910 | +Mary | +295 | +4 | +
1 | +CA | +F | +1910 | +Helen | +239 | +5 | +
2 | +CA | +F | +1910 | +Dorothy | +220 | +7 | +
3 | +CA | +F | +1910 | +Margaret | +163 | +8 | +
4 | +CA | +F | +1910 | +Frances | +134 | +7 | +
... | +... | +... | +... | +... | +... | +... | +
407423 | +CA | +M | +2022 | +Zayvier | +5 | +7 | +
407424 | +CA | +M | +2022 | +Zia | +5 | +3 | +
407425 | +CA | +M | +2022 | +Zora | +5 | +4 | +
407426 | +CA | +M | +2022 | +Zuriel | +5 | +6 | +
407427 | +CA | +M | +2022 | +Zylo | +5 | +4 | +
407428 rows × 6 columns
To modify a column, use []
to access the desired column, then re-assign it to a new array or Series.
# Modify the "name_lengths" column to be one less than its original value
-babynames["name_lengths"] = babynames["name_lengths"] - 1
+
+# Modify the "name_lengths" column to be one less than its original value.
+babynames["name_lengths"] = babynames["name_lengths"] - 1
babynames
-
-
- | State | -Sex | -Year | -Name | -Count | -name_lengths | -
---|---|---|---|---|---|---|
0 | -CA | -F | -1910 | -Mary | -295 | -3 | -
1 | -CA | -F | -1910 | -Helen | -239 | -4 | -
2 | -CA | -F | -1910 | -Dorothy | -220 | -6 | -
3 | -CA | -F | -1910 | -Margaret | -163 | -7 | -
4 | -CA | -F | -1910 | -Frances | -134 | -6 | -
... | -... | -... | -... | -... | -... | -... | -
407423 | -CA | -M | -2022 | -Zayvier | -5 | -6 | -
407424 | -CA | -M | -2022 | -Zia | -5 | -2 | -
407425 | -CA | -M | -2022 | -Zora | -5 | -3 | -
407426 | -CA | -M | -2022 | -Zuriel | -5 | -5 | -
407427 | -CA | -M | -2022 | -Zylo | -5 | -3 | -
+ | State | +Sex | +Year | +Name | +Count | +name_lengths | +
0 | +CA | +F | +1910 | +Mary | +295 | +3 | +
1 | +CA | +F | +1910 | +Helen | +239 | +4 | +
2 | +CA | +F | +1910 | +Dorothy | +220 | +6 | +
3 | +CA | +F | +1910 | +Margaret | +163 | +7 | +
4 | +CA | +F | +1910 | +Frances | +134 | +6 | +
... | +... | +... | +... | +... | +... | +... | +
407423 | +CA | +M | +2022 | +Zayvier | +5 | +6 | +
407424 | +CA | +M | +2022 | +Zia | +5 | +2 | +
407425 | +CA | +M | +2022 | +Zora | +5 | +3 | +
407426 | +CA | +M | +2022 | +Zuriel | +5 | +5 | +
407427 | +CA | +M | +2022 | +Zylo | +5 | +3 | +
407428 rows × 6 columns
Rename a column using the .rename()
method.
# Rename "name_lengths" to "Length"
-babynames = babynames.rename(columns={"name_lengths":"Length"})
+
+# Rename "name_lengths" to "Length".
+babynames = babynames.rename(columns={"name_lengths":"Length"})
babynames
-
-
- | State | -Sex | -Year | -Name | -Count | -Length | -
---|---|---|---|---|---|---|
0 | -CA | -F | -1910 | -Mary | -295 | -3 | -
1 | -CA | -F | -1910 | -Helen | -239 | -4 | -
2 | -CA | -F | -1910 | -Dorothy | -220 | -6 | -
3 | -CA | -F | -1910 | -Margaret | -163 | -7 | -
4 | -CA | -F | -1910 | -Frances | -134 | -6 | -
... | -... | -... | -... | -... | -... | -... | -
407423 | -CA | -M | -2022 | -Zayvier | -5 | -6 | -
407424 | -CA | -M | -2022 | -Zia | -5 | -2 | -
407425 | -CA | -M | -2022 | -Zora | -5 | -3 | -
407426 | -CA | -M | -2022 | -Zuriel | -5 | -5 | -
407427 | -CA | -M | -2022 | -Zylo | -5 | -3 | -
+ | State | +Sex | +Year | +Name | +Count | +Length | +
0 | +CA | +F | +1910 | +Mary | +295 | +3 | +
1 | +CA | +F | +1910 | +Helen | +239 | +4 | +
2 | +CA | +F | +1910 | +Dorothy | +220 | +6 | +
3 | +CA | +F | +1910 | +Margaret | +163 | +7 | +
4 | +CA | +F | +1910 | +Frances | +134 | +6 | +
... | +... | +... | +... | +... | +... | +... | +
407423 | +CA | +M | +2022 | +Zayvier | +5 | +6 | +
407424 | +CA | +M | +2022 | +Zia | +5 | +2 | +
407425 | +CA | +M | +2022 | +Zora | +5 | +3 | +
407426 | +CA | +M | +2022 | +Zuriel | +5 | +5 | +
407427 | +CA | +M | +2022 | +Zylo | +5 | +3 | +
407428 rows × 6 columns
Remove a column using .drop()
.
# Remove our new "Length" column
-babynames = babynames.drop("Length", axis="columns")
+
+# Remove our new "Length" column.
+babynames = babynames.drop("Length", axis="columns")
babynames
-
-
- | State | -Sex | -Year | -Name | -Count | -
---|---|---|---|---|---|
0 | -CA | -F | -1910 | -Mary | -295 | -
1 | -CA | -F | -1910 | -Helen | -239 | -
2 | -CA | -F | -1910 | -Dorothy | -220 | -
3 | -CA | -F | -1910 | -Margaret | -163 | -
4 | -CA | -F | -1910 | -Frances | -134 | -
... | -... | -... | -... | -... | -... | -
407423 | -CA | -M | -2022 | -Zayvier | -5 | -
407424 | -CA | -M | -2022 | -Zia | -5 | -
407425 | -CA | -M | -2022 | -Zora | -5 | -
407426 | -CA | -M | -2022 | -Zuriel | -5 | -
407427 | -CA | -M | -2022 | -Zylo | -5 | -
+ | State | +Sex | +Year | +Name | +Count | +
0 | +CA | +F | +1910 | +Mary | +295 | +
1 | +CA | +F | +1910 | +Helen | +239 | +
2 | +CA | +F | +1910 | +Dorothy | +220 | +
3 | +CA | +F | +1910 | +Margaret | +163 | +
4 | +CA | +F | +1910 | +Frances | +134 | +
... | +... | +... | +... | +... | +... | +
407423 | +CA | +M | +2022 | +Zayvier | +5 | +
407424 | +CA | +M | +2022 | +Zia | +5 | +
407425 | +CA | +M | +2022 | +Zora | +5 | +
407426 | +CA | +M | +2022 | +Zuriel | +5 | +
407427 | +CA | +M | +2022 | +Zylo | +5 | +
407428 rows × 5 columns
yash_counts = babynames[babynames["Name"] == "Yash"]["Count"]
+
+yash_counts = babynames[babynames["Name"] == "Yash"]["Count"]
yash_counts
-
-
331824 8 334114 9 336390 11 @@ -19298,219 +11163,154 @@
NumPy
¶ 406006 13 Name: Count, dtype: int64
# Average number of babies named Yash each year
+
+# Average number of babies named Yash each year
np.mean(yash_counts)
-
-
17.142857142857142
# Max number of babies named Yash born in any single year
+
+# Max number of babies named Yash born in any single year
max(yash_counts)
-
-
29
Built-In pandas
Methods¶
There are many, many utility functions built into pandas
, far more than we can possibly cover in lecture. You are encouraged to explore all the functionality outlined in the pandas
documentation.
Built-In pandas
Methods¶
There are many, many utility functions built into pandas
, far more than we can possibly cover in lecture. You are encouraged to explore all the functionality outlined in the pandas
documentation.
# Returns the shape of the object in the format (num_rows, num_columns)
+
+# Returns the shape of the object in the format (num_rows, num_columns)
babynames.shape
-
-
(407428, 5)
# Returns the total number of entries in the object, equal to num_rows * num_columns
+
+# Returns the total number of entries in the object, equal to num_rows * num_columns
babynames.size
-
-
2037140
# What summary statistics can we describe?
+
+# What summary statistics can we describe?
babynames.describe()
-
-
- | Year | -Count | -
---|---|---|
count | -407428.000000 | -407428.000000 | -
mean | -1985.733609 | -79.543456 | -
std | -27.007660 | -293.698654 | -
min | -1910.000000 | -5.000000 | -
25% | -1969.000000 | -7.000000 | -
50% | -1992.000000 | -13.000000 | -
75% | -2008.000000 | -38.000000 | -
max | -2022.000000 | -8260.000000 | -
+ | Year | +Count | +
count | +407428.000000 | +407428.000000 | +
mean | +1985.733609 | +79.543456 | +
std | +27.007660 | +293.698654 | +
min | +1910.000000 | +5.000000 | +
25% | +1969.000000 | +7.000000 | +
50% | +1992.000000 | +13.000000 | +
75% | +2008.000000 | +38.000000 | +
max | +2022.000000 | +8260.000000 | +
# Our statistics are slightly different when working with a Series
-babynames["Sex"].describe()
+
+# Our statistics are slightly different when working with a Series.
+babynames["Sex"].describe()
-
-
count 407428 unique 2 top F freq 239537 Name: Sex, dtype: object
# Randomly sample row(s) from the DataFrame
+
+# Randomly sample row(s) from the DataFrame.
babynames.sample()
-
-
- | State | -Sex | -Year | -Name | -Count | -
---|---|---|---|---|---|
71422 | -CA | -F | -1974 | -Terasa | -5 | -
+ | State | +Sex | +Year | +Name | +Count | +
154936 | +CA | +F | +2001 | +Darya | +10 | +
# Rerun this cell a few times – you'll get different results!
+
+# Rerun this cell a few times – you'll get different results!
babynames.sample(5).iloc[:, 2:]
-
-
- | Year | -Name | -Count | -
---|---|---|---|
261171 | -1949 | -Gerardo | -10 | -
172906 | -2006 | -Lila | -114 | -
199106 | -2012 | -Estefani | -17 | -
140347 | -1997 | -Adele | -10 | -
100699 | -1986 | -Dayana | -20 | -
+ | Year | +Name | +Count | +
339370 | +1998 | +Vito | +7 | +
99351 | +1985 | +Janea | +5 | +
386376 | +2015 | +Reilly | +8 | +
15857 | +1938 | +Sherry | +40 | +
118930 | +1991 | +Tatum | +8 | +
# Sampling with replacement
-babynames[babynames["Year"] == 2000].sample(4, replace = True).iloc[:,2:]
+
+# Sampling with replacement.
+babynames[babynames["Year"] == 2000].sample(4, replace = True).iloc[:,2:]
-
-
- | Year | -Name | -Count | -
---|---|---|---|
344805 | -2000 | -Kaine | -5 | -
344678 | -2000 | -Cris | -5 | -
343876 | -2000 | -Alen | -9 | -
150843 | -2000 | -Cleo | -12 | -
+ | Year | +Name | +Count | +
344801 | +2000 | +Justine | +5 | +
151714 | +2000 | +Cloey | +7 | +
342700 | +2000 | +Troy | +158 | +
152033 | +2000 | +Evelina | +6 | +
# Count the number of times each unique value occurs in a Series
-babynames["Name"].value_counts()
+
+# Count the number of times each unique value occurs in a Series.
+babynames["Name"].value_counts()
-
-
Name Jean 223 Francis 221 @@ -19931,85 +11669,59 @@Built-In
pandas
Methods -++-In [46]:+In [48]:-- +-# Return an array of all unique values in the Series -babynames["Name"].unique() ++- -# Return an array of all unique values in the Series. +babynames["Name"].unique()- -- -- --Out[46]:- - - - ---array(['Mary', 'Helen', 'Dorothy', ..., 'Zae', 'Zai', 'Zayvier'], +Out[48]:++-array(['Mary', 'Helen', 'Dorothy', ..., 'Zae', 'Zai', 'Zayvier'], dtype=object)-++-In [47]:+In [49]:-- +-# Sort a Series -babynames["Name"].sort_values() ++- -# Sort a Series. +babynames["Name"].sort_values()- -- -Out[47]:- - - - -+Out[49]:+366001 Aadan 384005 Aadan 369120 Aadan @@ -20023,45 +11735,33 @@Built-In
pandas
Methods -++-In [48]:+In [50]:-- +-# Sort a DataFrame – there are lots of Michaels in California -babynames.sort_values(by="Count", ascending=False) ++- -# Sort a DataFrame – there are lots of Michaels in California. +babynames.sort_values(by="Count", ascending=False)-- --- --Out[48]:- - - -+-Out[50]:+--- -
- - - -- State -Sex -Year -Name -Count -- -268041 -CA -M -1957 -Michael -8260 -- -267017 -CA -M -1956 -Michael -8258 -- -317387 -CA -M -1990 -Michael -8246 -- -281850 -CA -M -1969 -Michael -8245 -- -283146 -CA -M -1970 -Michael -8196 -- -... -... -... -... -... -... -- -317292 -CA -M -1989 -Olegario -5 -- -317291 -CA -M -1989 -Norbert -5 -- -317290 -CA -M -1989 -Niles -5 -- -317289 -CA -M -1989 -Nikola -5 -- - + +407427 -CA -M -2022 -Zylo -5 -+ + + ++ State +Sex +Year +Name +Count ++ +268041 +CA +M +1957 +Michael +8260 ++ +267017 +CA +M +1956 +Michael +8258 ++ +317387 +CA +M +1990 +Michael +8246 ++ +281850 +CA +M +1969 +Michael +8245 ++ +283146 +CA +M +1970 +Michael +8196 ++ +... +... +... +... +... +... ++ +317292 +CA +M +1989 +Olegario +5 ++ +317291 +CA +M +1989 +Norbert +5 ++ +317290 +CA +M +1989 +Niles +5 ++ +317289 +CA +M +1989 +Nikola +5 ++ +407427 +CA +M +2022 +Zylo +5 +407428 rows × 5 columns
-++ --++-In [49]:+In [51]:-- +-# Create a Series of the length of each name -babyname_lengths = babynames["Name"].str.len() ++- -# Create a Series of the length of each name. +babyname_lengths = babynames["Name"].str.len() -# Add a column named "name_lengths" that includes the length of each name -babynames["name_lengths"] = babyname_lengths +# Add a column named "name_lengths" that includes the length of each name. +babynames["name_lengths"] = babyname_lengths babynames.head(5)- -- -- --Out[49]:- - - -+-Out[51]:+--- -
- - - -- State -Sex -Year -Name -Count -name_lengths -- -0 -CA -F -1910 -Mary -295 -4 -- -1 -CA -F -1910 -Helen -239 -5 -- -2 -CA -F -1910 -Dorothy -220 -7 -- -3 -CA -F -1910 -Margaret -163 -8 -- - + +4 -CA -F -1910 -Frances -134 -7 -+ + + ++ State +Sex +Year +Name +Count +name_lengths ++ +0 +CA +F +1910 +Mary +295 +4 ++ +1 +CA +F +1910 +Helen +239 +5 ++ +2 +CA +F +1910 +Dorothy +220 +7 ++ +3 +CA +F +1910 +Margaret +163 +8 ++ +4 +CA +F +1910 +Frances +134 +7 +-++-In [50]:+In [52]:-- +-# Sort by the temporary column -babynames = babynames.sort_values(by="name_lengths", ascending=False) ++- -# Sort by the temporary column. +babynames = babynames.sort_values(by="name_lengths", ascending=False) babynames.head(5)- -- -- --Out[50]:- - - -+-Out[52]:+--- -
- - - -- State -Sex -Year -Name -Count -name_lengths -- -334166 -CA -M -1996 -Franciscojavier -8 -15 -- -337301 -CA -M -1997 -Franciscojavier -5 -15 -- -339472 -CA -M -1998 -Franciscojavier -6 -15 -- -321792 -CA -M -1991 -Ryanchristopher -7 -15 -- - + +327358 -CA -M -1993 -Johnchristopher -5 -15 -+ + + ++ State +Sex +Year +Name +Count +name_lengths ++ +334166 +CA +M +1996 +Franciscojavier +8 +15 ++ +337301 +CA +M +1997 +Franciscojavier +5 +15 ++ +339472 +CA +M +1998 +Franciscojavier +6 +15 ++ +321792 +CA +M +1991 +Ryanchristopher +7 +15 ++ +327358 +CA +M +1993 +Johnchristopher +5 +15 +-++-In [51]:+In [53]:-- +-# Drop the 'name_length' column -babynames = babynames.drop("name_lengths", axis="columns") ++- -# Drop the 'name_length' column. +babynames = babynames.drop("name_lengths", axis="columns") babynames.head(5)-- --- --Out[51]:- - - -+-Out[53]:+--- -
- - - -- State -Sex -Year -Name -Count -- -334166 -CA -M -1996 -Franciscojavier -8 -- -337301 -CA -M -1997 -Franciscojavier -5 -- -339472 -CA -M -1998 -Franciscojavier -6 -- -321792 -CA -M -1991 -Ryanchristopher -7 -- - + +327358 -CA -M -1993 -Johnchristopher -5 -+ + + ++ State +Sex +Year +Name +Count ++ +334166 +CA +M +1996 +Franciscojavier +8 ++ +337301 +CA +M +1997 +Franciscojavier +5 ++ +339472 +CA +M +1998 +Franciscojavier +6 ++ +321792 +CA +M +1991 +Ryanchristopher +7 ++ +327358 +CA +M +1993 +Johnchristopher +5 +-+-++-In [52]:+In [54]:-- +-babynames.sort_values("Name", key=lambda x:x.str.len(), ascending=False).head() ++- -babynames.sort_values("Name", key=lambda x:x.str.len(), ascending=False).head()-- --- --Out[52]:- - - -+-Out[54]:+--- -
- - - -- State -Sex -Year -Name -Count -- -334166 -CA -M -1996 -Franciscojavier -8 -- -327472 -CA -M -1993 -Ryanchristopher -5 -- -337301 -CA -M -1997 -Franciscojavier -5 -- -337477 -CA -M -1997 -Ryanchristopher -5 -- - + +312543 -CA -M -1987 -Franciscojavier -5 -+ + + ++ State +Sex +Year +Name +Count ++ +334166 +CA +M +1996 +Franciscojavier +8 ++ +327472 +CA +M +1993 +Ryanchristopher +5 ++ +337301 +CA +M +1997 +Franciscojavier +5 ++ +337477 +CA +M +1997 +Ryanchristopher +5 ++ +312543 +CA +M +1987 +Franciscojavier +5 +-++---Approach 3: Sorting Using the
map
Function¶We can also use the Python map function if we want to use an arbitrarily defined function. Suppose we want to sort by the number of occurrences of "dr" plus the number of occurences of "ea".
- ++Approach 3: Sorting Using the
map
Function¶We can also use the Python map function if we want to use an arbitrarily defined function. Suppose we want to sort by the number of occurrences of "dr" plus the number of occurences of "ea".
-++-In [53]:+In [55]:-- +-# First, define a function to count the number of times "dr" or "ea" appear in each name ++- -# First, define a function to count +# the number of times "dr" or "ea" appear in each name. def dr_ea_count(string): - return string.count('dr') + string.count('ea') + return string.count('dr') + string.count('ea') -# Then, use `map` to apply `dr_ea_count` to each name in the "Name" column -babynames["dr_ea_count"] = babynames["Name"].map(dr_ea_count) +# Then, use `map` to apply `dr_ea_count` to each name in the "Name" column. +babynames["dr_ea_count"] = babynames["Name"].map(dr_ea_count) -# Sort the DataFrame by the new "dr_ea_count" column so we can see our handiwork -babynames = babynames.sort_values(by="dr_ea_count", ascending=False) +# Sort the DataFrame by the new "dr_ea_count" column +# so we can see our handiwork. +babynames = babynames.sort_values(by="dr_ea_count", ascending=False) babynames.head()- -- -- --Out[53]:- - - -+-Out[55]:+--- -
- - - -- State -Sex -Year -Name -Count -dr_ea_count -- -115957 -CA -F -1990 -Deandrea -5 -3 -- -101976 -CA -F -1986 -Deandrea -6 -3 -- -131029 -CA -F -1994 -Leandrea -5 -3 -- -108731 -CA -F -1988 -Deandrea -5 -3 -- - + +308131 -CA -M -1985 -Deandrea -6 -3 -+ + + ++ State +Sex +Year +Name +Count +dr_ea_count ++ +115957 +CA +F +1990 +Deandrea +5 +3 ++ +101976 +CA +F +1986 +Deandrea +6 +3 ++ +131029 +CA +F +1994 +Leandrea +5 +3 ++ +108731 +CA +F +1988 +Deandrea +5 +3 ++ +308131 +CA +M +1985 +Deandrea +6 +3 +-++-In [54]:+In [56]:-- +-# Drop the `dr_ea_count` column -babynames = babynames.drop("dr_ea_count", axis="columns") ++- -# Drop the `dr_ea_count` column. +babynames = babynames.drop("dr_ea_count", axis="columns") babynames.head(5)-- --- --Out[54]:- - - -+-Out[56]:+--- -
- - - -- State -Sex -Year -Name -Count -- -115957 -CA -F -1990 -Deandrea -5 -- -101976 -CA -F -1986 -Deandrea -6 -- -131029 -CA -F -1994 -Leandrea -5 -- -108731 -CA -F -1988 -Deandrea -5 -- - + +308131 -CA -M -1985 -Deandrea -6 -+ + + ++ State +Sex +Year +Name +Count ++ +115957 +CA +F +1990 +Deandrea +5 ++ +101976 +CA +F +1986 +Deandrea +6 ++ +131029 +CA +F +1994 +Leandrea +5 ++ +108731 +CA +F +1988 +Deandrea +5 ++ +308131 +CA +M +1985 +Deandrea +6 +-++---Grouping¶
Group rows that share a common feature, then aggregate data across the group.
++Grouping¶
Group rows that share a common feature, then aggregate data across the group.
In this example, we count the total number of babies born in each year (considering only a small subset of the data, for simplicity).
- - +-++-In [55]:+In [57]:-- +-# The code below uses the full babynames dataset, which is why some numbers are different relative to the diagram -babynames[["Year", "Count"]].groupby("Year").agg(sum) ++- -# The code below uses the full babynames dataset, which is why +# some numbers are different relative to the diagram. +babynames[["Year", "Count"]].groupby("Year").agg(sum)-- --- --Out[55]:- - - -+-Out[57]:+--- -
- -- Count -- - - -Year -- - -1910 -9163 -- -1911 -9983 -- -1912 -17946 -- -1913 -22094 -- -1914 -26926 -- -... -... -- -2018 -395436 -- -2019 -386996 -- -2020 -362882 -- -2021 -362582 -- - + +2022 -360023 -+ ++ Count ++ + + +Year ++ + +1910 +9163 ++ +1911 +9983 ++ +1912 +17946 ++ +1913 +22094 ++ +1914 +26926 ++ +... +... ++ +2018 +395436 ++ +2019 +386996 ++ +2020 +362882 ++ +2021 +362582 ++ +2022 +360023 +113 rows × 1 columns
-++--+There are many different aggregation functions we can use, all of which are useful in different applications.
--++-In [56]:+In [58]:-- +-# What is the earliest year in which each name appeared? -babynames.groupby("Name")[["Year"]].agg(min) ++- -# What is the earliest year in which each name appeared? +babynames.groupby("Name")[["Year"]].agg(min)- -- -- --Out[56]:- - - -+-Out[58]:+--- -
- -- Year -- - - -Name -- - -Aadan -2008 -- -Aadarsh -2019 -- -Aaden -2007 -- -Aadhav -2014 -- -Aadhini -2022 -- -... -... -- -Zymir -2020 -- -Zyon -1999 -- -Zyra -2012 -- -Zyrah -2011 -- - + +Zyrus -2021 -+ ++ Year ++ + + +Name ++ + +Aadan +2008 ++ +Aadarsh +2019 ++ +Aaden +2007 ++ +Aadhav +2014 ++ +Aadhini +2022 ++ +... +... ++ +Zymir +2020 ++ +Zyon +1999 ++ +Zyra +2012 ++ +Zyrah +2011 ++ +Zyrus +2021 +20437 rows × 1 columns
-++-In [57]:+In [59]:--# What is the largest single-year count of each name? -babynames.groupby("Name")[["Count"]].agg(max) ++