\n",
- "The official Python documentation includes a discussion of
Py2 vs. Py3, including guidance on which to use. \n",
+ "
\n",
+ "The Standard Library refers to everything in Python that is part of standard version and install of Python.\n",
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "
\n",
+ "The Python \n",
+ "
Standard Library\n",
+ "comes with a lot of basic functionality. \n",
"
"
]
},
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Part of what makes Python a powerful language is the standard library itself, which is a rich set of tools for programming. However, the standard library itself does not include data science tools, and a lot of the power of Python stems for a rich ecosystem of packages that can be added and used with Python. "
+ ]
+ },
{
"cell_type": "markdown",
"metadata": {},
@@ -115,21 +159,34 @@
"metadata": {},
"source": [
"
\n",
- "Packages are basically just collections of code. The anaconda distribution comes with all the core packages you will need for this class. \n",
+ "Packages are collections of code. Packages from outside the standard library can be installed and added to Python.\n",
"
\n",
"\n",
"
\n",
- "For getting other packes, anaconda comes with\n",
+ "For managing and installing packages, Anaconda comes with the \n",
"
conda\n",
- "a package manager, with support for downloading and installing other packages.\n",
+ "package manager.\n",
"
"
]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Scientific Python\n",
+ "\n",
+ "When we say that Python is good for data science, and scientific computing, what we really mean is that there is a rich ecosystem of available open-source external packages, that greatly expand the capacities of the language beyond the standard library. \n",
+ "\n",
+ "This set of packages, which we will introduce as we go through these materials, is sometimes referred to as 'Scientific Python', or the 'Scipy' ecosystem. \n",
+ "\n",
+ "For the purposes of these materials, the Anaconda distribution that we are using contains all the packages you need. "
+ ]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
- "display_name": "Python [default]",
+ "display_name": "Python 3",
"language": "python",
"name": "python3"
},
@@ -143,7 +200,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.6.3"
+ "version": "3.7.4"
}
},
"nbformat": 4,
diff --git a/04-DataSciencePython.ipynb b/04-DataSciencePython.ipynb
index 3aa2629..5010181 100644
--- a/04-DataSciencePython.ipynb
+++ b/04-DataSciencePython.ipynb
@@ -4,11 +4,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "# Data Science in Python\n",
+ "# Scientific Python\n",
"\n",
"Python has a large number of tools available for doing data science. \n",
"\n",
- "The core of Data Science in Python revolves around some a set of core modules, typically comprising {scipy, numpy, pandas, matplotlib and scikit-learn}. \n",
+ "The core of Data Science in Python revolves around some a set of core modules, typically comprising {`scipy`, `numpy`, `pandas`, `matplotlib` and `scikit-learn`}. \n",
"\n",
"Here we will explore the basics of these modules and what they do. "
]
@@ -100,7 +100,7 @@
],
"source": [
"# Let's flip a bunch of coins!\n",
- "coin_flips = [sts.bernoulli.rvs(0.5) for i in range(100)]\n",
+ "coin_flips = [sts.bernoulli.rvs(0.5) for ind in range(100)]\n",
"print('The first ten coin flips are: ', coin_flips[:10])\n",
"print('The percent of heads from this sample is: ', sum(coin_flips) / len(coin_flips) * 100, '%')"
]
diff --git a/05-DataGathering.ipynb b/05-DataGathering.ipynb
index 1536f13..a8916f8 100644
--- a/05-DataGathering.ipynb
+++ b/05-DataGathering.ipynb
@@ -12,7 +12,7 @@
"metadata": {},
"source": [
"
\n",
- "Data Gathering is simply the process of collecting your data together.\n",
+ "Data Gathering is the process of accessing data collecting it data together.\n",
"
"
]
},
@@ -20,13 +20,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "This notebook covers strategies you can use to gather data for an analysis. \n",
+ "This notebook covers strategies for finding and gathering data.\n",
"\n",
- "If you want to move on to first working on data analyses (with provided data) you can move onto the next tutorials, and come back to this one later.\n",
+ "If you want to start by working on data analyses (with provided data) you can move onto the next tutorials, and come back to this one later.\n",
"\n",
- "Data gathering can encompass anything from launching a data collection project, web scraping, pulling from a database, downloading data in bulk. \n",
- "\n",
- "It might even include simply calling someone to ask if you can use some of their data. "
+ "Data gathering can encompass many different strategies, including data collection, web scraping, accessing data from databases, and downloading data in bulk. Sometimes it even includes things like calling someone to ask if you can use some of their data, and asking them to send it over. "
]
},
{
@@ -35,32 +33,49 @@
"source": [
"## Where to get Data\n",
"\n",
- "### The Web \n",
+ "There are lots of way to get data, and lots of places to get it from. Typically, most of this data will be accessed through the internet, in one way or another, especially when pursuing indepent research projects. \n",
+ "\n",
+ "### Institutional Access\n",
"\n",
- "The web is absolutely full of data or ways to get data, either by hosting **data repositories** from which you can download data, by offering **APIs** through which you can request specific data from particular applications, or as data itself, such that you can use **web scraping** to extract data directly from websites. \n",
+ "If you are working with data as part of an institution, such as a company of research lab, the institution will typically have data it needs analyzing, that it collects in various ways. Keep in mind that even people working inside institutions, with access to local data, will data still seek to find and incorporate external datasets. \n",
"\n",
- "### Other than the Web\n",
+ "### Data Repositories\n",
"\n",
- "Not all data is indexed or accessible on the web, at least not publicly. \n",
+ "**Data repositories** are databases from which you can download data. Some data repositories allow you to explore available datasets and download datasets in bulk. Others may also offer **APIs**, through which you can request specific data from particular databases.\n",
"\n",
- "Sometimes finding data means chasing down data wherever it might be. \n",
+ "### Web Scraping\n",
"\n",
- "If there is some particular data you need, you can try to figure out who might have it, and get in touch to see if it might be available.\n",
+ "The web itself is full of unstructured data. **Web scraping** can be done to directly extract and collect data directly from websites.\n",
"\n",
+ "### Asking People for Data\n",
+ "\n",
+ "Not all data is indexed or accessible on the web, at least not publicly. Sometimes finding data means figuring out if any data is available, figuring out where it might be, and then reaching out and asking people directly about data access. If there is some particular data you need, you can try to figure out who might have it, and get in touch to see if it might be available."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
"### Data Gathering Skills\n",
+ "\n",
"Depending on your gathering method, you will likely have to do some combination of the following:\n",
- "- Download data files from repositories\n",
- "- Read data files into python\n",
- "- Use APIs \n",
- "- Query databases\n",
- "- Call someone and ask them to send you a harddrive"
+ "\n",
+ "- Direct download data files from repositories\n",
+ "- Query databases & use APIs to extract and collect data of interest\n",
+ "- Ask people for data, and going to pick up data with a harddrive\n",
+ "\n",
+ "Ultimately, the goal is collect and curate data files, hopefully structured, that you can read into Python."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "## Data Repositories"
+ "## Definitions: Databases & Query Languages\n",
+ "\n",
+ "Here, we will introduce some useful definitions you will likely encounter when exploring how to gather data. \n",
+ "\n",
+ "Other than these definitions, we will not cover databases & query languages more in these tutorials. "
]
},
{
@@ -68,12 +83,7 @@
"metadata": {},
"source": [
"
\n",
- "A Data Repository is basically just a place that data is stored. For our purposes, it is a place you can download data from. \n",
- "
\n",
- "\n",
- "
\n",
- "There is a curated list of good data source included in the \n",
- "
project materials.\n",
+ "A database is an organized collection of data. More formally, 'database' refers to a set of related data, and the way it is organized. \n",
"
"
]
},
@@ -81,7 +91,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "## Databases"
+ "
\n",
+ "A query language is a language for operating with databases, such as retrieving, and sometimes modifying, information from databases.\n",
+ "
"
]
},
{
@@ -89,7 +101,16 @@
"metadata": {},
"source": [
"
\n",
- "A database is an organized collection of data. More formally, 'database' refers to a set of related data, and the way it is organized. \n",
+ "SQL (pronounced 'sequel') is a common query language used to interact with databases, and request data.\n",
+ "
\n",
+ "\n",
+ "
\n",
+ "If you are interested, there is a useful introduction and tutorial to SQL\n",
+ "
here\n",
+ "as well as some useful 'cheat sheets' \n",
+ "
here\n",
+ "and\n",
+ "
here.\n",
"
"
]
},
@@ -97,7 +118,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "### Structured Query Language - SQL"
+ "## Data Repositories"
]
},
{
@@ -105,16 +126,12 @@
"metadata": {},
"source": [
"
\n",
- "SQL (pronounced 'sequel') is a language used to 'communicate' with databases, making queries to request particular data from them.\n",
+ "A Data Repository is basically just a place that data is stored. For our purposes, it is a place you can download data from. \n",
"
\n",
"\n",
"
\n",
- "There is a useful introduction and tutorial to SQL\n",
- "
here\n",
- "as well as some useful 'cheat sheets' \n",
- "
here\n",
- "and\n",
- "
here.\n",
+ "There is a curated list of good data source included in the \n",
+ "
project materials.\n",
"
"
]
},
@@ -122,11 +139,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "SQL is the standard, and most popular, way to interface with relational databases.\n",
- "\n",
- "Note: None of the rest of the tutorials presume or require any knowledge of SQL. \n",
- "\n",
- "You can look into it if you want, or if it is relevant to accessing some data you want to analyze, but it is not required for this set of tutorials. "
+ "For our purposes, data repositories are places you can download data directly from, for example [data.gov](https://www.data.gov/)."
]
},
{
@@ -149,6 +162,12 @@
"
here\n",
"or for a much broader, more technical, overview try\n",
"
here.\n",
+ "
\n",
+ "\n",
+ "\n",
- "Python has some basic tools for I/O (input / output). \n",
- "
\n",
- "\n",
"\n",
- "There are many different standardized (and un-standardized) file types in which data may be stored. Here, we will start by examing CSV and JSON files. \n",
- "
"
+ "## File types\n",
+ "\n",
+ "There are many different file types in which data may be stored. \n",
+ "\n",
+ "Here, we will start by examining CSV and JSON files. "
]
},
{
@@ -210,7 +207,7 @@
},
{
"cell_type": "code",
- "execution_count": 1,
+ "execution_count": 17,
"metadata": {},
"outputs": [
{
@@ -225,7 +222,7 @@
],
"source": [
"# Let's have a look at a csv file (printed out in plain text)\n",
- "!cat files/dat.csv"
+ "!cat files/data.csv"
]
},
{
@@ -237,10 +234,8 @@
},
{
"cell_type": "code",
- "execution_count": 8,
- "metadata": {
- "collapsed": true
- },
+ "execution_count": 18,
+ "metadata": {},
"outputs": [],
"source": [
"# Python has a module devoted to working with csv's\n",
@@ -249,7 +244,7 @@
},
{
"cell_type": "code",
- "execution_count": 9,
+ "execution_count": 19,
"metadata": {},
"outputs": [
{
@@ -264,8 +259,8 @@
],
"source": [
"# We can read through our file with the csv module\n",
- "with open('files/dat.csv') as csvfile:\n",
- " csv_reader = csv.reader(csvfile, delimiter=',')\n",
+ "with open('files/data.csv') as csv_file:\n",
+ " csv_reader = csv.reader(csv_file, delimiter=',')\n",
" for row in csv_reader:\n",
" print(', '.join(row))"
]
@@ -279,10 +274,8 @@
},
{
"cell_type": "code",
- "execution_count": 10,
- "metadata": {
- "collapsed": true
- },
+ "execution_count": 20,
+ "metadata": {},
"outputs": [],
"source": [
"# Pandas also has functions to directly load csv data\n",
@@ -291,13 +284,26 @@
},
{
"cell_type": "code",
- "execution_count": 11,
+ "execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"