diff --git a/.gitignore b/.gitignore index cc5c9d120..5747a1fc0 100644 --- a/.gitignore +++ b/.gitignore @@ -61,3 +61,4 @@ cache/ docs/_build/ docs/_autosummary/ +docs/normal_data.csv diff --git a/.travis.yml b/.travis.yml index 630648fcd..3ec57d10c 100644 --- a/.travis.yml +++ b/.travis.yml @@ -26,7 +26,7 @@ addons: before_install: - wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh - chmod +x miniconda.sh - - ./miniconda.sh -b -p $HOME/miniconda + - ./miniconda.sh -b -f -p $HOME/miniconda - export PATH=/home/travis/miniconda/bin:$PATH - conda update --yes conda @@ -35,14 +35,18 @@ install: # TODO(sam): Add --upgrade flag when it works again - python3 setup.py install +# https://docs.travis-ci.com/user/gui-and-headless-browsers/#Using-xvfb-to-Run-Tests-That-Require-a-GUI +# sam: Not exactly sure why we need to initialize a display for this but it +# helps the tutorial plots build on Travis +before_script: + - "export DISPLAY=:99.0" + - "sh -e /etc/init.d/xvfb start" + - sleep 3 # give xvfb some time to start + script: - coverage run setup.py test - - cd docs && make html-raise-on-warning && cd .. + - make docs after_success: - coveralls - bash tools/deploy_docs.sh - -cache: - directories: - - /home/travis/virtualenv/python3.4.2/ diff --git a/docs/Makefile b/docs/Makefile index fd5eddeb2..b05545c01 100644 --- a/docs/Makefile +++ b/docs/Makefile @@ -52,11 +52,13 @@ clean: rm -rf $(BUILDDIR)/* html: + mkdir -p $(BUILDDIR)/html/_images $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html @echo @echo "Build finished. The HTML pages are in $(BUILDDIR)/html." html-raise-on-warning: + mkdir -p $(BUILDDIR)/html/_images $(SPHINXBUILD) -W -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html dirhtml: diff --git a/docs/conf.py b/docs/conf.py index 34943d339..1e5fceac7 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -35,6 +35,21 @@ 'sphinx.ext.autodoc', 'sphinx.ext.autosummary', 'sphinx.ext.viewcode', + # These IPython extensions allow for embedded IPython code that gets rerun + # at build time. + 'IPython.sphinxext.ipython_console_highlighting', + 'IPython.sphinxext.ipython_directive' +] + +# The following lines silence the matplotlib.use warnings since we import +# matplotlib in each ipython directive block +ipython_mplbackend = None +ipython_execlines = [ + 'import matplotlib', + 'matplotlib.use("Agg", warn=False)', + 'import numpy as np', + 'import matplotlib.pyplot as plt', + 'plt.style.use("fivethirtyeight")', ] # Config autosummary @@ -147,6 +162,7 @@ # relative to this directory. They are copied after the builtin static files, # so a file named "default.css" will overwrite the builtin "default.css". html_static_path = [] +ipython_savefig_dir = './_build/html/_images' # Add any extra paths that contain custom files (such as robots.txt or # .htaccess) here, relative to this directory. These files are copied diff --git a/docs/sample.csv b/docs/sample.csv new file mode 100644 index 000000000..ecee95a65 --- /dev/null +++ b/docs/sample.csv @@ -0,0 +1,4 @@ +x,y,z +1,10,100 +2,11,101 +3,12,102 diff --git a/docs/tutorial.rst b/docs/tutorial.rst index 7aaed0f49..20cd18c1c 100644 --- a/docs/tutorial.rst +++ b/docs/tutorial.rst @@ -1,16 +1,396 @@ Start Here: ``datascience`` Tutorial ==================================== -In progress. +This is a brief introduction to the functionality in +:py:mod:`datascience`. For a complete reference guide, please see +:ref:`tables-overview`. -Introduction ------------- +For other useful tutorials and examples, see: + +- `The textbook introduction to Tables`_ +- `Example notebooks`_ + +.. _The textbook introduction to Tables: http://data8.org/text/1_data.html#tables +.. _Example notebooks: https://github.com/deculler/TableDemos + +.. contents:: Table of Contents + :depth: 2 + :local: + +Getting Started +--------------- + +The most important functionality in the package is is the :py:class:`Table` +class, which is the structure used to represent columns of data. You may load +the class with: + +.. ipython:: python + + from datascience import Table + +In the IPython notebook, type ``Table.`` followed by the TAB-key to see a list +of members. + +Note that for the Data Science 8 class we also import additional packages and +settings for all assignments and labs. This is so that plots and other available +packages mirror the ones in the textbook more closely. The exact code we use is: + +.. code-block:: python + + # HIDDEN + + import matplotlib + matplotlib.use('Agg') + from datascience import Table + %matplotlib inline + import matplotlib.pyplot as plt + import numpy as np + plt.style.use('fivethirtyeight') + +In particular, the lines involving ``matplotlib`` allow for plotting within the +IPython notebook. + +Creating a Table +---------------- + +A Table is a sequence of labeled columns of data. + +The basic Table constructor works as follows: + +.. ipython:: python + + letters = ['a', 'b', 'c', 'z'] + counts = [9, 3, 3, 1] + points = [1, 2, 2, 10] + + t = Table(columns=[letters, counts, points], + labels=['letter', 'count', 'points']) + + print(t) + +Note how the first keyword, ``columns``, specifies the contents of the table, +and how the second, ``labels``, gives a name to each column. See +:meth:`~datascience.tables.Table.__init__` for more details. + +------ + +A table could also be read from a CSV file (that can be exported from an Excel +spreadsheet, for example). Here's the content of an example file: + +.. ipython:: python + + cat mydata.csv + +And this is how we load it in as a :class:`Table` using +:meth:`~datascience.tables.Table.read_table`: + +.. ipython:: python + + Table.read_table('sample.csv') + +CSVs from URLs are also valid inputs to +:meth:`~datascience.tables.Table.read_table`: + +.. ipython:: python + + Table.read_table('http://data8.org/text/sat2014.csv') + +------ -Basic Table Usage +For convenience, you can also initialize a Table from a dictionary of column +names using +:meth:`~datascience.tables.Table.from_columns_dict`. + +.. ipython:: python + + Table.from_columns_dict({ + 'letter': letters, + 'count': counts, + 'points': points, + }) + +This example illustrates the fact that built-in Python dictionaries don't +preserve their key order -- the dictionary keys are ordered ``'letter'``, +``'count'``, then ``'points'``, but the table columns are ordered ``'points'``, +``'count'``, then ``'letter'``). If you want to ensure the order of your +columns, use an ``OrderedDict``. + +Accessing Values +---------------- + +To access values of columns in the table, use +:meth:`~datascience.tables.Table.values`. + +.. ipython:: python + + t + + t.values('letter') + t.values('count') + + t['letter'] # This is a shorthand for t.values('letter') + +To access values by row, :meth:`~datascience.tables.Table.rows` returns an +list-like :class:`~datascience.tables.Table.Rows` object that contains +tuple-like :class:`~datascience.tables.Table.Row` objects. + +.. ipython:: python + + t.rows + t.rows[0] + + second = t.rows[1] + second + second[0] + second[1] + +To get the number of rows, use :attr:`~datascience.tables.Table.num_rows`. + +.. ipython:: python + + t.num_rows + + +Manipulating Data ----------------- -More Advanced Table Usage -------------------------- +Here are some of the most common operations on data. For the rest, see the +reference (:ref:`tables-overview`). + +Adding a column with :meth:`~datascience.tables.Table.with_column`: + +.. ipython:: python + + t + t.with_column('vowel?', ['yes', 'no', 'no', 'no']) + t # .with_column returns a new table without modifying the original + + t.with_column('2 * count', t['count'] * 2) # A simple way to operate on columns + +Selecting columns with :meth:`~datascience.tables.Table.select`: + +.. ipython:: python + + t.select('letter') + t.select(['letter', 'points']) + +Renaming columns with :meth:`~datascience.tables.Table.with_relabeling`: + +.. ipython:: python + + t + t.with_relabeling('points', 'other name') + t + t.with_relabeling(['letter', 'count', 'points'], ['x', 'y', 'z']) + +Selecting out rows by index with :meth:`~datascience.tables.Table.take` and +conditionally with :meth:`~datascience.tables.Table.where`: + +.. ipython:: python + + t + t.take(2) # the third row + t.take[0:2] # the first and second rows + +.. ipython:: python + + t.where('points', 2) # rows where points == 2 + t.where(t['count'] < 8) # rows where count < 8 + + t['count'] < 8 # .where actually takes in an array of booleans + t.where([False, True, True, True]) # same as the last line + +Operate on table data with :meth:`~datascience.tables.Table.sort`, +:meth:`~datascience.tables.Table.group`, and +:meth:`~datascience.tables.Table.pivot` + +.. ipython:: python + + t + t.sort('count') + t.sort('letter', descending = True) + +.. ipython:: python + + t.group('count') + + # You may pass a reducing function into the collect arg + # Note the renaming of the points column because of the collect arg + t.select(['count', 'points']).group('count', collect = sum) + +.. ipython:: python + + other_table = Table([ + ['married', 'married', 'partner', 'partner', 'married'], + ['Working as paid', 'Working as paid', 'Not working', 'Not working', 'Not working'], + [1, 1, 1, 1, 1] + ], + ['mar_status', 'empl_status', 'count']) + other_table + + other_table.pivot('mar_status', 'empl_status', 'count', collect = sum) + +Visualizing Data +---------------- + +We'll start with some data drawn at random from two normal distributions: + +.. ipython:: python + + normal_data = Table( + [ np.random.normal(loc = 1, scale = 2, size = 100), + np.random.normal(loc = 4, scale = 3, size = 100) ], + ['data1', 'data2'] + ) + + normal_data + +Draw histograms with :meth:`~datascience.tables.Table.hist`: + +.. ipython:: python + + @savefig hist.png width=4in + normal_data.hist() + +.. ipython:: python + + @savefig hist_binned.png width=4in + normal_data.hist(bins = range(-5, 10)) + +.. ipython:: python + + @savefig hist_overlay.png width=4in + normal_data.hist(bins = range(-5, 10), overlay = True) + +If we treat the ``normal_data`` table as a set of x-y points, we can +:meth:`~datascience.tables.Table.plot` and +:meth:`~datascience.tables.Table.scatter`: + +.. ipython:: python + + @savefig plot.png width=4in + normal_data.sort('data1').plot('data1') # Sort first to make plot nicer + +.. ipython:: python + + @savefig scatter.png width=4in + normal_data.scatter('data1') + +.. ipython:: python + + @savefig scatter_line.png width=4in + normal_data.scatter('data1', fit_line = True) + +Use :meth:`~datascience.tables.Table.barh` to display categorical data. + +.. ipython:: python + + t + @savefig barh.png width=4in + t.barh('letter') + +Exporting +--------- + +Exporting to CSV is the most common operation and can be done by first +converting to a pandas dataframe with :meth:`~datascience.tables.Table.to_df`: + +.. ipython:: python + + normal_data + + # index = False prevents row numbers from appearing in the resulting CSV + normal_data.to_df().to_csv('normal_data.csv', index = False) + +An Example +---------- + +We'll recreate the steps in `Chapter 3 of the textbook`_ to see if there is a +significant difference in birth weights between smokers and non-smokers using a +bootstrap test. + +For more examples, check out `the TableDemos repo`_. + +.. _Chapter 3 of the textbook: http://data8.org/text/3_inference.html#Using-the-Bootstrap-Method-to-Test-Hypotheses +.. _the TableDemos repo: https://github.com/deculler/TableDemos + +From the text: + + The table ``baby`` contains data on a random sample of 1,174 mothers and + their newborn babies. The column ``birthwt`` contains the birth weight of + the baby, in ounces; ``gest_days`` is the number of gestational days, that + is, the number of days the baby was in the womb. There is also data on + maternal age, maternal height, maternal pregnancy weight, and whether or not + the mother was a smoker. + +.. ipython:: python + + baby = Table.read_table('http://data8.org/text/baby.csv') + baby # Let's take a peek at the table + + # Select out columns we want. + smoker_and_wt = baby.select(['m_smoker', 'birthwt']) + smoker_and_wt + +Let's compare the number of smokers to non-smokers. + +.. ipython:: python + + @savefig m_smoker.png width=4in + smoker_and_wt.select('m_smoker').hist(bins = [0, 1, 2]); + +We can also compare the distribution of birthweights between smokers and +non-smokers. + +.. ipython:: python + + # Non smokers + # We do this by grabbing the rows that correspond to mothers that don't + # smoke, then plotting a histogram of just the birthweights. + @savefig not_m_smoker_weights.png width=4in + smoker_and_wt.where('m_smoker', 0).select('birthwt').hist() + + # Smokers + @savefig m_smoker_weights.png width=4in + smoker_and_wt.where('m_smoker', 1).select('birthwt').hist() + +What's the difference in mean birth weight of the two categories? + +.. ipython:: python + + nonsmoking_mean = smoker_and_wt.where('m_smoker', 0).values('birthwt').mean() + smoking_mean = smoker_and_wt.where('m_smoker', 1).values('birthwt').mean() + + observed_diff = nonsmoking_mean - smoking_mean + observed_diff + +Let's do the bootstrap test on the two categories. + +.. ipython:: python + + num_nonsmokers = smoker_and_wt.where('m_smoker', 0).num_rows + def bootstrap_once(): + """ + Computes one bootstrapped difference in means. + The table.sample method lets us take random samples. + We then split according to the number of nonsmokers in the original sample. + """ + resample = smoker_and_wt.sample(with_replacement = True) + bootstrap_diff = resample.values('birthwt')[:num_nonsmokers].mean() - \ + resample.values('birthwt')[num_nonsmokers:].mean() + return bootstrap_diff + + repetitions = 1000 + bootstrapped_diff_means = np.array( + [ bootstrap_once() for _ in range(repetitions) ]) + + bootstrapped_diff_means[:10] + + num_diffs_greater = (abs(bootstrapped_diff_means) > abs(observed_diff)).sum() + p_value = num_diffs_greater / len(bootstrapped_diff_means) + p_value + Drawing Maps ------------ +To come.