Python for Data Analysis

Wes Mckinney

September 2017

Python 2.7 | Python 3.6

Python 2.7 is the default on my machine.
To switch to Python 3.6 via the terminal:

source activate python3

To switch back to Python 2.7 via the terminal:

source deactivate

To view all environments (i.e. Python 2.7 and Python 3):

conda env list

Import Conventions

The Python community has adopted the following naming conventions for commonly-used modules.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Thus, functions will be accessed via something similar to np.arange.
- This is done instead of importing everything from numpy import *, which is considered bad practice.

Terminology

Munge/Munging/Wrangling: Describes the overall process of manipulating unstructured and/or messy data into a structured or clean form.

Pseudocode: A description of an algorithm or process that takes a code-like form while likely not being actual valid source code.

Example:

If student's grade is greater than or equal to 60
    print 'passed'
else
    print 'failed'

Syntactic sugar: Programming syntax which does not add new features, but makes something more convenient or easier to type.

DataFrame: The main pandas data structure is the DataFrame, which you can think of as representing a table or spreadsheet of data.

Vectorization: Batch operations on data (arrays) without writing any for loops.

Any arithmetic operations between equal-size arrays applies the operation elementwise.
Operations between differently sized arrays is called broadcasting.

Universal Functions: A universal function, or ufunc, is a function that performs elementwise operations on data in ndarrays.

Chapter 3: IPython: An Interactive Computing and Development Environment

Launching IPython on the command line:

$ ipython

Tab Completion: A feature common to most interactive data analysis environments. While entering expressions in the shell, pressing <Tab> will search the namespace for any variable (objects, functions, etc.) matching the characters you have typed so far:

an_apple = 27
an_example = 42
an<Tab>
    an_apple and an_example

You can also complete methods and attributes on any object after typing a period:

b = [1, 2, 3]
b.<Tab>
    b.append    b.extend    b.insert    b.remove    b.sort
    b.counts    b.index     b.pop       b.reverse

The same goes for modules:

import datetime
datetime.<Tab>
    datetime.date   datetime.MAXYEAR    datetime.timedelta
    dateime.time    datetime.MINYEAR    datetime.tzinfo

IPython by default hides methods and attributes starting with underscores, such as magic methods and internal 'private' methods and attributes. To access those via tab completion:

import datetime
datetime._<Tab>
    datetime.__doc__    datetime.__name__
    datetime.__file__   datetime.__package__

Tab completion also completes anything that looks like a file path on your computer's file system matching what you've typed:

path = 'book_scripts/<Tab>
    book_scripts/cprof_example.py   book_scripts/ipython_script_test.py
    book_scripts/ipython_bug.py     book_scripts/prof_mod.py

Object Introspection: Typing a ? before or after a variable will display come general information about the object:

a = [1, 2, 3]
a?

Using ?? will also show the function's source code if possible.

The %run command: To run a python script within the IPython console:

run fivethirtyeight.py

This works identically to running $ python fivethirtyeight.py on the command line.
Once you've run a .py file, all of the variables (imports, functions, and globals) defined in the file will be accessible in the IPython shell.

Interrupting Running Code: Pressing <Ctrl-C> while any code is running will cause nearly all Python programs to stop.

Executing Code From the Clipboard: To execute code copied on the clipboard:

%paste

Keyboard Shortcuts:

Command	Description
`Ctrl-p` or `up-arrow`	Search backward in command history
`Ctrl-n` or `down-arrow`	Search forward in command history
`Ctrl-r`	Readline-style reverse history search
`Command-v`	Past text from clipboard
`Ctrl-c`	Interrupt currently-executing code
`Ctrl-a`	Move cursor to the beginning of the line
`Ctrl-e`	Move cursor to the end of the line
`Ctrl-k`	Delete text from cursor until end of line
`Ctrl-u`	Discard all text on current line
`Ctrl-f`	Move cursor forward one character
`Ctrl-b`	Move cursor back on character
`Ctrl-l`	Clear screen

Magic Commands: Designed to facilitate common tasks and enable you to easily control the behavior of the IPython system.

Command	Description
`%quickref`	Display the IPython Quick Reference Card
`%magic`	Display detailed documentation for all of the available magic commands
`%debug`	Enter the interactive debugger at the bottom of the last exception traceback
`%hist`	Print command input (and optionally output) history
`pdb`	Automatically enter debugger after any execution
`%paste`	Execute pre-formatted Python code from clipboard
`%cpaste`	Open a special prompt for manually pasting Python code to be executed
`%reset`	Delete all variables/names defined in interactive namespace
`%page OBJECT`	Pretty print the object and display it through a pager
`%run script.py`	Run a Python script inside IPython
`%prun statement`	Execute statement with cProfile and report the profiler output
`%time statement`	Report the execution time of single statement
`%timeit statement`	Run a statement multiple times to compute an emsemble average execution time.
`%who, %who_ls, %whos`	Display variables defined in interactive namespace, with varying levels of information/verbosity
`%xdel variable`	Delete a variable and attempt to clear any references to the object in the IPython internals

Matplotlib Integration and Pylab Mode: If you create a matplotlib plot window in the regular IPython shell, you'll be sad to find that the GUI event loop 'takes control' of the IPython session until the window is closed. To avoid this, launch IPython with matplotlib integration on the command line:

$ ipython --pylab

Input and Output Variables: IPython stores references to both the input (the text that you type) and output (the object that is returned) in special variables:

The previous output is stored in the _ (underscore).
Input variables are scored in variables named _iX, where X is the input line number.
Thus, you can access any input or output via their line number:

_i27 # Returns the input at line 27
_27  # Returns the output at line 27

Logging the Input and Output:: IPython is capable of logging the entire console session including input and output. Logging is turned in by typing:

%logstart:

Companion functions are: %loggoff, %logon, %logstate, and %logstop

Bookmarks: IPython has a simple directory bookmarking system to enable you to bookmark common directories so you can jump to them very easily. For example, if you use Dropbox a lot, it would make sense to bookmark that file path:

%bookmark db /home/wesm/Dropbox/

Then, whenever you need to quickly navigate to Dropbox you can:

cd db

Bookmarks are automatically persisted between IPython sessions.

Changing Working Directory

To change the working directory in iPython:

import os
os.chdir('/Users/chrisfeller/Desktop/Python_Code')

Viewing file contents in Ipython:

To quickly view a file's contents in Ipython:

!cat file_name.csv

Chapter 4: Numpy Basics: Arrays and Vectorized Computation

Numpy: Short for Numerical Python, is the fundamental package required for high-performance scientific computing and data analysis. Here are some of the things it provides:

ndarray: a fast and space-efficient multidimensional array providing vectorized arithmetic operations and sophisticated broadcasting capabilities
Fast vectorized array operations for data munging and cleaning, subsetting and filtering, transformation, and any other kinds ofcomputations
Common array algorithms like sorting, unique, and set operations
Efficient descriptive statistics and aggregating/summarizing data
Data alignment and relational data manipulations for merging and joining together heterogeneous data sets
Expressing conditional logic as array expressions instead of loops with if-elif-else branches
Group-wise data manipulations (aggregation, transformation, function application).
- While NumPy provides the computational foundation for these operations, you will likely want to use pandas as your basis for most kinds of data analysis (especially for structured or tabular data).

The NumPy ndarray: A Multidimensional Array Object: a N-dimensional array object, which is a fast, flexible container for large data sets in Python.

An ndarray is a generic multidimensional container for homogeneous data; that is, all of the elements must by the same type.
Every ndarray has a shape, a tuple indicating the size of each dimension: data.shape
And a dtype, which describes the data type of the array: data.dtype
In most cases, 'array', 'NumPy array', and 'ndarray' are synonymous.

Creating ndarrays: The easiest way to create an array is to use the array function:

data = [6, 7.5, 8, 0, 1]
arr = np.array(data)

Nested sequences, like a list of equal-length lists will be converted into a multidimensional array.

data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2)

zeros creates an array of 0's with a given length or shape:

np.zeros(10)
 # OR
np.zeroes((3, 6))

ones creates an array of 1's with a given length or shape:

np.ones(10)
 # OR
np.ones((3, 6))

empty creates an array of random placeholder values:

np.empty((2, 3))

arange is an array-valued version of the build-in Python range function:

np.arange(15)

Specifying Data Types for ndarrays: To specify the data type or dtype for an array:

arr = np.array([1, 2, 3], dtype=np.float64)

Change dtype of array: To explicitly convert or cast an array from one dtype to another:

arr = np.array([1, 2, 3, 4, 5])
float_arr = arr.astype(np.float64)

Operations Between Arrays: Any arithmetic operations between equal-size arrays applies the operation elementwise:

arr = np.array([[1, 2, 3], [4, 5, 6]])
arr * arr
arr - arr

Arithmetic operations with scalars (a single number or non-vector item) are as you would expect, propagating the value to each element:

arr = np.array([[1, 2, 3], [4, 5, 6]])
1 / arr
arr ** 0.5

Basic Indexing and Slicing:

One-dimensional arrays act similarly to Python lists:

arr = np.arange(10)
arr[5] # Selects item at index 5
arr[5:8] # Selects items between index 5 and 7

If you assign a scalar value to a slice, the value is propagated (or broadcast) to the entire selection:

arr = np.arange(10)
arr[5:8] = 12 # returns array([ 0,  1,  2,  3,  4, 12, 12, 12,  8,  9])

An important first distinction from lists is that array slices are views on the original array. This means that the data is not copied, and any modifications to the view will be reflected in the source array.
- If you want to copy a slice of an ndarray instead of a view, you will need to explicitly copy the array:

arr[5:8].copy()

In a two-dimensional array, the elements at each index are no longer scalars but rather one-dimensional arrays:

arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
arr2d[2] # returns array([7, 8, 9])

To acces an individual element in a X-d array:

arr2d[0][[2]
 # OR
arr2d[0, 2]

Helpful diagram on Page 86

Boolean Indexing: To return a boolean for a certain condition:

names = np.array(['Bob', ' Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
names == 'Bob' # returns array([ True, False, False,  True, False, False, False], dtype=bool)

To select everything but Bob, use !=:

names != 'Bob'

Selecting two of the three names to combine multiple boolean conditions, use boolean arithmetic operations like & (and) and | (or):

mask = (names == 'Bob') | (names == 'Will')

The Python keywords and and or do not work with boolean arrays.
To set values with boolean arrays:

names[names == 'Bob'] = 'Chris'

Fancy Indexing: Term adopted by NumPy to describe indexing using integer arrays:

arr = np.empty((8, 4)) # Create an 8x4 array of placeholder values.

for i in range(7):
    arr[i] = i         # Replace placeholder values with the numbers 0-7

arr[[4, 3, 0, 6]]      # Select the 4th, 3rd, 0th, and 6th rows from the array

Reshaping Arrays: To reshape an existing array:

arr = np.arange(10)
arr = arr.reshape((2, 5)) # Returns a 2x5 array

Transposing Arrays and Swapping Axes: Arrays have the transpose method and also the special T attribute.:

arr = np.arange(15).reshape((3, 5))
arr.T # Returns a 5x3 array
 # OR
arr.transpose() # Also returns a 5x3 array

Simple transposing with .T is just a special case of swapping aces.

Universal Functions: Fast Element-wise Array Functions: A universal function, or ufunc, is a function that performs elementwise operations on data in ndarrays. Examples:

arr = np.arange(10)
np.sqrt(arr) # Returns an array with the square root of each element in arr
np.add(arr, 1) # Returns an array with each element in arr plus 1

Function (unary)	Description
`abs, fabs`	Compute the absolute value element-wise for integer, floating point, or complex vales. Use `fabs` as a faster alternative for non-complex valued data.
`sqrt`	Compute the square root of each element. Equivalent to arr ** 0.5
`square`	Compute the square of each element. Equivalent to arr ** 2
`exp`	Compute the exponent `e^x` of each element
`log, log10, log2, log1p`	Natural logarithm (base e), log base 10, log base 2, and log (1 + x), respectively.
`sign`	Compute the sign of each element: 1 (positive), 0 (zero), or -1 (negative)
`ceil`	Compute the ceiling of each element, i.e. the smallest integer greater than or equal to each element
`floor`	Compute the floor of each element, i.e. the largest integer less than or equal to each element
`rint`	Round elements to the nearest integer, preserving the dtype
`modf`	Return fractional and integral parts of array as separate array
`isnan`	Return a boolean array indicating whether each value is NaN (Not a number)
`isfinite, isinf`	Return boolean array indicating whether each element is finite or infinite, respectively.
`cos, cosh, sin, sinh, tan ,tanh`	Regular and hyperbolic trigonometric functions
`arccos, arcosh, arcsin, arcsinh, arctan, arctanh`	Inverse trigonometric functions
`logical_not`	Compute truth value of not x element-wise. Equivalent to -arr

Function (binary)	Description
`add`	Add corresponding elements in arrays
`subtract`	Subtract elements in second array from first array
`multiply`	Multiply array elements
`divide, floor_divide`	Divide or floor divide (truncate the remainder)
`power`	Raise elements in first array to powers indicated in second array
`maximum, fmax`	Element-wise maximum. `fmax` ignores NaN
`minimum, fmin`	Element-wise minimum. 'fmin' ignores Nan
`mod`	Element-wise modulus (remainder of division)
`copysign`	Copy sign of values in second argument to values in first argument
`greater, greater_equal, less, less_equal, equal, not_equal`	Perform element-wise comparison, yielding boolean array.
`logical_and, logical_or, logical_xor`	Compute element-wise truth value of logical operation.

Expressing Conditional Logic as Array Operations with np.where: The np.where function is a vectorized version of the ternary expression x if condition else y.

arr = np.random.randn(4, 4) # Creates 4x4 array w/ values from normal distribution
np.where(arr > 0, 2, -2) # If an element is greater than 0, then 2, else -2.

Nested where expression (for more complicated logic):

np.where(cond1 & cond2, np.where(cond1, 1, np.where(cond2, 2, 3)))

Mathematical and Statistical Methods: A set of mathematical functions which compute statistics about an entire array or about the data along an axis are accessible as array methods.

Aggregations (often called reductions) like sum, mean, and std can either be used by calling the array instance method or using the top-level NumPy function:

arr = np.random.randn(5, 4) # 5x4 array of normally-distributed data
arr.mean()
    # OR
np.mean(arr)

Functions like mean and sum take an optional axis argument, which computes the statistic over the given axis.

arr.mean(axis=1) # Returns the mean of each row
arr.sum(axis=0)  # Returns the sum of each column

Method	Description
`sum`	Sum all elements in the array or along an axis.
`mean`	Arithmetic mean.
`std, var`	Standard deviation and variance, respectively, with optional degrees of freedom adjustment (default denominator n)
`min, max`	Minimum and maximum
`argmin, argmax`	Indices of minimum and maximum elements, respectively.
`cumsum`	Cumulative sum of elements starting from 0.
`cumprod`	Cumulative product of elements starting from 1.

Methods for Boolean Arrays: Boolean values are coerced to 1 (True) and 0 (False) in the above methods. Thus, sum is often used as a means of counting True values in a boolean array.

arr = np.random.randn(100)
(arr > 0).sum() # Returns the number of positive values

There are two additional methods, any and all, useful especially for boolean arrays. any tests whether one or more values in an array is True, while all checks if every value is True.

bools = np.array([False, False, True, False])
bools.any() # Returns True because there is a True value in bools
bools.all() # Returns False because not all values in bools are True

These methods also work with non-boolean arrays, where non-zero elements evaluate to True.

Sorting: Like Python's built-in list type, NumPy arrays can be sorted in-place using the sort method or not in-place with sorted:

arr = np.random.randn(8)
sorted(arr) # Returns a sorted version of arr
arr.sort()  # Sorts arr and does not return anything

Multidimensional arrays can have each 1D section of values sorted in-place along an axis by passing the axis number to sort:

arr = np.random.randn(5, 3)
arr.sort(1) # Sorts arr by each row

A quick-and-dirty way to compute the quantiles of an array is to sort it and select the value at a particular rank:

large_arr = np.random.randn(1000)
large_arr.sort()
larg_arr[int(0.05 * len(large_arr))] # 5th quantile

Unique and Other Set Logic: The most commonly used basic set operation in NumPy is np.unique, which returns the sorted unique values in an array:

names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
np.unique(names) # Returns array(['Bob', 'Joe', 'Will'],
      dtype='|S4')

Method	Description
`unique(x)`	Compute the sorted, unique elements in x
`intersect1d(x, y)`	Compute the sorted, common elements in x and y
`union1d(x, y)`	Compute the sorted union of elements
`in1d(x, y)`	Compute a boolean array indicating whether each element of x is contained in y
`setdiff1d(x, y)`	Set difference, elements in x that are not in y
`setxor1d(x, y)`	Set symmetric differences; elements that are in either of the arrays, but not both

Saving and Loading Text Files: While the majority of loading and saving will be done through pandas, you can load files in NumPy via:

arr = np.loadtxt('array_ex.txt', delimiter=',')

To save a numpy array:

np.savetxt('array_output.txt', arr)

Linear Algebra: To multiply two dimensional arrays:

x = np.array([[1., 2., 3.], [4., 5., 6.]])
y = np.array([[6., 23.], [-1, 7], [8, 9]])
x.dot(y)
 # OR
np.dot(x, y)

More linear algebra on Page 102

Random Number Generators: The numpy.random module supplements the built-in Python random with functions for efficiently generating whole arrays or sample values from many kinds of probability distributions.

For example, you can get a 4 by 4 array of samples from the standard normal distribution using normal:

samples = np.random.normal(size=(4,4))

Python's built-in random, by contrast, only samples one value at a time.

Function	Description
`seed`	Seed the random number generator
`permutation`	Return a random permutation of a sequence, or return a permuted range
`shuffle`	Randomly permute a sequence in place
`rand`	Draw samples from a uniform distribution
`randint`	Draw random integers from a given low-to-high range
`randn`	Draw samples from a normal distribution with mean 0 and standard deviation 1
`binomial`	Draw samples from a binomial distribution
`normal`	Draw samples from a a normal (Gaussian) distribution
`beta`	Draw samples from a beta distribution
`chisquare`	Draw samples form a chi-square distribution
`uniform`	Draw samples from a uniform [0, 1) distribution
`gamma`	Draw samples from a gamma distribution

Chapter 5: Getting Started with pandas

pandas is built on top of numpy.
Import conventions:

from pandas import Series, DataFrame
import pandas as pd

The two main data structures in pandas:
1. Series
2. DataFrame

Series: One-dimensional array-like object containing an array of data and an associated array of data labels, called its index.

The simplest Series is formed from only an array of data:

obj = Series([4, 7, -5, 3])

To access the Series' values and index:

obj.values

obj.index

To specify the index (it defaults to the integers 0 through N - 1):

obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

In contrast to Numpy arrays, you can use the values in the index when selecting or assigning to single values or a set of values:

obj2['a']
obj2[['c', 'a', 'd']]
obj2['a'] = 10
obj2[['c', 'a', 'd']] = 10

Another way to think about a Series is a fixed-length, ordered dict, as it is a mapping of index to data values. It can be substituted into many functions that expect a dict
- You can go from a Python dict to a Series by:

dictdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = Series(dictdata)

To update a Serie's index in place:

obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']

DataFrame: A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.).

The DataFrame has both a row and a column index; it can be thought of as a dictionary of Series.
DataFrame sorts columns alphabetically by default.
Creating a DataFrame from a dict of equal-length lists of NumPy arrays:

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)

You can specify the sequence of the columns by:

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data, columns=['year', 'state', 'pop'])

If you pass a column that isn't contained in the data, it will appear as NaN
To retrieve a Series from a DataFrame:

frame['state']
 # OR
frame.state

To assign to a Series in a DataFrame:

frame['state'] = 'Colorado'

Assigning a column that doesn't exist will create a new column:

frame['capital'] = ['Columbus', 'Columbus', 'Columbus', 'Reno', 'Reno']

To remove a column:

del frame['capital']

To access the column names:

frame.columns

To transpose a DataFrame:

frame.T
 # OR
frame.transpose

To set a DataFrame's index or columns to a specific name:

frame.index.name = 'year'
frame.columns.name = 'state'

Missing Data: Pandas marks missing data or NA as Nan (Not a Number).

The isnull notnull functions should be used to detect missing data:

dictdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = Series(dictdata)
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(dictdata, index=states)
obj4

Out[45]:
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

 # California does not have an accompanying value and is thus NaN.

pd.isnull(obj4)

Out[46]:
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

pd.notnull(obj4)

Out[47]:
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

Reindexing: Create a new object with the data conformed to a new index:

obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])

Reindexing introduces missing values if any index values were not already present.
To fill missing values when reindexing:

obj3 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3.reindex(range(6), method='ffill')

ffill forward fills (or carry) the values, bfill fills (or carry) the values backwards

Dropping Entries from an Axis:

Example Data:

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

Columns = Axis1, Rows = Axis0
- drop defaults to Axis=0
To drop a column:

data.drop('two', axis=1)

To drop multiple columns:

data.drop(['one', 'two'], axis = 1)

To drop a row:

data.drop('Colorado')

To drop multiple rows:

data.drop(['Colorado', 'Utah'])

Indexing, Selecting, and Filtering:

To select a row:

data['row_two', :]

- To select multiple rows:

data[['row_two', 'row_three'], :]

To select a column:

data[:, 'column_two']

To select multiple columns:

data[:, ['column_two', 'column_three']]

Sorting and Ranking:

To sort a DataFrame by the index:

df.sort_index() # In ascending order
df.sort_index(ascending=False) # In descending order

to sort a DataFrame by the column names:

df.sort_index(axis=1) # In ascending order
df.sort_index(axis=1, ascending=False) # In descending order

To sort a DataFrame by a column:

df.sort_values('column_name') # In ascending order
df.sort_values('column_name', ascending=False) # In descending order

To sort a DataFrame by multiple columns:

df.sort_values(['column1, column2'])

Nan will automatically be sorted to the end or the column or row.
.rank() is similar to .sort_values() except it inserts the rank of the underlying value instead of the value itself.
- By default .rank() breaks ties by assigning each group the mean rank (i.e. if two values are tied for 6th, they will each be ranked as 6.5)
To rank a DataFrame by it's column values:

df.rank() # In ascending order
df.rank(ascending=False) # In descending order

To rank a DataFrame by it's row values:

df.rank(axis=1) # In ascending order
df.rank(axis=1, ascending=False) # In descending order

See page 132 for tie-breaking methods with .rank()

Axis Indexes w/ Duplicate Values

To check if index values are unique:

df.index.is_unique

Summarizing and Computing Descriptive Statistics:

Most pandas common mathematical and statistical methods call into two categories:
1. reductions
2. summary statistics
To sum each column:

df.sum()

To sum each row:

df.sum(axis=1)

To get the cumulative sum of a column:

df.cumsum()

To display descriptive statistics of a DataFrame:

df.describe()

Table of Summary Statistics available in pandas:

Method	Description
`count`	Number of non-NA values
`describe`	Compute set of summary statistics for each DataFrame column
`min`, `max`	Compute minimum and maximum values
`argmin`, `argmax`	Compute index locations (integers) at which minimum or maximum value obtained
`idxmin`, `idxmax`	Compute index values at which minimum or maximum value obtained
`quantile`	Compute sample quantile ranging from 0 to 1
`sum`	Sum of values
`mean`	Mean of values
`median`	Arithmetic median (50% quantile) of values
`mad`	Mean absolute deviation from mean value
`var`	Sample variance of values
`std`	Sample standard deviation of values
`skew`	Sample skewness of values
`kurt`	Sample kurtosis of values
`cumsum`	Cumulative sum of values
`cummin`, `cummax`	Cumulative minimum or maximum of values
`cumprod`	Cumulative product of values
`diff`	Compute arithmetic difference
`pct_change`	Compute percent changes

Correlation and Covariance

corr computes the correlation of the overlapping aligned-by-index values in two Series:

returns.MSFT.corr(returns.IBM)

cov computes the covariance of the overlapping aligned-by-index values in two Series:

returns.MSFT.cov(returns.IBM)

When used with DataFrames, corr and cov return a matrix of correlations or covariances between all columns.

df.corr()
df.cov()

To calculate the correlation or covariance between one column and the rest of the DataFrame:

df.corrwith(df.columns)

Unique Values, Value Counts, and Membership

To get the unique values of a column:

df.column.unique()

To get the count of each unique value in a column:

df.column.value_counts()

To test membership of an item in a column:

df.column.isin(['item'])

Handling Missing Data:

Pandas uses the floating point value Nan (not a number) to represent missing data in both floating as well as in non-floating point arrays.
The built-in Python None value is also treated as NA in object arrays.
To see missing data in a DataFrame:

df.isnull()

To count the number of missing data:

df.isnull().sum()

To see non-missing data:

df.notnull()

To count the number of non-missing data:

df.notnull().sum()

Filtering Out Missing Data:

To drop missing data in a Series:

series.dropna()

With DataFrames, dropna() defaults to dropping any row containing a missing value.
To drop all rows with any missing data:

df.dropna()

To drop all rows with missing data in all columns:

df.dropna(how='all')

To drop all columns with any missing data:

df.dropna(axis=1)

To drop all columns with missing data in all rows:

df.dropna(how='all', axis=1)

Filling in Missing Data:

To replace all missing values with a specific value:

df.fillna(value)

To replace all missing values with a separate value for each column:

df.fillna({row1: 0, row2:1})

Add the inplace=True argument to modify the underlying DataFrame.
To forward fill missing values within a row:

df.fillna(method='ffill')

To replace all missing values with the mean of each row:

df.fillna(df.mean())

Hierarchical Indexing:

Hierarchical indexing enables you to have multiple index levels on an axis, which lets you work with higher dimensional data in a lower dimensional form.
Hierarchical indexes also allow for partial indexing.
With DataFrames, either axis can have a hierarchical index.
Examples on pages 144-145

Pandas Frequent Functions:

Pandas Function	What It Does
`import pandas as pd`	imports pandas as pd
`df=pd.read_csv('path-to-file.csv')`	load data into pandas
`df.head(5)`	prints the first n lines; in this case 5 lines
`df.index`	prints the index of your dataframe
`df.columns`	prints the columns of your dataframe
`df.set_index('col')`	make the index (aka row names) the values of col
`df.reset_index()`	reset index
`df.columns = ['new name1', 'new name2']`	rename cols
`df = df.rename(columns={'old name 1':'new name 1'})`	rename specific col
`df['col']`	selects one column
`df[['col1', 'col2']]`	select more than one col
`df['col'] = 1`	set the entire col to equal 1
`df['empty col'] = np.nan`	make an empty column
`df['col3'] = df['col1'] + df['col2']`	create a new col, equal to the sum of other cols
`df.loc['row 0']` or `df.iloc[0]`	select row 0
`df.loc['row 5': 'row 100']` or `df.iloc[5:100]`	select rows 5 through 100
`df.loc[[2,4,6,8]]`	select rows 2, 4, 6, 8
`df.loc[0]['col']`	select row and column, retrieve cell value
`del df['col']`	delete or drop or remove a column
`df.drop('col', axis=`1`)`	delete or drop or remove a column
`df.drop('row')`	delete or drop or remove a row
`df = df.sort_values(by='col')`	sort dataframe on this column
`df.sort_values(by=['col', 'col2'])`	sort data by col, then col2
`solo_col=df['col']`	make a variable that is equal the col n
`just_values = df['col'].values`	returns an array with just the values, NO INDEX
`df[(df['col']=='condition')]`	return df when col is equal to condition
`df['col'][(df['col1'] == 'this') & (df['col2'...])]`	select col1 to new value when col1 == this, and c...
`df.groupby('col').sum()`	group by groupby a column and SUM all other
`df.plot(kind='bar')**king='bar' or 'line'`	make a bar plot
`alist = df['cols'].values`	extract just the values of a column into a list
`a_matrix=df.as_matrix()`	extract just the values of a whole dataframe
`df.sort(axis=1)`	sort by column names
`df.sort('col', axis=0)`	will sort by the 'col' column in ascending order
`df.sort('col', axis=0, ascending=True)`	will sort by 'col' column in descending order
`df.sort(['col-1', 'col-b'], axis=0)`	sort by more than one column
`df.sort_index()`	this will sort the index
`df.rank()`	it keeps your df in order, but ranks them
`df = pd.DataFrame({'col-a: a list', 'col-b':...})`	how to put or how to insert a list into a dataframe
`df.dtypes`	will print out the type of value in each column
`df['float-col'].astype(np.int)`	will change column data type
`joined = dfone.join(dftwo)`	join two dataframes if the keys are in the index
`merged = pd.merge(dfone, dftwo, one='key col')`	merge two dataframes on a similar column
`pd. concat([dfone, dftwo, series3])`	append data to the end of the dataframe

Navigating the Pandas DataFrame: df['colname']: selects single column or sequence of columns from the DataFrame. df.loc[val] or df.iloc[val]: selects single row of subset of rows from the DataFrame. df.loc[:, val] or df.iloc[:, val]: selects single column of subset of rows. df.loc[val1, val2] or df.iloc[val1, val2]: select both rows and columns. df.reset_index(): conform one or more axes to new indexes df.xs(): select single row or column as a Series by label

Counting with Pandas versus Counting with Pandas:

Counting wihtout Pandas:

def get_counts(sequence):
    counts = {}
    for x in sequence:
        if x in counts:
            count[x] += 1
        else:
            counts[x] = 1
    return counts

OR

from collections import defaultdict

def get_counts2(sequence):
    counts = defaultdict(int) # values will initialize to 0
    for x in sequence:
        counts[x] += 1
    return counts

OR

from collections import Counter
Counter(sequence)

Counting with Pandas:

import pandas as pd
data = pd.read_csv('2017_Season.csv')
counts = data['Tm'].value_counts()

Chapter 6: Data Loading, Storage, and File Formats

Input and output typically falls into three main categories:

Reading text files and other more efficient on-disk formats
Loading data from databases
Interacting with network sources like web APIs

Reading and Writing Data in Text Format:

The two most used function for reading tabular data as a DataFrame object are:

pd.read_csv()
pd.read_table()

Functions for reading in tabular data:

Function	Description
`pd.read_csv()`	Load delimited data from a file, URL, or file-like object. Use comma as default delimiter.
`pd.read_table()`	Load delimited data from a file, URL, or file-like object. Use tab ('\t') as default delimiter.
`pd.read_fwf()`	Read data in fixed-width column format (that is, no delimiter)
`pd.read_clipboard()`	Version of `read_table` that reads data from the clipboard. Useful for converting tables from web pages.

To import from a .csv file with header names:

df = pd.read_csv('filename.csv')
 #OR
df = pd.read_table('filename.csv', sep=',')

To import a .csv without header names:

df = pd.read_csv('filename.csv', header=None) # Pandas will assign default column names
 #OR
df = pd.read_csv('filename.csv', names=['column1', 'column2', 'column3'])

To specify a column as the index:

df = pd.read_csv('filename.csv', names=['column1', 'column2', 'column3'], index_col='column1')

To read a fixed-delimiter file (like .txt):

df = pd.read_table('filename.txt', sep='value')

To skip certain rows during import:

df = pd.read_csv('filename.csv', skiprow=[0, 2, 3])

Create a dataframe from your own clipboard:

df = pd.read_clipboard()

Import arguments:

Argument	Description
`path`	String indicating filesystem location, URL, or file-like object.
`sep` or `delimiter`	Character sequence or regular expression to use to split fields in each row.
`header`	Row number to use as column names. Defaults to 0 (first row), but should be None if there is no header row.
`index_col`	Column numbers or names to use as the row index in the result. Can be a single name/number or a list of them for a hierarchical index.
`names`	List of column names for result, combine with header=None
`skiprows`	Number of rows at beginning of file to ignore or list of row numbers (starting from 0) to skip
`na_values`	Sequence of values to replace with NA
`comment`	Character or characters to split comments off the end of lines.
`parse_dates`	Attempt to parse data to datetime; False by default.
`keep_date_col`	If joining columns to parse date, keep the joined columns. Default False.
`converters`	Dict containing column number of name mapping to function. For example: {'foo': f} would apply the function f to all values int he 'foo' column.
`dayfirst`	When parsing potentially ambiguous dates, treat as international format (e.g. 7/6/2012 -> June 7, 2012). Default False.
`date_parser`	Function to use to parse dates.
`nrows`	Number of rows to read from beginning of file.
`iterator`	Return a TextParser object for reading file piecemeal
`chunksize`	For iteration, size of file chunk.
`skip_footer`	Number of lines to ignore at each of file.
`verbose`	Print various parser output information, like the number of missing values placed in non-numeric columns.
`encoding`	Text encoding for unicode. For example, 'utf-8' for UTF-8 encoded text.
`squeeze`	If the parsed data only contains one column return a Series
`thousands`	Separator for thousands, e.g. ',' or '.'

Reading Text Files in Pieces:

If you want to only read out a small number of rows (avoiding reading the entire file), specify that with nrows:

pd.read_csv('file_name.csv', nrows=100)

To split the file into iterable chunks:

chunks = pd.read_csv('ex6.csv', chunksize=1000)

Writing Data Out to Text Format:

To write to a .csv file:

df.to_csv('file_name.csv')

To write to a .csv with other seperators:

df.to_csv('file_name.csv', sep='|')

To write a subset of the DataFrame to a .csv file:

df.to_csv('file_name.csv', cols=['column1', 'column2', 'column3'])

Manually Working with Delimited Formats:

Pages 161-163

JSON Data:

JSON is short for JavaSCript Object Notation and has become one of the standard formats for sending data by HTTP request between web browsers and other applications.
- It is a much more flexible data format than a tabular text for like .csv.
To convert a JSON string to Python:

import json
result = json.loads(obj)

To convert a Python object to JSON:

import json
asjason = json.dumps(result)

To import a JSON file from the web:

import json
path = 'ch02/usagov_bitly_data2012-03-16-1331923249.txt'
record = [json.loads(line) for line in open(path, 'rb')]

XML and HTML: Web Scraping:

Many websites make data available in HTML tables for viewing in a browser, but not downloadable as an easily machine-readable format like JSON.
XML (extensible markup language) is another common structured data format supporting hierarchical, nested data with metadata.

Reading Microsoft Excel Files: xls_file = pd.ExcelFile('data.xls') table = xls_file.parse('Sheet1')

Chapter 7: Data Wrangling: Clean, Transform, Merge, Reshape

Combining and Merging Data Sets:

Data contained in pandas objects can be combined together in three separate ways:
1. pandas.merge: connects rows in DataFrames based on one or more keys. This will be familiar to users of SQL or other relational databases, as it implements database join operations.
2. pandas.concat: glues or stacks together objects along an axis.
3. combine_first: instance method enables splicing together overlapping data to fill in missing values in one object with values from another.

Database-style DataFrame Merges:

To merge two databases on a shared column:

pd.merge(df1, df2, on='column')

By default merge does an inner join; with the keys in the result are the intersection.
- Other possible options are 'left', 'right', 'outer'.
  - The outer join takes the union of the keys, combining the effect of applying both left and right joins.
To change the join type:

pd.merge(df1, df2, how='outer')

To merge with multiple keys, pass a list of column names:

pd.merge(df1, df2, on=['key1', 'key2'], how='outer')

If the two DataFrames you are attempting to merge have overlapping column names, pandas will automatically add on _x and _y to the end of the two column names.
To manually specify new names to overlapping columns:

pd.merge(left, right, on='key', suffixes('_left', '_right'))

Merge Argument References:

Argument	Description
`left`	DataFrame to be merged on the left side
`right`	DataFrame to be merged on the right side
`how`	One of the `inner`, `outer`, `left`, or `right`. `inner` is by default
`on`	Column names to join on. Must be found in both DataFrame objects. If not specified and no other join keys given, will use the intersection of the column names in `left` and `right` as the join keys.
`left_on`	Columns in `left` DataFrame to use as join keys.
`right_on`	Columns in `right` DataFrame to use as join keys.
`left_index`	Use row index in `left` as its join key
`right_index`	Use row index in `right` as its join key
`sort`	Sort merged data lexicographically by join keys; True by default.
`suffixes`	Tuple of string values to append to column names in case of overlap; defaults to ('_x', '_y'). For example, if 'data' in both DataFrame objects, would appear as 'data_x' and 'data_y' in result.
`copy`	If False, avoid copying data into resulting data structures in some exceptional cases. By default always copies.

To use indexes as the merge key:

pd.merge(left1, right1, left_on='key', right_index=True)

To merge two or more DataFrames with similar indexes:

pd.join(df1, df2, df3, how='outer' )

To do a simple left join:

df1.join(df2, on='key')

Concatenating Along an Axis:

To concatenate two numpy arrays together by column (results in a fatter array):

np.concatenate([arr1, arr2], axis=1)

To concatenate two numpy arrays together by row (results in a longer array):

np.concatenate([arr1, arr2], axis=0) # axis defaults to zero

To concatenate multiple DataFrames together by columns (results in a fatter DataFrame):

pd.concat([df1, df2, df3], axis=1)

To concatenate multiple DataFrames together by row (results in a longer DataFrame):

pd.concat([df1, df2, df3], axis=0) # axis defaults to zero

To concatenate multiple DataFrames together by columns using an inner join (only the columns that are present in each DataFrame):

pd.concat([df1, df2], axis=1, join='inner')

Concat Function Arguments:

Argument	Description
`objs`	List or dict of pandas objects to be concatenated. This is the only required argument.
`axis`	Axis to concatenate along; defaults to 0 (rows).
`join`	One of 'inner' or 'outer'; defaults to 'outer'. Whether to intersection (inner) or union (outer) together indexes along the other axes.
`join_axes`	Specific indexes to use for the other n-1 axes instead of performing union/intersection logic.
`keys`	Values to associate with objects being concatenated, forming a hierarchical index along the concatenation axis. Can either be a list of array if arbitrary values, an array of tuples, or a list of arrays (if multiple level arrays passed in `levels`).
`levels`	Specific indexes to use as hierarchical index level or levels if keys passed.
`names`	Names for created hierarchical levels if `keys` or `levels` passed.
`verify_integrity`	Check new axis in concatenated object for duplicates and raise exception if so. By default (False) allows duplicates.
`ignore_index`	Do not preserve indexes along concatenation `axis`, instead producing a new `range(total_length)` index.

Reshaping with Hierarchical Indexing:

Hierarchical indexing provides a consistent way to rearrange data in a DataFrame in two ways:
1. stack: this "rotates" or pivots from the columns in the data to the rows.
2. unstack: this pivots from the rows into the columns.
To stack the data, pivoting the columns into the rows:

df.stack()

To unstack a DataFrame, pivoting the rows into the columns:

df.unstck()

Stacking filters out missing data by default.

Pivoting "long" to "wide" Format:

Great example on Pages 190-191

Removing Duplicates:

To see which rows in your DataFrame are duplicates:

df.duplicated() # Returns boolean True/False

To remove duplicate rows in your DataFrame:

df.drop_duplicates()

To drop duplicate rows that have duplicates in a specific row:

df.drop_duplicates(['column'])

To drop duplicate rows that have duplicates in specific rows:

df.drop_duplicates(['column1', 'column2'])

By default drop_duplicates keeps the first observed duplicate row. To instead keep the last duplicate row:

df.drop_duplicates(take_late=True)

Replacing Values:

To replace a specific value within a column with another specific value:

df['column'].replace(old_value, new_value)

To replace multiple values within a column with a specific value:

df['column'].replace([old_value1, old_value2], new_value)

To replace multiple values within a column with multiple specific values:

df['column'].replace({old_value1: new_value1, old_value2: new_value2})

Renaming Axis Indexes:

To transform an axis index via a function:

df.index = df.index.map(f)
 #OR
df.rename(index=str.upper, inplace=True)

To rename certain axis values:

data.rename(index={'old_value1': 'new_value1', 'old_value2': 'new_value2'}, inplace=True)

Discretization and Binning:

To bin column values, in this case ages 18-25, 26-35, 36-60, 61-100:

ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)

To bin column values into named bins:

ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
group_names = ['Youth', 'Young_Adult', 'Middle_Aged', 'Senior']
cats = pd.cut(ages, bins, labels=group_names)

To bin column values into n equal-length bins (based on the minimum and maximum vales in the data).

pd.cut(column, number_of_bins)

To bin column values into n equally-distributed bins (based on quantiles):

pd.qcut(column, 4) # Cut into quartiles

Detecting and Filtering Outlier:

To remove any value in a column that is greater than or equal to a specific number. For example, to remove any value that is outside of +3 or -3 standard deviations in a column:

from scipy import stats
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

Permutation and Random Sampling:

Randomly reordering (permuting) a Series or the rows in a DataFrame can be done by using np.random.permutation:

df.take(np.random.permutation(len(df)))

To select a random subset of a DataFrame without replacement, slice off the first k-elements of the array returned by permutation, where k is the desired subset:

df.take(np.random.permutation(len(df)))[:k]

Computing Indicator/Dummy Variables:

To convert a categorical variable into a "dummy" or "indicator" matrix:

pd.get_dummies(df['column'])

To add a prefix to the dummy columns:

pd.get_dummies(df['column'], prefix='dummy_')

To add the dummies to the DataFrame:

pd.loc[:, ['column']].join(dummies)

String Object Methods:

split is often combined with strip to trim whitespace and break strings into separate objects:

val = 'a, b,     guido'
pieces = [x.strip() for x in val.split(',')]

To join separate strings with a specific delimiter (or space):

a = 'hello'
b = 'world'
c = '!'
answer = ' '.join([a, b, c])

To check if a substring is in a string:

a = 'hello'
sentence = 'hello world!'
a in sentence # Returns True

To find the index of a substring with a string

a = 'hello'
sentence = 'hello world!'
sentence.index(a) # returns 0. If the subset is not in the sentence .index() will return a ValueError.
 #OR
sentence.find(a) # returns 0. If the subset is not in the sentence .find() will return -1.

To count the number of subsets within a string:

a = 'hello'
sentence = 'hello world hello'
sentence.count(a) # returns 2

To substitute occurrences of one patter for another:

a = 'hello'
sentence = 'hello world!'
sentence.replace('hello', 'goodbye') # returns 'goodbye world!'

You can use .replace() to also delete patterns by passing in an empty string:

a = 'cruel'
sentence = 'hello cruel world!'
sentence.replace(a, '') # returns 'hello world!'

Python built-in string methods:

Argument	Description
`count`	Return the number of non-overlapping occurrences of substring in the string.
`endswith`, `startswith`	Returns True if string ends with suffix (starts with prefix.)
`join`	Use string as delimiter for concatenating a sequence of other strings.
`index`	Return position of first character in substring if found in the string. Raises `ValueError` if not found.
`find`	Return position of first character of first occurrence of substring. Like index, but returns -1 if not found.
`rfind`	Return position of first character of last occurrence of substring in the string. Returns -1 if not found.
`replace`	Replace occurrences of string with another string.
`strip`, `rstrip`, `lstrip`	Trim whitespace, including newlines; equivalent to x.strip() (and rstrip, lstrip, respectively) for each element.
`split`	Break string into substrings using passed delimiter.
`lower`, `upper`	Convert alphabet characters to lowercase or uppercase, respectively.
`ljust`, `rjust`	Left justify or right justify, respectively. Pad opposite side of string with spaces (or some other fill character) to return a string with a minimum width.

Regular Expressions:

Regular expressions provide a flexible way to search or match string patterns in text.
A regex is a single expression formed according to the regular expression language.
Python's built-in re module is responsible for applying regular expressions to strings.
- The re module functions fall into three categories:
  1. Pattern matching
  2. Substitution
  3. Splitting
To split a string with a variable number of whitespace characters (tab, spaces, and newlines):

import re
text = "foo     bar\t baz    \tquax"
re.split('\s+', text) # 's\+' is the regex describing one or more whitespace characters.

To get a list of all patterns matching the regex:

import re
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

text = """Dave dave@google.com
     Steve steve@gmail.com
     Rob rob@gmail.com
     Ryan ryan@yahoo.com"""

regex = re.compile(pattern, flags=re.IGNORECASE)
regex.findall(text)

To segment a pattern into separate parts, for example to identify an email address and then subset its three components (username, domain name, domain suffix):

import re
pattern = r'([A-Z0-9.+%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)
m =regex.match('wesm@bright.net')
m.groups()

Regular Expression Methods:

Argument	Description
`findall`, `finditer`	Return all non-overlapping matching patterns in a string. `finall` returns a list of all patterns while `finditer` returns them one by one from an iterator.
`match`	Match pattern at start of string and optionally segment pattern components into groups. If the pattern matches, returns a match object, otherwise None.
`search`	Scan string for match to pattern; returning a match object if so. Unlike `match`, the match can be anywhere in the string as opposed to only at the beginning.
`split`	Break string into pieces at each occurrence of pattern.
`sub`, `subn`	Replace all sub or first n occurrences of subn of pattern in string with replacement expression. Use symbols \1, \2, ... to refer to match group elements in the replacement string.

Chapter 8: Plotting and Visualization

When using matplotlib in Ipython be sure to start Ipython in the following manner:

ipython --pylab

To close a plot window in ipython:

close()

To import convention for matplotlib is:

import matplotlib.pyplot as plt

To use fivethirtyeight's plotting sytle:

plt.style.use('fivethirtyeight')

To create a blank figure:

fig = plt.figure()

To create a single black graph:

fig, axes = plt.subplots()

To create a blank figure with for subplots:

fig, axes = plt.subplots(2, 2)

plt.subplots options:

Argument	Description
`nrows`	Number of rows of subplots
`ncols`	Number of columns of subplots
`sharex`	All subplots should use the same X-axis ticks (adjusting the `xlim` will affect all subplots)
`sharey`	All subplots should use the same Y-axis ticks (adjusting the `ylim` will affect all subplots)
`subplot_kw`	Dict of keywords passed to `add_subplot` call used to create each subplot
`**fig_kw`	Additional keywords to `subplots` are used when creating the figure, such as `plt.subplots(2, 2, figsize=(8,6))`

Adjusting the spacing around subplots:

By default matplotlib leaves a certain amount of padding around the outside of the subplots and spacing between subplots.
The spacing can be most easily changed using the subplots_adjust method:

subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=None, hspace=None)

wspace and hspace control the percent of the figure width and figure height, respectively, to use as spacing between subplots.
matplotlib doesn't check whether axis labels overlap, so you will need to fix the labels yourself on many occasions.

Colors, Markers, and Line Styles:

matplotlib's main plot function accepts arrays of X and Y coordinates and optionally a string abbreviation indicating color and line style.
- For example, to plot x versus y with green dashes:

x = np.arange(10)
y = np.arange(10)
plt.plot(x, y, 'g--')

The same plot could be expressed more explicitly:

x = np.arange(10)
y = np.arange(10)
plt.plot(x, y, linestyle='--', color='g')

To add markers to highlight the actual data points:

plot(randn(30).cumsum(), color='k', linestyle='dashed', marker='o')

Ticks, Labels, and Legends:

To manually set the x-axis ticks:

fig = plt.figure(); ax = fig.add_subplot(1, 1, 1)
ax = fig.add_subplot(1, 1, 1)
ax.plot(randn(1000).cumsum())
ticks = ax.set_xticks([0, 250, 500, 750, 1000])

To manually set the x-axis labels:

fig = plt.figure(); ax = fig.add_subplot(1, 1, 1)
ax = fig.add_subplot(1, 1, 1)
ax.plot(randn(1000).cumsum())
labels = ax.set_xticklabels(['one', 'two', 'three', 'four', 'five'], rotation=30, fontsize='small')

To give a name to the X axis:

fig = plt.figure(); ax = fig.add_subplot(1, 1, 1)
ax = fig.add_subplot(1, 1, 1)
ax.plot(randn(1000).cumsum())
ax.set_xlabel('Stages')

To add a title to the plot:

fig = plt.figure(); ax = fig.add_subplot(1, 1, 1)
ax = fig.add_subplot(1, 1, 1)
ax.plot(randn(1000).cumsum())
ax.set_title('My first matplotlib plot')

To manually set the y-axis ticks:

fig = plt.figure(); ax = fig.add_subplot(1, 1, 1)
ax = fig.add_subplot(1, 1, 1)
ax.plot(randn(1000).cumsum())
ticks = ax.set_yticks([0, 10, 20, 30])

To manually set the y-axis labels:

fig = plt.figure(); ax = fig.add_subplot(1, 1, 1)
ax = fig.add_subplot(1, 1, 1)
ax.plot(randn(1000).cumsum())
labels = ax.set_yticklabels(['one', 'two', 'three'], fontsize='small')

Adding Legends:

To create a legend:

fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.plot(randn(1000).cumsum(), 'k', label='one')
ax.plot(randn(1000).cumsum(), 'k--', label='two')
ax.plot(randn(1000).cumsum(), 'k.', label='three')
ax.legend(loc='best')

The loc argument tells matplotlib where to place the legend. If you aren't picky 'best' is a good option, as it will choose a locaiton that is most out of the way.
To exclude one or more of the elements from the legend, pass _nolegend_ to the label argument as we've done with the second element below:

fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.plot(randn(1000).cumsum(), 'k', label='one')
ax.plot(randn(1000).cumsum(), 'k--', label='_nolegend_')
ax.plot(randn(1000).cumsum(), 'k.', label='three')
ax.legend(loc='best')

Annotations and Drawing on a Subplot:

To add an annotation to a plot:

fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
x = np.arange(10)
y = np.arange(10)
plt.plot(x, y, 'g--')
ax.text(5, 5, 'Hello world!', family='monospace', fontsize=20)

The above will close 'Hello world!' at the coordinates (5, 5).
Detailed example of annotations and arrows:

from datetime import datetime

fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)

data = pd.read_csv('spx.csv', index_col=0, parse_dates=True)
spx = data['SPX']

spx.plot(ax=ax, style='k-')

crisis_data = [
    (datetime(2007, 10, 11), 'Peak of bull market'),
    (datetime(2008, 3, 12), 'Bear Stearns Fails'),
    (datetime(2008, 9, 15), 'Lehman Bankruptcy')
]

for date, label in crisis_data:
    ax.annotate(label, xy=(date, spx.asof(date) + 50),
    xytext=(date, spx.asof(date) + 200),
    arrowprops=dict(facecolor='black'),
    horizontalalignment='left', verticalalignment='top')

# Zoom in on 2007-2010
ax.set_xlim(['1/1/2007', '1/1/2011'])
ax.set_ylim([600, 1800])

ax.set_title('Important dates in 2008-2009 financial crisis')

matplotlib's common shapes are called patches. General patches such as Rectange and Circle are found within matplotlib.pyplot, but the full set is located in matplotlib.patches. To add a patch to a plot:

fig =plt.figure()
ax = fig.add_subplot(1, 1, 1)

rect = plt.Rectangle((0.2, 0.75), 0.4, 0.15, color='k', alpha=0.3)
circ = plt.Circle((0.7, 0.2), 0.15, color='b', alpha=0.3)
pgon = plt.Polygon([[0.15, 0.15], [0.35, 0.4], [0.2,0.6]], color='g', alpha=0.5)

ax.add_patch(rect)
ax.add_patch(circ)
ax.add_patch(pgon)

Saving Plots to File:

The active figure can be saved to file using plt.savefig:

plt.savefig('plot1.png')

The file type is inferred from the extension. So to save as .pdf:

plt.savefif('plot1.pdf')

To increase the dpi (dots-per-inch) resolution as well as bbox_inches(tirm the whitespace around the figure):

plt.savefig('plot1.png`, dpi=400, bbox_inches='tight')

Figure.savefig Options

Argument	Description
`fname`	String containing a filepath or a Python like object. The figure format is inferred from the file extension, e.g. .pdf for PDF or .pnf for PNG.
`dpi`	The figure resolution in dots per inch; defaults to 100 out of the box but can be configured.
`facecolor`, `edgecolor`	The color of the figure background of the subplots. 'w' (white), by default.
`format`	The explicit file format to use ('png', 'svg', 'ps', 'eps'...)
`bbox_inches`	The portion of the figure to save. If 'tight' is passed, will attempt to trim the empty space around the figure.

matplotlib Configuration:

There are two ways to change matplotlib's configuration:

Programatically:

plt.rc('component to change', 'new parameter')

An example:

plt.rc('figure', figzie=(10, 10))

Via the config file.

Plotting Functions in pandas:

While matplotlib requires a lot of information to plot, pandas infers much of that information from the DataFrame. Thus, plotting from pandas only requires concise statements.
Be default, pandas plot method defaults to line plots.
To create a line plot of one column:

df['column'].plot()

To create a line plot of all columns in the DataFrame:

df.plot()

By default, the index is plotted as the x-axis. To disable this:

df.plot(use_index=False)

Series.plot Method Arguments:

Argument	Description
`label`	Label for plot legend
`ax`	matplitlib subplot object to plot on. If nothing is passed, uses active matplotlib subplot.
`style`	Style string, like 'k--', to be passed to matplotlib
`alpha`	The plot fill opacity (from 0 to 1)
`kind`	Can be `line`, `bar`, `barh`, `kde`
`logy`	Use logarithmic scaling on the Y axis.
`use_index`	Use the object index for tick labels
`rot`	Rotation of tick labels (0 through 360)
`xticks`	Values to use for X axis ticks
`yticks`	Values to use for Y axis ticks
`xlim`	X axis limits (e.g. [0, 10])
`ylim`	Y axis limits
`grid`	Display axis grid (on by default)

DataFrame.plot Method Arguments:

Argument	Description
`subplots`	Plot each DataFrame column in a separate subplot
`sharex`	If `subplots=True`, share the same X axis, linking ticks and limits
`shary`	If `subplots=True`, share the same Y axis
`figsize`	Size of figure to create as tuple
`title`	Plot title as string
`legend`	Add a subplot legend (True by default)
`sort_columns`	Plot columns in alphabetical order; by default uses existing column order.

Bar Plots:

To create a vertical-bar plot:

df.plot(kind='bar')

Top create a horizontal-bar plot:

df.plot(kind='hbar')

Bar plots group the values in each row together in a group of bars, side by side, for each value.
To create a stacked bar plot:

df.plot(king='bar', stacked=True)

Histograms and Density Plots:

To plot a histogram:

df.plot.hist()

Scatter Plots:

To create a scatter plot between two columns of a DataFrame:

plt.scatter(df['column1'], df['column2'])

To create a scatter plot between all columns within a DataFrame:

pd.scatter_matrix(df)

Chapter 9: Data Aggregation and Group Operations

Groupby Mechanics:

To compute the summary statistic of a column based on the value of a separate column:

df['column1'].groupby(df['column2']).mean()

To compute summary statistics of each column in the DataFrame based on the value of a separate column:

df.groupby('column').mean()

To compute summary statistics of each column in the DataFrame based on the value of multiple separate columns:

df.groupby(['column1', 'column2']).mean()

A useful groupby method is size, which returns a series containing group sizes of the resulting groups:

df.groupby(['column1']).size()

Missing values in a groupby are excluded from the result.

Iterating Over Groups:

The groupby object supports iteration, which can be saved into a dictionary:

groupby_dict = dict(list(df.groupby('column')))

By default, groupby groups on axis=0, but you can group on any of the other axes:

df.groupby('column', axis=1) # Not sure about this example. Check on this.

Selecting a Column or Subset of Columns:

Indexing a groupby object created form a DataFrame with a column name or array of column names has the effect of selecting those columns for aggregation.
This means that:

df.groupby('column2')['column1']

is equal to:

df['column1'].groupby(df['column2'])

In terms of multiple column aggregation:

df.groupby(['column1', 'column2'])[['column3']].mean()

Data Aggregation:

Groupby Methods:

Function Name	Description
`count`	Number of non-NA values in the group
`sum`	Sum of non-NA values
`mean`	Mean of non-NA values
`median`	Arithmetic median of non-NA values
`std`, `var`	Unbiased (n-1 denominator) standard deviation and variance
`min`, `max`	Minimum and maximum of non-NA values
`prod`	Product of non-NA values
`first`, `last`	First and last non-NA values

Column-wise and Multiple Function Application:

To apply multiple aggregate functions to one column:

df.groupby('column').agg(['min', 'max', 'mean'])

To apply multiple aggregate functions to multiple columns:

df.groupby(['column1', 'column2']).agg(['count', 'mean', 'max'])

To apply different functions to one or more columns:

df.groupby(['column1', 'column2']).agg({'column3': 'sum', 'column4': 'mean'})

To apply multiple different functions to one or more columns:

df.groupby(['column1', 'column2']).agg({'column3': ['min', 'max', 'mean'], 'column4': ['count', 'std']})

Pivot Tables and Cross Tabulation:

A pivot table is a data summarization tool frequently found in spreadsheet programs like Excel.
- It aggregates a table of data by one or more keys, arranging the data in a rectangle with some of the group keys along the rows and some along the columns.
The following two commands are the same:

df.groupby(['column1', 'column2']).mean()
df.pivot_table(index=['column1', 'column2'])

pivot_table defaults to use mean.
To use different aggregation functions:

df.pivot_table(index=['column1', 'column2'], aggfunc=f)

If some combinations are empty, or otherwise NA, pass fill_value argument:

df.pivot_table(index=['column1', 'column2'], aggfunc=f, fill_value=0)

Pivot Table Options

Function Name	Description
`values`	Column name or names to aggregate. By default aggregates all numeric columns.
`rows`	Column names or other group keys to group on the rows of the resulting pivot table.
`cols`	Column names or other group keys to group on the columns of the resulting pivot table.
`aggfunc`	Aggregation function or list of functions; 'mean' by default. Can be any function valid in a groupby context.
`fill_value`	Replace missing values in result table.
`margins`	Add row/column subtotals and grant total. False by default.

Cross-Tabulation: Crosstab:

A crosstab is a special case of a pivot table that computes group frequencies.

pd.crosstab('column1', 'column2', margins=True)

Chapter 10: Time Series

Four types of time series:

Timestamps: specific specific instants in time.
Fixed periods: such as the month January 2007 or the full year 2010.
Intervals of time: indicated by a start and end timestamp. Periods can be thought of as special cases of intervals.
Experiment or elapsed time: each timestamp is a measure of time relative to a particular start time. For example, the diameter of a cookie baking each second since being placed in the oven.

The simplest and most widely used kind of time stamp are those indexed by timestamp.

Data and Time Data Types and Tools:

The most import modules for dates and times are:

datetime
- Within datetime is a method datetime.
- from datetime import datetime
time
calendar

To get the current date and time:

from datetime import datetime
now = datetime.now()
np.year # Returns the year
np.month # Returns the month
np.day # Returns the day

To get the difference between two dates and times:

datetime(2001, 1, 7) - datetime(2008, 6, 24, 8, 15)

To add (or subtract) a timedelta to a datetime object:

from datetime import timedelta
start = datetime(2011, 1, 7) # January 7th, 2011
start + timedelta(12) # add 12 days
start - timedelta(12) # subtract 12 days

datetime module:

Type	Description
`date`	Store calendar date (year, month, day) using the Gregorian calendar.
`time`	Store time of day as hours, minutes, seconds, and microseconds.
`datetime`	Stores both date and time.
`timedelta`	Represents the difference between two datetime values (as days, seconds, and microseconds)

Converting between string and datetime:

To convert a datetime object to a string:

stamp = datetime(2011, 1, 3)
str(stamp)
 #OR
stamp.strftime('%Y-%m-d')

To convert a string to datetime:

value = '2011-01-03'
datetime.strptime(value, '%Y-%m-%d')

To convert multiple strings to datetime:

datestrs = ['7/6/2011', '8/6/2011']
[datetime.strptime(x, '%m/%d/%Y') for x in datestrs]

datetime.strptime is the best way to parse a date with a known format.
However, another option is to use parse:

from dateutil.parser import parse
parse('2011-01-03')

parse is capable of parsing almost any human-intelligible data representation:

parse('Jan 31, 1997 10:45 PM')
 #OR
parse('6/11/1991')

When dealing with international datetimes, where days appear before months:

parse('6/12/2011', dayfirst=True)

To parse an entire columns, use pd.to_datetime:

pd.to_datetime('column')

to_datetime changes all missing values to NaT
- NaT (Not a Time) is panda's NA value for timestamp data
Datetime format specifications:

Type	Description
`%Y`	4-digit year
`%y`	2-digit year
`%m`	2-digit month [01, 12]
`%d`	2-digit day [01, 31]
`%H`	Hour (24-hour clock) [0, 23]
`%I`	Hour (12-hour clock) [01, 12]
`%M`	2-digit minute [00-59]
`%S`	Second [00, 61] (seconds 60, 61 account for leap seconds)
`%w`	Weekday as integer [0(Sunday), 6]
`%U`	Week number of the year [00, 53]. Sunday is considered the first day of the week, and days before the first Sunday of the year are "week 0"
`%W`	Week number of the year [00, 53]. Monday is considered the first day of the week, and days before the first Monday of the year are 'week 0.'
`%z`	UTC time zone offset as +HHMM or -HHMM, empty if time zone naive
`%F`	Shortcut for %Y-%m-%d, for example 2012-4-18
`%D`	Shortcut for %m/%d/%y, for example 04/18/12

Local-specific date formatting:

Type	Description
`%a`	Abbreviated weekday name
`%A`	Full weekday name
`%b`	Abbreviated month name
`%B`	Full month name
`%c`	Fill date and time, for example 'Tue 01 May 2012 04:20:57 PM'
`%p`	Locale equivalent of AM or PM
`%x`	Locale-appropriate formatted date; e.g. in US May 1, 2012 yields '05/01/2012'
`%X`	Locale-appropriate time, e.g. '04:24:12 PM'

Time Series Basics:

TimeSeries is a subclass of Series and thus behaves in many of the same ways:
To create a TimeSeries with datetimes as the index and values in a separate columns:

dates = [datetime(2011, 1, 2), datetime(2011, 1, 5), datetime(2011, 1, 7), datetime(2011, 1, 8), datetime(2011, 1, 10), datetime(2011, 1, 12)]
ts = Series(np.random.randn(6), index=dates)

To index a TimeSeries:

ts[2] # index of the third timestamp
 #OR
ts[`01/10/2011`] # also index of the third timestamp
 #OR
ts['20110110'] # also index of the third timestamp

For longer TimeSeries, you can index based on year:

long_ts['2001']

For longer TimeSeries, you can also index based on month:

long_ts['2001-05']

Slicing with dates works just like with a regular Series:

long_ts[datetime(2001, 1, 7):]

Time Series with Duplicate Indices:

To check if the index of a timeseries is unique:

ts.index.is_unique()

To see which items are duplicates:

grouped = ts.groupby(level=0)
groupbed.count()

Data Ranges, Frequencies, and Shifting:

To create a fixed-frequency series (meaning same time interval difference between each record; such as every day) from a non-fixed frequency series:

ts.resample('D').asfreq()

Generating Date Ranges:

To create a range of dates:

pd.date_range('4/1/2012', '6/1/2012')

By default, date_range generates daily timestamps.
To create a range of dates with just a start or end date:

pd.date_range(start='4/1/2012', periods=20)

Frequencies and Data Offsets:

To create a date range based on a specified time frequency, in this case every four hours:

pd.date_range('1/1/2000', '1/3/2000', freq='4h')
 #OR
pd.date_range('1/1/2000', periods=10, freq='120min')

To create a date range including every third Friday of every month:

pd.date_range('1/1/2000', periods=10, freq='WOM-3FRI')

Entire freq options on Pages 296-297

Shifting (Leading and Lagging) Data:

"Shifting" refers to moving data backward and forward through time.
A common use of shift is computing percent changes in a time series or multiple time series as DataFrame columns:

ts/ts.shift(1) - 1

Time Zone Handling:

In python, time zone information comes from the 3rd party pytz library.
To get a time zone object from pytz:

tz = pytz.timezone('US/Mountain')

By default, time series in pandas are time zone naive, meaning they don't have a time zone attached to them.
To create a timeseries with a specified localized time zone:

pd.date_range('3/9/2012 9:30', periods=10, freq='D', tz='UTC')

To localize the time zone of an existing time series:

ts_utc = ts.tz_localize('UTC')

Once a time series has been localized to a particular time zone, it can be converted to another time zone:

ts_mtn = ts_utc.tz_convert('US/Mountain')

If two time series with different time zones are combined, the result will be UTC.

Chapter 12: Advanced Numpy

Reshaping array:

To convert an array from one shape to another:

arr = np.arange(8)
arr.reshape((4, 2)) # Reshapes the array into two columns and four rows.

To reshape a multidimensional array:

arr.reshape((4,2)).reshape((2,4))

To reshape an array by specifying only the column or row dimension and letting numpy figure out the best configuration, pass -1 to the argument:

arr = np.arange(15)
arr.reshape((5,-1))

To reshape an array to mirror the shape of another array:

arr = np.arange(15)
other_arr = np.ones((3,5))
arr.reshape(other_arr.shape)

To flatten (or ravel) an array, meaning to go from a higher-dimension to a one-dimension array:

arr = np.arange(15).reshape((5,3))
arr.ravel() # ravel does not produce a copy of the underlying data.
 #OR
arr.flatten() # flatten always returns a copy of the data.

Concatenating and Splitting Arrays:

numpy.concatenate takes a sequence (tuple, list, etc.) of arrays and joins them together in order along the input axis.
To concatenate the long way (along the rows):

arr1 = np.array([[1, 2, 3], [4, 5, 6]])
arr2 = np.array([[7, 8, 9], [10, 11, 12]])
np.concatenate([arr1, arr2], axis=0)

The above can also be done via vstack:

np.vstack((arr1, arr2))

To concatenate the wide way (along the columns):

arr1 = np.array([[1, 2, 3], [4, 5, 6]])
arr2 = np.array([[7, 8, 9], [10, 11, 12]])
np.concatenate([arr1, arr2], axis=1)

The above can also be done via hstack:

np.hstack((arr1, arr2))

To split an array:

from numpy.random import randn
arr = randn(3,2)
first, second, third = np.split(arr, [1, 2])

Array Concatenation Functions:

Function	Description
`concatenate`	Most general function, concatenates collection of arrays along one axis
`vstack`, `row_stack`	Stack arrays row-wise (along axis 0)
`hstack`	Stack arrays column-wise (along axis 1)
`column_stack`	Like hstack, but converts 1D arrays to 2D column vectors first
`dstack`	Stack arrays 'depth'-wise (along axis 2)
`split`	Split array at passed locations along a particular axis
`hsplit`, `vsplit`, `dsplit`	Convenience functions for splitting on axis 0, 1, and 2 respectively.

Stacking helpers: r_ and c_:

Similar to vstack, np.r_ stacks rows onto other rows:

arr1 = np.arange(6).reshape((3,2))
arr2 = randn(3,2)

np.r_[arr1, arr2]

Similar to hstack, np.c_ stacks rows onto other rows:

arr1 = np.arange(6).reshape((3,2))
arr2 = randn(3,2)

np.c_[arr1, arr2]

These helpers can also translate slices to arrays:

np.c_[1:6, -10:-5]

Repeating Elements: Tile and Repeat:

The two main tools for repeating and replicating arrays to produce larger arrays are the repeat and tile functions.
The repeat function replicates each element in an array some number of times, producing a larger array:

arr = np.arange(3)
arr.repeat(3)

* By default, if you pass an integer, each element of the original array will be repeated that number of times.
* If you pass an array of integers, each element can be repeated a different number of times.

Multidimensional arrays can have their elements repeated along a particular axis:

arr = randn(2, 2)
arr.repeat(2, axis=0)

The title function is a shortcut for stacking copies of an array along an axis. Think of it as laying down tiles.

arr = randn(2,2)
np.tile(arr, 2)

To lay them down the long way:

np.tile(arr, (2,1))

Fancying Indexing Equivalents: Take and Put:

Fancy indexing is the practice of subsetting an array using integer arrays.
- It is the same process as masking:

arr = np.arange(10)
arr[arr > 4]

Numpy offers other ways of fancy indexing:

arr = np.arange(10)
inds = [7, 1, 2, 6]
arr.take(inds) # Returns an array of the 7th, 1st, 2nd, and 6th items of arr

To take along another axes:

arr.take(inds, axis=1)

To replace items in an array with fancy indexing:

arr = np.arange(10)
inds = [7, 1, 2, 6]
arr.put(inds, 69) # Returns the original array with the 7th, 1st, 2nd, and 6th items replaced with the number 69.

Broadcasting:

Broadcasting describes how arithmetic works between arrays of different shapes.
Simplest example of broadcasting occurs when combining a scalar values with an array:

arr = np.arange(5)
 # Output: array([0, 1, 2, 3, 4])
arr * 4
 # Output: array([ 0,  4,  8, 12, 16])

* Here we say that the scalar value 4 has been broadcast to all of the other elements in the multiplication operation.

To use broadcasting to subtract the mean from each item in a column:

arr = randn(4,3)
arr_demeaned = arr - arr.mean(0)

The Broadcasting Rule: Two arrays are compatible for broadcasting if for each trailing dimension (that is, starting from the end), the axis lengths match or if either of the lengths is 1. Broadcasting is then performed over the missing and/or length 1 dimensions.

ufunc Instance Methods:

reduce takes a single array and aggregates its values, optionally along an axis.
For example, an alternate way to sum elements in an array is to use np.add.reduce:

arr = np.arange(10)
np.add.reduce(arr) # Is the same as arr.sum()

accumulate works similar to cumprod

arr = np.arange(15).reshape((3,5))
np.add.accumulate(arr, axis=1)

ufunc methods:

Method	Description
`reduce(x)`	Aggregate values by successive applications of the operation
`accumulate(x)`	Aggregate values, preserving all partial aggregates
`reduceat(x, bins)`	'Local' reduce or 'groupby'. Reduce contiguous slices of data to produce aggregated array.
outer(x, y)	Apply operation to all pairs of elements in x and y. Result array has shape x.shape + y.shape

More About Sorting:

Like Python's built-in list, the ndarray sort instance method is an in-place sort, meaning that the array contents are rearranged without producing a new array:

arr = randn(6)
arr.sort()

To sort based on the rows:

arr = randn(6)
arr.sort(axis=1)

When sorting arrays in-place, remember that if the array is a view of a different ndarray, the original array will me modified.
To avoid this, use numpy.sort, which creates a new, sorted copy of an array:

arr = randn(6)
np.sort(arr)

To np.sort on the rows:

arr = randn(6)
np.sort(arr, axis=1)

Neither arr.sort() or np.sort(arr) have a reverse option. To reverse the output:

arr.sort()[::-1]
np.sort(arr)[::-1]

numpy.searchsorted: Finding elements in a Sorted Array:

searchsorted is an array method that performs a binary search on a sorted array, re-turning the location in the array where the value would need to be inserted to maintain sortedness:

arr = np.array([0, 1, 7, 12, 15])
arr.searchsort(9)

You can also pass an array of values to get an array of indices back:

arr.searchsorted([0, 8, 11, 16])

The default behavior is to return the index at the left side of a group of equal values:
- To change this to the right side:

arr.searchsorted([0, 1], side='right')

Appendix: Python Language Essentials

The Python Interpreter:

Python is an interpreted language:
To run python from the console:

$ python

To run a python script from the console:

$ python script.py

To run ipython from the console:

$ ipython

To run a script within ipython:

%run script.py

Comments:

Any text preceded by the hash mark (pound sign) # is ignored by the Python interpreter.

# This is a comment

To comment out an entire section, highlight it and then press Command + /

Data Types:

To check if an object is an instance of a particular type:

a = 5
isinstance(a, int)

To check if an object's type is among multiple types:

a = 5
isinstance(a, (int, float))

Strings:

The backslash character \ is an escape character, meaning that it is used to specify special characters like newline \n or unicode characters. To write a string literal with backslashes, you need to escape them:

s = '12\\34'
print s # Returns 12\34

If you have a string with a lot of backslashes and no special characters, you might find this a bit annoying. Fortunately you can preface the leading quote of the string with r which means that the characters should be interpreted as is:

s = r'this\has\no\special\characters'
s # Returns 'this\\has\\no\\special\\characters'

Booleans:

You can see exactly what boolean value an object coerces to by invoking bool on it:

bool([]) # False
bool('Hello World') # True

Dates and Times:

The built-in Python datetime module provides datetime, date, and time types.
- The datetime type as you may imagine combines the information stored in date and time and us the most commonly used.
To create a datetime object:

from datetime import datetime, date, time
dt = datetime(2011, 10, 29, 20, 30, 21)

To access various parts of a datetime object:

dt.year
dt.month
dt.day
dt.hour
dt.minute
dt.second

To extract just the date part of a datetime object:

dt.date()

To extract just the time part of a datetime object:

dt.time()

To format a datetime as a string:

datetime.strftime(dt,'%m/%d/%Y')

To convert (parse) a string to a datetime object:

datetime.strptime('20091031', '%Y%m%d')

To replace certain parts of a datetime object:

dt.replace(minute=0, second=0)

for loops:

A for loop can be advanced to the next iteration, skipping the remained of the block, using the continue keyword. For example, this code sums up integers in a list and skips None values:

sequence = [1, 2, None, 4, None, 5]
total = 0
for value in sequence:
    if value is None:
        continue
    total += value

Exception Handling:

To handle exceptions:

def attempt_float(x):
    try:
        return float(x)
    except:
        return 'float() only accepts numerical values.'

To handle a specific exception, for instance ValueError:

def attempt_float(x):
    try:
        return float(x)
    except ValueError:
        return 'float() only accepts numerical values'

To handle more than one specific exception:

def attempt_float(x):
    try:
        return float(x)
    except (ValueError, TypeError):
        return 'float(x) only accepts numerical values'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python_For_Data_Analysis_Notes.md

Python_For_Data_Analysis_Notes.md

Python for Data Analysis

Wes Mckinney

September 2017

Python 2.7 | Python 3.6

Import Conventions

Terminology

Chapter 3: IPython: An Interactive Computing and Development Environment

Chapter 4: Numpy Basics: Arrays and Vectorized Computation

Chapter 5: Getting Started with pandas

Chapter 6: Data Loading, Storage, and File Formats

Chapter 7: Data Wrangling: Clean, Transform, Merge, Reshape

Chapter 8: Plotting and Visualization

Chapter 9: Data Aggregation and Group Operations

Chapter 10: Time Series

Chapter 12: Advanced Numpy

Appendix: Python Language Essentials

Files

Python_For_Data_Analysis_Notes.md

Latest commit

History

Python_For_Data_Analysis_Notes.md

File metadata and controls

Python for Data Analysis

Wes Mckinney

September 2017

Python 2.7 | Python 3.6

Import Conventions

Terminology

Chapter 3: IPython: An Interactive Computing and Development Environment

Chapter 4: Numpy Basics: Arrays and Vectorized Computation

Chapter 5: Getting Started with pandas

Chapter 6: Data Loading, Storage, and File Formats

Chapter 7: Data Wrangling: Clean, Transform, Merge, Reshape

Chapter 8: Plotting and Visualization

Chapter 9: Data Aggregation and Group Operations

Chapter 10: Time Series

Chapter 12: Advanced Numpy

Appendix: Python Language Essentials