Home

Planning

No prerequisite for previous R use, but expect Python familiarity

We do require some basic programming experience (say, equivalent to some hypothetical "Programming 101"), but it doesn't have to be specifically in R/Python.

Should focus on hands-on doing rather than lectures + separate exercise (see coderefinery approach)
Presentation technology bikeshedding

If this were a Python only course, jupyter notebooks would be an obvious choice? But what about R users? jupyter isn't that popular there, R users tend to use Rstudio, which provides "Rmarkdown" documents which can be used to do similar "literate programming" stuff as jupyter notebooks.

Notes

Key topics:

IO, data storage formats (local disks, scratch, ...)
Compariason of type of tools/libraries for different tasks
Filesystems (what we have available)?
matplotlib/ggplot
Optimizing memory usage
Parallelization - split, apply, combine, array jobs Secondary topics:
profiling
slurm scripts/slurm history/array jobs
memory/object models
seff

Python specific

Should use python 3.x (http://python3statement.org/). Python for data analysis 2nd edition (Wes McKinneys Pandas book) also uses python3.

R specific

How much do we want to teach Hadleyverse stuff vs. out-of-the-box R stuff?

ggplot at least is IMHO quite a lot better than the built-in plotting and widely used.

Outline

The general idea is that we do the same workshop/session/lecture/whatever twice, once with R and once with Python. That allows us to reuse lecture materials for both courses and share improvements.

Day 1

Introduction
- What does the course cover?
- Data Frames
  - What kind of data structure is it? Compare to the other usual suspects, lists, dicts, N-d arrays.
    - Special features: Categories/Factors, missing values
  - Useful for tabular data (CSV files, some similarities with RDBMS)

Day 2

Split-apply-combine
- Motivation, why is this a common and useful workflow?
- Running on a parallel batch system
  - Small problem: Everything in one process
  - Medium: Apply part in parallel using multiprocessing or other simple technique.
  - Large: Apply part in parallel using slurm array jobs, and using job dependencies to correctly order the split, apply, and combine phases.

Day 3

Visualization with matplotlib & ggplot
- Seaborn could be interesting too (statistics-focused layer on top of matplotlib), but I have no personal experience of it.
- For matplotlib could cover tricks like using latex for rendering math for axis labels etc.