-
Notifications
You must be signed in to change notification settings - Fork 8
Home
-
No prerequisite for previous R use, but expect Python familiarity
We do require some basic programming experience (say, equivalent to some hypothetical "Programming 101"), but it doesn't have to be specifically in R/Python.
-
Should focus on hands-on doing rather than lectures + separate exercise (see coderefinery approach)
-
Presentation technology bikeshedding
If this were a Python only course, jupyter notebooks would be an obvious choice? But what about R users? jupyter isn't that popular there, R users tend to use Rstudio, which provides "Rmarkdown" documents which can be used to do similar "literate programming" stuff as jupyter notebooks.
Key topics:
- IO, data storage formats (local disks, scratch, ...)
- Comparison of type of tools/libraries for different tasks
- Filesystems (what we have available)?
- matplotlib/ggplot
- Optimizing memory usage
- Parallelization - split, apply, combine, array jobs Secondary topics:
- profiling
- slurm scripts/slurm history/array jobs
- memory/object models
- seff
- Should use python 3.x (http://python3statement.org/). Python for data analysis 2nd edition (Wes McKinneys Pandas book) also uses python3.
How much do we want to teach Hadleyverse stuff vs. out-of-the-box R stuff?
- ggplot at least is IMHO quite a lot better than the built-in plotting and widely used.
Unlike the outline, these are the big lessons people should learn via the things we teach.
- use the right tools, data structures, and libraries
- automation of workflows. Don't do everything manually
- use good file formats
- good development environments, IDEs, ...
- profiling (and less debugging)
The general idea is that we do the same workshop/session/lecture/whatever twice, once with R and once with Python. That allows us to reuse lecture materials for both courses and share improvements.
- Introduction
- What does the course cover?
- Data Frames
- What kind of data structure is it? Compare to the other usual suspects, lists, dicts, N-d arrays.
- Special features: Categories/Factors, missing values
- Useful for tabular data (CSV files, some similarities with RDBMS)
- What kind of data structure is it? Compare to the other usual suspects, lists, dicts, N-d arrays.
- Get people set up
- Start Rstudio / jupyter notebook session on node via slurm
- ssh keys (at least for R)
- Introductory exercises
- numpy/pandas beginnings (/ similar stuff for R)
- Profiling, debugging
- A few more short exercises
- I/O
- HDF5 / pytables
- sqlite
- csv
- Even more exercises
- Maybe move part of I/O from day 1 here?
- Split-apply-combine
- Motivation, why is this a common and useful workflow?
- Running on a parallel batch system
- Small problem: Everything in one process
- Medium: Apply part in parallel using multiprocessing or other simple technique.
- Large: Apply part in parallel using slurm array jobs, and using job dependencies to correctly order the split, apply, and combine phases.
- Visualization with matplotlib & ggplot
- Seaborn could be interesting too (statistics-focused layer on top of matplotlib), but I have no personal experience of it.
- For matplotlib could cover tricks like using latex for rendering math for axis labels etc.
- Workflows for visualization
- repeatability is important!
- putting plotting stuff into scripts vs. redoing it
- using make for managing workflows
- Day 1
- 30 min: general course intro and Jupyter notebook intro
- 30 min: data types and numpy
- ufuncs, broadcasting, ...
- 30 min: pandas intro, some puzzles. Baby names example (from: pandas 100 puzzles)
- 30 min: advanced dataframe operations (PROFILING)
- 30 min: read data to pandas from sqlite (and other formats)
- 15 min: More about notebooks: publishing, version control, questions and so on.
- Day 2: data handling
- 30 min: Split-apply-combine (DEBUGGING)
- 15 min: small vs large files: intro and a basic benchmarking
- 30 min: advanced storage formats: HDF5, sqlite, containers for machine learning, etc.
- 30 min: basic automation with makefiles
- Day 3: visualization
- 10 min: Intro and video: some high level visualization motivation video (that describes the four(?) types of visual ways to represent data and what they are good for).
- 30 min: matplotlib basic concepts. figures vs axes vs axis, object-orientedness, common arguments, direct API vs global-background API, etc. Seaborn. Graphics stack big picture: flexible and hard to use tools vs limited-purpose but easy to use tools. Scriptability of graphics.
- ?? min: matplotlib/seaborn examples.
- (insert rest here)
- 30 min: interactive visualization with jupyter and widgets.
- https://www.machinelearningplus.com/101-numpy-exercises-python/
- https://github.com/rougier/numpy-100/blob/master/100%20Numpy%20exercises.md (lots of overlap with first link above)
- https://pandas.pydata.org/pandas-docs/stable/cookbook.html
- https://github.com/ajcr/100-pandas-puzzles
- https://github.com/guipsamora/pandas_exercises
- Ubuntu IRC logs - https://irclogs.ubuntu.com/ - mirror locally in advance
- https://zenodo.org/record/1186215 - transport networks. In sqlite3 format and others.
- Iris dataset, http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html (famous example, used in some exercises)