Software
| Schedule
| Project
|
Resources
On this page we have the course notes for the second part of the Data Science 871 module.
The first part is taught by Nico Katzke and focuses on data wrangling and good programming practices. You can find his notes here.
In this part of the course we will cover many different topics. The idea is to give you a whirlwind tour of many of the important ideas in Data Science. I want to thank Grant McDermott for making his data science notes freely available. Many of the notes here are just amended versions of his material. I have copied certain portions of his notes directly (with his permission). Make sure you go and check out his new book, Data Science for Economists and Other Animals.
We will be using R as our primary language for this course. However, I would like to introduce you to the basics of Julia.
I really enjoy coding in Julia, which is one of the main reasons I like to share it with others. Julia is used primarily in scientific computing, but recently there has been a shift toward using Julia for application in Data Science. This will definitely be a language to keep an eye out for in the future. The syntax of the language feels like a combination of Matlab and Python. If you are thinking of taking the advanced time series course then you might want to pay careful attention to the Julia component of the course, since we will use it there. Julia is not used as much in industry. It is used more at research institutions such as Universities.
The most popular language for machine learning is Python. I will provide some nice links to resources for getting started with Python. We will quickly talk about the differences between languages throughout the course. We hope that you realise after this course that the choice of language is not important to your success. Most languages are very similar and you should be able to switch pretty easily between languages once you know the basic principles of programming. I have found that a lot of people in industry use Python, especially engineers, applied mathematicians, computer scientists, etc. So a getting to know some Python in the future is not a bad idea.
My advice for the beginner is to pick the language you can most easily express yourself in when attempting new projects (try and start with R). Become comfortable with that language and then you can look to expand your portfolio after you feel like you can code in that one language. The reason for showing Julia in this course is so that you can compare it with R (and any other language you know) and decide which language you prefer. If you enjoy programming in a language you will do more of it, so it is worthwhile figuring out what you like. I mostly use Julia and Python, but you might find your niche somewhere else.
Below are my details. You can contact me via email. If I don't respond within a day or so, please remind me with a follow up email.
Dawie van Lill | |
---|---|
dvanlill@sun.ac.za | |
Office | Schumann 511 |
GitHub | DawievLill |
If you are using Windows, please go the following link and install all the software indicated on that page. This will all be useful for a data scientist.
If you are using MacOS you can go here
If you are using Ubuntu or some Linux distribution, you can go here
The list of software that you need to have installed for this course is,
- Visual Studio Code (and its extensions)
- Git (for Windows we will use Git Bash -- NB to install!)
- Python (via Anaconda)
- JupyterLab
- R (and associated tools per OS such as IRKernel and Rstudio)
- Julia (not on that page, but can be downloaded here)
- Make
If you are using another operating system than Windows, many of these programs will already be installed. If you are struggling to install something in another OS, please come and speak to me during the first lecture.
This course schedule will be updated frequently, so please check before lecture for the requisite readings and lecture notes. The notebooks and slides that I provide are like brief summaries of the readings. If you want more information you should do the readings or look at the references below.
The R notebooks are mostly complete and form the basis of this course. However, I am currently working on porting over some of the code for the Julia and Python sections, so that students can see what the code would look like in other languages. This is still a work in progress though. The Julia port is happening first and then I will move to Python when I have the time.
The notes for R are written in R Markdown, while the notebooks for Julia and Python are in Jupyter notebook format. It is worthwhile learning how to use both R Markdown and Jupyter notebooks, since Jupyter notebooks are widely used in industry.
If students are interested in extra credit (marks) and some recognition, they can make fork this repository and submit pull requests or open issues. I am especially interested in ports of the current code base, so that will count even more heavily in your favour.
Topic | R Notebooks | Julia Notebooks 🚧 | Python Notebooks 🚧 | Readings |
---|---|---|---|---|
Shell | [.rmd / .html] | Merely Useful Ch2-5 + EC607 Slides | ||
Git, Github and Make | [.rmd / .html] | EC 607 Slides | ||
SQL basics | [.rmd / .html] | EC 607 notes + Software Carpentry notes | ||
Introduction to Julia | [.ipynb] | [.ipynb] | Basics + QE notes + CTU notes | |
Data basics with Julia | [.ipynb] | [.ipynb] | DataFrames.jl + official documentation here | |
Functions | [.rmd / .html] | [.ipynb] | [.ipynb] | EC 607 notes # 1 + EC607 notes #2 |
Parallel programming | [.rmd / .html] | [.ipynb] | [.ipynb] | EC 607 notes |
Fundamentals of ML | [.rmd / .html] | [.ipynb] | [.ipynb] | BB notes |
Shrinkage methods | [.rmd / .html] | [.ipynb] | [.ipynb] | BB notes |
Decision trees and bagging | [.rmd / .html] | [.ipynb] | [.ipynb] | BB notes |
Random forests and gradient boosting | [.rmd / .html] | [.ipynb] | [.ipynb] | BB notes |
Cloud computing (optional) | [.rmd / .html] | [.ipynb] | [.ipynb] | EC 607 notes # 1 + EC607 notes #2 |
Working with Big Data (optional) | [.rmd / .html] | [.ipynb] | [.ipynb] | EC 607 notes |
In this part of the course there is no exam, only a final project. We will discuss this project in more detail during the semester. However, the basic idea is that you apply machine learning techniques to any type of data that you find interesting. The topic should ideally be related to economics. Please consult with me or Nico about your topics, so that we can talk about the availability of data and whether the project is feasible. We prefer that you code in R, but for those that want to code in Python or Julia please come and have a chat.
An alternative to the shell notes that I linked would be the following
- [Shell] Software Carpentry. The Unix Shell. https://swcarpentry.github.io/shell-novice/
For the SQL component we will reference the following,
- [SQL] Software Carpentry. Databases and SQL. https://swcarpentry.github.io/sql-novice-survey/
We will work through the basics of Python, but if you want a good introduction to Julia I would recommend,
- [Julia] QuantEcon. Getting Started with Julia. https://julia.quantecon.org/getting_started_julia/index.html
- [Julia] Paul Soderlind. Julia Tutorial. https://github.com/PaulSoderlind/JuliaTutorial
- [Julia] Ben Lauwens and Allen Downey (2018). Think Julia: How to Think Like a Computer Scientist. https://benlauwens.github.io/ThinkJulia.jl/latest/book.html
For data science components in R we will use,
- [R] Grant McDermott (2021). Data Science for Economists and Other Animals. https://grantmcdermott.com/ds4e/index.html
For data science in Julia a good textbook is,
- [Julia] Jose Storopoli, Rik Huijzer and Lazaro Alonso (2021). Julia Data Science. https://juliadatascience.io.
For the machine learning in R we will be using material from the following book,
- [R] Bradley Boehmke and Brandon Greenwell (2021). Hands-on Machine Learning with R. https://bradleyboehmke.github.io/HOML/
If you want machine learning tutorials in Julia then you could look at,
- [Julia] Data Science Tutorials in Julia. https://juliaai.github.io/DataScienceTutorials.jl/
Other resources might also be used, check the reading list for each of the lectures.