Welcome to part 2 of STA 380, a course on predictive modeling in the MS program in Business Analytics at UT-Austin. All course materials can be found through this GitHub page. Please see the course syllabus for links and descriptions of the readings mentioned below.
Instructors:
- Dr. James Scott. Office hours on Fridays, 6:40am to 7:30am, and 2:00 PM to 3:00 PM, via Zoom. (All times are US central time.)
- Dr. Jennifer Starling. Office hours Mondays, 8-8:50 am and 1-2 pm Central US Time.
Students in both sections are welcome to attend either set of office hours!
The exercises are available here. These are due Monday, August 17th at 5 PM, U.S central time. Pace yourself over the next few weeks, and start early on the first couple of problems!
Slides: The data scientist's toolbox
Good data-curation and data-analysis practices; R; Markdown and RMarkdown; the importance of replicable analyses; version control with Git and Github.
Readings:
- Introduction to RMarkdown
- RMarkdown tutorial
- Introduction to GitHub
- Getting starting with GitHub Desktop
- Jeff Leek's guide to sharing data
Your assignment after the first class day:
- Create a GitHub account.
- Create your first GitHub repository.
- Inside that repository (on your local machine), create a toy RMarkdown file that does something---e.g. simulates some normal random variables and plots a histogram.
- Knit that RMarkdown file to a Markdown (.md) output.
- Push the changes to GitHub and view the final (knitted) .md file.
These instructions will make sense after you read the tutorials above!
Slides: Some fun topics in probability
Two short pieces that illustrate the "fallacy of mistaken compounding":
- How likely is it that birth control could let you down? from the New York Times
- An excerpt from Chapter 7 of AIQ: How People and Machines are Smarter Together, by Nick Polson and James Scott.
Optional reference: Chapter 1 of these course notes. There's a lot more technical stuff in here, but Chapter 1 really covers the basics of what every data scientist should know about probability.
Topics: data visualization and practice with R.
Slides: Introduction to Data Exploration
R scripts and data:
Inspiration and further reference:
- excerpts from some course notes on data science. You'll find some example figures in Chapter 1.
- 50 ggplots
- A map of average ages in Swiss municipalities
- Low-income students in college
- The French presidential election
- LeBron James's playoff scoring record
The bootstrap; joint distributions; using the bootstrap to approximate value at risk (VaR).
Slides: Introduction to the bootstrap
Reference: ISL Section 5.2 for a basic overview of the bootstrap.
For the class exercises, you will need to refer to any basic explanation of the concept of value at risk (VaR) for a financial portfolio, e.g. here, here, or here.
R scripts and data:
Supplemental resources:
- R walkthrough on Monte Carlo simulation
- These notes on bootstrapping and the permutation test.
- Section 2 of these notes, on bootstrap resampling. You can ignore the stuff about utility if you want.
- This R walkthrough on using the bootstrap to estimate the variability of a sample mean.
Basics of clustering; K-means clustering; hierarchical clustering.
Slides: Introduction to clustering.
Scripts and data:
Readings:
- ISL Section 10.1 and 10.3 or Elements Chapter 14.3 (more advanced)
- K-means++ original paper or simple explanation on Wikipedia. This is a better recipe for initializing cluster centers in k-means than the more typical random initialization.
Principal component analysis (PCA).
Slides: Introduction to PCA
Scripts and data for class:
- pca_intro.R
- nbc.R, nbc_showdetails.csv, nbc_pilotsurvey.csv
- congress109.R, congress109.csv, and congress109members.csv
- ercot_PCA.R, ercot.zip
A few other examples we likely won't cover in class:
Readings:
- ISL Section 10.2 for the basics or Elements Chapter 14.5 (more advanced)
- Shalizi Chapters 18 and 19 (more advanced). In particular, Chapter 19 has a lot more advanced material on factor models, beyond what we covered in class.
Networks and association rule mining.
Slides: Intro to networks. Note: these slides refer to "lastfm.R" but this is the same thing as "playlists.R" below.
Some supplemental slides on association rule mining. These contain the details of the apriori algorithm. If there's time we might cover some of this in class, but mainly we'll focus on the shorter intro slides above, together with the example R scripts below.
Software you'll need:
- Gephi, a great piece of software for exploring graphs
- The Gephi quick-start tutorial
Scripts and data:
- medici.R and medici.txt
- playlists.R and playlists.csv
- microfi.R, microfi_households.csv, and microfi_edges.txt.
Supplemental resource: In-depth explanation of the Apriori algorithm
Co-occurrence statistics; naive Bayes; TF-IDF; topic models; vector-space models of text (if time allows).
Scripts and data:
If time in class, we'll cover this script below. But if not, it's a useful starting point for your homework anyway:
Supplemental material:
- Great blog post about word vectors.
- Stanford NLP notes on vector-space models of text, TF-IDF weighting, and so forth.
- Using the tm package for text mining in R.
- Some slides on Naive Bayes text classification.