README.Rmd

---
title: "OkCupid Profile Data for Intro Stats and Data Science Courses"
author: "Albert Y. Kim and Adriana Escobedo-Land"
output: github_document
references:
- id: Kim2015
  title: OkCupid Profile Data for Introductory Statistics and Data Science Courses
  author:
  - family: Kim
    given: Albert Y. Kim
  - family: Escobedo-Land
    given: Adriana
  ISSN: 1069-1898
  volume: 23
  URL: 'http://www.amstat.org/publications/jse/v23n2/kim.pdf'
  publisher: American Statistical Association
  issue: 2
  type: article-journal
  issued:
    year: 2015
    month: 7
---

```{r, echo=FALSE, message=FALSE, warning=FALSE}
suppressPackageStartupMessages(library(mosaic))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(knitr))
```

Data and code for [OkCupid Profile Data for Introductory Statistics and Data Science Courses](http://www.amstat.org/publications/jse/v23n2/kim.pdf) (Journal of Statistics Education July 2015, [Volume 23, Number 2](http://www.amstat.org/publications/jse/contents_2015.html)).

* `JSE.bib`:  bibliography file
* `JSE.pdf`:  PDF of document
* `JSE.Rnw`:  R Sweave document to recreate `JSE.pdf`. 
* `JSE.R`:  R code used in document
* `okcupid_codebook.txt`:  codebook for all variables
* `profiles.csv.zip`: CSV file of profile data (unzip this first)

Note:

* Permission to use this data set was explicitly granted by OkCupid.
* Usernames are not included.
* `JSE.Rnw` Sweave document was compiled using the `knitr` package. In RStudio, go to "Tools" -> "Project Options" -> "Sweave" -> "Weave Rnw files using:" and select knitr.  

## Preview

### Distribution of Male and Female Heights

```{r, echo=FALSE, message=FALSE, warning=FALSE, cache=TRUE, fig.height=4, fig.width=6}
profiles <- read.csv(file="profiles.csv", header=TRUE, stringsAsFactors=FALSE)
```

```{r, echo=FALSE, message=FALSE, warning=FALSE, cache}
profiles.subset <- 
  filter(profiles, height>=55 & height <=80) %>%
  mutate(
    sex = ifelse(sex == "m", "male", sex),
    sex = ifelse(sex == "f", "female", sex)
  )
histogram(~height | sex, width=1, layout=c(1,2), xlab="Height in inches", data=profiles.subset)
```


### Joint Distribution of Sex and Sexual Orientation

A mosaicplot of the cross-classification of the `r nrow(profiles)` users' sex and sexual orientation:  

```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.height=5, fig.width=5}
sex.by.orientation <- tally(~sex + orientation, data=profiles)
mosaicplot(sex.by.orientation, main="Sex vs Orientation", las=1)
```


### Logistic Regression to Predict Gender

Linear regression (in red) and logistic regression (in blue) compared.  Note both the x-axis (height) and y-axis (is female: 1 if user is female, 0 if user is male) have random jitter added to better visualize the number of points involved for each (height x gender) pair.

```{r, echo=FALSE, message=FALSE, warning=FALSE, cache=TRUE, fig.height=3, fig.width=6}
profiles <- filter(profiles, height>=55 & height <=80)
set.seed(76)
profiles <- sample_n(profiles, 5995)
profiles <- mutate(profiles, is.female = ifelse(sex=="f", 1, 0))

# Linear Regression
linear.model <- lm(is.female ~ height, data=profiles)
b1 <- coef(linear.model)

# Logistic Regression
logistic.model <- glm(is.female ~ height, family=binomial, data=profiles)
b2 <- coefficients(logistic.model)

inverse.logit <- function(x, b){
  linear.equation <- b[1] + b[2]*x
  1/(1+exp(-linear.equation))
}
ggplot(data=profiles, aes(x=height, y=is.female)) +
  geom_jitter(position = position_jitter(width = .2, height=.17)) +
  scale_y_continuous(breaks=0:1) +
  theme(panel.grid.minor.y = element_blank()) +
  xlab("Height in inches") +
  ylab("Is female?") +
  geom_abline(intercept=b1[1], slope=b1[2], col="red", size=2) +
  stat_function(fun = inverse.logit, args=list(b=b2), color="blue", size=2)
```

Fitted probabilities p-hat of each user being female along witha decision threshold (in red) used to predict if user is female or not.  

```{r, echo=FALSE, message=FALSE, warning=FALSE, cache=TRUE, fig.height=4, fig.width=4}
profiles$p.hat <- fitted(logistic.model)
ggplot(data=profiles, aes(x=p.hat)) +
  geom_histogram(binwidth=0.1) +
  xlab(expression(hat(p))) +
  ylab("Frequency") +
  xlim(c(0,1)) +
  geom_vline(xintercept=0.5, col="red", size=1.2)
```