forked from rudeboybert/JSE_OkCupid
-
Notifications
You must be signed in to change notification settings - Fork 1
/
README.Rmd
118 lines (95 loc) · 4.24 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
title: "OkCupid Profile Data for Intro Stats and Data Science Courses"
author: "Albert Y. Kim and Adriana Escobedo-Land"
output: github_document
references:
- id: Kim2015
title: OkCupid Profile Data for Introductory Statistics and Data Science Courses
author:
- family: Kim
given: Albert Y. Kim
- family: Escobedo-Land
given: Adriana
ISSN: 1069-1898
volume: 23
URL: 'http://www.amstat.org/publications/jse/v23n2/kim.pdf'
publisher: American Statistical Association
issue: 2
type: article-journal
issued:
year: 2015
month: 7
---
```{r, echo=FALSE, message=FALSE, warning=FALSE}
suppressPackageStartupMessages(library(mosaic))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(knitr))
```
Data and code for [OkCupid Profile Data for Introductory Statistics and Data Science Courses](http://www.amstat.org/publications/jse/v23n2/kim.pdf) (Journal of Statistics Education July 2015, [Volume 23, Number 2](http://www.amstat.org/publications/jse/contents_2015.html)).
* `JSE.bib`: bibliography file
* `JSE.pdf`: PDF of document
* `JSE.Rnw`: R Sweave document to recreate `JSE.pdf`.
* `JSE.R`: R code used in document
* `okcupid_codebook.txt`: codebook for all variables
* `profiles.csv.zip`: CSV file of profile data (unzip this first)
Note:
* Permission to use this data set was explicitly granted by OkCupid.
* Usernames are not included.
* `JSE.Rnw` Sweave document was compiled using the `knitr` package. In RStudio, go to "Tools" -> "Project Options" -> "Sweave" -> "Weave Rnw files using:" and select knitr.
## Preview
### Distribution of Male and Female Heights
```{r, echo=FALSE, message=FALSE, warning=FALSE, cache=TRUE, fig.height=4, fig.width=6}
profiles <- read.csv(file="profiles.csv", header=TRUE, stringsAsFactors=FALSE)
```
```{r, echo=FALSE, message=FALSE, warning=FALSE, cache}
profiles.subset <-
filter(profiles, height>=55 & height <=80) %>%
mutate(
sex = ifelse(sex == "m", "male", sex),
sex = ifelse(sex == "f", "female", sex)
)
histogram(~height | sex, width=1, layout=c(1,2), xlab="Height in inches", data=profiles.subset)
```
### Joint Distribution of Sex and Sexual Orientation
A mosaicplot of the cross-classification of the `r nrow(profiles)` users' sex and sexual orientation:
```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.height=5, fig.width=5}
sex.by.orientation <- tally(~sex + orientation, data=profiles)
mosaicplot(sex.by.orientation, main="Sex vs Orientation", las=1)
```
### Logistic Regression to Predict Gender
Linear regression (in red) and logistic regression (in blue) compared. Note both the x-axis (height) and y-axis (is female: 1 if user is female, 0 if user is male) have random jitter added to better visualize the number of points involved for each (height x gender) pair.
```{r, echo=FALSE, message=FALSE, warning=FALSE, cache=TRUE, fig.height=3, fig.width=6}
profiles <- filter(profiles, height>=55 & height <=80)
set.seed(76)
profiles <- sample_n(profiles, 5995)
profiles <- mutate(profiles, is.female = ifelse(sex=="f", 1, 0))
# Linear Regression
linear.model <- lm(is.female ~ height, data=profiles)
b1 <- coef(linear.model)
# Logistic Regression
logistic.model <- glm(is.female ~ height, family=binomial, data=profiles)
b2 <- coefficients(logistic.model)
inverse.logit <- function(x, b){
linear.equation <- b[1] + b[2]*x
1/(1+exp(-linear.equation))
}
ggplot(data=profiles, aes(x=height, y=is.female)) +
geom_jitter(position = position_jitter(width = .2, height=.17)) +
scale_y_continuous(breaks=0:1) +
theme(panel.grid.minor.y = element_blank()) +
xlab("Height in inches") +
ylab("Is female?") +
geom_abline(intercept=b1[1], slope=b1[2], col="red", size=2) +
stat_function(fun = inverse.logit, args=list(b=b2), color="blue", size=2)
```
Fitted probabilities p-hat of each user being female along witha decision threshold (in red) used to predict if user is female or not.
```{r, echo=FALSE, message=FALSE, warning=FALSE, cache=TRUE, fig.height=4, fig.width=4}
profiles$p.hat <- fitted(logistic.model)
ggplot(data=profiles, aes(x=p.hat)) +
geom_histogram(binwidth=0.1) +
xlab(expression(hat(p))) +
ylab("Frequency") +
xlim(c(0,1)) +
geom_vline(xintercept=0.5, col="red", size=1.2)
```