-
Notifications
You must be signed in to change notification settings - Fork 24
/
Copy pathdiscontinuity.qmd
executable file
·190 lines (127 loc) · 9.44 KB
/
discontinuity.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
---
output: html_document
editor_options:
chunk_output_type: console
---
# Slicing data
```{r echo=FALSE}
source("libs/Common.R")
```
```{r echo = FALSE}
pkg_ver(c("dplyr", "ggplot2","tidyr"))
```
So far, we've assumed that the underlying process that relates the $x$ and $y$ variables is homogeneous across the full range of x-values. This assumption simplifies model fitting strategies. But sometimes, that assumption does not hold as we will see next.
## Introduction
Let's start off by downloading a dataset.
```{r}
df <- read.csv("http://mgimond.github.io/ES218/Data/Sample1.csv")
```
Next, we'll plot the data and fit a straight line.
```{r fig.width=3.5, fig.height=2.5, small.mar=TRUE}
library(ggplot2)
ggplot(df, aes(x = x,y = y)) + geom_point(alpha = 0.3) +
stat_smooth(method = "lm", se = FALSE)
```
The line seems to do a decent job in depicting the *overall* trend but the relationship does not appear perfectly linear. Let's check the residuals via a residual-dependence plot.
```{r fig.width=3.5, fig.height=2.5, small.mar=TRUE}
M1 <- lm( y ~ x, df)
df$residuals <- residuals(M1)
ggplot(df, aes(x = x, y = residuals)) + geom_point(alpha = 0.3) +
stat_smooth(method = "loess", se = FALSE, span = 0.2,
method.args = list(degree = 1))
```
There appears to be a dip in residual values between ~95 and ~107 followed by an increase beyond ~107. The *kinks* in the residuals seem to delineate three perfectly straight line segment suggesting that the raw data may be modeled using three distinct lines (i.e. with differing slopes and intercepts). Note the use of a small loess span in the above code to reveal the kinks in the pattern.
Sometimes, the data may represent outcomes from different processes given different ranges in independent (x) values. Since the residual plot seems to suggest that the kinks occur are `x=95` and `x=106`, we'll split the data into three groups: less than 95, between 95 and 106 and greater than 106. These groups will be labeled `1`, `2`, and `3` and will be assigned to a new column called `grp`. This new column will then be used to facet the scatter plots and associated loess curve. We'll use of the `case_when()` function to perform this task.
```{r fig.width=6, fig.height=2.5}
library(dplyr)
df2 <- df %>% mutate(grp = case_when(x < 95 ~ 1,
x < 106 ~ 2,
x >= 106 ~ 3))
ggplot(df2, aes(x = x,y = y)) + geom_point(size = 0.5, alpha = 0.3) +
stat_smooth(method = "loess", se = FALSE) + facet_grid(. ~ grp)
```
The segmented plots seem to confirm our earlier suspicion that the data followed three separate linear processes. We can extract the slope and intercept for each segment using the following chunk of code:
```{r}
library(tidyr)
df2 %>%
group_by(grp) %>%
do( M1 = (lm(y ~ x, . ) ) ) %>%
mutate(intercept = coef(M1)[1],
slope = coef(M1)[2]) %>%
select(-M1)
```
So, three separate models were needed to capture, what we believed to be, three different underlying processes.
Exploring the residuals using the loess fit was pivotal in identifying the breaks in the pattern. But, as we will learn in the next section, these breaks may not always be so obvious in a dataset.
## Case study: Trends in daily temperature extremes
> Disclaimer: the analysis presented here is only exploratory and does not mirror the complete analysis conducted by Vincent *et al.* nor the one conducted by Stone.
### Original analysis
The following data are pulled from the paper titled "*Observed Trends in Indices of Daily Temperature Extremes in South America 1960-2000*" (Vincent et al., 2005) and represent the percentage of nights with temperatures greater than or colder than the 90^th^ and 10^th^ percentiles respectively within each year. The percentiles were calculated for the 1961 to 2000 period.
```{r message=FALSE}
library(tidyr)
Year <- 1960:2000
PerC <- c(11.69,9.33,14.35,10.73,14.15,11.16,13,12.13,14.25,10.01,11.94,14.35,
10.83,9.38,11.5,10.44,12.66,7.55,9.77,9.81,8.9,8.51,7.02,6.83,9.67,
7.84,7.11,8.56,10.59,7.93,8.85,8.8,8.75,8.18,7.16,9.91,10.15,6.58,
6.44,9.43,8.03)
PerH <- c(8.62,10.1,6.67,11.13,5.71,9.48,7.63,8.12,7.2,9.64,8.42,5.71,11.72,
11.32,7.2,7.17,7.46,13.17,9.28,8.75,12.38,10,13.83,17.59,10.14,
9.84,11.23,14.39,9.44,8.26,12.15,12.45,13.14,13.67,15.22,11.79,11.16,
20.37,17.56,11.13,11.49)
df2 <- data.frame(Year, PerC, PerH)
df2.l <- pivot_longer(df2, names_to = "Temp", values_to = "Percent", -Year)
```
Let's plot the data and fit a straight line to the points.
```{r fig.width=4.5, fig.height=2.5}
ggplot(df2.l, aes(x = Year, y = Percent)) + geom_point() +
stat_smooth(method = "lm", se = FALSE) + facet_wrap(~ Temp, nrow = 1)
```
The plot on the left shows percent cold nights and the one on the right shows percent hot nights. At first glance, the trends seem real and monotonic.
Next we'll fit a loess to see if the trends are indeed monotonic. To minimize the undue influence of end values in the plot, we'll implement loess' bisquare estimation method via the `family=symmetric` option. We'll also use a small span to help identify any "kinks" in the patterns.
```{r fig.width=4.5, fig.height=2.5}
ggplot(df2.l, aes(x = Year, y = Percent)) + geom_point() +
stat_smooth(method = "loess", se = FALSE, span = 0.5,
method.args = list(degree = 1, family = "symmetric")) +
facet_wrap(~ Temp, nrow = 1)
```
The patterns seem to be segmented around the 1975-1980 period for both plots suggesting that the observed trends may not be monotonic. In fact, there appears to be a prominent kink in the percent cold data around the mid to late 1970`s. A similar, but not as prominent kink can also be observed in the percent hot data at around the same time period.
### Changepoint
In a comment to *Vincent et al.*'s paper, R.J. Stone argues that the trend observed in the percent hot and cold dataset is not monotonic but segmented instead. In other words, there is an abrupt change in patterns for both datasets that make it *seem* as though a monotonic trend exists when in fact the data may follow relatively flat patterns for two different segments of time. He notes that the abrupt change (which he refers to as a **changepoint**) occurs around the 1976 and 1977 period. He suggests that this time period coincides with a change in the Pacific Decadal Oscillation (PDO) pattern. PDO refers to an ocean/atmospheric weather pattern that spans multiple decades and that is believed to impact global climate.
The following chunk of code loads the PDO data, then summarizes the data by year before plotting the resulting dataset.
```{r fig.width=5, fig.height=2.5}
df3 <- read.table("http://mgimond.github.io/ES218/Data/PDO.dat",
header = TRUE, na.strings = "-9999")
pdo <- df3 %>%
pivot_longer(names_to = "Month", values_to = "PDO", -YEAR) %>%
group_by(YEAR) %>%
summarise(PDO = median(PDO) )
ggplot(pdo, aes(x = YEAR, y = PDO)) + geom_line() +
stat_smooth(se = FALSE, span = 0.25) +
geom_vline(xintercept = c(1960, 1976, 2000), lty = 3)
```
The contrast in PDO indexes between the 1960-1976 period and the 1976-2000 period is obvious with the pre-1977 index values appearing to remain relatively flat over a 15 year period and with the post-1977 index appearing to show a gradual increase towards a peak around the early 1990's.
To see if distinct patterns emerge from the percent hot and cold data before and after 1976, we'll split the data into two segments using a cutoff year of 1976-1977. Values associated with a period prior to 1977 will be assigned a `seg` value of `Before` and those associated with a post-1977 period will be assigned a `seg` value of `After`.
```{r}
df2.l$seg <- ifelse(df2.l$Year < 1977, "Before", "After")
```
Next, we'll plot the data across four facets:
```{r fig.width=5, fig.height=4}
ggplot(df2.l, aes(x = Year, y = Percent)) + geom_point() +
stat_smooth(method = "lm", se = FALSE) + facet_grid(seg ~ Temp)
```
We can also choose to map `seg` to the color aesthetics which will split the points by color with the added benefit of fitting two separate models to each batch.
```{r fig.width=5, fig.height=2.5}
ggplot(df2.l, aes(x = Year, y = Percent, col = seg)) + geom_point() +
stat_smooth(method = "lm", se=FALSE) + facet_wrap( ~ Temp, nrow = 1)
```
To check for "straightness" in the fits, we'll fit a loess to the points.
```{r fig.width=5, fig.height=2.5}
ggplot(df2.l, aes(x = Year, y = Percent, col = seg)) + geom_point() +
stat_smooth(method = "loess", se = FALSE, method.args = list(degree = 1)) +
facet_wrap( ~ Temp, nrow = 1)
```
There is a clear "stair-step" pattern for the percent cold nights. However, there seems to be an upward trend in the percent of hot nights for the post-1977 period which could imply that in addition to the PDO effect, another process could be at play.
This little exercise highlights the ease in which an analysis can follow different (and seemingly sound) paths. It also serves as a reminder of the importance of domain knowledge when exploring data.
## Reference
**Original paper**: Vincent, L. A., et al., 2005. *Observed trends in indices of daily temperature extremes in South America 1960–2000.* J. Climate, 18, 5011–5023.
**Comment to the paper** Stone, R. J., 2011. *Comments on "Observed trends in indices of daily temperature extremes in South America 1960–2000."* J. Climate, 24, 2880–2883.
**The reply to the comment** Vincent, L. A., et al., 2011. *Reply*, J. Climate, 24, 2884-2887.