-
Notifications
You must be signed in to change notification settings - Fork 1
/
06-writing.qmd
392 lines (269 loc) Β· 19.2 KB
/
06-writing.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
# Writing Code with AI
*Written by Emily Nordmann*
The best AI platform for writing code is arguably Github Copilot (or Super Clippy as my programmer friends like to call it). However, unless coding is the main part of your job, most people are unlikely to have this subscription service so we'll stick with the generic platforms.
If you have access to LinkedIn Learning (which you do if you are a UofG student or staff), I'd also highly recommend the short course [Pair Programming with AI](https://www.linkedin.com/learning/pair-programming-with-ai/) by Morten Rand-Hendriksen. It only takes 1.5 hours to work through and he covers using both Github Copilot and ChatGPT and has a nicely critical view of AI.
If you've worked through this entire book then hopefully what you've learned is that AI is very useful but I also hope that you have a healthy mistrust of anything it produces which is why this chapter is the last in the coding section.
The number 1 rule to keep in mind when using AI to write code is that the platforms we're using **are not intelligent**. They are not thinking. They are not reasoning. They are not telling the truth or lying to you. They don't have gaps in their knowledge because they have no knowledge. They are digital parrots on cocaine.
## Activity 1: Set-up
Again we'll use the `palmerpenguins` dataset for this exercise so open up an Rmd and run the following:
```{r warning=FALSE, message=FALSE}
library(tidyverse)
library(palmerpenguins)
data("penguins")
```
We'll also do a little bit of set-up for the AI:
> Act like an expert programmer in R. I want you to help me write code. The code should be commented and should use the tidyverse where possible. Ask me questions about the code before you write it if necessary.
## Activity 2: Knowledge is power
When you input this starting prompt, there's a good chance you'll get something like the following:
```{r img-write1, echo=FALSE, fig.cap="More info needed"}
knitr::include_graphics("include/images/06-writing/write1.png")
```
Before you ask the AI to write code, it's helpful to give it as much information as you can about your dataset. You could write out a description manually but there's a few options to automate.
`summary()` is useful because it provides a list of all variables with some descriptive statistics so that the AI has a sense of the type and range of data:
```{r}
summary(penguins)
```
`str()` is also useful because it lists the variables, their data type, and the initial values for each variable. However, that means that you are giving it at least some of the raw data so you have to be very careful if you have sensitive / confidential data and you must ensure that any use of AI is in line with your data management plan. Using Copilot Enterprise means the data won't be stored and used to train the AI further so it's potentially the best option (which is not to say it's safe or problem free, please be careful and critical!).
```{r}
str(penguins)
```
Finally, `ls()` provides a list of all the variables in a given object. It doesn't provide any info on the variable type or sample, but that does mean it's the most secure and depending on the task, this might be all the info you really need to give the AI. I would suggest starting with `ls()` and only scaling up if necessary (and your data isn't sensitive):
```{r}
ls(penguins)
```
In your AI of choice, input the following:
> here are the variables in my data set
> ls(penguins)
[1] "bill_depth_mm" "bill_length_mm" "body_mass_g" "flipper_length_mm"
[5] "island" "sex" "species" "year"
Without being asked, Copilot has provided some examples of how you might analyse this data using R and the code all works as intended.
```{r}
# Load the tidyverse package
library(tidyverse)
# Assuming your data is in a dataframe called 'penguins'
# Let's start by getting a summary of each numerical variable
penguins %>%
summarise(
Average_Bill_Depth = mean(bill_depth_mm, na.rm = TRUE),
Average_Bill_Length = mean(bill_length_mm, na.rm = TRUE),
Average_Body_Mass = mean(body_mass_g, na.rm = TRUE),
Average_Flipper_Length = mean(flipper_length_mm, na.rm = TRUE)
)
# Now, let's see the distribution of species across different islands
penguins %>%
count(island, species) %>%
ggplot(aes(x = island, y = n, fill = species)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Distribution of Penguin Species Across Islands",
x = "Island",
y = "Count")
# We can also explore the relationship between flipper length and body mass
penguins %>%
ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(aes(color = species)) +
geom_smooth(method = "lm") +
labs(title = "Relationship Between Flipper Length and Body Mass",
x = "Flipper Length (mm)",
y = "Body Mass (g)")
# If you want to perform more specific analyses or visualizations,
# please let me know what you're interested in!
```
I find this hugely problematic. Psychology has spent the last decade dealing with a replication and reproducibility crisis stemming in part because of questionable research practices. As a researcher, you should be making informed decisions as to how you analyse your data and I fear that this type of output is going to increase phishing and atheoretical, exploratory-as-confirmatory nonsense. What happens when the example code the AI spits out without being asked turns out to be a significant regression model that you would never have predicted or run yourself? Are you going to delete it? Or convince yourself that you were going to run it anyway and there's a perfectly logical explanation?
Before I have a full blown existential crisis, let's get back on track.
## Activity 3: Writing code
First, we'll aim to produce a table of descriptive statistics for `body_mass_g` and a histogram. Because it has the exact variable and object names, it should write code we can easily copy and paste:
> Create a table of descriptive statistics for body mass and a histogram of that variable
This is ChatGPT's output (Copilot was very similar):
```{r}
# Load the tidyverse package
library(tidyverse)
# Load the penguins dataset (assuming it's from the palmerpenguins package)
data("penguins")
# Check the structure of the dataset
str(penguins)
# Summary statistics for body mass
body_mass_summary <- penguins %>%
summarise(
Mean = mean(body_mass_g, na.rm = TRUE),
Median = median(body_mass_g, na.rm = TRUE),
SD = sd(body_mass_g, na.rm = TRUE),
Min = min(body_mass_g, na.rm = TRUE),
Max = max(body_mass_g, na.rm = TRUE)
)
# Print the summary statistics
print(body_mass_summary)
# Create a histogram of body mass
ggplot(penguins, aes(x = body_mass_g)) +
geom_histogram(fill = "skyblue", color = "black", bins = 20) +
labs(
title = "Distribution of Body Mass",
x = "Body Mass (g)",
y = "Frequency"
) +
theme_minimal()
```
Some notes:
* I didn't have to use the exact variable name `body_mass_g`, it has figured out from my request that's the one I want to use.
* It's also figured out that we're using the `palmerpenguins` dataset, although it hasn't given us the code to load that package so if we just copied and pasted this into a fresh script without having loaded the data already, it wouldn't work.
* It's used `na.rm = TRUE` to ignore missing values. This is a very common choice but remember **it is a choice**. You might not want this.
* Similarly, it's specified `binwidth = 20` for the histogram. Looking at the data this seems like a reasonable binwidth to use, but again, it is a decision the AI has imposed and you must be aware of what you're accepting.
* I am aware that I am using phrases like "choice" and "decision" that imbue the AI with consciousness. I know that's not true, I just can't figure out how to write it any other way.
ChatGPT and Copilot both worked well. Gemini on the other hand... had a bit of a moment. First, it told me that it couldn't complete the request so I amended the variable name, thinking that perhaps it needed the exact name. It then produced a table rather than giving me the code. The numbers in this table look like plausible values for this dataset (although don't match exactly) so I suspect it has information about the `palmerpenguins` dataset.
```{r img-write2, echo=FALSE, fig.cap="Gemini missing the mark"}
knitr::include_graphics("include/images/06-writing/write2.png")
```
The issue is that I didn't tell it to write me code -I told it to create a table which is did, although it's interesting that both Copilot and ChatGPT understood the intent of my prompt and Gemini did not.
I then edited my prompt:
> Write code to create a table of descriptive statistics for body_mass_g and a histogram of that variable
And it responded by giving me python code. Again, not technically wrong given the prompt but it does mean that it is ignoring the rest of the conversation and the initial set-up prompt. Following up to ask specifically for R code worked, but it's not a good start for Gemini.
```{r img-write3, echo=FALSE, fig.cap="Gemini going rogue"}
knitr::include_graphics("include/images/06-writing/write3.png")
```
## Activity 4: Giving the AI more freedom
My initial request to write code was relatively simple and relatively well-specified but let's try an example where there's more room for the AI to make choices.
> Create a new variable that codes each penguin as chonky_boi if they're a large male, chonky_gurl if they're a large female, smol_boi if they're a small male, and smol_boi if they're a small female.
This prompt gives a lot of leeway. I haven't specified what to call the new variable, what criteria should be used to determine large and small, or what to do with observations that are neither.
ChatGPT used percentiles to determine category, however, this code didn't work - it returns all observations as "average":
```{r}
# Create a new variable based on conditions
penguins <- penguins %>%
mutate(chonkiness = case_when(
sex == "Male" & body_mass_g > quantile(body_mass_g, 0.75, na.rm = TRUE) ~ "chonky_boi",
sex == "Female" & body_mass_g > quantile(body_mass_g, 0.75, na.rm = TRUE) ~ "chonky_gurl",
sex == "Male" & body_mass_g <= quantile(body_mass_g, 0.25, na.rm = TRUE) ~ "smol_boi",
sex == "Female" & body_mass_g <= quantile(body_mass_g, 0.25, na.rm = TRUE) ~ "smol_gurl",
TRUE ~ "average"
))
```
> In this code:
> We use the case_when() function to define the conditions based on sex and body mass to categorize the penguins.
> For males with a body mass greater than the 75th percentile, they are categorized as "chonky_boi." For females with a body mass greater than the 75th percentile, they are categorized as "chonky_gurl."
> For males with a body mass less than or equal to the 25th percentile, they are categorized as "smol_boi." For females with a body mass less than or equal to the 25th percentile, they are also categorized as "smol_gurl."
> All other penguins are categorized as "average."
> We then use mutate() to create the new variable "chonkiness" based on these conditions.
I will spare you the full adventure (my [chat history is here](https://chat.openai.com/share/769637ca-61bd-4ac5-9afb-d21381c5c99e) if you'd like to see) but after quite a bit of back-and-forth I realised that in the dataset "male" and "female" should both be lowercase and the code ChatGPT gave me had capitalised them both. **There is no substitute for knowing your data**.
Once I changed this, it worked.
```{r}
# Create a new variable based on conditions
penguins <- penguins %>%
mutate(chonkiness = case_when(
sex == "male" & body_mass_g > quantile(body_mass_g, 0.75, na.rm = TRUE) ~ "chonky_boi",
sex == "female" & body_mass_g > quantile(body_mass_g, 0.75, na.rm = TRUE) ~ "chonky_gurl",
sex == "male" & body_mass_g <= quantile(body_mass_g, 0.25, na.rm = TRUE) ~ "smol_boi",
sex == "female" & body_mass_g <= quantile(body_mass_g, 0.25, na.rm = TRUE) ~ "smol_gurl",
TRUE ~ "average"
))
```
Copilot used a median split and didn't try and create a third category, which is a questionable choice because the wording of the prompt really implies there needs to be something that the penguins are larger or smaller than. The code did work out of the box but remember the Jurassic Park rule: just because you can, doesn't mean you should.
```{r}
# Assuming 'penguins' is your dataframe and 'body_mass_g' is the variable for body mass
# First, we need to define what we consider 'large' and 'small'.
# For this example, let's assume penguins heavier than the median are 'large',
# and those equal or lighter are 'small'.
# Calculate the median body mass
median_body_mass <- median(penguins$body_mass_g, na.rm = TRUE)
# Create the new variable 'size_category'
penguins <- penguins %>%
mutate(size_category = case_when(
body_mass_g > median_body_mass & sex == "male" ~ "chonky_boi",
body_mass_g > median_body_mass & sex == "female" ~ "chonky_gurl",
body_mass_g <= median_body_mass & sex == "male" ~ "smol_boi",
body_mass_g <= median_body_mass & sex == "female" ~ "smol_gurl",
TRUE ~ NA_character_ # for missing or other values
))
```
This code is a good example of why LLMs are inherently Luddites, even if they produce working code. When there's missing data, `case_when()` used to require you to specify the type of `NA`, in this case `NA_character`. This is very uninintuitive, particularly when you're trying to teach beginner coders (how can missing data have a type? is an entirely reasonable question) so I was delighted when it updated to remove this requirement. Additionally, `TRUE` was also changed to `default` which is a bit more readable.
However, these changes are not reflected in Copilot's code suggestion because the training data is older than the most recent update to `case_when()`. This example is relatively minor but it's important to recognise that if you only use these kinds of AI to write code, you will very quickly be writing outdated code and using deprecated functions.
Gemini once again told me it could not complete my request and I have given up using it for the remainder of this chapter.
## Activity 5: Further analysis
To give one last example, I entered the following prompt:
> write me code to run a linear regression predicting body mass from sex. also include the code to perform necessary assumption tests using the easystats performance package.
```{r error = TRUE, message=FALSE, warning=FALSE}
library(tidyverse)
library(easystats)
# Load the penguins dataset
data("penguins")
# Filter out rows with missing values in body mass or sex
penguins <- penguins %>%
filter(!is.na(body_mass_g) & !is.na(sex))
# Convert sex to lowercase
penguins$sex <- tolower(penguins$sex)
# Run linear regression predicting body mass from sex
lm_model <- lm(body_mass_g ~ sex, data = penguins)
# Display summary of the linear regression model
summary(lm_model)
# Perform assumption tests using the performance package
check_assumptions(lm_model)
```
ChatGPT refuses to forget that in my conversation history there was an issue with the code not being lowercase, so it's now including functions to covert to lowercase, even though it's completely unnecessary.
It's also introduced an error. The function should be `check_model()` not `check_assumptions()`. I know this because I know the package. The further away you get from what you could do yourself, the harder it will be to fix any errors and check what it has gotten wrong.
## Conclusions and musings
A large part of coding is looking up examples of code you've used before and then tweaking it to your new use case. If you use AI like this, then it can really help speed you up by giving you access to boiler plate code you can adapt. I also find it very useful for those functions that just refuse to stick in my brain - every time I try to change the labels on a faceted `ggplot()` I get it wrong. These days I just go straight to ChatGPT.
Additionally, if what you're wanting to achieve is relatively simple - descriptive stats, basic plots, simple wrangling and cleaning - it can't go that far wrong and it's usually easy enough to spot when something is wrong.
These use cases work well because it's your expertise that drives what the AI does and reviewing and tweaking the output is built-in to these workflows. These are the AI skills that everyone is going to need.
But using code written by an AI that you don't understand is dangerous and irresponsible. Think of all the stupid errors it has made throughout this book, and then imagine someone using it to write code for their medical research. I dread to think of the impact that this abdication of responsibility and expertise is going to have (and will already be having).
This is a depressing way to end this book so I gave all three AIs the prompt:
> Write me a fun piece of R code.
ChatGPT reassuring me comedy isn't going to be outsourced any time soon:
```{r}
# Function to generate a random joke
generate_joke <- function() {
jokes <- c(
"Why don't scientists trust atoms? Because they make up everything!",
"I told my wife she was drawing her eyebrows too high. She looked surprised!",
"Why did the scarecrow win an award? Because he was outstanding in his field!",
"Why don't skeletons fight each other? They don't have the guts!",
"I'm reading a book on anti-gravity. It's impossible to put down!"
)
return(sample(jokes, 1))
}
# Generate and print a random joke
cat("Here's a random joke for you:\n")
cat(generate_joke(), "\n")
```
Gemini reassuring me art isn't going to be outsourced any time soon:
```{r}
# Define a list of ASCII art characters
art_chars <- c("#", "%", "@", "&", "$", "*")
# Function to generate a random character
get_art_char <- function() {
return(sample(art_chars, 1))
}
# Set picture dimensions (width and height)
picture_width <- 50
picture_height <- 20
# Create an empty matrix to store the picture
art_matrix <- matrix(nrow = picture_height, ncol = picture_width)
# Fill the matrix with random characters
for (i in 1:picture_height) {
for (j in 1:picture_width) {
art_matrix[i, j] <- get_art_char()
}
}
# Convert the matrix to a string for printing
art_string <- apply(art_matrix, 1, FUN = paste, collapse = "")
art_string <- paste(art_string, collapse = "\n")
# Print the ASCII art picture
cat("Your random ASCII art:", art_string, sep = "\n")
```
Copilot on its way to Vegas (ok I actually quite like this one):
```{r}
# Simple Slot Machine in R
set.seed(Sys.time()) # Seed for randomness based on current time
# Function to spin the slot machine
spin_slot_machine <- function() {
fruits <- c("π", "π", "π", "π", "π", "π")
spin <- sample(fruits, size = 3, replace = TRUE)
cat("Spinning... You got:", spin, "\n")
if (length(unique(spin)) == 1) {
cat("Congratulations! You won! π\n")
} else {
cat("Try again! π\n")
}
}
# Spin the slot machine
spin_slot_machine()
```
::: {.callout-caution}
This book was written in Spring 2024 and should be considered a **living document**. The functionality and capability of AI is changing rapidly and the most recent advances may not reflect what is described in this book. Given the brave new world in which we now live, all constructive feedback and suggestions are welcome! If you have any feedback or suggestions, please provide it [via Forms](https://forms.office.com/Pages/ResponsePage.aspx?id=KVxybjp2UE-B8i4lTwEzyKAHhjrab3lLkx60RR1iKjNUM0VCMUUxUUFLMTdNM0JTS09PSDg2SFQ3US4u).
:::