Skip to content

Commit 80c7315

Browse files
committed
blog: Start writing about anomaly
1 parent 4d9fffd commit 80c7315

File tree

4 files changed

+98
-8
lines changed

4 files changed

+98
-8
lines changed

src/content/blog/2025-12-31-data-collection.md

Lines changed: 0 additions & 8 deletions
This file was deleted.

src/content/blog/anomaly.md.shadow

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
---
2+
title: "An Anomaly in Reddit's Data"
3+
pubDate: "December 26, 2024"
4+
author: Yash Thakur
5+
description: "Looking into why many comments from around 2019 are inaccessible"
6+
---
7+
8+
We get our data by randomly sampling comments from Reddit. To do this, we randomly generate and request comment IDs (e.g. `t1_d123abc`). Some of these IDs correspond to comments that are deleted or in private subreddits, and so we get nothing back for these IDs. We do get back comments for most IDs, but we saw that around 2019/2020, there was a significant spike in the number of inaccessible comments.
9+
10+
We believe it was caused by the banning of a massive number of subreddits in 2020. This blog post will go over how we found the anomaly and what we will be doing to account for it.
11+
12+
## Finding the anomaly
13+
14+
When we started trying out Reddit's API, we simply generated comments between two IDs, one corresponding to a comment at the beginning of our time range and the other corresponding to a comment at the end of our time range. At the time, we had to find these comments manually, though we now have a script to do it automatically (TODO reference other blog post here).
15+
16+
The below code uses this approach to obtain a list of IDs for successfully requested comment and a corresponding list of their timestamps. It also creates a list of IDs for inaccessible comments.
17+
18+
```python
19+
from datetime import datetime
20+
import numpy as np
21+
22+
first_id = int("cnas8zz", 36) # ID of comment from 2015-01-01 00:00:00+00:00
23+
last_id = int("kfrowyo", 36) # ID of comment from 2024-01-01 00:00:00+00:00
24+
25+
ids: list[int] = []
26+
times: list[datetime] = []
27+
misses: list[int] = []
28+
29+
for _ in range(1000):
30+
rand_ids = np.random.randint(first_id, last_id, size=100)
31+
fullnames = [f"t1_{np.base_repr(id, 36).lower()}" for id in rand_ids]
32+
possible_misses = set(rand_ids)
33+
for comment in reddit.info(fullnames=fullnames):
34+
id = int(comment.id, 36)
35+
ids.append(id)
36+
times.append(datetime.fromtimestamp(comment.created_utc))
37+
possible_misses.remove(id)
38+
misses.extend(possible_misses)
39+
```
40+
41+
Let's see the distribution of the timestamps of these comments. We can aggregate in 3-month chunks and then see how many comments we got from each 3-month-long time period.
42+
43+
```python
44+
import polars as pl
45+
import seaborn as sns
46+
47+
df = pl.DataFrame({"id": ids, "Time": times})
48+
df = df.sort(by="Time")
49+
time_groups = df.group_by_dynamic("Time", every="3mo")
50+
sns.lineplot(
51+
data=time_groups.agg(Count=pl.len()),
52+
x="Time",
53+
y="Count",
54+
)
55+
```
56+
57+
No surprises here. The number of comments is increasing over time, which is to be expected, given that Reddit's popularity has increased over time.
58+
59+
![Number of hits over time period](./img/anomaly-num-hits.png)
60+
61+
But we also wanted to know how many of our requests were failing, because at the time, we weren't sure if the generate-an-ID-and-hope-it-exists approach was viable. So we calculated this miss rate by taking the number of misses between the start and end ID of each of the 3-month time periods created above and dividing it by the total number of comments requested in that period.
62+
63+
```python
64+
with_bounds = time_groups.agg(
65+
start=pl.col("id").min(),
66+
end=pl.col("id").max(),
67+
hits=pl.len(),
68+
)
69+
70+
def miss_rate(row):
71+
_, start_id, end_id, num_hits = row
72+
num_misses = len([miss for miss in misses if start_id < miss < end_id])
73+
return num_misses / (num_hits + num_misses) if num_misses else 0.0
74+
75+
sns.lineplot(
76+
data=with_bounds.with_columns(with_bounds.map_rows(miss_rate)).rename(
77+
{"map": "Miss rate"}
78+
),
79+
x="Time",
80+
y="Miss rate",
81+
)
82+
```
83+
84+
It's true that this will slightly undercount the number of misses in each time period. We now have a way to account for it, but as you can see below, the anomaly is apparent even with this imperfect calculation. There is a very noticeable spike in the miss rate around 2019/2020. It starts off well under 0.1, and at some point in 2018, it starts increasing, shooting up to around 0.35 before settling back down around 2021.
85+
86+
![Miss rate over time](./img/anomaly-miss-rate.png)
87+
88+
## The Great Ban
89+
90+
A former team member found that in 2020, many subreddits were banned. In fact, around 2000 subreddits were banned in order to make Reddit a safer, more inclusive space, as is discussed in [this paper](https://arxiv.org/abs/2401.11254v1). This decision came after years of investigation by Reddit, who discovered that subreddits dedicated to racism, sexism, anti-Semitism, transphobia, and so on were often filled with racism, sexism, anti-Semitism, transphobia, and so on.
91+
92+
Many of these subreddits were quite popular, such as r/The_Donald ([subreddit stats](https://subredditstats.com/r/the_donald)). When these subreddits were banned, their comments became inaccessible through Reddit's API. This explains why our miss rate was so much higher around 2019-2020. TODO find out if Reddit continued heavier moderation after 2020. Would help explain why miss rate is still high in early 2021.
93+
94+
As for why the miss rate only spiked around 2019-2020 despite many of the banned subreddits existing well before then, this was probably because they wereasdfasdf todo write this
95+
96+
## Dealing with the anomaly
97+
98+
todo write this
27.5 KB
Loading
22.1 KB
Loading

0 commit comments

Comments
 (0)