Skip to content

Does your data collection rescan and overwrite? #39

@Zoher15

Description

@Zoher15

Hey @ArthurHeitmann ,

Thank you so much for putting together and sharing all of this Reddit data. I really appreciate the effort that goes into maintaining it.

I’m currently working on a research project about Reddit moderation. Up until now, I was using the Pushshift dumps from 2005-06 through 2023-02, focusing on moderator comments and their corresponding parent comments (the moderated comments).

However, I ran into a significant issue. Because of rescanning and overwriting, many controversial comments ended up being replaced with “[removed]” or “[deleted].” It seems that even though Pushshift was designed to capture data quickly, often within seconds, the later rescans about 24 hours afterward would overwrite the original content and erase valuable moderated discussions.

Before I invest time into working with your data dumps, could you clarify whether they follow a similar rescanning or overwriting process? Understanding this would help me determine whether your data dumps avoids the same issue.

Thanks again for your time and for making this resource available.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions