You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
RedditHarbor simplifies collecting Reddit data and saving it to a database. It removes the complexity of working with APIs, letting you easily build a "harbor" of Reddit data for analysis.
Introduction
Social media data from platforms like Reddit contains rich insights into human behaviour and beliefs. However, collecting and storing this data requires dealing with complex APIs.
RedditHarbor streamlines this entire process so you can focus on your research.
In plain language:
β¨ Comprehensive API Data Collection: Gather Reddit submissions, comments, and user profiles directly from the official data API.
π Privacy-Preserving: Anonymise PII to protect user privacy and meet ethical/IRB standards.
π¦ Controlled Data Storage: Store collected data in your own secure database for accessibility and organisation.
π Highly Scalable: Handle massive datasets with millions of rows through efficient pagination.
πΉοΈ Configurable Collection: Tailor data gathering to your specific needs via adjustable parameters.
π Analysis-Ready Exports: Export to CSV, JSON, JPEG for seamless integration with analysis tools.
π Temporal Metric Tracking: Regularly update post metrics like scores, upvote ratios, awards over time - unlike static snapshot databases.
β‘ Smart Update Intervals: Automatically adjust update frequency based on dataset size for optimised API efficiency.
Minimum coding required after the initial setup! The tool is designed specifically for researchers with limited coding backgrounds.
Prerequisites
For a more detailed step-by-step instructions, see our documentation.
Reddit API: You need a Reddit account to access the Reddit API. Follow Reddit's API guide to register as a developer and create a script app. This will provide the credentials (PUBLIC_KEY and SECRET_KEY) needed to authenticate with Reddit.
Supabase API: Sign up for a Supabase account. Create a new project to get the database URL and SECRET_KEY. You will need these credentials to connect and store the Reddit data.
Getting Started
Installation
Install the RedditHarbor package using pip:
# requires Python 3.9 or higherpipinstallredditharborpipinstallredditharbor[pii]
pip install redditharbor[pii] is required to enable anonymising any personally identifiable information (PII).
Setting Up Supabase Tables
We need to create three different types of tables in Supabase to store the user, submission and comments data from Reddit.
For testing purpose, we will name them "test_redditor", "test_submission", and "test_comment". Go to the Supabase Dashboard and open SQL Editor. Click "New Query" to start a new SQL query, and paste this table creation SQL:
This will create the three tables with the required columns and data types. Once created, you will see the new tables now available in the Supabase interface. In the future, you can duplicate these tables and modify the table names for your own production.
RedditHarbor pacakge is dependent on the predefined column names for all users, submissions and comments tables. To ensure proper functionality, it is crucial to create tables with all the specified columns mentioned in the documentation. Failure to do so may lead to errors or incomplete data retrieval.
Running the Code:
To use the package, first create an empty Python file in your IDE of choice, such as VS Code. Running the code directly in Jupyter notebook is not recommended, as it may cause errors. To start collecting Reddit data, you first need to configure the authentication:
SUPABASE_URL="<your-supabase-url>"SUPABASE_KEY="<your-supabase-api-key>"#Remember to use "service_role/secret" key, not "anon/public" key REDDIT_PUBLIC="<your-reddit-public-key>"REDDIT_SECRET="<your-reddit-secret-key>"REDDIT_USER_AGENT="<your-reddit-user-agent>"#format - <institution:project-name (u/reddit-username)>
Then define the database table names to store the data:
This will collect the 5 hottest and 5 top submissions from r/python and r/learnpython, along with the associated user data, and store them in the configured database tables. If you would like to anonymise any pii data, set mask_pii as True.