Skip to content

My third year project repository dedicated to analysing intra-subreddit dynamics using graph theory.

Notifications You must be signed in to change notification settings

david-git-acc/3YPREPOS

Repository files navigation

What is this?

This is my third year project, also called a "dissertation", made for credit for the University of Warwick. The project, and its associated code, are designed to identify communities of discussion within a given subreddit by capturing throughthe Python Reddit API (PRAW) sets of posts, comments and their angement data and compiling them to generate graph theory networks, which in short, can then be used to run community detection algorithms to separate clusters of communities. These communities are then analysed and studied using data analytics methods to determine their subject matter, users, keywords, engagement metrics (custom-built), measures of spread and other key metrics. The goal of this is to generate a novel social analytics tool that can accurately capture the internal divisions of subject matter within a subreddit, a subject known as intra-subreddit dynamics (IRSD), as opposed to the more ubiquitous capture of relationships between subreddits, known as inter-subreddit dynamics (IESD). The hope is that this will produce a useful social analytics tool.

coolgraph _An example screenshot of one of the post graphs developed for this project._

Why Reddit?

I chose Reddit for a variety of reasons, the most prominent being my own substantial personal experience with the site and interest in developing further modularity within the subreddit structure. The second most important reason is that unlike most social media sites which focus on communities around individuals, Reddit focusses on individuals around communities. This makes it much more amenable to a graph theory approach; there is a limited extent of information to gather around one person, but a practically infinite amount to gather around a subject matter or field; one is temporary, the other is permanent. In addition, Reddit also has an easy-to-use and free (for academic purposes) API, PRAW, which is easy to use and synergises well with Python, a language I am extensively familiar with and is also a key language for data science, which was one of the core aims of this project.

Why Python?

In addition to the above reasons, I wanted to get as much work completed as possible, both to satisfy my project supervisors and allocate more time for working on project documentation, as well as my own personal fulfilment. I knew that to develop a good project I would need to actually enjoy the experience, even at the cost of computational efficiency. That is why I chose a language I was well versed in and enjoyed for this. While this resulted in memory overhead issues going over about 67,000 posts, I do not regret the choice.

Why SQL?

This was chosen because almost universally across my education in Computer Science I have only ever studied with SQL as an implementable language for data storage, and it is also simple and easy to work with.

What tools were used to create this project?

The programming language was exclusively a mix of Python 3.9 and SQL (specifically SQLite3). The reason SQL does not appear in the "Languages" tab is because it is all embedded in the form of Prepared Statements.

image _A selection of some of the Python modules used to work on this project._

NumPy

Used for computation of matrices and data manipulation (alongside Pandas).

Pandas

Used for storing program data that was retrieved from the SQLite3 databases.

SQLite3

I chose this because it is free, lightweight and easy to configure; I was looking for minimum overhead in the project to maximise development speed.

File structure and organisation

The file structure of the project can be visualised as follows: image

There are also original test files that I have decided to leave in the repository in case of any future work, although this is unlikely.

The database diagram of the project is given by: image

About

My third year project repository dedicated to analysing intra-subreddit dynamics using graph theory.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published