This is my third year project, also called a "dissertation", made for credit for the University of Warwick. The project, and its associated code, are designed to identify communities of discussion within a given subreddit by capturing throughthe Python Reddit API (PRAW) sets of posts, comments and their angement data and compiling them to generate graph theory networks, which in short, can then be used to run community detection algorithms to separate clusters of communities. These communities are then analysed and studied using data analytics methods to determine their subject matter, users, keywords, engagement metrics (custom-built), measures of spread and other key metrics. The goal of this is to generate a novel social analytics tool that can accurately capture the internal divisions of subject matter within a subreddit, a subject known as intra-subreddit dynamics (IRSD), as opposed to the more ubiquitous capture of relationships between subreddits, known as inter-subreddit dynamics (IESD). The hope is that this will produce a useful social analytics tool.
_An example screenshot of one of the post graphs developed for this project._
I chose Reddit for a variety of reasons, the most prominent being my own substantial personal experience with the site and interest in developing further modularity within the subreddit structure. The second most important reason is that unlike most social media sites which focus on communities around individuals, Reddit focusses on individuals around communities. This makes it much more amenable to a graph theory approach; there is a limited extent of information to gather around one person, but a practically infinite amount to gather around a subject matter or field; one is temporary, the other is permanent. In addition, Reddit also has an easy-to-use and free (for academic purposes) API, PRAW, which is easy to use and synergises well with Python, a language I am extensively familiar with and is also a key language for data science, which was one of the core aims of this project.
In addition to the above reasons, I wanted to get as much work completed as possible, both to satisfy my project supervisors and allocate more time for working on project documentation, as well as my own personal fulfilment. I knew that to develop a good project I would need to actually enjoy the experience, even at the cost of computational efficiency. That is why I chose a language I was well versed in and enjoyed for this. While this resulted in memory overhead issues going over about 67,000 posts, I do not regret the choice.
This was chosen because almost universally across my education in Computer Science I have only ever studied with SQL as an implementable language for data storage, and it is also simple and easy to work with.
The programming language was exclusively a mix of Python 3.9 and SQL (specifically SQLite3). The reason SQL does not appear in the "Languages" tab is because it is all embedded in the form of Prepared Statements.
_A selection of some of the Python modules used to work on this project._
Used for computation of matrices and data manipulation (alongside Pandas).
Used for storing program data that was retrieved from the SQLite3 databases.
I chose this because it is free, lightweight and easy to configure; I was looking for minimum overhead in the project to maximise development speed.
The file structure of the project can be visualised as follows:

There are also original test files that I have decided to leave in the repository in case of any future work, although this is unlikely.
