A topic model is a type of statistical model for discovering abstract "topics" that occur in a collection of documents. It provides a general summary of topics being discussed in the data and association between those topics.
I was familiar with traditional topic modeling techniques like Latent Dirichlet Allocation (LDA), but I wanted to explore BERTopic which is more sophisticated and recent algorithm in topic modeling. It overcomes certain drawbacks of LDA like pre-processing requirements, unstable results and high processing power by leveraging techniques like UMAP, c-TF-IDF and word embeddings.
With the emergence of meme stocks and several online forums for stock trading, it has become necessary for financial institutions to factor in market sentiments from such sources while making investment decisions. Case in point - GameStop short squeeze causing major financial consequences for certain hedge funds and large losses for short sellers.
Wallstreetbets on Reddit is one of the many public forums where people discuss such recent market trends and express their sentiments about them. This project focuses on WallStreetBets (responsible for the GME short-squeeze linked above) to identify popular topics and stocks being discussed in the comments and posts, and recommend stocks to buy based on that analysis.
- BERTopic
- praw (Python Reddit API Wrapper)
- pmaw (Pushshift Multithread API Wrapper)
- sklearn
- joblib==1.1.0 (needed because of conflicts with BERTopic)
pip install --upgrage joblib==1.1.0
- Scraped ~500k comments & posts from the subreddit r/Wallstreetbets
- Date range: Sept 1, 2022 to Sept 30, 2022
- Scrape data to get top stock tickers from Sept 1, 2022 to Sept 30, 2022
- Scrape data for only those top tickers to perform sentiment analysis
- Scraped Yahoo finance for actual monthly closing prices
- Remove comments with less than 10 words to ensure only opinions are well-explained
- Only keep content with more than 5 upvotes to weed out irrelevant content
- Remove outlier topics (taken care by BERT)
- Modify VADER's sentiment score on lexicons we got from topic modelling
- +2 for positive sentiment & -10 for negative sentiment
- Perform sentiment analysis on top 10 tickers in windows of 90, 60 and 30 days, and
- Get actual data from yahoo finance for the same top 10 tickers to compare our recommendations
Queen of England's death is associated closely with topics related to inflation (cluster of topics at center).
- Use an embedding-based sentiment analyzer (e.g., "Flair"), instead of heuristic-based technique like VADER. We stopped at VADER since it was giving good results (8 out of 10 recommended stocks were in profit)
- Create a multi-processing scraper using joblib for faster scraping