myNextdoor is a program which sends users daily emails containing recommended posts for them from their Nextdoor.com feed according to their individual preferences. It learns these preferences by first scanning historical posts and and noting whether the user has interacted with them, and then by using this as a training dataset for supervised learning models to predict whether a new post is of interest to the user based on text and quantitative factors.
- Parse both daily and historical Nextdoor.com feed using Selenium Web Automation
- Preprocess retrieved text and quantitative data to be made suitable for our models
- Implement common supervised learning algorithms from scratch and train them on historical data to be able to generate predictions on whether or not a new out-of-sample post is relevant to our user's interest
- Notify our user of new posts via automated SMTP emails and store these recommendations to a SQLite database
myNextdoor is a personal research project and is not intended for commercial or unethical use. The program can only access neighborhoods on Nextdoor.com that its user belongs to (valid credentials must be entered in settings.config
), and it is the user's job to treat all such restricted data with care and with respect to the privacy of those who belong in their neighborhood. No data has been published in this repository and all references to findings in this README.md file have been modified to respect the privacy of those beloging to the original creator's neighborhood.
- Framework: Selenium Web Driver
- Programming Language: Java
- Email Delivery: SMTP
- Database: SQLite3
- Model Serialization: Pickle
- Process Communication: JSON, subprocess
To generate insights, we abstract every Nextdoor post into the following fields
- Text body (i.e. "Isn't today a beautiful day!")
- Author (i.e. John Smith)
- Hometown of Author (i.e. Dallas, Fort Worth, etc.)
- How long ago it was posted (i.e. "1 hour ago", "7 days ago", etc.)
- Number of reactions it has received
- Number of comments it has received
We wish to analyze these qualitative/quantitative factors to return a prediction as to whether or not a post with some set of fields is either "important" or "non-important" to our user according to their individual preferences.
To learn these preferences, we must first compile some training data set of posts, with each post labelled as either important or non-important. Ideally, this would come from the user themselves, but to approximate this, we implemented a parser to look back and parse historical posts and attach to each a boolean label capturing whether or not our user either 1) "reacted" to them (i.e. liked, supported, etc.) or 2) left a comment on them. In doing so, we assume that if our user interacts with a post of any of these ways, then he or she finds the post "important", and that we should recommended similar such posts to them in the future.
With this Java code, we build a labelled training dataset of ids, posts, and their "observed" importances. We save it as .json
file, which might look something like this
"285667624": {
"Interacted": false,
"Author": "Steven Smith",
"NumReactions": 1,
"Text": "Looking for beginner tennis lessons",
"Age": "1 day ago",
"NumComments": 0,
"Location": "Neverland"
},
"285642325": {
"Interacted": false,
"Author": "Gilbert Miranda",
"NumReactions": 33,
"Text": "Does anyone know why the road off the highway is under construction?",
"Age": "3 hours ago",
"NumComments": 126,
"Location": "Narnia"
},
"285658580": {
"Interacted": true,
"Author": "Madison Lunter",
"NumReactions": 1,
"Text": "How should I train my dog?",
"Age": "1 day ago",
"NumComments": 1,
"Location": "New York City"
},
With this file saved, we can then work on implementing some supervised learning algorithms to make predictions!
At a high level, we must generate insights from both our 1) text data and 2) our non-text data, which may be either qualitative (i.e. author) or quantitative (i.e. number of likes). For analyzing text, I choose to implement a Naive Bayes model, while for non-text data, I implemented a Logistic Regression model after first converting all data points into quantitative measures.
A Naive Bayes Classifier relies on Bayes formula estimate the conditional probability of an outcome (i.e. whether a post is important) given particular observed features (i.e. a frequency vector counting the number of times keywords appear in that post). Mathematically, for a post with a vector of word frequencies
where
Using our training data set, we can calculate the values of these prior probabilities for each output class (i.e. important vs non-important), and the values of our feature likelihoods for each keyword ever observed. We assume each of our frequencies follow a
Meanwhile, we use a Logistic Regression model to generate similar boolean predictions for each post, this time looking at the non-text parameters. This is similar to a Linear Regression algorithm, which computes an output level given a set of numerical inputs. However, a key distinction (which makes it suitable for this classification problem) is that a logistic regression "squishes" our output range to be between 0 and 1 for any input vector
The goal of training our model is to compute the coefficient vector
Two steps were required to achieve this. First, for the regression approach to make any sense, we had to devise a way to convert all of our unordered qualitative inputs to ordered quantitative inputs prior to training the model. We did this according to one-hot encoding, which adds features to our input set by replacing categorical levels with linearly independent unit vectors. Second, to compute the coefficient vector and bias, we used a simple gradient descent implementation to optimize these vectors iteratively.
The scripts naive_bayes.py
and logistic_regression.py
train these models and serialize them as pickle files, which are then loaded by main.py
when it runs daily and takes the union of these two models to generate out-of-sample predictions.