GitHub - chasefeng11/myNextdoor: A tool powered by Machine Learning to generate daily post recommendations for users on Nextdoor.com

📚 Table of contents

Overview
Features
Disclaimers
Technical Components
Design

🔎 Overview

myNextdoor is a program which sends users daily emails containing recommended posts for them from their Nextdoor.com feed according to their individual preferences. It learns these preferences by first scanning historical posts and and noting whether the user has interacted with them, and then by using this as a training dataset for supervised learning models to predict whether a new post is of interest to the user based on text and quantitative factors.

📌 Features

Parse both daily and historical Nextdoor.com feed using Selenium Web Automation
Preprocess retrieved text and quantitative data to be made suitable for our models
Implement common supervised learning algorithms from scratch and train them on historical data to be able to generate predictions on whether or not a new out-of-sample post is relevant to our user's interest
Notify our user of new posts via automated SMTP emails and store these recommendations to a SQLite database

📋 Disclaimers

myNextdoor is a personal research project and is not intended for commercial or unethical use. The program can only access neighborhoods on Nextdoor.com that its user belongs to (valid credentials must be entered in settings.config), and it is the user's job to treat all such restricted data with care and with respect to the privacy of those who belong in their neighborhood. No data has been published in this repository and all references to findings in this README.md file have been modified to respect the privacy of those beloging to the original creator's neighborhood.

🛠 Technical Components

Data Parsing

Framework: Selenium Web Driver
Programming Language: Java

Data Analysis

Programming Language: Python
Libraries: sklearn, numpy, pandas

Operations

Email Delivery: SMTP
Database: SQLite3
Model Serialization: Pickle
Process Communication: JSON, subprocess

🧮 Design

To generate insights, we abstract every Nextdoor post into the following fields

Text body (i.e. "Isn't today a beautiful day!")
Author (i.e. John Smith)
Hometown of Author (i.e. Dallas, Fort Worth, etc.)
How long ago it was posted (i.e. "1 hour ago", "7 days ago", etc.)
Number of reactions it has received
Number of comments it has received

We wish to analyze these qualitative/quantitative factors to return a prediction as to whether or not a post with some set of fields is either "important" or "non-important" to our user according to their individual preferences.

To learn these preferences, we must first compile some training data set of posts, with each post labelled as either important or non-important. Ideally, this would come from the user themselves, but to approximate this, we implemented a parser to look back and parse historical posts and attach to each a boolean label capturing whether or not our user either 1) "reacted" to them (i.e. liked, supported, etc.) or 2) left a comment on them. In doing so, we assume that if our user interacts with a post of any of these ways, then he or she finds the post "important", and that we should recommended similar such posts to them in the future.

With this Java code, we build a labelled training dataset of ids, posts, and their "observed" importances. We save it as .json file, which might look something like this

  "285667624": {
    "Interacted": false,
    "Author": "Steven Smith",
    "NumReactions": 1,
    "Text": "Looking for beginner tennis lessons",
    "Age": "1 day ago",
    "NumComments": 0,
    "Location": "Neverland"
  },
  "285642325": {
    "Interacted": false,
    "Author": "Gilbert Miranda",
    "NumReactions": 33,
    "Text": "Does anyone know why the road off the highway is under construction?",
    "Age": "3 hours ago",
    "NumComments": 126,
    "Location": "Narnia"
  },
  "285658580": {
    "Interacted": true,
    "Author": "Madison Lunter",
    "NumReactions": 1,
    "Text": "How should I train my dog?",
    "Age": "1 day ago",
    "NumComments": 1,
    "Location": "New York City"
  },

With this file saved, we can then work on implementing some supervised learning algorithms to make predictions!

At a high level, we must generate insights from both our 1) text data and 2) our non-text data, which may be either qualitative (i.e. author) or quantitative (i.e. number of likes). For analyzing text, I choose to implement a Naive Bayes model, while for non-text data, I implemented a Logistic Regression model after first converting all data points into quantitative measures.

A Naive Bayes Classifier relies on Bayes formula estimate the conditional probability of an outcome (i.e. whether a post is important) given particular observed features (i.e. a frequency vector counting the number of times keywords appear in that post). Mathematically, for a post with a vector of word frequencies $\vec{v}$, we compute the probability the post is important as

$$ \begin{align} P(\text{true} | \vec{v}) &= \frac{P(\text{true}) P(\vec{v} | \text{true})}{P(\vec{v})} \\ &\propto P(\text{true}) \prod P(f_i = v_i) \end{align} $$

where $P(\text{true})$ denotes the prior probability that any arbitrary post is important, and $P(f_i = v_i)$ indicates the feature likelihood probability that any arbitrary post contains the $i$-th word with $v_i$ frequency.

Using our training data set, we can calculate the values of these prior probabilities for each output class (i.e. important vs non-important), and the values of our feature likelihoods for each keyword ever observed. We assume each of our frequencies follow a $Gaussian(\mu, \sigma^2)$ distribution, and we estimate each pair of mean/variance parameters according to the training dataset. After this is complete, we now have a trained model, which we can use to generate boolean prediction on new posts by computing the posterior probabilities of each outcome, then choose the outcome with the larger probability.

Meanwhile, we use a Logistic Regression model to generate similar boolean predictions for each post, this time looking at the non-text parameters. This is similar to a Linear Regression algorithm, which computes an output level given a set of numerical inputs. However, a key distinction (which makes it suitable for this classification problem) is that a logistic regression "squishes" our output range to be between 0 and 1 for any input vector $\vec{x}$ according to a sigmoid function, as seen below

$$ \sigma(\vec{x}) = \frac{1}{1 + e^{-\theta \cdot \vec{x} + \beta}} $$

The goal of training our model is to compute the coefficient vector $\theta$ and the bias $\beta$ to maximize our model's predictive ability over the labelled training data set.

Two steps were required to achieve this. First, for the regression approach to make any sense, we had to devise a way to convert all of our unordered qualitative inputs to ordered quantitative inputs prior to training the model. We did this according to one-hot encoding, which adds features to our input set by replacing categorical levels with linearly independent unit vectors. Second, to compute the coefficient vector and bias, we used a simple gradient descent implementation to optimize these vectors iteratively.

The scripts naive_bayes.py and logistic_regression.py train these models and serialize them as pickle files, which are then loaded by main.py when it runs daily and takes the union of these two models to generate out-of-sample predictions.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Data		Data
Databases		Databases
Logs		Logs
Src		Src
README.md		README.md
settings.config		settings.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📚 Table of contents

🔎 Overview

📌 Features

📋 Disclaimers

🛠 Technical Components

Data Parsing

Data Analysis

Operations

🧮 Design

About

Releases

Packages

Languages

chasefeng11/myNextdoor

Folders and files

Latest commit

History

Repository files navigation

📚 Table of contents

🔎 Overview

📌 Features

📋 Disclaimers

🛠 Technical Components

Data Parsing

Data Analysis

Operations

🧮 Design

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages