GitHub - sejalv/stock-similarity

Stock-Similarity

Simple collaborative recommendation for product similarity estimation.

Setup

Prerequisites

Built with Python 3.9, but should work with other > Python 3.6 versions.

pip install virtualenv
virtualenv <name>
source <name>/bin/activate
pip install -r requirements.txt

Please run below in the same order

DB Setup

DB: Recommendations, Table (Initial): UserInstruments, Other tables: UserItemSimilarity, ItemSimilarity

python model_db.py

Tech used: SQLAlchemy (SQLite)

Data Generation

While running below, wait for 5 mins, as we need price history of 5 mins.

python tests/generate_data.py

Uses the provided instrument_stream.py to generate input data to DB & flat-file

Output: data/ dir

Training

python src/train_similarity_stocks.py

Both approaches are implemented:

Item-based collaborative filtering: Model is based on item cosine distances
User-based collaborative filtering: Model is based on item euclidean distance

Reason: Collaborative Filtering with cosine similarity can be in memory, and is fast in making simple arithmetic operations. To avoid the downside of having a sparse matrix, we can select on the required columns. Clustering (eg. KNN) may not be scalable for 100M users hitting 10 times daily.

Tech used: sklearn, pandas etc.

API

python app.py

Will generate an IP which you can pass to the endpoints below.

Endpoints:

/add/request : To manually add new user or instrument entry (instead of through instrument_stream)

/instruments/<int:instrid> : Retrieves the last 5minutes of price history with minutely resolution

/instruments/similar/<string:userid> : Returns the 10 user-recommended stocks, given a userid

/instruments/similar/<int:instrid> : Returns the 10 most similar stocks, given an instrument-id

http://<ip>:5000//instruments/similar/<instrid>
http://<ip>:5000//instruments/similar/<user_id>

or using curl ...

Example:

Add new user:

curl http://<ip>:5000/add/request --request POST --header "Content-Type: application/json" --data '{"userid": "ABC", "price": "10.273", "timestamp": "1629061628", "instrid": 4}'

Output:

{
  "resp": {
    "instrid": 4,
    "price": "10.273",
    "timestamp": "1629061628",
    "userid": "ABC"
  }
}

Retrain model (To get scores for new user or instrument):

python train_similarity_stocks.py

Price history from last 5 minutes for an instrument:

curl http://<ip>:5000/instruments/4

Output:

{
  "2021-08-16 23:32:00": {
    "instrid": 4,
    "price_close": 30.720722747311438,
    "price_high": 30.95552569216846,
    "price_low": 28.130188131072263,
    "price_open": 28.130188131072263
  },
  "2021-08-16 23:33:00": {
    "instrid": 4,
    "price_close": 32.625519076821135,
    "price_high": 33.75415083166951,
    "price_low": 30.651110731017894,
    "price_open": 30.651110731017894
  },
  "2021-08-16 23:34:00": {
    "instrid": 4,
    "price_close": 35.080199941493284,
    "price_high": 35.518648971117905,
    "price_low": 29.955314672154707,
    "price_open": 29.955314672154707
  },
  "2021-08-16 23:35:00": {
    "instrid": 4,
    "price_close": 36.45158436457821,
    "price_high": 37.229010165569015,
    "price_low": 34.065298396096324,
    "price_open": 34.60302781982425
  },
  "2021-08-16 23:36:00": {
    "instrid": 4,
    "price_close": 41.49435434250442,
    "price_high": 41.49435434250442,
    "price_low": 34.69979107925532,
    "price_open": 34.69979107925532
  }
}

Item-based similarity

curl http://<ip>:5000/instruments/similar/4

Output:

{
  "predicted_items": [
    "0",
    "5",
    "6",
    "2",
    "7",
  ]
}

User-based similarity

curl http://<ip>:5000/instruments/similar/ABC

Output:

{
  "predicted_items": [
    "4",
    "5",
    "1",
    "7",
    "3",
    "6",
    "8",
    "2",
    "9",
    "0"
  ]
}

Tech used: Flask

Follow-up Questions

How would you change the system to support 50k instruments each streaming one or more quotes every second?
- Streaming service/Message Broker like Kafka (on-prem) or Kinesis (AWS/Cloud), along with an online Feature store/Cache like Redis, to deliver predictions at low latency
Which changes are needed, so your code could support 100M daily active users hitting each endpoint 10 times/day? Tech: Load Balancer & HTTP Servers, eg. Nginx We can create multiple instances of the prediction service, use a load balancer to split traffic, and send the traffic to the appropriate copy of your service. In practice, there are two common methods:
- Deploying containerized applications/services on auto-scaling clusters such as Kubernetes (on-prem) or ECS (AWS/Cloud)
- Or use a serverless option like AWS Lambda. (Fully-managed, but meant for lightweight serving only, not heavy transformations)
What would you put in place to make the sure the system, and the model are working as expected?
- Test cases: Cross-Validation, A/B Testing, Canary deployment, Verifying labelling, regular training & evaluation sets, Hyper-parameter tuning
- Versioning (Source code & Data), eg. DVC
- Data Quality checks, eg. Great Expectations
- Logging & Monitoring, eg. CloudWatch (AWS), Graphana, Prometheus, ELK
- Deployment, pipelines & Infra: IaC (eg. Terraform), CI/CD (eg. GitHub workflows), Containers (eg. Docker)
How would you evaluate and improve the Similarity model?
- Coverage: percent of items in the training data, the model is able to recommend on a test set.
- Personalization: to assess if a model recommends many of the same items to different users. It is the dissimilarity (1- cosine similarity) between user’s lists of recommendations. A high personalization score would indicate the user’s recommendations are different, meaning the model is offering a personalized experience to each user.
- Intra-list similarity: average cosine similarity of all items in a list of recommendations. This calculation uses features of the recommended items to calculate the similarity.
- For improving:
  - Remove popular items from the training data, Reduce noise
  - Scale item ratings by the user’s value, such as average transaction value. This can help a model learn to recommend items that lead to loyal or high-value customers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stock-Similarity

Setup

API

Follow-up Questions

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
src		src
tests		tests
README.md		README.md
app.py		app.py
challenge.pdf		challenge.pdf
config.py		config.py
model_db.py		model_db.py
requirements.txt		requirements.txt

sejalv/stock-similarity

Folders and files

Latest commit

History

Repository files navigation

Stock-Similarity

Setup

API

Follow-up Questions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages