Skip to content

budhrajaankita/News-movie-Recommendation-Search-Engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

Information retrieval based search engine
(Ankita Budhraja)
 

Type of Project: Implementation

Project Overview

This project is influenced by class concepts and readings on Information retrieval, Web Crawling, Search Engines, text analysis with 
TF-IDF and BM25, Cosine similarity, etc.

Intrigued by the ElasticSearch engine in the assignment 'Search Ranking and Evaluation'. I developed a similar search engine using Python
 for News Articles and Movie Dataset. Apart from having search queries to find word similarities based on them, I also developed a recommendation
 system for news articles and movies. 

Following tasks can be performed by this program:
a. searching for a news article mentioning keywords. For e.g. stock prices, kolkata india, october november etc.
b. looking up for movies similar to a given movie as recommendations. For e.g. input a movie name “Jumanji” and get top 10 similar movie 
recommendations as output.
c. looking up for news articles similar to a given headline. For e.g. Taking one headline as input “Edtech’s failure is Indian education
sector's curse to bear” and getting top 5 article recommendations based on the query. 

I also wanted to learn more about web crawling and scraping, hence, I tested the recommendation system on 2 datasets - 
one created by web scraping news articles from moneycontrol.com and another annotated dataset on “movies” from Kaggle.
(https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset/ [License CC0: Public Domain])
 
Technical Specifications

(Implementation on Google Colab)
https://colab.research.google.com/drive/1cngza8xW9_Ae1_9hGp3vbfEgHBKF0CN3?usp=sharing

Github Link
https://github.com/budhrajaankita/202/blob/main/INFO202.ipynb

Following libraries have been utilized in the implementation:

Web Scraping Requests, BeautifulSoup
Manipulating and creating datasets using Pandas
Nltk, Sklearn, BM25Okapi
Google.colab as the implementation platform

Step: Web Scraping

Web scraping requires two parts -  crawler and scraper. The crawler is an algorithm that browses the web to search for the particular
data required by following the links. The scraper, on the other hand, is a specific tool created to extract data from the website. 
In this project, web scraping is done for “moneycontrol.com” to first get links related to news, and then scrape the articles present on these links
(spread across multiple pages). The limit for maximum pages for each topic is currently set to 3 due to RAM issues. 

Step: Creating Dataset, Transform and tokenize words
All articles from all links collected will be read up to a maximum of 3 pages and saved as a dataframe. [also exported as mnc.csv]

Step: Calculating TFIDF for all words/tokens
The term frequency-inverse document frequency is used in the fields of information retrieval and machine learning, 
and is used to quantify the importance/relevance of words, in a document amongst a collection of documents (here, articles). 
Formed the Inverted Index matrix.

Step: Cosine Similarity
Using TF-IDF vectorizer, Cosine similarity of each article in the dataframe is calculated and the result is displayed.

Step: Linear Kernel
Calculate the linear kernel of each article against each other. When using articles ~1000, the linear kernel took less time 
than cosine similarity. However, since the tf-idf functionality in sklearn.feature_extraction.text produces normalized vectors, 
in this case, cosine_similarity was equivalent to linear_kernel and the time wasn’t always more for cosine similarity calculation.

Step: News Article Recommendation based on headline as input
Recommendation system which takes existing headlines as input, and outputs 5 most similar articles based on cosine similarity matrix.

BM25 Search: Getting results with highest BM25 scores.
- Creating a new column (text_new) in dataframe by tokenizing articles’ text and using wordnet lemmatizer.
- Asking user for search query as input
- Calculating bm25 doc scores based on tokenized search query and text_new column
- Output results only with scores > 0, and exact matches.

get_top_n: Using get_top_n from BM25 module based on user input
Using the new column created and tokenized query, search for articles. This provided similar results compared to the previously implemented function.

Evaluation: Testing the same model on another annotated dataset from Kaggle to assess the efficiency of the above model.
Dataset taken from https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset/ [License CC0: Public Domain]
Movie Dataset Recommendation

Results were as expected. 

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published