The Press Information Bureau (PIB) automated feedback system uses web crawlers to create a dataset of news articles, Optical Character Recognition (OCR) technology to extract content from e-papers, and a public Application Programming Interface (API) to analyze YouTube videos. It then utilizes advanced Natural Language Processing (NLP) techniques to classify news articles into relevant government departments and evaluate their sentiment.
The primary functionality of the system is to send timely notifications for negative articles while providing a user-friendly dashboard for data visualization. Additionally, there is a separate Chrome extension for real-time fake news detection.
Data Acquisition
-
Asynchronous Web Scraping: Utilized BeautifulSoup library along with asynchronous libraries such as aiohttp and asyncio to efficiently scrape articles from various national and regional media websites.
-
Text Extraction & Language Translation: Implemented Google's Optical Character Recognition engine (Pytesseract) to extract text from scanned or image-based regional newspaper articles and integrated Google Translator API to translate the extracted text into English, supporting cross-language analysis.
-
Video Content Breakdown: Leveraged OpenAI Whisper API for an in-depth analysis of closed captioning in YouTube videos from selected news channels, enhancing media monitoring capabilities.
🗂️ Processed data is stored automatically in JSON format with well-defined key-value pairs, ensuring compatibility for frontend integration and wider accessibility across various applications.
Data Analysis
-
Department Categorization: Developed a machine learning model using the Support Vector Machine (SVM) algorithm, complemented by Natural Language Processing techniques like Text Lemmatization and Term Frequency-Inverse Document Frequency (TF-IDF) vectorization, to analyze a dataset comprising diverse government departments. The test model achieved an accuracy of ~95%
-
Sentiment Analysis: Trained a Bidirectional Encoder Representations from Transformers (BERT) model within the PyTorch framework, on a dataset comprising articles classified as positive, neutral, and negative. The test model achieved an accuracy of ~81%, closely matching the ground truth labels.
📊 Matplotlib library is applied automatically to generate graphs to visually represent the correlation between government departments and the sentiment expressed in news articles, making it easier to identify trends, patterns, and areas of concern.
Data Presentation
-
Cross-Platform User Interface: Designed a website using frameworks such as React and Bootstrap, with integrated SMTPlib library and Twilio API for real-time notifications to government officials regarding negative articles, thereby improving the ability to monitor and respond proactively.
-
Chrome Extension:
📦 Hosted the website on GoDaddy and configured the Frontend to send POST requests via the Axios library and the Backend to process them securely with the CORS extension provided by the Flask framework.
Follow these steps to set up and run the Insight Ink software on your local machine, or you can watch the demo video
- Clone the repository to your local machine:
git clone https://github.com/areebahmeddd/Insight-Ink.git
- Navigate to the project directory:
cd Insight-Ink
- Create a virtual environment (optional but recommended):
python -m venv .venv
- Activate the virtual environment:
- Windows:
.venv\Scripts\activate
- macOS and Linux:
source .venv/bin/activate
- Install the project dependencies:
pip install -r requirements.txt
npm install
- Run the application and start the development server:
python app.py
npm start
- Access the application in your web browser by navigating to http://localhost:3000
This project is licensed under the Apache License 2.0
Areeb Ahmed, Shivansh Karan, Nandini Sharma, Ravikant Saraf, Mohit Nagaraj