Skip to content

Analysed syntax and Semantics of Corpus of Text Documents Retrieved from Web Scraping of News articles from Inshorts and followed the Standard NLP Workflow of the CRISP-DM model.

License

Notifications You must be signed in to change notification settings

codekhal/Inshorts-NLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Inshorts-NLP

Scraping

Analysed syntax and Semantics of Corpus of Text Documents Retrived from Web Scraping of News articles from Inshorts and followed the Standard NLP Workflow of the CRISP-DM model.

WorkFlow

Credits

Open Issues Forks Stars

Maintained Made with Python
Open Source Love
Built with Love

📒 Index

🔰 About

A NLP based Project which scraps the news articles of mainly 3 categories:

  • Technology
  • Sports
  • World

from InShorts using website urls. Finally after numerous preprocessing steps like Text Wrangling, Removing accented characters, Removing html tags, Lemmatization, Stemming, build a text normalizer to create dataset for applying sentiment analysis.

Sentiment analysis is perhaps one of the most popular applications of NLP.

The key aspect of sentiment analysis is to analyze a body of text for understanding the opinion expressed by it. Typically, quantifying this sentiment with a positive or negative value, called polarity.

This project can be used to create following key features:

  • Building Text summarizer using RNNs and LSTM
  • Gain only particular sentiment be it positive or negative.
  • Emojifier: Building appropriate reaction emojis from the extracted sentiments.
  • Building a tone detector as Grammarly (Beta) provides us.

Build this project to learn the nuances of NLP of handling Text Data.

🔌 Installation

📦 Commands

Packages which should be imported:

  • Pandas
  • Numpy
  • Seaborn
  • nltk
  • Afinn
  • TextBlob
  • Beautiful Soup
  • requests
  • Spacy Language Models

Note: Spacy may give lot of errors, one should make sure to proper install it. Further more refer to the requirements.txt

Just want to run the project on your local machine: Make sure you install all the packages mentioned in requirements.txt.

  • Clone the repository
$ git clone https://github.com/codekhal/Inshorts-NLP 
  • Install dependencies.
$ cd Inshorts-NLP
  • Now in your terminal, using appropriate conda env
$ run jupyter or any other preferable editor

📂 File Structure

  • File structure with the basic details about files and directories.
.__Inshorts-NLP__
├── contractions.py
├── img
│   ├── scraping.png
│   ├── Sentiment_Score_News_Category.png
│   ├── sentiments.png
│   ├── stemming.png
│   ├── Visualizing_Sentiments_Box_Plot.png
│   └── workflow.png
├── LICENSE
├── news.csv
├── NLP_main.ipynb
├── __pycache__
│   └── contractions.cpython-35.pyc
├── README.md
└── requirements.txt

2 directories, 13 files

- Brief Description

Built a web scraper which had scraped news articles from Inshorts website urls. Then using numerous text-preprocessing techniques, cleaned the data for further processing. After this, turn came for sentiment analysis on the data. Various popular lexicons are used for sentiment analysis, including the following.

  • AFINN lexicon
  • Bing Liu’s lexicon
  • MPQA subjectivity lexicon
  • SentiWordNet
  • VADER lexicon
  • TextBlob lexicon

Used NLTK, AFINN and TextBlob library. Using both data visualization tools and pandas dataframe techniques to show results of the dataset.

📷 Info Gallery

The sentiment score of different genres of news category is shown with the help of the following plots.

Box Plot

Lastly, the count of three sentiments in different genres of news articles is depicted with the help of factor or bar plot.

Factor Plot

📜 Guidelines

  • Contribution Guidelines

Future Work that could be done:

  • Flask/Flask App Deployment -​ ​ Deploy the app so that couldbe efficiently used.

  • Use of Deep Learning -​ One may try and use deep learning for building a text summurizer and tone detector.

Kindly follow the Contributions Guildlines before you create any pull requests or issues. Though feel free to contribute in any form.
Open Source <3

📄 Resources

🌟 Present Contributors

Contributors

Want to share your ideas

Feel free to reach out to me

Telegram

🔒 License

License

About

Analysed syntax and Semantics of Corpus of Text Documents Retrieved from Web Scraping of News articles from Inshorts and followed the Standard NLP Workflow of the CRISP-DM model.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published