Apologies, but I did not use the NLTK package for some tasks. Instead, I used:
- TextBlob for sentiment analysis
- spaCy for various text processing tasks
- Syllapy for counting syllables in words
🗂 Directories and Files
📝 Cleaned Articles
- cleaned_articles: Contains cleaned articles ready for analysis.
📂 Extracted Articles
- extracted_articles: Holds raw articles extracted for the project.
📚 Master Dictionary
- master_dictionary: Collection of files for sentiment analysis.
cleaned_negative_words.txt
: List of cleaned negative words.cleaned_positive_words.txt
: List of cleaned positive words.negative-words.txt
: Raw negative words for sentiment analysis.positive-words.txt
: Raw positive words for sentiment analysis.
📑 Project Introduction
- project_introduction: Overview and objectives of the project.
🧪 Test Assessment
- test_assessment: Contains test assignments and notebooks.
dataextraction.ipynb
: Jupyter Notebook for data extraction tasks.testassessment.ipynb
: Jupyter Notebook for additional test assessments.
💻 Code and Markdown
- testassignment: Code and markdown files related to assignments.
Code + Markdown/
: Contains code snippets and explanations.Run All/
: Script to execute all code cells in notebooks.
🚫 Stop Words
- Stop Words: Directory with various stop words files for preprocessing.
📊 Text Analysis
- text_analysis: Files for performing text analysis.
textanalysis.ipynb
: Jupyter Notebook for text analysis.sentiment_analysis.log
: Log file for sentiment analysis results.textblob_sentiment_result.csv
: CSV file with sentiment analysis results.
📈 Additional Files
- additional_files: Summary results and metrics.
analysis_results.csv
: Various analysis results.final_text_analysis_results.xlsx
: Final compiled analysis results.
- Consulting Website: Blackcoffer | LSA Lead
- Web App Products: Netclan | Insights | Hire Kingdom | Workcroft
- Mobile App Products: Netclan | Bwstory
- Objective: Extract textual data from provided URLs and perform text analysis.
- Data Extraction:
- Input from
Input.xlsx
- Tools: Python, BeautifulSoup, Selenium, Scrapy.
- Input from
- Data Analysis:
- Output in CSV or Excel format.
- Variables include Positive Score, Negative Score, Polarity Score, etc.
- Timeline: Duration of 6 days.
- Submission: Via Google Form with required files.
- Sentimental Analysis: Clean text using stop words, create dictionaries of positive/negative words, and extract variables.
- Readability Analysis: Calculate average sentence length, percentage of complex words, and Fog Index.
Objective:
The ProText-Analyzer project extracts article content from provided URLs and performs various text analysis tasks like sentiment scoring, readability measurement, and more. The results are structured in a clean and organized format, ready for review and further use.
The goal of ProText-Analyzer is to:
- Extract Textual Data: Fetch the article content from URLs provided in the
Input.xlsx
file. - Perform Textual Analysis: Calculate the following metrics:
- Sentiment scores (positive, negative, polarity, subjectivity)
- Readability scores (Fog Index, Avg. Sentence Length)
- Word count, syllable count, and other word statistics
- Python 🐍
- Libraries:
TextBlob
for sentiment analysisspaCy
for text processing tasks (tokenization, POS tagging, etc.)Syllapy
for syllable countingBeautifulSoup
for HTML parsing during data extractionRequests
for handling HTTP requests
- Libraries:
- Pandas for data management
- Excel/CSV for input/output handling
-
Clone the repository to your local machine:
git clone https://github.com/rubydamodar/ProText-Analyzer.git cd ProText-Analyzer
-
Install the required Python libraries:
pip install -r requirements.txt
The ProText-Analyzer extracts the article title and body from each URL listed in the Input.xlsx
file and stores the text for further analysis.
- Read Input File: Load the URLs and their associated IDs from
Input.xlsx
. - Extract Article Content:
- Fetch HTML content using
requests
. - Parse the HTML using
BeautifulSoup
to extract the article's title and body. - Save the extracted content into text files named after the
URL_ID
.
- Fetch HTML content using
- Each article's content is saved in text files, facilitating a clean process for further analysis.
- Error handling ensures proper management of file I/O and network issues.
The extracted text undergoes several analysis steps to compute the following variables:
-
Sentiment Analysis:
- Implemented using
TextBlob
to compute Positive Score, Negative Score, Polarity Score, and Subjectivity Score. - Text is cleaned by removing stop words and irrelevant characters.
- Implemented using
-
Readability Analysis:
- Calculated using the Gunning Fog Index.
- Additional metrics: Average Sentence Length, Percentage of Complex Words, and Fog Index.
-
Word-Level Metrics:
- Word Count, Complex Word Count, Syllable Count per Word (via
syllapy
), Personal Pronouns Count (using regex), and Average Word Length.
- Word Count, Complex Word Count, Syllable Count per Word (via
The results are saved in Excel/CSV format as per the structure outlined in Output Data Structure.xlsx
. The following variables are included:
- Positive Score
- Negative Score
- Polarity Score
- Subjectivity Score
- Average Sentence Length
- Complex Word Count
- Word Count
- Syllable Count
- Personal Pronouns Count
- Average Word Length
-
Data Extraction: Run the script to extract article data from the URLs:
python data_extraction.py
-
Text Analysis: Run the text analysis script to process the extracted articles:
python text_analysis.py
The results will be saved in the output directory in .csv
or .xlsx
format.
- Error Handling: Implemented robust error handling to manage potential network and file-related issues.
- Text Processing: Utilized advanced tools like
spaCy
for precise text tokenization and POS tagging, andsyllapy
for syllable counting. - Personal Pronouns: Regex was used to accurately capture pronouns without including words like "US" mistakenly.
We welcome contributions to enhance ProText-Analyzer! To contribute:
- Fork the repository.
- Create a new branch for your changes.
- Submit a pull request with a detailed description of your changes.
This project is licensed under the MIT License.
Ruby Poddar
Email: rubypoddarr@gmail.com