Skip to content

NLP Pipeline for Gender Bias Detection in Portuguese Literature

Notifications You must be signed in to change notification settings

marianaossilva/gender_pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NLP Pipeline for Gender Bias Detection in Portuguese Literature

This project implements an NLP pipeline to detect gender bias in Portuguese literature. The pipeline consists of six steps, from preprocessing the text to gender bias analysis. The main script processes multiple text files, extracting entities, classifying gender, analyzing dependencies, and calculating gender skewness, outputting the results into CSV files.

Instalation

  1. Clone the repository:
git clone https://github.com/marianaossilva/gender_pipeline.git
cd gender_pipeline
  1. Create a virtual environment and activate it:
python -m venv venv
source venv/bin/activate # On Windows, use `venv\Scripts\activate`
  1. Install the required packages:
pip install -r requirements.txt
  1. Download the spaCy Portuguese model:
python -m spacy download pt_core_news_sm

Usage

Running the Pipeline

The main script main.py orchestrates the entire pipeline. To run the pipeline, ensure you have your input text files in the data/raw directory, then execute:

python main.py

Steps in the Pipeline

  1. Preprocessing and Sentencer:
  • Cleans the text, tokenizes, and segments it into sentences.
  • Output: Preprocessed text files in data/preprocessed.
  1. Entity Recognition:
  • Uses a BERT-CRF model to extract PERSON entities.
  • Output: JSON files with recognized entities in data/results/book_dicts.
  1. Excerpt Segmentation:
  • Segments text into excerpts around PERSON entities.
  • Output: Updated JSON files in data/results/book_dicts.
  1. Gender Classification:
  • Classifies the gender of each PERSON entity.
  • Output: Updated JSON files in data/results/book_dicts.
  1. Dependency Analysis:
  • Analyzes grammatical dependencies to understand how gendered terms are used.
  • Output: Updated JSON files in data/results/book_dicts.
  1. Gender Skewness:
  • Measures gender bias by calculating skewness in the text.
  • Output: JSON files with gender bias results in data/results/gender_bias.
  1. Plot Results (To be implemented):
  • Visualizes the analysis results.

How to Cite

If you use this pipeline in your research or work, please cite it as follows:

@inproceedings{semish/Silva24,
 author = {Mariana Silva and Mirella Moro},
 title = {{NLP} Pipeline for Gender Bias Detection in Portuguese Literature},
 booktitle = {Anais do LI Seminário Integrado de Software e Hardware, {SEMISH}},
 location = {Brasília/DF},
 year = {2024},
 issn = {2595-6205},
 pages = {169--180},
 publisher = {SBC},
 doi = {10.5753/semish.2024.2914},
 url = {https://sol.sbc.org.br/index.php/semish/article/view/29365}
}

About

NLP Pipeline for Gender Bias Detection in Portuguese Literature

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published