NLP Pipeline for Gender Bias Detection in Portuguese Literature

This project implements an NLP pipeline to detect gender bias in Portuguese literature. The pipeline consists of six steps, from preprocessing the text to gender bias analysis. The main script processes multiple text files, extracting entities, classifying gender, analyzing dependencies, and calculating gender skewness, outputting the results into CSV files.

Instalation

Clone the repository:

git clone https://github.com/marianaossilva/gender_pipeline.git
cd gender_pipeline

Create a virtual environment and activate it:

python -m venv venv
source venv/bin/activate # On Windows, use `venv\Scripts\activate`

Install the required packages:

pip install -r requirements.txt

Download the spaCy Portuguese model:

python -m spacy download pt_core_news_sm

Usage

Running the Pipeline

The main script main.py orchestrates the entire pipeline. To run the pipeline, ensure you have your input text files in the data/raw directory, then execute:

python main.py

Steps in the Pipeline

Preprocessing and Sentencer:

Cleans the text, tokenizes, and segments it into sentences.
Output: Preprocessed text files in data/preprocessed.

Entity Recognition:

Uses a BERT-CRF model to extract PERSON entities.
Output: JSON files with recognized entities in data/results/book_dicts.

Excerpt Segmentation:

Segments text into excerpts around PERSON entities.
Output: Updated JSON files in data/results/book_dicts.

Gender Classification:

Classifies the gender of each PERSON entity.
Output: Updated JSON files in data/results/book_dicts.

Dependency Analysis:

Analyzes grammatical dependencies to understand how gendered terms are used.
Output: Updated JSON files in data/results/book_dicts.

Gender Skewness:

Measures gender bias by calculating skewness in the text.
Output: JSON files with gender bias results in data/results/gender_bias.

Plot Results (To be implemented):

Visualizes the analysis results.

How to Cite

If you use this pipeline in your research or work, please cite it as follows:

@inproceedings{semish/Silva24,
 author = {Mariana Silva and Mirella Moro},
 title = {{NLP} Pipeline for Gender Bias Detection in Portuguese Literature},
 booktitle = {Anais do LI Seminário Integrado de Software e Hardware, {SEMISH}},
 location = {Brasília/DF},
 year = {2024},
 issn = {2595-6205},
 pages = {169--180},
 publisher = {SBC},
 doi = {10.5753/semish.2024.2914},
 url = {https://sol.sbc.org.br/index.php/semish/article/view/29365}
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
sandbox		sandbox
src		src
.gitignore		.gitignore
README.md		README.md
config.yaml		config.yaml
pipeline.log		pipeline.log
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP Pipeline for Gender Bias Detection in Portuguese Literature

Instalation

Usage

Running the Pipeline

Steps in the Pipeline

How to Cite

About

Releases

Packages

Languages

marianaossilva/gender_pipeline

Folders and files

Latest commit

History

Repository files navigation

NLP Pipeline for Gender Bias Detection in Portuguese Literature

Instalation

Usage

Running the Pipeline

Steps in the Pipeline

How to Cite

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages