Skip to content

Repository that contains NLP work done on the Grand Débat National textual data

License

Notifications You must be signed in to change notification settings

Make-the-Debat-Great-Again/grand_debat_nlp

Repository files navigation

NLP works on the "Grand Débat National" data

In this repository, we stored the code used for :

  • Produce and use a classification model to identify answers associated with the "transport" theme
  • Using lexical resources, extract occurrences of specific words in the text data
  • Extract specific co-occurrences patterns in the "transport" answers

Requirements

!! Attention !!

Before using any script, start the Talismane server using the following command:

cd <talisman_directory>
java -Xmx6G -Dconfig.file=talismane-fr-5.2.0.conf -jar talismane-core-5.3.0.jar --analyse --sessionId=fr --mode=server --encoding=UTF-8

Classification model

To identify answers related to "transportation", we produce a classification model using the SVM algorithm. To train our model, we used crowdsourced annotation from the "Grande Annotation" initiative.

In order to train the model, run the train.py script using the following command:

python3 train.py data/LA_TRANSITION_ECOLOGIQUE.csv data/results.csv -o # WITH TALISMANE

Once the model is trained, use the following command to classify a dataset from the Grand Débat National :

python3 predict <dataset> <dataset_code**>

** Each "Grand Débat National" dataset is associated to a code, here is the association table:

code dataset
1 transition_eco
2 democratie_et_citoy
3 fiscalite_et_depense_publique
4 organisation_de_etat_et_service_pub

The output file contains the results of the classification. Each line corresponds to a contribution and each column to a question. The rest of the columns corresponds to the author's ID, postal code and the reference ID of the contributions.

The intersection between a row and column corresponds to a boolean that indicate if the answer is associated to the transportation thematic or not.

You can find an output sample in in the example directory.

Measure the number of occurrences of terms associated with transportation

Using the classification output from predict.py script, we can count the number of occurrences of specific words. Here, we proposed three lexicons:

  • Transport lexicons
  • Transportation Verb
  • Alteration Verb

To measure the occurrence of the terms from these lexicons (check the resources/lexiques directory), run the following command:

python3 count_terms.py <grand_debat_dataset> <classification_result>
  • <grand_debat_dataset> : CSV of a "Grand Débat National" dataset
  • <classification_result> : Results using the trained classification model on the same dataset.

Identify specific patterns representing proposition of contributors

If counting occurrences

To measure the co-occurrence patterns based on the lexicons (check the resources/lexiques directory), use the following command:

python3 count_coocs.py <grand_debat_dataset> <classification_result> <dataset_code>
  • <grand_debat_dataset> : CSV of a "Grand Débat National" dataset
  • <classification_result> : Results using the trained classification model on the same dataset.

About

Repository that contains NLP work done on the Grand Débat National textual data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published