Fact check on claims from PUBHEALTH dataset

The aim of this model is to predict the correct label associated with the claim based on claim and/or other columns.

Table of content

1. Data exploration, preprocesing and feature analysis:

Exploration and Preprocesing Notebook.

1.1. Data exploration.

Loading data.
Prelimanary column/feature selection/elemination.

1.2. Drop Missing Values.

1.3. Check outliers.

Find and eliminate outliers in labels.
Save clean data.

1.4. Check on Data balance and label distribution.

1.5. Basic statistics on columns.

1.6. PCA and TF-IDF for features insights.

PCA Notebook.

Conducting PCA on each column considering tf-idf.
'Subjects' column seems to expose quite the pattern.
More details to be find in the notebook.

When It comes to True claims there is no great PCA component significance according to subjects but for other labls particularly False and unproven claims the tendency is clear.

2. Models training and data iteration:

2.1. Base model BERT.

Pretrained BERT was used on cleaned imbalanced dataset first.

2.2. Iterate/modify model hyperparameters Data features (Enriching data based on 1.5. and 1.6 results ).

2.3. BERT on balanced/undersampled data.

2.4. RoBERTa trial.

3. Obstacles and challenges:

Consumed my 2 google colab GPU trails.
Grid search could eventually not be considered due to the above reason.
Hyperparameters tunning was done manually (guided by previous attempts of papers).Also sometimes in place iteration had to be done which may make models diversity diffcult to follow.
Conducting PCA on training set was computationally impossible. Test set was used instead. (except Subject column which was verified also with Train set thanks to its short lenght).

4. Results and remarks:

BERTA : accuracy < 0.61 , loss : 1.3 , F1 : 60 best ,Data : Imbalanced. From combinations of the following [{lr : 1e-3 , batch_size : 128, Epochs : 10} , {lr : 1e-4 , batch_size : 64, Epochs : 12}].
BERTA : accuracy < 0.57 , loss : 1.1 , F1 : 62 best ,Data : Balanced. From combinations of the following [{lr : 1e-3 , batch_size : 128, Epochs : 10} , {lr : 1e-4 , batch_size : 64, Epochs : 12}].

Note :

Balancing decreased the precision and eventually due to more equal contribution of different labels/classes the F1_score increased.

BERTA : accuracy < 0.67 , loss : 0.92 , F1 : 60 best ,Data : Enriched (concatenated) with column 'Subjects'. From combinations of the following [{lr : 1e-3 , batch_size : 64, Epochs : 10} , {lr : 1e-4 , batch_size : 8 , Epochs : 12}].
RoBERTa : accuracy < 0.64 , loss : 0.83 , F1 : 64 best ,Data : Enriched (concatenated) with column 'Subjects'. From combinations of the following [{lr : 1e-4 , batch_size : 64, Epochs : 10} , {lr : 1e-4 , batch_size : 8, Epochs : 12}].

Next inline: DeBERTa.

5. Conclusion:

RoBERTa seems the best fitting model but still far from acceptable results though results from previous work using DeBERTa v3 did also great (testing in progress).

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
data_prepr_analysis		data_prepr_analysis
models_code		models_code
PCA_insights.ipynb		PCA_insights.ipynb
README.md		README.md
data_balancing.ipynb		data_balancing.ipynb
exploration_preprocessing.ipynb		exploration_preprocessing.ipynb
subjectsPCA.JPG		subjectsPCA.JPG

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fact check on claims from PUBHEALTH dataset

Table of content

1. Data exploration, preprocesing and feature analysis:

2. Models training and data iteration:

3. Obstacles and challenges:

4. Results and remarks:

Note :

5. Conclusion:

About

Releases

Packages

Languages

H-Ismael/pubhealth

Folders and files

Latest commit

History

Repository files navigation

Fact check on claims from PUBHEALTH dataset

Table of content

1. Data exploration, preprocesing and feature analysis:

2. Models training and data iteration:

3. Obstacles and challenges:

4. Results and remarks:

Note :

5. Conclusion:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages