QumranNLP

Development

Create a new env (however you want, here a suggestion):

cd /path/to/cloned/repo
pyenv virtualenv 3.10.0 QumranNLP
pyenv activate QumranNLP
pip install -r requirements.txt

Reproducibility

Reproducibility in the academic world is pretty sucks. Ive tried to make this whole repo reproducible as much as i can. If you are reading this, you are probably wish to use the Qumran data, so the data pipelines will be pretty easy to use (src/ETL/main_ETL.py). In case you are here for the machine learning, algorithms or visualizations, Ive tried to make it robust as i can to the data. In any case, all of the scripts can be easily run with the exp_runner.sh script.

Data

I'm using the ETCBC/dss package (built on text-fabric). This repo contains the original transcriptions by Matrin Abegg as well (data/texts/abegg).

For generating data, you need to run the src/main_ETL.py script. It will run over all of the scrolls (bib and nonbib), will generate starr features and save two dataframes:

Full data (no filtering).
Filtered data (you can specify which rules do you want to apply using the filter_df_by_rules function). The rules for now are:
1. Books greater then 300 words.
2. Hebrew books.
3. Each book divided into 100 words chunks.

Repo structure

├── data
├── experiments
├── models
├── notebooks
├── reports
├── src\

Data - contains the most updated processed data (under processed_data), the qumran_labels.csv contains the labels by scrolls (by multiple labeling schemes: sectarian/composition/genre, etc...).
Experiments - contains the results of multiple experiments.
Models - contains some trained models (mainly GNN's, the fine-tuned models are stored in HF).
Notebooks - contains alot of research notebooks.
Reports - contains the results of most of the experiments.
Src - contains the code for the main ETL, feature engineering, experiments, model training.

Running Research

Topic modeling

After trying multiple methods for getting the optimal number of topics ( LDA with coherence and perplexity, NMF optimization by Gal Gilad method, HDP), we decided that the optimal number is somewhere between 10-20. For now, we will proceed without it.

Global tuning params

Two different researches for determine the optimal chunk_size and the pre_processing scheme. For evaluating each parameter, we checked those scheme on supervised and unsupervised classification for the scroll and composition level. That means running the src/ETL/main_ETL.py for generating data, and then running make_baselines_results for each of the tasks (src/baselines/main.py).

Chunk size research: 1.1-select_best_chunk_size.ipynb (code in branch new-chunking-scheme )
Pre processing research: 1.1-select_best_pre_processing_scheme.ipynb (code in branch create-pre-processing-schemes-19-08 )

Fine-tuning

I made the fine-tuning via masked LM scheme with 15% random masking. The code was run with colab fine-tuning-bert-maskedLM.ipynb for the easy to use GPU 😅.

GNN

For implementing different structures in the GNN, ive created a framework which can combine different edge types together (heterogeneous graph didn't work so well, i believe this structure is too complex for this data). So each node x is a chunk of text represented by a vector of dimension 768 (from different BERT models). The edges can constructed via various methods, when the scheme is to define some feature space of the nodes, taking the cosine similiarity between each node, and taking only edges that are most similar (practically zeroing out the <0.99 quantile of the adj matrix).

Supervised classification

We can see that for the global tasks (scroll, composition and sectarian classification) the GNN always outperform the rest of the methods.

Interesting to see which types of adjacency matrices perform best:

Unsupervised classification

For the unsupervised setting, i've used the GAE (Graph auto-encoder) algorithm. Its a classic encoder-decoder framework which can work with graphs. Ive trained the model for each one of our potential embedders, when the graph was built using tf-idf/starr/trigram. The clustering is made by hierarchical clustering (agglomerative), number of clusters as number of unique labels per task. The following plots shows the difference between different models for the top per each metric:

Bible validation

I validate the results of the DSS with the Hebrew bible dataset (same chunking technique, embeddings, etc...). The results are compared with the DSS results which is nice.

Tasks:

Posts:
- Easy implementation of GNN with supervised and unsupervised context.
- How to use GNN for text classification with different adj matrices.

Name		Name	Last commit message	Last commit date
Latest commit History 338 Commits
Data		Data
experiments		experiments
infra		infra
models		models
notebooks		notebooks
reports/plots		reports/plots
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
base_utils.py		base_utils.py
config.py		config.py
exp_runner.sh		exp_runner.sh
logger.py		logger.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

QumranNLP

Development

Reproducibility

Data

Repo structure