Pretraining InLegalBERT

This repository contains the codes for pre-training a BERT-base model on a large, un-annotated corpus of text using dynamic Masked Language Modeling (MLM) and dynamic Next Sentence Prediction (NSP). All settings in this repository are configured for replicating the pre-training procedure of InLegalBERT. For details please refer to our paper "Pre-trained Language Models for the Legal Domain: A Case Study on Indian Law".

In this Repository

data_helpers.py: Custom data loader and chunking operations. model.py: Implementation of BertForPreTraining. training.py: Custom trainer and metrics. run.py: Main code for running the model. sample.jsonl: Sample document for the dataset format.

Dataset Format

There must be two files "train.jsonl" and "test.jsonl" in the main directory. Each line in these files must contain a json dictionary, similar to "sample.jsonl". The keys of the dict are:

  id: String          // identifier
  title: String       // title of the case
  source: String      // court where the case was heard
  text: List[String]  // main document text kept as a list of sentences

This format expects the 'text' field to be divided into a List of sentences. In case your data is not pre-divided, and the text is a single string, wrap that single string in a List. You may use a different dataset schema by modifying run.py:39.

Settings and Hyperparameters

The main settings are available from run.py:16.

  SOURCE_PATH: Source model/tokenizer to start with. Can be a folder with the relevant files or a repo in HuggingFace.
  OUTPUT_PATH: All relevant outputs will be stored in "Output/<OUTPUT_PATH>"
  CACHE_PATH: Path to the cache directory
  FROM_SCRATCH(True/False): Whether to train from scratch or using the existing checkpoint at "<SOURCE_PATH>".

Other hyperparameters are available from run.py:91.

Running the code

Setup the parameters and hyperparameters, and all relevant files.

  python run.py

Requirements

  python=3.9.7
  torch=1.10.2
  transformers=4.17.0
  datasets=2.4.0
  pyarrow=7.0.0
  sklearn=1.0.2
  tqdm=4.64.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pretraining InLegalBERT

In this Repository

Dataset Format

Settings and Hyperparameters

Running the code

Requirements

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LICENSE		LICENSE
README.md		README.md
data_helpers.py		data_helpers.py
model.py		model.py
run.py		run.py
sample.jsonl		sample.jsonl
training.py		training.py

License

Law-AI/pretraining-bert

Folders and files

Latest commit

History

Repository files navigation

Pretraining InLegalBERT

In this Repository

Dataset Format

Settings and Hyperparameters

Running the code

Requirements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages