The BERT Error Detection for STPA (BEDS) is a machine-learning Pipeline dedicated to assist the system analyst to perform the first step of the System-Theoretic Process Analysis (STPA) hazard analysis technique. BEDS was trained using the BERT language model, and specializes in detecting writing errors in sentences that does not follow the guidelines present in the STPA Handbook.
The pipeline has four steps:
- (Optional) The first step takes an unlabeled sentence and classifies between the Loss, Hazard, and Constraint classes;
- The second step checks if a sentence is considered either correct or incorrect based on the examples given in the STPA Handbook;
- The third step checks the category of error present in the incorrect sentences discovered in the previous step;
- The fourth step uses a sentence similarity model to suggest corrections from a list of verified sentences to the incorrect sentences previously discovered.
Two Python notebooks are available in this repository:
- BEDS_Pipeline_Fine_tuning_and_Evaluation is the code used to manipulate the dataset and train all the ML models of the pipeline;
- BEDS_Pipeline_Execution_example is the functional example of the pipeline.
To experiment with BEDS, you should use BEDS_Pipeline_Execution_example.ipynb:
- Uncomment and install the required libraries;
- Prepare your input based on the examples given in this repository ("input_example labeled.csv" or "input_example unlabeled.csv");
- Choose the input type: "labeled" or "unlabeled";
- Run all lines of code sequentially.
This dataset contains textual sentences generated and used during the first step of the System-Theoretic Process Analysis (STPA) hazard analysis technique, called "defining the purpose of the analysis". In this step, three security aspects of the system are defined:
- Losses are something of value which a loss is unacceptable to stakeholders, such as human life, equipment or mission;
- System-Level Hazards are system states or conditions that, together with a set of worst-case environmental conditions, will lead to a loss;
- Sustem-Level Constraints are the system's conditions or behaviors that need to be satisfied to prevent hazards.
This dataset was created by extracting sentences found in presentations from the Annual MIT STAMP Workshop. The presentations are from 2012 to 2023.
The dataset is a ".csv" file. For Python programming language, the use of Pandas library is recommended:
import pandas as pd
df = pd.read_csv(r'/[PATH]/stpa-dataset.csv')
This dataset contains 9 columns that organize the collected data.
- "sentence": The extracted sentence from the presentation;
- "label": The corresponding label of the sentence (Loss, Hazard, or Constraint);
- "validation": Indicates whether the sentence is correct or incorrect;
- "error": Indicates the type of error in incorrect sentences;
- "domain": Domain of the presentation;
- "year": Year of the presentation;
- "title": Title of the presentation;
- "url": URL of the presentation;
- "slide": The number of the slide which the sentence was extracted;
The sentences extracted are from slides that explicitly show the type of sentence (for example a table explaining which are the system losses and hazards), that automatically represents the corresponding label to be filled in the dataset. However, the presentations containing different amounts of examples lead to an unbalanced dataset.
Class | Sentences |
---|---|
loss | 291 |
hazard | 424 |
constraint | 369 |
Total | 1084 |
This repository was created by the Computing and Communication Systems graduate student Andrey Toshiro Okamura, from the State University of Campinas (UNICAMP)'s School of Technology.