Skip to content

Commit

Permalink
FSTALIGN-37: Add flag in fstalign to allow for case-sensitive testing (
Browse files Browse the repository at this point in the history
  • Loading branch information
pique0822 authored Oct 19, 2023
1 parent 5c5a150 commit 49ce329
Show file tree
Hide file tree
Showing 23 changed files with 405 additions and 152 deletions.
59 changes: 3 additions & 56 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,7 @@
* [Dependencies](#Dependencies)
* [Build](#Build)
* [Docker](#Docker)
- [Quickstart](#Quickstart)
* [WER Subcommand](#WER-Subcommand)
* [Align Subcommand](#Align-Subcommand)
- [Advanced Usage](#Advanced-Usage)
- [Documentation](#Documentation)

## Overview
`fstalign` is a tool for creating alignment between two sequences of tokens (here out referred to as “reference” and “hypothesis”). It has two key functions: computing word error rate (WER) and aligning [NLP-formatted](https://github.com/revdotcom/fstalign/blob/develop/docs/NLP-Format.md) references with CTM hypotheses.
Expand Down Expand Up @@ -75,55 +72,5 @@ For development you can also build the docker image locally using:
docker build . -t fstalign-dev
```

## Quickstart
```
Rev FST Align
Usage: ./fstalign [OPTIONS] [SUBCOMMAND]
Options:
-h,--help Print this help message and exit
--help-all Expand all help
--version Show fstalign version.
Subcommands:
wer Get the WER between a reference and an hypothesis.
align Produce an alignment between an NLP file and a CTM-like input.
```

### WER Subcommand

The wer subcommand is the most frequent usage of this tool. Required are two arguments traditional to WER calculation: a reference (`--ref <file_path>`) and a hypothesis (`--hyp <file_path>`) transcript. Currently the tool is configured to simply look at the file extension to determine the file format of the input transcripts and parse accordingly.

| File Extension | Reference Support | Hypothesis Supprt |
| ----------- | ----------- | ----------- |
| `.ctm` | :white_check_mark: | :white_check_mark: |
| `.nlp` | :white_check_mark: | :white_check_mark: |
| `.fst` | :white_check_mark: | :white_check_mark: |
| All other file extensions, assumed to be plain text | :white_check_mark: | :white_check_mark: |

Basic Example:
```
ref.txt
this is the best sentence
hyp.txt
this is a test sentence
./bin/fstalign wer --ref ref.txt --hyp hyp.txt
```

When run, fstalign will dump a log to STDOUT with summary WER information at the bottom. For the above example:
```
[+++] [20:37:10] [fstalign] done walking the graph
[+++] [20:37:10] [wer] best WER: 2/5 = 0.4000 (Total words in reference: 5)
[+++] [20:37:10] [wer] best WER: INS:0 DEL:0 SUB:2
[+++] [20:37:10] [wer] best WER: Precision:0.600000 Recall:0.600000
```

Note that in addition to general WER, the insertion/deletion/substitution breakdown is also printed. fstalign also has other useful outputs, including a JSON log for downstream machine parsing, and a side-by-side view of the alignment and errors generated. For more details, see the [Outputs](https://github.com/revdotcom/fstalign/blob/develop/docs/Advanced-Usage.md#outputs) section in the [Advanced Usage](https://github.com/revdotcom/fstalign/blob/develop/docs/Advanced-Usage.md) doc.

### Align Subcommand
Usage of the `align` subcommand is almost identical to the `wer` subcommand. The exception is that `align` can only be run if the provided reference is a NLP and the provided hypothesis is a CTM. This is because the core function of the subcommand is to align an NLP without timestamps to a CTM that has timestamps, producing an output of tokens from the reference with timings from the hypothesis.

## Advanced Usage
See [the advanced usage doc](https://github.com/revdotcom/fstalign/blob/develop/docs/Advanced-Usage.md) for more details.
## Documentation
For more information on how to use `fstalign` see our [documentation](https://github.com/revdotcom/fstalign/blob/develop/docs/Usage.md) for more details.
132 changes: 121 additions & 11 deletions docs/Advanced-Usage.md → docs/Usage.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,78 @@
## Advanced Usage
Much of the advanced features for fstalign come from providing [NLP file inputs](#NLP) to the references. Some of these features include:
- Entity category WER and normalization: based on labels in the NLP file, entities are grouped into classes in the WER output
- For example: if the NLP has `2020|0||||CA|['0:YEAR']|` you will see
```s
[+++] [22:36:50] [approach1] class YEAR         WER: 0/8 = 0.0000
# Documentation
## Table of Contents
* [Quickstart](#quickstart)
* [Subcommands](#subcommands)
* [`wer`](#wer)
* [`align`](#align)
* [Inputs](#inputs)
* [CTM](#ctm)
* [NLP](#nlp)
* [FST](#fst)
* [Synonyms](#synonyms)
* [Normalizations](#normalizations)
* [WER Sidecar](#wer-sidecar)
* [Text Transforms](#text-transforms)
* [use-punctuation](#use-punctuation)
* [use-case](#use-case)
* [Outputs](#outputs)
* [Text Log](#text-log)
* [Side-by-side](#sbs)
* [JSON Log](#json-log)
* [Aligned NLP](#nlp-1)
* [Advanced Usage](#advanced-usage)

In this document, we outline the functions of `fstalign` and the features that make this tool unique. Please feel free to start an issue if any of this documentation is lacking / needs further clarification.

## Quickstart
```
Rev FST Align
Usage: ./fstalign [OPTIONS] [SUBCOMMAND]
- Another useful feature here is normalization, which allows tokens with entity labels to have multiple normalizations accepted as correct by fstalign. This functionality is enabled when the tool is invoked with `--ref-json <path_to_norm_sidecar>` (passed in addition to the `--ref`). This enables something like `2020` to be treated equivalent to `twenty twenty`. More details on the specification for this file are specified in the [Inputs](#Inputs) section below. Note that only reference-side normalization is currently supported.
Options:
-h,--help Print this help message and exit
--help-all Expand all help
--version Show fstalign version.
- Speaker-wise WER: since the NLP file contains a speaker column, fstalign logs and output will provide a breakdown of WER by speaker ID if non-null
Subcommands:
wer Get the WER between a reference and an hypothesis.
align Produce an alignment between an NLP file and a CTM-like input.
```
## Subcommands
### `wer`

The wer subcommand is the most frequent usage of this tool. Required are two arguments traditional to WER calculation: a reference (`--ref <file_path>`) and a hypothesis (`--hyp <file_path>`) transcript. Currently the tool is configured to simply look at the file extension to determine the file format of the input transcripts and parse accordingly.

| File Extension | Reference Support | Hypothesis Supprt |
| ----------- | ----------- | ----------- |
| `.ctm` | :white_check_mark: | :white_check_mark: |
| `.nlp` | :white_check_mark: | :white_check_mark: |
| `.fst` | :white_check_mark: | :white_check_mark: |
| All other file extensions, assumed to be plain text | :white_check_mark: | :white_check_mark: |

Basic Example:
```
ref.txt
this is the best sentence
hyp.txt
this is a test sentence
./bin/fstalign wer --ref ref.txt --hyp hyp.txt
```

When run, fstalign will dump a log to STDOUT with summary WER information at the bottom. For the above example:
```
[+++] [20:37:10] [fstalign] done walking the graph
[+++] [20:37:10] [wer] best WER: 2/5 = 0.4000 (Total words in reference: 5)
[+++] [20:37:10] [wer] best WER: INS:0 DEL:0 SUB:2
[+++] [20:37:10] [wer] best WER: Precision:0.600000 Recall:0.600000
```

Note that in addition to general WER, the insertion/deletion/substitution breakdown is also printed. fstalign also has other useful outputs, including a JSON log for downstream machine parsing, and a side-by-side view of the alignment and errors generated. For more details, see the [Outputs](https://github.com/revdotcom/fstalign/blob/develop/docs/Advanced-Usage.md#outputs) section in the [Advanced Usage](https://github.com/revdotcom/fstalign/blob/develop/docs/Advanced-Usage.md) doc.

### `align`
Usage of the `align` subcommand is almost identical to the `wer` subcommand. The exception is that `align` can only be run if the provided reference is a NLP and the provided hypothesis is a CTM. This is because the core function of the subcommand is to align an NLP without timestamps to a CTM that has timestamps, producing an output of tokens from the reference with timings from the hypothesis.

- Speaker-switch WER: similarly, fstalign will report the error rate of words around a speaker switch
- The window size for the context of a speaker switch can be adjusted with the `--speaker-switch-context <int>` flag. By default this is set to 5.

## Inputs
### CTM
Expand Down Expand Up @@ -46,7 +107,7 @@ must also be disabled with `--disable-approx-alignment`.
### Synonyms
Synonyms allow for reference words to be equivalent to similar forms (determined by the user) for error counting. They are accepted for any input formats and passed into the tool via the `--syn <path_to_synonym_file>` flag. For details see [Synonyms Format](https://github.com/revdotcom/fstalign/blob/develop/docs/Synonyms-Format.md). A standard set of synonyms we use at Rev.ai is available in the repository under `sample_data/synonyms.rules.txt`.

In addition to allowing for custom synonyms to be passed in via CLI, fstalign also automatically generates synonyms based on the reference and hypothesis text. Currently, it does this for two cases: cutoff words (hello-) and compound hyphenated words (long-term). In both cases, a synonym is dynamically generated with the hyphen removed. Both of these synonym types can be disabled through the CLI by passing in `--disable-cutoffs` and `--disable-hyphen-ignore`, respectively.
In addition to allowing for custom synonyms to be passed in via CLI, fstalign also automatically generates synonyms based on the reference and hypothesis text. Currently, it does this for three cases: cutoff words (e.g. hello-), compound hyphenated words (e.g. long-term), and tags or codes that follow the regular expression: `<.*>` (e.g. <laugh>). In the first two cases, a synonym is dynamically generated with the hyphen removed. Both of these synonym types can be disabled through the CLI by passing in `--disable-cutoffs` and `--disable-hyphen-ignore`, respectively. For the last case of tags, we will automatically allow for `<unk>` to be a valid synonym -- currently, this feature cannot be turned off.

### Normalizations
Normalizations are a similar concept to synonyms. They allow a token or group of tokens to be represented by alternatives when calculating the WER alignment. Unlike synonyms, they are only accepted for NLP file inputs where the tokens are tagged with a unique ID. The normalizations are specified in a JSON format, with the unique ID as keys. Example to illustrate the schema:
Expand Down Expand Up @@ -83,6 +144,40 @@ CLI flag: `--wer-sidecar`
Only usable for NLP format reference files. This passes a [WER sidecar](https://github.com/revdotcom/fstalign/blob/develop/docs//NLP-Format.md#wer-tag-sidecar) file to
add extra information to some outputs. Optional.

## Text Transforms
In this section, we outline transforms that can be applied to input files. These will modify the handling of the files by `fstalign`.
### `use-punctuation`
Adding the `--use-punctuation` flag will treat punctuation from NLP files as individual tokens for `fstalign`. All other file formats that desire this format are expected to handle punctuation on their own and separating them into their own tokens.

The following files are equivalent with this flag set:

**example.nlp**
```
token|speaker|ts|endTs|punctuation|case|tags|wer_tags
Good|0||||UC|[]|[]
morning|0|||.|LC|['5:TIME']|['5']
Welcome|0|||!|LC|[]|[]
```

**example.txt**
```
good morning . welcome !
```

_Note that WER when this flag is set, measures errors in the words output by the ASR as well as punctuation._

### `use-case`
Adding the `--use-case` flag will take a word's letter case into consideration. In other words, the same word with different letters capitalized will now be considered a different word. For example consider the following:

**Ref:** `Hi this is an example`

**Hyp:** `hi THIS iS An ExAmPlE`

Without this flag, `fstalign` considers these two strings to be equivalent and result in 0 errors. With `--use-case` set, none of these words would be equivalent because they have different letter cases.

_Note that WER when this flag is set, measures errors in the words output by the ASR, taking into account letter casing._


## Outputs

### Text Log
Expand Down Expand Up @@ -170,3 +265,18 @@ The “bigrams” and “unigrams” fields are only populated with unigrams and
CLI flag: `--output-nlp`

Writes out the reference [NLP](https://github.com/revdotcom/fstalign/blob/develop/docs/NLP-Format.md), but with timings provided by a hypothesis CTM. Mostly relevant for the `align` subcommand.

## Advanced Usage
Much of the advanced features for fstalign come from providing [NLP file inputs](#NLP) to the references. Some of these features include:
- Entity category WER and normalization: based on labels in the NLP file, entities are grouped into classes in the WER output
- For example: if the NLP has `2020|0||||CA|['0:YEAR']|` you will see
```s
[+++] [22:36:50] [approach1] class YEAR         WER: 0/8 = 0.0000
```

- Another useful feature here is normalization, which allows tokens with entity labels to have multiple normalizations accepted as correct by fstalign. This functionality is enabled when the tool is invoked with `--ref-json <path_to_norm_sidecar>` (passed in addition to the `--ref`). This enables something like `2020` to be treated equivalent to `twenty twenty`. More details on the specification for this file are specified in the [Inputs](#Inputs) section below. Note that only reference-side normalization is currently supported.

- Speaker-wise WER: since the NLP file contains a speaker column, fstalign logs and output will provide a breakdown of WER by speaker ID if non-null

- Speaker-switch WER: similarly, fstalign will report the error rate of words around a speaker switch
- The window size for the context of a speaker switch can be adjusted with the `--speaker-switch-context <int>` flag. By default this is set to 5.
18 changes: 12 additions & 6 deletions src/Ctm.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,16 @@ using namespace fst;
/***************************************
CTM FST Loader Class Start
***************************************/
CtmFstLoader::CtmFstLoader(vector<RawCtmRecord> &records) : FstLoader() {
CtmFstLoader::CtmFstLoader(vector<RawCtmRecord> &records, bool use_case) : FstLoader() {
{
mCtmRows = records;
mUseCase = use_case;
for (auto &row : mCtmRows) {
std::string lower_cased = UnicodeLowercase(row.word);
mToken.push_back(lower_cased);
std::string token = std::string(row.word);
if (!mUseCase) {
token = UnicodeLowercase(row.word);
}
mToken.push_back(token);
}
}
}
Expand Down Expand Up @@ -51,13 +55,15 @@ StdVectorFst CtmFstLoader::convertToFst(const SymbolTable &symbol, std::vector<i
int map_sz = map.size();
for (TokenType::const_iterator i = mToken.begin(); i != mToken.end(); ++i) {
std::string token = *i;
std::string lower_cased = UnicodeLowercase(token);
if (!mUseCase) {
token = UnicodeLowercase(token);
}
transducer.AddState();

if (map_sz > wc && map[wc] > 0) {
transducer.AddArc(prevState, StdArc(symbol.Find(lower_cased), symbol.Find(lower_cased), 1.0f, nextState));
transducer.AddArc(prevState, StdArc(symbol.Find(token), symbol.Find(token), 1.0f, nextState));
} else {
transducer.AddArc(prevState, StdArc(symbol.Find(lower_cased), symbol.Find(lower_cased), 0.0f, nextState));
transducer.AddArc(prevState, StdArc(symbol.Find(token), symbol.Find(token), 0.0f, nextState));
}

prevState = nextState;
Expand Down
4 changes: 3 additions & 1 deletion src/Ctm.h
Original file line number Diff line number Diff line change
Expand Up @@ -27,13 +27,15 @@ struct RawCtmRecord {

class CtmFstLoader : public FstLoader {
public:
CtmFstLoader(std::vector<RawCtmRecord> &records);
CtmFstLoader(std::vector<RawCtmRecord> &records, bool use_case = false);
~CtmFstLoader();
vector<RawCtmRecord> mCtmRows;
virtual void addToSymbolTable(fst::SymbolTable &symbol) const;
virtual fst::StdVectorFst convertToFst(const fst::SymbolTable &symbol, std::vector<int> map) const;
virtual std::vector<int> convertToIntVector(fst::SymbolTable &symbol) const;
virtual const std::string &getToken(int index) const { return mToken.at(index); }
private:
bool mUseCase;
};

class CtmReader {
Expand Down
2 changes: 2 additions & 0 deletions src/FstLoader.h
Original file line number Diff line number Diff line change
Expand Up @@ -28,11 +28,13 @@ class FstLoader {
const std::string& wer_sidecar_filename,
const std::string& json_norm_filename,
bool use_punctuation,
bool use_case,
bool symbols_file_included);

static std::unique_ptr<FstLoader> MakeHypothesisLoader(const std::string& hyp_filename,
const std::string& hyp_json_norm_filename,
bool use_punctuation,
bool use_case,
bool symbols_file_included);


Expand Down
Loading

0 comments on commit 49ce329

Please sign in to comment.