talk-tag is a tool for automatic morphosyntactic error annotation in transcribed speech.
It adds inline CHAT-compatible error tags to utterances, helping researchers and annotators pre-annotate transcripts for review. The current system follows a subset of the CHAT word-level error coding scheme described in Tools for Analyzing Talk, Part 1: The CHAT Transcription Format (Chapter 18.1).
TalkTag currently annotates:
- morphological errors:
[* m:*] - substitution errors (subset of semantic errors in the manual):
[* s:r:*]and[* s:r:gc:*]
It also inserts target reconstructions inline, following CHAT conventions:
[: target]when the produced form is a non-word[= target]when the produced form is a real word but the intended target should still be recorded
For the current package behavior:
- non-word reconstructions such as
[: went]are preserved - real-word reconstructions are converted to
[= target] [= target]output is hidden by default and included only when--show-targetis set
This is intentional: according to the CHAT manual, [= target] is not required
for analysis in the way [: target] is, so TalkTag keeps it optional and
defaults to the cleaner output.
The underlying model was trained before the current manual standardized the
real-word target notation as [= target]. Because of that, raw generations may
still reflect the older [:: target] convention. TalkTag rewrites those cases
to [= target] in post-processing before saving output.
The CHAT manual distinguishes these because [: target] lets MOR use "the real
word target" for parsing, whereas the real-word replacement notation lets MOR
use "the actual word produced" while still preserving the target for other CLAN
analyses. See the CHAT
manual and the CLAN
manual.
Yesterday I walk [= walked] [* m:0ed] to school .
Yesterday I goed [: went] [* m:=ed] to school .
Yesterday me [= I] [* s:r:gc:pro] walked to school .
Yesterday I went in [= to] [* s:r:prep] school .
See the CHAT Transcription Guidelines.
CHAT error tags are compositional: each part of a tag indicates, from general to fine-grained the error and its underline process.
For example, in [* m:0ed], m marks a morphosyntactic error, 0 marks a missing form, and ed marks past morpheme.
| Level 1 | Meaning |
|---|---|
* m: |
morphosyntactic error |
| Level 2 | Meaning |
0 |
missing regular form |
= |
over-regularisation |
+ |
superfluous marking |
++ |
double marking |
base: |
base for irregular form |
irr: |
irregular for base form |
sub: |
past/perfective substitution |
allo |
allomorphic errors |
vsg: |
irregular verb 3SG |
vun: |
irregular verb unmarked |
| Level 3 | Meaning |
mor |
target morpheme |
a |
agreement error |
i |
irregular target |
Common level-3 morphemes include:
ed, en, 3s, ing, s, 's, er, and est.
In practice, common outputs include:
[* m:0ed]for missing past tense[* m:=ed]for over-regularised past forms[* m:03s:a]for missing 3SG agreement marking
| Level 1 | Meaning |
|---|---|
* s: |
substitution error |
| Level 2 | Meaning |
r: |
related lexical substitution |
r:gc: |
related grammatical substitution |
| Level 3 | Meaning |
POS |
target part of speech |
Supported part-of-speech (POS) in the paper include:
pro(pronoun), det (determiner), and prep (preposition).
In practice, common outputs include:
-
[* s:r:gc:pro]for pronoun substitutions: possessive for nominative:her/his/theirforshe/he/they) -
[* s:r:prep]for preposition substitutions: e.g., *he is marriedwith(instead ofto) Maria
- The current runtime follows a narrow prototype scope and does not cover the full CHAT error inventory.
- The paper's model was developed on children's narrative data from the ENNI corpus under low-resource conditions.
- The most realistic use case is assisted annotation and review of plausible error candidates.
Python requirement: >=3.10.
pip install "talk-tag[runtime]"Runtime extras include torch, transformers, and peft.
The current fixed deployment is based on a bnb-4bit Hugging Face model. In
practice, this means:
- CUDA is the preferred accelerated runtime
- CPU is supported as a fallback
- Apple MPS is not supported for this deployment
- Check environment:
talk-tag doctor- Pull/warm model assets:
talk-tag model pull --device autoOn Apple Silicon, --device auto will fall back to CPU instead of MPS.
- Run annotation:
talk-tag annotate \
--input-dir ./input \
--output-dir ./output \
--target-speaker "*CHI" \
--device autoSingle-file .cha example:
talk-tag annotate \
--input-path ./input/sample.cha \
--output-dir ./output \
--target-speaker "*CHI" \
--device autoShow optional real-word reconstructions in the output:
talk-tag annotate \
--input-path ./input/sample.cha \
--output-dir ./output \
--target-speaker "*CHI" \
--show-target \
--device auto--show-target only affects optional real-word reconstructions such as
[= goes]. Non-word reconstructions such as [: went], which are needed for
analysis, are preserved either way.
For quick debugging, you can also print only the target utterances that changed:
talk-tag annotate \
--input-path ./input/sample.cha \
--output-dir ./output \
--target-speaker "*CHI" \
--limit 5 \
--print-debug-lines \
--device autoThis prints changed lines as original/annotated pairs during the run. It does not change the output file content.
If needed, you can also cap inference for quick local checks:
talk-tag annotate \
--input-path ./input/sample.cha \
--output-dir ./output \
--target-speaker "*CHI" \
--limit 20 \
--device autoWhen --limit is greater than 0, TalkTag still writes the output file. It
simply stops annotation after the first N target utterances and prints a
notice that the limit is active.
max_new_tokens = 128max_seq_length = 512max_context_chars = 1200limit = 0(0means no cap; use it as a debug/testing limit on target utterances)- greedy decoding (
do_sample = false)
The CLI currently exposes:
--limitto cap the number of target utterances processed in one run for testing/debugging; output files are still written--print-debug-linesto print only changed target utterances during a run for quick debugging
.cha.jsonl(requires--speaker-fieldand--text-field)
The annotate command accepts either:
--input-dirfor folder annotation--input-pathfor a single.chaor.jsonlfile
Other previously supported formats (.txt, .csv, .json, .xlsx) are rejected in adapter-only deployment mode.
See examples/colab_quickstart.ipynb for a minimal setup flow.