talk-tag

talk-tag is a tool for automatic morphosyntactic error annotation in transcribed speech.

It adds inline CHAT-compatible error tags to utterances, helping researchers and annotators pre-annotate transcripts for review. The current system follows a subset of the CHAT word-level error coding scheme described in Tools for Analyzing Talk, Part 1: The CHAT Transcription Format (Chapter 18.1).

What It Annotates

TalkTag currently annotates:

morphological errors: [* m:*]
substitution errors (subset of semantic errors in the manual): [* s:r:*] and [* s:r:gc:*]

It also inserts target reconstructions inline, following CHAT conventions:

[: target] when the produced form is a non-word
[= target] when the produced form is a real word but the intended target should still be recorded

For the current package behavior:

non-word reconstructions such as [: went] are preserved
real-word reconstructions are converted to [= target]
[= target] output is hidden by default and included only when --show-target is set

This is intentional: according to the CHAT manual, [= target] is not required for analysis in the way [: target] is, so TalkTag keeps it optional and defaults to the cleaner output.

The underlying model was trained before the current manual standardized the real-word target notation as [= target]. Because of that, raw generations may still reflect the older [:: target] convention. TalkTag rewrites those cases to [= target] in post-processing before saving output.

The CHAT manual distinguishes these because [: target] lets MOR use "the real word target" for parsing, whereas the real-word replacement notation lets MOR use "the actual word produced" while still preserving the target for other CLAN analyses. See the CHAT manual and the CLAN manual.

Quick Examples

Yesterday I walk [= walked] [* m:0ed] to school .
Yesterday I goed [: went] [* m:=ed] to school . 
Yesterday me [= I] [* s:r:gc:pro] walked to school .
Yesterday I went in [= to] [* s:r:prep] school .

See the CHAT Transcription Guidelines.

Annotation Scheme

Morphological Labels

CHAT error tags are compositional: each part of a tag indicates, from general to fine-grained the error and its underline process. For example, in [* m:0ed], m marks a morphosyntactic error, 0 marks a missing form, and ed marks past morpheme.

Level 1	Meaning
`* m:`	morphosyntactic error
Level 2	Meaning
`0`	missing regular form
`=`	over-regularisation
`+`	superfluous marking
`++`	double marking
`base:`	base for irregular form
`irr:`	irregular for base form
`sub:`	past/perfective substitution
`allo`	allomorphic errors
`vsg:`	irregular verb 3SG
`vun:`	irregular verb unmarked
Level 3	Meaning
`mor`	target morpheme
`a`	agreement error
`i`	irregular target

Common level-3 morphemes include:

ed, en, 3s, ing, s, 's, er, and est.

In practice, common outputs include:

[* m:0ed] for missing past tense
[* m:=ed] for over-regularised past forms
[* m:03s:a] for missing 3SG agreement marking

Substitution Labels

Level 1	Meaning
`* s:`	substitution error
Level 2	Meaning
`r:`	related lexical substitution
`r:gc:`	related grammatical substitution
Level 3	Meaning
`POS`	target part of speech

Supported part-of-speech (POS) in the paper include:

pro(pronoun), det (determiner), and prep (preposition).

In practice, common outputs include:

[* s:r:gc:pro] for pronoun substitutions: possessive for nominative: her/his/their for she/he/they)
[* s:r:prep] for preposition substitutions: e.g., *he is married with (instead of to) Maria

Scope Notes

The current runtime follows a narrow prototype scope and does not cover the full CHAT error inventory.
The paper's model was developed on children's narrative data from the ENNI corpus under low-resource conditions.
The most realistic use case is assisted annotation and review of plausible error candidates.

Install

Python requirement: >=3.10.

pip install "talk-tag[runtime]"

Runtime extras include torch, transformers, and peft.

Runtime support

The current fixed deployment is based on a bnb-4bit Hugging Face model. In practice, this means:

CUDA is the preferred accelerated runtime
CPU is supported as a fallback
Apple MPS is not supported for this deployment

First-run workflow

Check environment:

talk-tag doctor

Pull/warm model assets:

talk-tag model pull --device auto

On Apple Silicon, --device auto will fall back to CPU instead of MPS.

Run annotation:

talk-tag annotate \
  --input-dir ./input \
  --output-dir ./output \
  --target-speaker "*CHI" \
  --device auto

Single-file .cha example:

talk-tag annotate \
  --input-path ./input/sample.cha \
  --output-dir ./output \
  --target-speaker "*CHI" \
  --device auto

Show optional real-word reconstructions in the output:

talk-tag annotate \
  --input-path ./input/sample.cha \
  --output-dir ./output \
  --target-speaker "*CHI" \
  --show-target \
  --device auto

--show-target only affects optional real-word reconstructions such as [= goes]. Non-word reconstructions such as [: went], which are needed for analysis, are preserved either way.

For quick debugging, you can also print only the target utterances that changed:

talk-tag annotate \
  --input-path ./input/sample.cha \
  --output-dir ./output \
  --target-speaker "*CHI" \
  --limit 5 \
  --print-debug-lines \
  --device auto

This prints changed lines as original/annotated pairs during the run. It does not change the output file content.

If needed, you can also cap inference for quick local checks:

talk-tag annotate \
  --input-path ./input/sample.cha \
  --output-dir ./output \
  --target-speaker "*CHI" \
  --limit 20 \
  --device auto

When --limit is greater than 0, TalkTag still writes the output file. It simply stops annotation after the first N target utterances and prints a notice that the limit is active.

Inference defaults

max_new_tokens = 128
max_seq_length = 512
max_context_chars = 1200
limit = 0 (0 means no cap; use it as a debug/testing limit on target utterances)
greedy decoding (do_sample = false)

The CLI currently exposes:

--limit to cap the number of target utterances processed in one run for testing/debugging; output files are still written
--print-debug-lines to print only changed target utterances during a run for quick debugging

Supported runtime inputs

.cha
.jsonl (requires --speaker-field and --text-field)

The annotate command accepts either:

--input-dir for folder annotation
--input-path for a single .cha or .jsonl file

Other previously supported formats (.txt, .csv, .json, .xlsx) are rejected in adapter-only deployment mode.

Colab quickstart

See examples/colab_quickstart.ipynb for a minimal setup flow.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.github		.github
.idea		.idea
docs		docs
src/talk_tag		src/talk_tag
tests		tests
.gitignore		.gitignore
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

talk-tag

What It Annotates

Quick Examples

Annotation Scheme

Morphological Labels

Substitution Labels

Scope Notes

Install

Runtime support

First-run workflow

Inference defaults

Supported runtime inputs

Colab quickstart

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

talk-tag

What It Annotates

Quick Examples

Annotation Scheme

Morphological Labels

Substitution Labels

Scope Notes

Install

Runtime support

First-run workflow

Inference defaults

Supported runtime inputs

Colab quickstart

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages