Unsupervised Text Segmentation and Tokenization

Origin

Original task singnet/language-learning#255

References

Papers

An unsupervised machine learning approach to segmentation of clinician-entered free text, 2007
A Hybrid Approach to Cross-Linguistic Tokenization: Morphology with Statistics, 2016
Dynamic Programming Encoding for Subword Segmentation in Neural Machine Translation, 2020
Wine is not v i n. On the Compatibility of Tokenizations across Languages, 2021
Unsupervised Tokenization Learning, 2022, also https://arxiv.org/abs/2205.11443
Self-tuning hyper-parameters for unsupervised cross-lingual tokenization, 2023

Corpora

Multi-Lingual News from Common Crawl
English
- Brown Corpus
  - http://www.sls.hawaii.edu/bley-vroman/brown_nolines.txt
- Gutenberg Corpus
  - https://www.gutenberg.org/
Russian
- Inventory
  - https://nlpub.ru/%D0%A0%D0%B5%D1%81%D1%83%D1%80%D1%81%D1%8B
  - https://github.com/natasha/corus#usage
- RusAge - text books
  - https://www.kaggle.com/datasets/oldaandozerskaya/fiction-corpus-for-agebased-text-classification
- Twitter, need to extract froom SQL
  - http://study.mokoron.com/
- Wiki, need to extract from XML
  - https://linguatools.org/tools/corpora/wikipedia-monolingual-corpora/
Chinese (Simplified/Mandarin)
- Lexicon
  - http://www.chineselexicaldatabase.com/download.php
- CLUE
  - https://github.com/brightmart/nlp_chinese_corpus

Links

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2655800/
https://scholarsarchive.byu.edu/cgi/viewcontent.cgi?article=6983&context=etd
https://lena-voita.github.io/nlp_course/language_modeling.html
https://en.wikipedia.org/wiki/Perplexity
singnet/language-learning#255
https://medium.com/mlearning-ai/word-embeddings-wordpiece-and-language-agnostic-bert-labse-98c7626878c7
https://github.com/natasha/razdel - razdel tries to mimic segmentation of these 4 datasets: SynTagRus, OpenCorpora, GICRYA and RNC.
https://www.kaggle.com/c/text-normalization-challenge-english-language
https://www.kaggle.com/c/text-normalization-challenge-russian-language

Tasks

Subword segmentation aligneed with morphology
- TODO, 20230622
  - conclude on morphology_lexicon_counted_en_ru
  - read https://arxiv.org/pdf/2005.06606.pdf (PROGRESS)
  - use tokenization to learn words -> wordbase
  - use word segmentation to learn subwords
    - for every word, build all possible splits based on known words (wordbase) and unmatched fragments
    - for every split, find the most probable split and add the new parts to the counted partbase
    - list counted parts
    - repeat from the toop of the above, counting parts along with words, till no new parts can be found
    - have the wordbase+partbase as subword segmentation base
  - evaluate partbase against suffixes and prefixes
    - languages
      - en
      - ru
  - evaluate subword segmentation scheme against the reference
    - languages
      - en
      - ru
    - with different reference tokenizatiion schemes, tested on
      - ping
      - pings
      - pinging
  - conclude
Self-tuning hyperparameters unsupervisedly! - TODO FIX STATUS
- metrics
  - Cross-spli F1 on models from split corpora CSF1
  - Compression factor C%
  - Normalized Anti-Entropy ~S https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.entropy.html
- English
  - test 100 (DONE)
  - test 1K (PROGRESS)
  - test 10K (TODO)
- Russian
  - test 100 (DONE)
  - test 1K (PROGRESS)
  - test 10K (TODO)
- Chinese
  - test 100 (TODO)
Check sources of errors (! and ? marks) for English, see if artifical generation of correcting set fixes that
- Add web names and numbers to test set, see how to deal with "contextual tokenization"
...
Beat unsupervized tokenizer (UT) SOTA with semi-supervised tokenizer (SST)
- implement semi-supervised tokenizer, trained on word/character corpus (SST))
- pre-train freq model for Tokenization on corpus, including A) words B) individual delimiters, C) generated numbers, D)
  - tokenize based on true lexicon.txt ("curriculum learning" concept), count frequencies of non-present words, see what to do next
When counting smaller ngrams based on p+/p-, denominate them for being part of larger ngrams?
- "inhibit frequencies" (or rather ""boost) from higher-order to lower-order?
- https://github.com/aigents/pygents/blob/main/notebooks/nlp/tokenization/TokenMining.ipynb
Explore "surprizeness" measure to split as extension to "freedom"/"uncertainty"!?
Further token/ngram graph analysis and scenario mining for tokenization and morphology extending to sentence segmentation
- tokenize by clustering words in the sentence
  - by gram counts - using MUTUAL INFORMATION!!! (does not work? double-check)
  - merge tokens in a way minimizing "freedom"/"uncertainty" (maximaly certain tree or MCT)
Model graph analysis with relationships
- prev/next (sequential)
- part/whole (morphology)
- intention/extension (class hierarchy - vowels and consonants, suffixes and prefixe)

Problems

how to split endings quotes delimiters away from regular words, keeping the slashes, points and periods being parts of websites and numbers as part of tokens!?
unsupervised decapitalization/capitalization?
how to decode special chars like '\u200b' from input corpus data (other than just ignoring like we do now)

Results

2023 May - Self-tuning subword segmentation (morpho-parsing)

Paper

https://github.com/aigents/pygents/blob/main/docs/2023/evolution_comm_codes_kolonin_2023_2.pdf
https://arxiv.org/abs/2306.02383

Slides

https://github.com/aigents/pygents/blob/main/docs/2023/bottom-up-language-learning-2022.pdf
https://github.com/aigents/pygents/blob/main/docs/2023/inlp-2023.pdf

Conclusions

noticieable (yet not straight) correlation between morpho-parsing F1 score and recall of detection of morpho-units, see https://github.com/aigents/pygents/blob/main/notebooks/nlp/morphology/morphology_lexicon_en_only.ipynb
word length up to 10 over 7 => saturation at 7, see https://github.com/aigents/pygents/blob/main/notebooks/nlp/morphology/morphology_lexicon_en_only.ipynb
F1 on morpho-parsing does not exceed 0.42-0.44, see https://github.com/aigents/pygents/blob/main/notebooks/nlp/morphology/morphology_lexicon_en_only.ipynb
- 0.42 if NOT accounting for word boundaries, N(top F1)=1
- 0.44 if accounintg for word boundaries, N(top F1)=2 (adding _ before and after every word and counting them)
direct dependence of F1 on compression factor, see https://github.com/aigents/pygents/blob/main/notebooks/nlp/morphology/morphology_lexicon_en_only.ipynb
inverse dependence of F1 on anti-entropy (why!?), see https://github.com/aigents/pygents/blob/main/notebooks/nlp/morphology/morphology_lexicon_en_only.ipynb
limiting training set by word frequency helps to improve morho-parsing F1 (0.0005-0.001 improves F1 from 0.24 to 0.28 on Table 4 in https://arxiv.org/pdf/2005.06606.pdf ), see https://github.com/aigents/pygents/blob/main/notebooks/nlp/morphology/morphology_lexicon_en_test.ipynb
applying model compression threshold helps to improve morho-parsing F1 on small corpus (0.1 improves F1 from 0.24 to 0.27 on Table 4 in https://arxiv.org/pdf/2005.06606.pdf ) but does not render noticable impact on full lexicon, see https://github.com/aigents/pygents/blob/main/notebooks/nlp/morphology/morphology_lexicon_en_test.ipynb

2023 March - Self-tuning tokenization

Found linear correlation between tokenization F1 score and each of the anti-entropy, compression factor and cross-split F1 score - for English, Russian and Chinese
https://github.com/aigents/pygents/blob/main/docs/2022/clustering-segmentation-2022.pdf
https://arxiv.org/pdf/2303.02427.pdf

2022 May

Reached 0.71-1.0 F1 scrore across English, Russian and Chinese languages
https://github.com/aigents/pygents/blob/main/docs/2022/unsupervised_segmentation_learning_emnlp2022_582.pdf
https://aclanthology.org/2022.emnlp-main.239/
https://arxiv.org/abs/2205.11443

2022 April

Trained N-gram models with N=7 on different corpora
- https://github.com/aigents/pygents/blob/main/notebooks/nlp/tokenization/Tokenizer-Corpora.ipynb
- Brown (B) - 6M
- Gutenberg Children (GC) - 29M
- Gutenberg Adullt (GA) - 140M
- Social Media (SM) - 65M
Explored frequencies on SM corpus
- https://github.com/aigents/pygents/blob/main/notebooks/nlp/tokenization/Tokenizer.ipynb
- All N-grams (n=[1..7])
  - top 1-gram ' ' - gets outstanding score, next are 't' and 'e' (from 'the')
  - top 2-gram 'in' - after ' t' and 'e ' (from 'the')
  - top 3-grams 'the' and 'ing' - along with ' th' and 'he ' (from 'the')
  - top 4-grams ' the', 'the ', 'ing ', ' to ', 'http'
  - top 5-grams ' the', 'https'
  - top 6/7-grams 'https:', 'https:/'
- Token N-grams based on space-tokenizer
  - top 'the', 'to', 'and', 'a', 'of', ...
- Logarithmic distributions still apppear Zipfian
Explored models based on different metrics according to https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2655800/ using SM corpus only
- Conditional probabilities on N-to-N+1-gram transitions forward p+ and backward p-
  - https://github.com/aigents/pygents/blob/main/notebooks/nlp/tokenization/Tokenizer.ipynb
  - appear correlated with spaces and morphology (both!)
  - also have sums (|) and productions (&) across p+ and p- metrics with different N=[1..7] and directions +/-
- Transitional "freedom" (uncertainty) forward p+ and backward p- (on gram-to-char and gram-to-gram basis for different N-s)
  - https://github.com/aigents/pygents/blob/main/notebooks/nlp/tokenization/Tokenizer.ipynb
  - https://github.com/aigents/pygents/blob/main/notebooks/nlp/tokenization/TokenizerTest.ipynb
  - appear more impressively connected with punctuation than p+ or p-
  - also have sums (|) and productions (&) across f+ and f- metrics with different N=[1..7] and directions +/- - all appear more impressive than based on p+ and p-
  - also have deviations ddf+ and ddf- capped above zero - appear even more impressively connected with punctuation, so used in tokenizatioon
Explored MI using SM corpus, applied to bigram according to https://arxiv.org/pdf/cmp-lg/9805009.pdf (page 40)
- https://github.com/aigents/pygents/blob/main/notebooks/nlp/Tokenizer.ipynb (see "counts2mis")
- Hoping to cluster tokens based on pointise mutual information (PMI) did not lead to any promising results
Tried extended "generated quoted words" lexicon to improves the situation with F1 (detach framing double quotes " from the words)
- Does not help to "resolve" doublequotes, detaching them away from words, the Brown corpus has enough quotes connected with word letters, but since the spaces are adjacent to many other punctuations, they have f+/f- much higher than doublequotes, so can not be resolved by threshold compared to word letters.
Tried Brown (B) Gutenberg Children (GC) and Gutenberg Adult corpora to train models based on ddf+ or ddf- metrics (top F1 on tokens with no spaces) tested on B corpus https://github.com/aigents/pygents/blob/main/notebooks/nlp/TokenizerTest-Runs.ipynb
- B => F1=0.91 (n=[1,2], t=0.4) - the best (most errors are caused with unability to detach framing double quotes " from the words)!!!
- GC, GA, GC+GA => F1=0.78 (n=[1], t=0.4-0.8)
- B+GC+GA => F1=0.91 (n=[1,2], t=0.4) - same as on B!
- SM => F1=0.78 (n=[1], t=0.2-0.8)
- B+GC+GA+SM => F1=0.78 (n=[1], t=0.2-0.8) - same as on SM!
Improved the "freedom" models removing the low-frequency "tails" for each of the corpora
- F1=0.99 on Brown (and Brown + Gutenberg Children+Adult) with Brown 10 lines test set
  - https://github.com/aigents/pygents/blob/main/notebooks/nlp/tokenization/TokenizerTest.ipynb
  - https://github.com/aigents/pygents/blob/main/notebooks/nlp/tokenization/TokenizerTest-Runs.ipynb
- F1=0.96 on Brown (and Brown + Gutenberg Children+Adult) with Brown 100 lines test set
  - https://github.com/aigents/pygents/blob/main/notebooks/nlp/tokenization/TokenizerTest-Runs-100.ipynb
explored freedom-based models with all possible combinations of grams 1-7 for better F1 with larger test set of 100 lines from B
- https://github.com/aigents/pygents/blob/main/notebooks/nlp/tokenization/TokenizerTest-Runs-100.ipynb
- larger corpus does not make better, the best is the smallest B, adding GC+GA to it does not improve (F1=0.96), adding SM to it makes it a bit worse (F1=0.93)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Unsupervised Text Segmentation and Tokenization

Origin

References

Papers

Corpora

Links

Tasks

Problems

Results

2023 May - Self-tuning subword segmentation (morpho-parsing)

Paper

Slides

Conclusions

2023 March - Self-tuning tokenization

2022 May

2022 April

Files

README.md

Latest commit

History

README.md

File metadata and controls

Unsupervised Text Segmentation and Tokenization

Origin

References

Papers

Corpora

Links

Tasks

Problems

Results

2023 May - Self-tuning subword segmentation (morpho-parsing)

Paper

Slides

Conclusions

2023 March - Self-tuning tokenization

2022 May

2022 April