Skip to content

Latest commit

 

History

History
223 lines (188 loc) · 13.2 KB

README.md

File metadata and controls

223 lines (188 loc) · 13.2 KB

Unsupervised Text Segmentation and Tokenization

Origin

References

Papers

Corpora

Links

Tasks

  • Subword segmentation aligneed with morphology

    • TODO, 20230622
      • conclude on morphology_lexicon_counted_en_ru
      • read https://arxiv.org/pdf/2005.06606.pdf (PROGRESS)
      • use tokenization to learn words -> wordbase
      • use word segmentation to learn subwords
        • for every word, build all possible splits based on known words (wordbase) and unmatched fragments
        • for every split, find the most probable split and add the new parts to the counted partbase
        • list counted parts
        • repeat from the toop of the above, counting parts along with words, till no new parts can be found
        • have the wordbase+partbase as subword segmentation base
      • evaluate partbase against suffixes and prefixes
        • languages
          • en
          • ru
      • evaluate subword segmentation scheme against the reference
        • languages
          • en
          • ru
        • with different reference tokenizatiion schemes, tested on
          • ping
          • pings
          • pinging
      • conclude
  • Self-tuning hyperparameters unsupervisedly! - TODO FIX STATUS

  • Check sources of errors (! and ? marks) for English, see if artifical generation of correcting set fixes that

    • Add web names and numbers to test set, see how to deal with "contextual tokenization"
  • ...

  • Beat unsupervized tokenizer (UT) SOTA with semi-supervised tokenizer (SST)

    • implement semi-supervised tokenizer, trained on word/character corpus (SST))
    • pre-train freq model for Tokenization on corpus, including A) words B) individual delimiters, C) generated numbers, D)
      • tokenize based on true lexicon.txt ("curriculum learning" concept), count frequencies of non-present words, see what to do next
  • When counting smaller ngrams based on p+/p-, denominate them for being part of larger ngrams?

  • Explore "surprizeness" measure to split as extension to "freedom"/"uncertainty"!?

  • Further token/ngram graph analysis and scenario mining for tokenization and morphology extending to sentence segmentation

    • tokenize by clustering words in the sentence
      • by gram counts - using MUTUAL INFORMATION!!! (does not work? double-check)
      • merge tokens in a way minimizing "freedom"/"uncertainty" (maximaly certain tree or MCT)
  • Model graph analysis with relationships

    • prev/next (sequential)
    • part/whole (morphology)
    • intention/extension (class hierarchy - vowels and consonants, suffixes and prefixe)

Problems

  • how to split endings quotes delimiters away from regular words, keeping the slashes, points and periods being parts of websites and numbers as part of tokens!?
  • unsupervised decapitalization/capitalization?
  • how to decode special chars like '\u200b' from input corpus data (other than just ignoring like we do now)

Results

2023 May - Self-tuning subword segmentation (morpho-parsing)

Paper

Slides

Conclusions

2023 March - Self-tuning tokenization

2022 May

2022 April