Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Feb 5, 2026

What do these changes do

Replaces Thai National Corpus (TNC) word frequencies in the spell checker with Phupha dataset filtered by Royal Society of Thailand (ORST) word list.

What was wrong

TNC word frequency dictionary captured real-world language usage, including misspelled words. The spell checker trained on this data could suggest incorrect spellings.

How this fixes it

Added Phupha corpus module:

  • pythainlp/corpus/phupha.py - word frequency accessor functions
  • pythainlp/corpus/phupha_word_freqs.txt - 62,264 words from Common Crawl Corpus
  • Dataset: Phupha v1.0 (CC0 license)

Modified spell checker:

  • pythainlp/spell/pn.py - filters Phupha by thai_orst_words at initialization
  • Result: 38,160 ORST words → 35,840 after min_freq/min_len filters
  • Full Phupha dataset remains available for other uses

Example:

from pythainlp.spell import NorvigSpellChecker

checker = NorvigSpellChecker()  # Now uses Phupha filtered by ORST
checker.correct("ปัญชา")  # Returns: "ปัญหา" (correct spelling)

Tests:

  • test_phupha() - verifies corpus module functions
  • test_issue_680_orst_filtering() - validates ORST filtering behavior
  • Updated test_norvig_spell_checker() - checks dictionary size constraints

Limitation: Compound words not in ORST (e.g., "ปลาอินทรีย์") won't be suggested. This is expected when restricting to authoritative dictionary.

Your checklist for this pull request

  • Passed code styles and structures
  • Passed code linting checks and unit test
Original prompt

This section details on the original issue you should resolve

<issue_title>Add Phupha: Thai Word Frequency Dataset</issue_title>
<issue_description>Add Phupha: Thai Word Frequency Dataset to corpus in PyThaiNLP and word frequency.

Comments on the Issue (you are @copilot in this section)


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 4 commits February 5, 2026 05:35
- Created filtered Phupha dataset with only thai_orst_words (38,160 words)
- Added pythainlp/corpus/phupha.py module with word_freqs() and unigram_word_freqs()
- Updated pythainlp/spell/pn.py to use Phupha instead of TNC
- Updated documentation to reference Phupha dataset

Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
- Changed to use full Phupha dataset (62,264 words) in corpus file
- Added filtering logic in pythainlp/spell/pn.py to filter by thai_orst_words
- This allows the full Phupha dataset to be available for other uses
- Updated tests to verify filtering works correctly
- Spell checker now filters 38,160 ORST words from full Phupha dataset

Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
- Added test_issue_680_orst_filtering test case
- Documents that compound words not in ORST won't be suggested
- Verifies spell checker only uses ORST words
- Tests specific case from issue #680: ปลาอินทรีย์ -> ปลาอินทรี
- Explains expected behavior when words aren't in ORST dictionary

Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
- Changed 'is A Thai' to 'is a Thai' (lowercase 'a')
- Fixed in module docstring and both function docstrings
- Addressed code review feedback

Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
@sonarqubecloud
Copy link

sonarqubecloud bot commented Feb 5, 2026

Copilot AI changed the title [WIP] Add Phupha Thai Word Frequency Dataset to corpus Replace TNC with Phupha word frequency dataset filtered by ORST words Feb 5, 2026
Copilot AI requested a review from wannaphong February 5, 2026 05:45
@wannaphong wannaphong marked this pull request as ready for review February 5, 2026 08:15
@wannaphong wannaphong added this to the 5.3 milestone Feb 5, 2026
@bact bact added the enhancement enhance functionalities label Feb 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement enhance functionalities

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Phupha: Thai Word Frequency Dataset คำสะกดไม่ถูก

3 participants