-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cb changes 20250131 #11218
Cb changes 20250131 #11218
Conversation
WalkthroughThis pull request encompasses multiple changes across various LanguageTool language modules, focusing on refining grammatical and punctuation rules, as well as expanding language resources. The modifications include updates to spelling dictionaries, ignore files, disambiguation rules, and grammar configurations in both German and English modules. Key changes involve modifying bracket detection logic to focus solely on parentheses, removing specific test cases related to mismatched brackets, and adding numerous words, names, and technical terms to enhance language processing and spell-checking capabilities. Changes
Possibly related PRs
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/ignore.txt (1)
Line range hint
2-11007
: Well-formed additions to the ignore list.The new entries maintain consistent formatting and include a good mix of proper nouns, technical terms, and compound words. The preservation of case sensitivity and diacritical marks (e.g., "Kantō") is appropriate.
Consider grouping related terms for better maintainability.
For variants like "Kanto" and "Kantō", consider adding a comment to indicate their relationship or grouping them together in the file. This would make future maintenance easier.
# Japanese region names and variants Kanto Kantōlanguagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml (1)
17185-17194
: Consider expanding the rule to handle more variations.The current rule might miss common variations of "I/O". Consider adding support for:
- Plural forms (e.g., "I/Os")
- Alternative separators (e.g., "I-O")
- Lowercase variations (e.g., "i/o")
Here's a suggested expansion of the rule:
<rule id="I_O_NO_PRP" name="I in I/O is no PRP"> <pattern> <marker> - <token case_sensitive="yes">I</token> + <token regexp="yes">(?i)I|i</token> </marker> - <token>/</token> + <token regexp="yes">[/-]</token> - <token>O</token> + <token regexp="yes">(?i)O|o</token> + <token regexp="yes">s?</token> </pattern> <disambig action="remove" postag="PRP.*"/> </rule>
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
languagetool-core/src/main/java/org/languagetool/rules/CommaWhitespaceRule.java
(1 hunks)languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt
(1 hunks)languagetool-core/src/test/java/org/languagetool/rules/CommaWhitespaceRuleTest.java
(0 hunks)languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/ignore.txt
(1 hunks)languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt
(1 hunks)languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt
(1 hunks)languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/remote-rule-filters.xml
(1 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml
(1 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/ignore.txt
(1 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt
(1 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml
(1 hunks)
💤 Files with no reviewable changes (1)
- languagetool-core/src/test/java/org/languagetool/rules/CommaWhitespaceRuleTest.java
✅ Files skipped from review due to trivial changes (1)
- languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/ignore.txt
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Analyze (java-kotlin)
🔇 Additional comments (13)
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml (1)
106880-106883
: LGTM! Good addition of the "for...sake" antipattern.The new antipattern correctly handles common idiomatic expressions like "for heaven's sake", "for goodness sake", etc. The
skip="4"
attribute appropriately allows for variations in these phrases.languagetool-core/src/main/java/org/languagetool/rules/CommaWhitespaceRule.java (1)
236-236
: Verify the impact of bracket handling changes.Let's analyze the usage of this rule across the codebase to understand the full impact of these changes.
Also applies to: 244-244
✅ Verification successful
Bracket handling changes are safely isolated to core implementation
The change to parenthesis handling in
isLeftBracket()
is inherited by all language-specific implementations without any conflicts or overrides. No language-specific customizations of bracket handling were found.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Analyze usage and impact of CommaWhitespaceRule changes echo "1. Finding direct usages of CommaWhitespaceRule:" rg -l "CommaWhitespaceRule" echo -e "\n2. Finding subclasses:" ast-grep --pattern 'class $_ extends CommaWhitespaceRule' echo -e "\n3. Finding test files containing bracket-related assertions:" rg -l "assertMatches.*[\[\{\(\)\}\]].*CommaWhitespaceRule" echo -e "\n4. Finding language-specific configurations:" fd -e xml | xargs rg "COMMA_PARENTHESIS_WHITESPACE"Length of output: 4800
Script:
#!/bin/bash # Check language-specific implementations and tests echo "1. Examining German implementation:" rg -A 5 -B 5 "class GermanCommaWhitespaceRule" languagetool-language-modules/de/src/main/java/org/languagetool/rules/de/GermanCommaWhitespaceRule.java echo -e "\n2. Examining Ukrainian implementation:" rg -A 5 -B 5 "class UkrainianCommaWhitespaceRule" languagetool-language-modules/uk/src/main/java/org/languagetool/rules/uk/UkrainianCommaWhitespaceRule.java echo -e "\n3. Examining Arabic implementation:" rg -A 5 -B 5 "class ArabicCommaWhitespaceRule" languagetool-language-modules/ar/src/main/java/org/languagetool/rules/ar/ArabicCommaWhitespaceRule.java echo -e "\n4. Examining Persian implementation:" rg -A 5 -B 5 "class PersianCommaWhitespaceRule" languagetool-language-modules/fa/src/main/java/org/languagetool/rules/fa/PersianCommaWhitespaceRule.java echo -e "\n5. Looking for bracket-related tests:" rg -A 5 -B 5 "test.*[\(\)].*CommaWhitespaceRule" languagetool-core/src/test/java/org/languagetool/rules/CommaWhitespaceRuleTest.javaLength of output: 2690
Script:
#!/bin/bash # Examine core test cases and implementation details echo "1. Looking for bracket-related test cases in CommaWhitespaceRuleTest:" rg -A 2 "[\(\)].*CommaWhitespaceRule" languagetool-core/src/test/java/org/languagetool/rules/CommaWhitespaceRuleTest.java echo -e "\n2. Check for isLeftBracket or similar method overrides:" ast-grep --pattern 'class $_ extends CommaWhitespaceRule { $$$ isLeftBracket($_) { $$$ } $$$ }' echo -e "\n3. Examining the core implementation context:" rg -A 5 -B 5 "isLeftBracket" languagetool-core/src/main/java/org/languagetool/rules/CommaWhitespaceRule.javaLength of output: 1484
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt (4)
66006-66020
: LGTM! Well-structured podcast-related terminology.The compound words follow German word formation rules correctly, with proper gender pairs and inflection markers.
66005-66005
: LGTM! Well-formed adjectives with correct inflection markers.The adjectives follow German compound formation rules and are correctly marked for inflection.
Also applies to: 66021-66022, 66033-66033
66023-66034
: LGTM! Consistent marking of separable prefix verbs.The verbs use underscore consistently to mark separable prefixes, following German verb formation rules.
66035-66036
: LGTM! Complex compounds and adjectives are well-formed.The terms follow German word formation rules with appropriate inflection markers.
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt (1)
696-710
: LGTM! The new words are properly formatted and follow the existing pattern.The additions include correctly formatted scientific and technical terms:
- Biology terms: "four-limbed", "strong-limbed", "limbed", "uroepithelium", "uroepithelia", "uroepithelial"
- Chemistry terms: "isocitrate", "isocitrates", "decarboxylation", "alphaketoglutarate", "alphaketoglutarates"
- Architecture terms: "stylobate", "stylobates"
- Historical terms: "Perizzite", "Perizzites"
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt (1)
3215-3221
: LGTM! The new entries are properly formatted and follow the existing pattern.The additions include correctly formatted names of public figures with appropriate "/S" suffix:
- News presenters: "Susanne Daubner/S", "Constantin Schreiber/S", "Dunja Hayali/S", "Marietta Slomka/S", "Linda Zervakis", "Jan Hofer/S"
- Artists: "Katy Perry/S"
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/ignore.txt (1)
Line range hint
1-1
: Clear and informative header comment.The header comment effectively communicates the file's purpose and auto-generated nature.
languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/remote-rule-filters.xml (1)
2587-2595
: LGTM! Rule correctly enforces capitalization of the political party name.The rule appropriately checks for proper capitalization of "Die Linke" (The Left), a German political party, ensuring consistent capitalization in text.
languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt (2)
27133-27133
:⚠️ Potential issueRemove the line number that got mixed into the entries.
The number "27133" appears to be a line number that accidentally got included in the spelling dictionary.
-27133
Likely invalid or redundant comment.
27134-27145
: Verify proper names and consider adding common variations.The entries look good, but consider adding these variations for consistency:
- "Boom" as a standalone entry (since "CoreWeave" is included as standalone)
- "Siena" as a standalone entry (since it's part of "Banca Monte dei Paschi di Siena")
Let's verify the accuracy of proper names:
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml (1)
17185-17194
: LGTM! The rule correctly prevents "I" in "I/O" from being tagged as a personal pronoun.The implementation uses appropriate case sensitivity and pattern matching to identify the "I/O" context.
@@ -233,15 +233,15 @@ private static boolean isLeftBracket(String str) { | |||
return false; | |||
} | |||
char c = str.charAt(0); | |||
return c == '(' || c == '[' || c == '{'; | |||
return c == '('; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Breaking change: Bracket handling restricted to parentheses only.
The changes in isLeftBracket
and isRightBracket
methods now only handle parentheses ()
, removing support for square brackets []
and curly braces {}
. This is a breaking change that conflicts with:
- The class documentation which states: "A rule that matches periods, commas and closing parenthesis preceded by whitespace and opening parenthesis followed by whitespace."
- The existing test cases mentioned in the AI summary that check for various bracket types.
Consider one of these solutions:
- Revert the changes to maintain backward compatibility:
- return c == '(';
+ return c == '(' || c == '[' || c == '{';
- return c == ')';
+ return c == ')' || c == ']' || c == '}';
- If the intent is to restrict bracket handling:
- Update the class documentation to clarify the scope
- Update the test cases to remove bracket type checks
- Consider creating a new rule for handling other bracket types
- Add a migration guide for users
Also applies to: 244-244
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (9)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/compounds.txt (1)
9044-9044
: Compound Entry Addition: "Winnie-the-Pooh"*The new compound entry "Winnie-the-Pooh*" has been added correctly. It follows the expected format by including hyphens and appending the appropriate asterisk (*) indicator. Please verify that its placement aligns with the intended ordering (alphabetical or grouping) within the file.
languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/grammar.xml (2)
75401-75407
: New Antipattern Rule EnhancementThe newly added
<antipattern>
element is well-structured. It introduces a rule that matches a sentence starting with aSENT_START
token, an optional punctuation token (matching quotation marks), then the literal "musste", followed by tokens for adverbs and articles. This rule should help refine the grammatical checking capabilities for specific constructs in German.Consider adding an inline comment to clarify the purpose of this rule for future maintainers.
75401-75407
: Optional Inline Documentation RecommendationTo further enhance maintainability, an inline comment describing the rationale behind this antipattern could be beneficial. For example, documenting that this rule detects cases where sentences begin with optional quotation marks followed by "musste" and specific grammatical tokens would provide useful context.
You might consider applying a diff similar to the following:
- <antipattern> + <!-- Antipattern: Detects sentences that optionally start with quotation marks followed by "musste", an adverb, and an article --> + <antipattern>languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/remote-rule-filters.xml (1)
2587-2595
: New Rule Addition for 'Die Linke'The new rule in the
AI_DE_GGEC_REPLACEMENT_ORTHOGRAPHY_UPPERCASE
rule group is structured correctly and follows the established XML format. It checks for the pattern where the token "Die" is marked and is immediately followed by the token "Linke". The example provided clearly illustrates the intended usage:<pattern case_sensitive="yes"> <marker> <token>Die</token> </marker> <token>Linke</token> </pattern> <example correction="">Ich glaube, <marker>Die</marker> Linke hat keine Chance bei der nächsten Wahl.</example>
Consideration:
Please verify whether this rule might inadvertently flag valid occurrences of the political party "Die Linke" (if that is not the intended target) versus instances where "die" should be lowercase. If the correction should not apply when referring to the party, an exception or additional context might be needed to avoid false positives.languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (2)
1-8108
: Consider organizing entries alphabetically for better maintainability.While the file contains a comprehensive list of multiword terms, the entries appear to be added in a somewhat random order. Consider reorganizing them alphabetically to make it easier to:
- Find existing entries
- Avoid duplicates
- Maintain the file over time
Here's a script to help sort the entries while preserving comments:
#!/bin/bash # Sort entries while preserving comments grep "^#" multiwords.txt > comments.tmp grep -v "^#" multiwords.txt | sort > entries.tmp cat comments.tmp entries.tmp > multiwords_sorted.txt rm comments.tmp entries.tmp
1-8108
: Consider adding POS tag validation.The file uses various POS tags (NNP, NNPS, NN, NNS, etc.) but there's no explicit validation of allowed tags. Consider adding a validation step to ensure consistency.
Here's a script to analyze current POS tag usage:
#!/bin/bash # Extract and count unique POS tags grep -v "^#" | awk '{print $NF}' | sort | uniq -clanguagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt (3)
27131-27150
: Evaluate new entries and maintain consistency.The additions in this hunk introduce several proper nouns and names (e.g., "Willem-Alexander", "Yomiuri Shimbun", "Asahi Shimbun", etc.). Please verify that these new entries adhere to the established formatting and capitalization conventions. It’s also worth confirming whether the order (typically alphabetical) is maintained or if reordering is required for consistency with other dictionary sections.
27157-27175
: Review naming variations and duplicate entries.This hunk adds multiple new entries such as "Kumbh Mela", "Boom Supersonic", and several personal and corporate names. Note the presence of closely related variants (e.g., "Arabella Rose Kushner", "Arabella Kushner", and "Arabella Rose")—please confirm that each variant is necessary and not an accidental duplicate. Additionally, ensure the naming conventions (spacing, hyphenation) remain consistent.
27186-27192
: Validate accuracy and relevance of additional entries.The entries introduced in this hunk (e.g., "Ramanujan", "Rick Deckard", "Vernon Lee", "Karoline Leavitt") expand the dictionary's scope. Please double-check that these names are accurate, relevant, and align with the overall goals of the spelling resource. Maintaining consistency in style with the rest of the file is crucial.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt
(3 hunks)languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/ignore.txt
(2 hunks)languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt
(1 hunks)languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt
(1 hunks)languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/grammar.xml
(1 hunks)languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/remote-rule-filters.xml
(3 hunks)languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/replace.txt
(1 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/compounds.txt
(1 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml
(1 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/ignore.txt
(1 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt
(1 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt
(1 hunks)languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml
- languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml
- languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/ignore.txt
- languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt
- languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/ignore.txt
🧰 Additional context used
🪛 LanguageTool
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt
[misspelling] ~3226-~3226: This word is normally spelled as one.
Context: ...onkunas Peter Joppich/S Abd al-Rahman/S Exxon Mobil/S Masar-e Scharif/S Tall Abyad/S #name ...
(EN_COMPOUNDS_EXXON_MOBIL)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Analyze (java-kotlin)
🔇 Additional comments (10)
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt (2)
66005-66038
: Validation & Consistency Check for Added Podcast and "simsen" TermsThis hunk introduces a batch of new entries ranging from "drachenmäßig/A" through "Zustrombegrenzungsgesetze/SN". These include podcast-related terms (e.g., "Podcastwerbung", "Podcastmarketing/S", "Podcastranking/S") and various verbal forms using “simsen” (e.g., "rum_simsen", "weg_simsen", "herum_simsen", etc.), as well as adjectives such as "geschlechterparitätisch/A". Please verify that each entry adheres to the established Hunspell dictionary format and that the morphological flags (like /A, /S, /SN, /NS) are accurate and consistent with similar, existing entries.
It might be useful to run an automated script or manual check against the dictionary’s parsing logic to ensure none of these entries inadvertently break lookup expectations.
66046-66067
: 🛠️ Refactor suggestionReview of Additional New Technical and Loanword Entries
This section adds further vocabulary items starting with "Slusheis" and continuing through "Tanzim-Miliz". Notably, entries such as "Slusheis" versus "Slush-Eis" and the two variations for the ice machine ("Slush-Eis-Maschine/N" and "Slusheismaschine/N") suggest an intention to capture regional or orthographic variants. Additionally, terms like "Al-Aksa-Moschee", "al-Aqsa-Moschee", "al-Aqsa", and "al-Aksa" appear with subtle differences that should be cross-checked for consistency with any preexisting standards or guidelines for loanwords and hyphenation. The inline remark in "Stop-Loss #eng" is acceptable if similar notation is used elsewhere; otherwise, consider standardizing its format.
Please ensure that these entries do not introduce unintended duplicates or conflicts with existing dictionary terms.
languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/replace.txt (2)
853-854
: New Replacement Rules for “Bär” VariantsThe new entries
Puh der Bär=Pu der Bär
Poo der Bär=Pu der Bär
are correctly added to standardize these expressions in line with the 2024 German spelling reform. Please verify that the chosen replacement “Pu der Bär” aligns with overall project conventions.
855-869
: Review of Updated Replacement EntriesThe subsequent lines (855–869) update several replacement rules (e.g. “Fettarmekost”, “Gutelaune”, “Hugo Ekitike[s]”, “Bloc Quebecois”, etc.) with the correct diacritical marks and spellings per the recent reform. These modifications look consistent with the updated guidelines.
🧰 Tools
🪛 LanguageTool
[uncategorized] ~859-~859: Do not mix variants of the same word (‘Québécois’ and ‘Quebecois’) within a single text.
Context: ...tikes=Hugo Ekitikés Bloc Quebecois=Bloc Québécois Le Meridien=Le Méridien Francois Bayrou...(EN_WORD_COHERENCY)
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt (4)
3215-3235
: New Multi-token Suggestion EntriesA large set of new entries has been appended to enhance multi-token suggestion coverage. The additions include several personal names and terms such as:
Susanne Daubner/S
,Constantin Schreiber/S
,Dunja Hayali/S
,Marietta Slomka/S
,Linda Zervakis
,Jan Hofer/S
, andKaty Perry/S
(lines 3215–3221).
These entries appear to be well-formed in terms of syntax and suffix usage. Please confirm that they meet the intended criteria and that the use of “/S” (or the absence thereof, as with “Linda Zervakis”) is deliberate.🧰 Tools
🪛 LanguageTool
[misspelling] ~3226-~3226: This word is normally spelled as one.
Context: ...onkunas Peter Joppich/S Abd al-Rahman/S Exxon Mobil/S Masar-e Scharif/S Tall Abyad/S #name ...(EN_COMPOUNDS_EXXON_MOBIL)
3226-3227
: Spelling Check for “Exxon Mobil/S”The entry
Exxon Mobil/S
(line ~3227)
triggers a static analysis hint: it is normally written as one word (“ExxonMobil”). Please verify whether the space is intentional due to domain-specific usage or if it should be corrected for consistency.🧰 Tools
🪛 LanguageTool
[misspelling] ~3226-~3226: This word is normally spelled as one.
Context: ...onkunas Peter Joppich/S Abd al-Rahman/S Exxon Mobil/S Masar-e Scharif/S Tall Abyad/S #name ...(EN_COMPOUNDS_EXXON_MOBIL)
3228-3233
: Validation of Additional New EntriesEntries such as
Masar-e Scharif/S
,
Tall Abyad/S #name
,
Tripoli/S #name
,
Umm Qasr/S
,
Umm Kasr/S
, and
Willem-Alexander/S #name
(from lines 3228–3233) have been added. They effectively expand the multi-token suggestion dataset. It is recommended to double-check that the usage of suffixes (e.g. “/S”) consistently applies to all names as intended.
3234-3234
: Consistency Check for “Pu der Bär” EntryThe new entry
Pu der Bär
(line 3234)
is added without any suffix. In the other resource file, similar terms use suffixes to indicate token modifications. Please verify if this omission is intentional or if a suffix (such as “/S”) should be appended to maintain consistency with the overall format.languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (2)
1-4
: LGTM! Clear and helpful header comments.The header comments effectively explain the purpose and usage rules for this multiwords file, including:
- Its use in EnglishHybridDisambiguator
- Criteria for adding terms (single POS, no disambiguation needed)
- Spelling handling behavior
- Case sensitivity rules
316-317
: Verify consistency with other geographic terms.The newly added entries "West African" (NNP) and "West Africans" (NNPS) follow the same pattern as other geographic terms in the file like "East Flanders" (NNP) and "West Flanders" (NNP). Let's verify there are no similar terms missing.
✅ Verification successful
Consistency Confirmed for Directional African Terms
The review comment is addressed as the grep results show that the directional geographic terms in our repository follow a consistent pattern. In the English multiwords file, similar entries like “East Flanders” and “West Flanders” are used, and there are no other “African” directional terms (e.g., North African, South African, or East African) expected or missing. The added “West African” (NNP) and “West Africans” (NNPS) entries align with this established convention.
- The grep output from other geographic terms confirms consistent directional naming.
- There is no evidence of missed similar directional terms regarding “African” in the English multiwords file.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for other directional geographic terms that may need similar treatment rg -i "^(north|south|east|west)\s+\w+" | grep -v "African"Length of output: 31487
Summary by CodeRabbit
New Features
Bug Fixes
Documentation