Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cb changes 20250131 #11218

Merged
merged 11 commits into from
Feb 4, 2025
Merged

Cb changes 20250131 #11218

merged 11 commits into from
Feb 4, 2025

Conversation

tiff
Copy link
Member

@tiff tiff commented Jan 31, 2025

Summary by CodeRabbit

  • New Features

    • Enhanced German language support with new podcast-related terms and names.
    • Expanded vocabulary in English and German spelling dictionaries, including scientific and contemporary terms.
    • Improved language disambiguation for technical terms like "I/O".
    • Introduced new rules for capitalization related to political references in German and for unnecessary pronouns.
  • Bug Fixes

    • Refined comma whitespace rule to focus specifically on parentheses.
    • Updated grammar rules for more accurate language checking.
  • Documentation

    • Added multiple new names and terms to multi-token suggestion lists.

Copy link
Contributor

coderabbitai bot commented Jan 31, 2025

Walkthrough

This pull request encompasses multiple changes across various LanguageTool language modules, focusing on refining grammatical and punctuation rules, as well as expanding language resources. The modifications include updates to spelling dictionaries, ignore files, disambiguation rules, and grammar configurations in both German and English modules. Key changes involve modifying bracket detection logic to focus solely on parentheses, removing specific test cases related to mismatched brackets, and adding numerous words, names, and technical terms to enhance language processing and spell-checking capabilities.

Changes

File Path Change Summary
languagetool-core/src/main/java/org/languagetool/rules/CommaWhitespaceRule.java Modified bracket detection logic to focus only on parentheses
languagetool-core/src/test/java/org/languagetool/rules/CommaWhitespaceRuleTest.java Removed test cases for mismatched bracket scenarios
languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt Added numerous words, brands, names, and cultural references
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/ignore.txt Added "Nietzard/S", "Deepseek/S", "Leavitt/S", "Adschman/S", "Alijew/S", "Yonpyong/S" to the ignore list
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt Added podcast-related terms, new adjectives, and verb variations
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt Added names of public figures and personalities
languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/remote-rule-filters.xml Added new rule for "Die Linke" capitalization
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml Added rule for correctly interpreting "I/O"
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/ignore.txt Added technical terms, names, and specialized vocabulary
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt Added scientific and technical terminology
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml Updated SPEND_IT_FOR rule's antipattern structure
languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/grammar.xml Added new antipattern for specific grammatical structures
languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/replace.txt Updated replacement rules according to the 2024 German spelling reform
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/compounds.txt Added new compound word "Winnie-the-Pooh"
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt Added entries for "West African" and "West Africans"

Possibly related PRs

  • Cb changes 20241203 #11081: The changes in the main PR, specifically in the CommaWhitespaceRule class, are related to the modifications in the spelling_global.txt file from the retrieved PR, as both involve adjustments to how specific types of characters (brackets and words) are processed.
  • Cb changes 20241121 #11051: The changes in the main PR, specifically in the CommaWhitespaceRule class, are related to the modifications in the CommaWhitespaceRuleTest class, which involves removing tests that previously checked for various bracket types, aligning with the narrowed focus on parentheses in the main PR.
  • Cb changes 20241217 #11120: The changes in the main PR, specifically in the CommaWhitespaceRule class, are related to the modifications in the spelling_global.txt file from the retrieved PR, as both involve adjustments to how specific terms (including punctuation and vocabulary) are processed.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/ignore.txt (1)

Line range hint 2-11007: Well-formed additions to the ignore list.

The new entries maintain consistent formatting and include a good mix of proper nouns, technical terms, and compound words. The preservation of case sensitivity and diacritical marks (e.g., "Kantō") is appropriate.

Consider grouping related terms for better maintainability.

For variants like "Kanto" and "Kantō", consider adding a comment to indicate their relationship or grouping them together in the file. This would make future maintenance easier.

  # Japanese region names and variants
  Kanto
  Kantō
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml (1)

17185-17194: Consider expanding the rule to handle more variations.

The current rule might miss common variations of "I/O". Consider adding support for:

  • Plural forms (e.g., "I/Os")
  • Alternative separators (e.g., "I-O")
  • Lowercase variations (e.g., "i/o")

Here's a suggested expansion of the rule:

     <rule id="I_O_NO_PRP" name="I in I/O is no PRP">
         <pattern>
             <marker>
-                <token case_sensitive="yes">I</token>
+                <token regexp="yes">(?i)I|i</token>
             </marker>
-            <token>/</token>
+            <token regexp="yes">[/-]</token>
-            <token>O</token>
+            <token regexp="yes">(?i)O|o</token>
+            <token regexp="yes">s?</token>
         </pattern>
         <disambig action="remove" postag="PRP.*"/>
     </rule>
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ba526dc and d95cba7.

📒 Files selected for processing (11)
  • languagetool-core/src/main/java/org/languagetool/rules/CommaWhitespaceRule.java (1 hunks)
  • languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt (1 hunks)
  • languagetool-core/src/test/java/org/languagetool/rules/CommaWhitespaceRuleTest.java (0 hunks)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/ignore.txt (1 hunks)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt (1 hunks)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt (1 hunks)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/remote-rule-filters.xml (1 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml (1 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/ignore.txt (1 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt (1 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml (1 hunks)
💤 Files with no reviewable changes (1)
  • languagetool-core/src/test/java/org/languagetool/rules/CommaWhitespaceRuleTest.java
✅ Files skipped from review due to trivial changes (1)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/ignore.txt
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Analyze (java-kotlin)
🔇 Additional comments (13)
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml (1)

106880-106883: LGTM! Good addition of the "for...sake" antipattern.

The new antipattern correctly handles common idiomatic expressions like "for heaven's sake", "for goodness sake", etc. The skip="4" attribute appropriately allows for variations in these phrases.

languagetool-core/src/main/java/org/languagetool/rules/CommaWhitespaceRule.java (1)

236-236: Verify the impact of bracket handling changes.

Let's analyze the usage of this rule across the codebase to understand the full impact of these changes.

Also applies to: 244-244

✅ Verification successful

Bracket handling changes are safely isolated to core implementation

The change to parenthesis handling in isLeftBracket() is inherited by all language-specific implementations without any conflicts or overrides. No language-specific customizations of bracket handling were found.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Analyze usage and impact of CommaWhitespaceRule changes

echo "1. Finding direct usages of CommaWhitespaceRule:"
rg -l "CommaWhitespaceRule"

echo -e "\n2. Finding subclasses:"
ast-grep --pattern 'class $_ extends CommaWhitespaceRule'

echo -e "\n3. Finding test files containing bracket-related assertions:"
rg -l "assertMatches.*[\[\{\(\)\}\]].*CommaWhitespaceRule"

echo -e "\n4. Finding language-specific configurations:"
fd -e xml | xargs rg "COMMA_PARENTHESIS_WHITESPACE"

Length of output: 4800


Script:

#!/bin/bash
# Check language-specific implementations and tests

echo "1. Examining German implementation:"
rg -A 5 -B 5 "class GermanCommaWhitespaceRule" languagetool-language-modules/de/src/main/java/org/languagetool/rules/de/GermanCommaWhitespaceRule.java

echo -e "\n2. Examining Ukrainian implementation:"
rg -A 5 -B 5 "class UkrainianCommaWhitespaceRule" languagetool-language-modules/uk/src/main/java/org/languagetool/rules/uk/UkrainianCommaWhitespaceRule.java

echo -e "\n3. Examining Arabic implementation:"
rg -A 5 -B 5 "class ArabicCommaWhitespaceRule" languagetool-language-modules/ar/src/main/java/org/languagetool/rules/ar/ArabicCommaWhitespaceRule.java

echo -e "\n4. Examining Persian implementation:"
rg -A 5 -B 5 "class PersianCommaWhitespaceRule" languagetool-language-modules/fa/src/main/java/org/languagetool/rules/fa/PersianCommaWhitespaceRule.java

echo -e "\n5. Looking for bracket-related tests:"
rg -A 5 -B 5 "test.*[\(\)].*CommaWhitespaceRule" languagetool-core/src/test/java/org/languagetool/rules/CommaWhitespaceRuleTest.java

Length of output: 2690


Script:

#!/bin/bash
# Examine core test cases and implementation details

echo "1. Looking for bracket-related test cases in CommaWhitespaceRuleTest:"
rg -A 2 "[\(\)].*CommaWhitespaceRule" languagetool-core/src/test/java/org/languagetool/rules/CommaWhitespaceRuleTest.java

echo -e "\n2. Check for isLeftBracket or similar method overrides:"
ast-grep --pattern 'class $_ extends CommaWhitespaceRule {
  $$$
  isLeftBracket($_) {
    $$$
  }
  $$$
}'

echo -e "\n3. Examining the core implementation context:"
rg -A 5 -B 5 "isLeftBracket" languagetool-core/src/main/java/org/languagetool/rules/CommaWhitespaceRule.java

Length of output: 1484

languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt (4)

66006-66020: LGTM! Well-structured podcast-related terminology.

The compound words follow German word formation rules correctly, with proper gender pairs and inflection markers.


66005-66005: LGTM! Well-formed adjectives with correct inflection markers.

The adjectives follow German compound formation rules and are correctly marked for inflection.

Also applies to: 66021-66022, 66033-66033


66023-66034: LGTM! Consistent marking of separable prefix verbs.

The verbs use underscore consistently to mark separable prefixes, following German verb formation rules.


66035-66036: LGTM! Complex compounds and adjectives are well-formed.

The terms follow German word formation rules with appropriate inflection markers.

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt (1)

696-710: LGTM! The new words are properly formatted and follow the existing pattern.

The additions include correctly formatted scientific and technical terms:

  • Biology terms: "four-limbed", "strong-limbed", "limbed", "uroepithelium", "uroepithelia", "uroepithelial"
  • Chemistry terms: "isocitrate", "isocitrates", "decarboxylation", "alphaketoglutarate", "alphaketoglutarates"
  • Architecture terms: "stylobate", "stylobates"
  • Historical terms: "Perizzite", "Perizzites"
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt (1)

3215-3221: LGTM! The new entries are properly formatted and follow the existing pattern.

The additions include correctly formatted names of public figures with appropriate "/S" suffix:

  • News presenters: "Susanne Daubner/S", "Constantin Schreiber/S", "Dunja Hayali/S", "Marietta Slomka/S", "Linda Zervakis", "Jan Hofer/S"
  • Artists: "Katy Perry/S"
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/ignore.txt (1)

Line range hint 1-1: Clear and informative header comment.

The header comment effectively communicates the file's purpose and auto-generated nature.

languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/remote-rule-filters.xml (1)

2587-2595: LGTM! Rule correctly enforces capitalization of the political party name.

The rule appropriately checks for proper capitalization of "Die Linke" (The Left), a German political party, ensuring consistent capitalization in text.

languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt (2)

27133-27133: ⚠️ Potential issue

Remove the line number that got mixed into the entries.

The number "27133" appears to be a line number that accidentally got included in the spelling dictionary.

-27133

Likely invalid or redundant comment.


27134-27145: Verify proper names and consider adding common variations.

The entries look good, but consider adding these variations for consistency:

  • "Boom" as a standalone entry (since "CoreWeave" is included as standalone)
  • "Siena" as a standalone entry (since it's part of "Banca Monte dei Paschi di Siena")

Let's verify the accuracy of proper names:

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml (1)

17185-17194: LGTM! The rule correctly prevents "I" in "I/O" from being tagged as a personal pronoun.

The implementation uses appropriate case sensitivity and pattern matching to identify the "I/O" context.

@@ -233,15 +233,15 @@ private static boolean isLeftBracket(String str) {
return false;
}
char c = str.charAt(0);
return c == '(' || c == '[' || c == '{';
return c == '(';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Breaking change: Bracket handling restricted to parentheses only.

The changes in isLeftBracket and isRightBracket methods now only handle parentheses (), removing support for square brackets [] and curly braces {}. This is a breaking change that conflicts with:

  1. The class documentation which states: "A rule that matches periods, commas and closing parenthesis preceded by whitespace and opening parenthesis followed by whitespace."
  2. The existing test cases mentioned in the AI summary that check for various bracket types.

Consider one of these solutions:

  1. Revert the changes to maintain backward compatibility:
-    return c == '(';
+    return c == '(' || c == '[' || c == '{';
-    return c == ')';
+    return c == ')' || c == ']' || c == '}';
  1. If the intent is to restrict bracket handling:
    • Update the class documentation to clarify the scope
    • Update the test cases to remove bracket type checks
    • Consider creating a new rule for handling other bracket types
    • Add a migration guide for users

Also applies to: 244-244

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (9)
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/compounds.txt (1)

9044-9044: Compound Entry Addition: "Winnie-the-Pooh"*

The new compound entry "Winnie-the-Pooh*" has been added correctly. It follows the expected format by including hyphens and appending the appropriate asterisk (*) indicator. Please verify that its placement aligns with the intended ordering (alphabetical or grouping) within the file.

languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/grammar.xml (2)

75401-75407: New Antipattern Rule Enhancement

The newly added <antipattern> element is well-structured. It introduces a rule that matches a sentence starting with a SENT_START token, an optional punctuation token (matching quotation marks), then the literal "musste", followed by tokens for adverbs and articles. This rule should help refine the grammatical checking capabilities for specific constructs in German.

Consider adding an inline comment to clarify the purpose of this rule for future maintainers.


75401-75407: Optional Inline Documentation Recommendation

To further enhance maintainability, an inline comment describing the rationale behind this antipattern could be beneficial. For example, documenting that this rule detects cases where sentences begin with optional quotation marks followed by "musste" and specific grammatical tokens would provide useful context.

You might consider applying a diff similar to the following:

-                <antipattern>
+                <!-- Antipattern: Detects sentences that optionally start with quotation marks followed by "musste", an adverb, and an article -->
+                <antipattern>
languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/remote-rule-filters.xml (1)

2587-2595: New Rule Addition for 'Die Linke'

The new rule in the AI_DE_GGEC_REPLACEMENT_ORTHOGRAPHY_UPPERCASE rule group is structured correctly and follows the established XML format. It checks for the pattern where the token "Die" is marked and is immediately followed by the token "Linke". The example provided clearly illustrates the intended usage:

<pattern case_sensitive="yes">
  <marker>
    <token>Die</token>
  </marker>
  <token>Linke</token>
</pattern>
<example correction="">Ich glaube, <marker>Die</marker> Linke hat keine Chance bei der nächsten Wahl.</example>

Consideration:
Please verify whether this rule might inadvertently flag valid occurrences of the political party "Die Linke" (if that is not the intended target) versus instances where "die" should be lowercase. If the correction should not apply when referring to the party, an exception or additional context might be needed to avoid false positives.

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (2)

1-8108: Consider organizing entries alphabetically for better maintainability.

While the file contains a comprehensive list of multiword terms, the entries appear to be added in a somewhat random order. Consider reorganizing them alphabetically to make it easier to:

  1. Find existing entries
  2. Avoid duplicates
  3. Maintain the file over time

Here's a script to help sort the entries while preserving comments:

#!/bin/bash
# Sort entries while preserving comments
grep "^#" multiwords.txt > comments.tmp
grep -v "^#" multiwords.txt | sort > entries.tmp
cat comments.tmp entries.tmp > multiwords_sorted.txt
rm comments.tmp entries.tmp

1-8108: Consider adding POS tag validation.

The file uses various POS tags (NNP, NNPS, NN, NNS, etc.) but there's no explicit validation of allowed tags. Consider adding a validation step to ensure consistency.

Here's a script to analyze current POS tag usage:

#!/bin/bash
# Extract and count unique POS tags
grep -v "^#" | awk '{print $NF}' | sort | uniq -c
languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt (3)

27131-27150: Evaluate new entries and maintain consistency.

The additions in this hunk introduce several proper nouns and names (e.g., "Willem-Alexander", "Yomiuri Shimbun", "Asahi Shimbun", etc.). Please verify that these new entries adhere to the established formatting and capitalization conventions. It’s also worth confirming whether the order (typically alphabetical) is maintained or if reordering is required for consistency with other dictionary sections.


27157-27175: Review naming variations and duplicate entries.

This hunk adds multiple new entries such as "Kumbh Mela", "Boom Supersonic", and several personal and corporate names. Note the presence of closely related variants (e.g., "Arabella Rose Kushner", "Arabella Kushner", and "Arabella Rose")—please confirm that each variant is necessary and not an accidental duplicate. Additionally, ensure the naming conventions (spacing, hyphenation) remain consistent.


27186-27192: Validate accuracy and relevance of additional entries.

The entries introduced in this hunk (e.g., "Ramanujan", "Rick Deckard", "Vernon Lee", "Karoline Leavitt") expand the dictionary's scope. Please double-check that these names are accurate, relevant, and align with the overall goals of the spelling resource. Maintaining consistency in style with the rest of the file is crucial.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 009d66e and 2172943.

📒 Files selected for processing (13)
  • languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt (3 hunks)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/ignore.txt (2 hunks)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt (1 hunks)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt (1 hunks)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/grammar.xml (1 hunks)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/remote-rule-filters.xml (3 hunks)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/replace.txt (1 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/compounds.txt (1 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml (1 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/ignore.txt (1 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt (1 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (1 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml
  • languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml
  • languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/ignore.txt
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/ignore.txt
🧰 Additional context used
🪛 LanguageTool
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt

[misspelling] ~3226-~3226: This word is normally spelled as one.
Context: ...onkunas Peter Joppich/S Abd al-Rahman/S Exxon Mobil/S Masar-e Scharif/S Tall Abyad/S #name ...

(EN_COMPOUNDS_EXXON_MOBIL)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Analyze (java-kotlin)
🔇 Additional comments (10)
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt (2)

66005-66038: Validation & Consistency Check for Added Podcast and "simsen" Terms

This hunk introduces a batch of new entries ranging from "drachenmäßig/A" through "Zustrombegrenzungsgesetze/SN". These include podcast-related terms (e.g., "Podcastwerbung", "Podcastmarketing/S", "Podcastranking/S") and various verbal forms using “simsen” (e.g., "rum_simsen", "weg_simsen", "herum_simsen", etc.), as well as adjectives such as "geschlechterparitätisch/A". Please verify that each entry adheres to the established Hunspell dictionary format and that the morphological flags (like /A, /S, /SN, /NS) are accurate and consistent with similar, existing entries.

It might be useful to run an automated script or manual check against the dictionary’s parsing logic to ensure none of these entries inadvertently break lookup expectations.


66046-66067: 🛠️ Refactor suggestion

Review of Additional New Technical and Loanword Entries

This section adds further vocabulary items starting with "Slusheis" and continuing through "Tanzim-Miliz". Notably, entries such as "Slusheis" versus "Slush-Eis" and the two variations for the ice machine ("Slush-Eis-Maschine/N" and "Slusheismaschine/N") suggest an intention to capture regional or orthographic variants. Additionally, terms like "Al-Aksa-Moschee", "al-Aqsa-Moschee", "al-Aqsa", and "al-Aksa" appear with subtle differences that should be cross-checked for consistency with any preexisting standards or guidelines for loanwords and hyphenation. The inline remark in "Stop-Loss #eng" is acceptable if similar notation is used elsewhere; otherwise, consider standardizing its format.

Please ensure that these entries do not introduce unintended duplicates or conflicts with existing dictionary terms.

languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/replace.txt (2)

853-854: New Replacement Rules for “Bär” Variants

The new entries
  Puh der Bär=Pu der Bär
  Poo der Bär=Pu der Bär
are correctly added to standardize these expressions in line with the 2024 German spelling reform. Please verify that the chosen replacement “Pu der Bär” aligns with overall project conventions.


855-869: Review of Updated Replacement Entries

The subsequent lines (855–869) update several replacement rules (e.g. “Fettarmekost”, “Gutelaune”, “Hugo Ekitike[s]”, “Bloc Quebecois”, etc.) with the correct diacritical marks and spellings per the recent reform. These modifications look consistent with the updated guidelines.

🧰 Tools
🪛 LanguageTool

[uncategorized] ~859-~859: Do not mix variants of the same word (‘Québécois’ and ‘Quebecois’) within a single text.
Context: ...tikes=Hugo Ekitikés Bloc Quebecois=Bloc Québécois Le Meridien=Le Méridien Francois Bayrou...

(EN_WORD_COHERENCY)

languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt (4)

3215-3235: New Multi-token Suggestion Entries

A large set of new entries has been appended to enhance multi-token suggestion coverage. The additions include several personal names and terms such as:
  Susanne Daubner/S, Constantin Schreiber/S, Dunja Hayali/S, Marietta Slomka/S, Linda Zervakis, Jan Hofer/S, and Katy Perry/S (lines 3215–3221).
These entries appear to be well-formed in terms of syntax and suffix usage. Please confirm that they meet the intended criteria and that the use of “/S” (or the absence thereof, as with “Linda Zervakis”) is deliberate.

🧰 Tools
🪛 LanguageTool

[misspelling] ~3226-~3226: This word is normally spelled as one.
Context: ...onkunas Peter Joppich/S Abd al-Rahman/S Exxon Mobil/S Masar-e Scharif/S Tall Abyad/S #name ...

(EN_COMPOUNDS_EXXON_MOBIL)


3226-3227: Spelling Check for “Exxon Mobil/S”

The entry
  Exxon Mobil/S (line ~3227)
triggers a static analysis hint: it is normally written as one word (“ExxonMobil”). Please verify whether the space is intentional due to domain-specific usage or if it should be corrected for consistency.

🧰 Tools
🪛 LanguageTool

[misspelling] ~3226-~3226: This word is normally spelled as one.
Context: ...onkunas Peter Joppich/S Abd al-Rahman/S Exxon Mobil/S Masar-e Scharif/S Tall Abyad/S #name ...

(EN_COMPOUNDS_EXXON_MOBIL)


3228-3233: Validation of Additional New Entries

Entries such as
  Masar-e Scharif/S,
  Tall Abyad/S #name,
  Tripoli/S #name,
  Umm Qasr/S,
  Umm Kasr/S, and
  Willem-Alexander/S #name
(from lines 3228–3233) have been added. They effectively expand the multi-token suggestion dataset. It is recommended to double-check that the usage of suffixes (e.g. “/S”) consistently applies to all names as intended.


3234-3234: Consistency Check for “Pu der Bär” Entry

The new entry
  Pu der Bär (line 3234)
is added without any suffix. In the other resource file, similar terms use suffixes to indicate token modifications. Please verify if this omission is intentional or if a suffix (such as “/S”) should be appended to maintain consistency with the overall format.

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (2)

1-4: LGTM! Clear and helpful header comments.

The header comments effectively explain the purpose and usage rules for this multiwords file, including:

  1. Its use in EnglishHybridDisambiguator
  2. Criteria for adding terms (single POS, no disambiguation needed)
  3. Spelling handling behavior
  4. Case sensitivity rules

316-317: Verify consistency with other geographic terms.

The newly added entries "West African" (NNP) and "West Africans" (NNPS) follow the same pattern as other geographic terms in the file like "East Flanders" (NNP) and "West Flanders" (NNP). Let's verify there are no similar terms missing.

✅ Verification successful

Consistency Confirmed for Directional African Terms

The review comment is addressed as the grep results show that the directional geographic terms in our repository follow a consistent pattern. In the English multiwords file, similar entries like “East Flanders” and “West Flanders” are used, and there are no other “African” directional terms (e.g., North African, South African, or East African) expected or missing. The added “West African” (NNP) and “West Africans” (NNPS) entries align with this established convention.

  • The grep output from other geographic terms confirms consistent directional naming.
  • There is no evidence of missed similar directional terms regarding “African” in the English multiwords file.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for other directional geographic terms that may need similar treatment
rg -i "^(north|south|east|west)\s+\w+" | grep -v "African"

Length of output: 31487

@tiff tiff merged commit a8f2643 into master Feb 4, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant