Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Feb 5, 2026

What do these changes do

Implements safe corpus management practices for ClusterFuzzLite fuzzing infrastructure.

What was wrong

Fuzzing setup lacked guidance on corpus handling. No protection against committing large/duplicate corpus files, sensitive data, or crash-triggering inputs to version control.

How this fixes it

Documentation (fuzz/README.md)

  • Minimize/de-duplicate: libFuzzer -merge=1 commands and workflow
  • Data sanitization: Explicit do's/don'ts for test data, grep patterns for sensitive content detection
  • Dedicated storage: Explains gh-pages branch usage, alternatives for private repos
  • Crash handling: 5-step procedure (reproduce → debug → fix → verify → regression test → add to corpus)
  • Security considerations: Public corpus storage safety, artifact retention policies
  • Command reference: Common libFuzzer operations for corpus management

Artifact Exclusions (.gitignore)

fuzz/corpus/
fuzz/crashes/
fuzz/artifacts/
crash-*
leak-*
timeout-*
oom-*
*.profraw
*.profdata

Prevents accidental commits of fuzzing artifacts while maintaining ClusterFuzzLite's automatic gh-pages corpus persistence.

Your checklist for this pull request

  • Passed code styles and structures
  • Passed code linting checks and unit test
Original prompt

Setup ClusterFuzzLite for Fuzz Testing

Context

Issue #1070 requests setting up ClusterFuzzLite for continuous fuzz testing of PyThaiNLP. This will help discover edge cases, crashes, and potential security issues in text processing functions.

Objectives

  1. Add ClusterFuzzLite GitHub Actions workflow that integrates with existing CI/CD
  2. Create fuzzing harnesses for high-priority modules (tokenization and utilities)
  3. Add Docker configuration for ClusterFuzzLite build environment
  4. Provide documentation for maintaining and expanding fuzz tests

Requirements

1. GitHub Actions Workflow (.github/workflows/clusterfuzzlite.yml)

Create a new workflow that:

  • Runs on pull requests to dev branch (matching existing workflow patterns)
  • Runs on push to dev branch
  • Runs on schedule (daily at 06:00 UTC)
  • Uses AddressSanitizer for memory safety checks
  • Fuzzes for 5 minutes per harness (300 seconds)
  • Uses mode: 'code-change' to focus on PR changes
  • Follows the concurrency pattern used in other workflows (cancel-in-progress for same branch)
  • Uploads crash artifacts if failures occur
  • Has proper permissions (contents: read, issues: write)

2. Docker Configuration (Dockerfile)

Create a Dockerfile at the repository root that:

  • Extends gcr.io/oss-fuzz-base/base-builder-python
  • Installs Python development dependencies
  • Copies the repository to $SRC/pythainlp
  • Installs pythainlp in development mode with minimal dependencies

3. Fuzzing Harnesses (fuzz/ directory)

Create the following fuzz test files:

fuzz/fuzz_tokenize.py

  • Test pythainlp.tokenize.word_tokenize() with random Unicode input
  • Use atheris.FuzzedDataProvider to generate test data
  • Validate output is a list of strings
  • Catch expected exceptions (ValueError, TypeError, UnicodeDecodeError)
  • Re-raise unexpected exceptions for investigation

fuzz/fuzz_util_normalize.py

  • Test pythainlp.util.normalize() with random Unicode input
  • Use atheris.FuzzedDataProvider to generate test data
  • Validate output is a string
  • Catch expected exceptions
  • Re-raise unexpected exceptions

fuzz/build.sh

  • Install atheris for Python fuzzing
  • Compile each fuzzer (all fuzz/fuzz_*.py files)
  • Make fuzzers executable
  • Copy to $OUT/ directory
  • Print build success message

fuzz/README.md

  • Explain the fuzzing setup and directory structure
  • Provide instructions for local testing
  • Document how to add new fuzzers
  • List expansion plan for future fuzzing targets:
    • spell/ (spelling correction)
    • corpus/ (data loading)
    • soundex/ (phonetic encoding)
    • transliterate/ (romanization)
    • tag/ (part-of-speech tagging)

4. Code Style and Standards

  • Follow existing PyThaiNLP code style (check .github/workflows/lint.yml)
  • Add SPDX headers to new files where appropriate:
    # SPDX-FileCopyrightText: 2026 PyThaiNLP Project
    # SPDX-License-Identifier: Apache-2.0
    # SPDX-FileType: SOURCE
  • Use proper error handling and logging
  • Include descriptive docstrings

5. Integration Testing

The solution should:

  • Not break existing CI/CD workflows
  • Use minimal dependencies (only atheris for fuzzing)
  • Be compatible with Python 3.9+ (matching requires-python in pyproject.toml)
  • Follow the path-ignore patterns used in other workflows

Success Criteria

  • GitHub Actions workflow runs successfully on PRs
  • Fuzzing harnesses execute without immediate crashes
  • Docker container builds successfully
  • Documentation is clear and complete
  • No conflicts with existing workflows
  • Follows PyThaiNLP's existing CI/CD patterns

References

Notes

  • Start with minimal fuzzing (2 harnesses) to validate the setup
  • Can expand to more modules after initial validation
  • Fuzzing time (300s) can be adjusted based on CI/CD resource constraints
  • The workflow uses path-ignore patterns similar to existing workflows to avoid unnecessary runs

This pull request was created from Copilot chat.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI assigned Copilot and bact Feb 5, 2026
@bact bact linked an issue Feb 5, 2026 that may be closed by this pull request
@bact bact added infrastructure build process, configuration, GitHub Actions tests Unit test, code coverage, test case labels Feb 5, 2026
@bact bact added this to PyThaiNLP Feb 5, 2026
@bact bact moved this to In progress in PyThaiNLP Feb 5, 2026
@bact bact added this to the Future milestone Feb 5, 2026
@github-actions
Copy link

github-actions bot commented Feb 5, 2026

Hello @Copilot, thank you for submitting a PR! We will respond as soon as possible.

สวัสดี @Copilot ขอบคุณที่ส่ง PR เข้ามา เราจะตอบกลับให้เร็วที่สุดเท่าที่จะทำได้

Copilot AI and others added 2 commits February 5, 2026 00:48
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Copilot AI changed the title [WIP] Add ClusterFuzzLite setup for continuous fuzz testing Add ClusterFuzzLite for continuous fuzz testing Feb 5, 2026
Copilot AI requested a review from bact February 5, 2026 00:51
@bact
Copy link
Member

bact commented Feb 5, 2026

@copilot people will confuse about Dockerfile* files. we should name it properly, telling which one is for normal use, which one is for fuzzing. Alternatively, put the Dockerfile for fuzzing in the fuzz/ directory.

What is the best pattern that commonly used?

Co-authored-by: bact <128572+bact@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Feb 5, 2026

@copilot people will confuse about Dockerfile* files. we should name it properly, telling which one is for normal use, which one is for fuzzing. Alternatively, put the Dockerfile for fuzzing in the fu...

Moved the fuzzing Dockerfile to fuzz/Dockerfile and restored the original Dockerfile at root for the main application. This follows the common pattern of keeping fuzzing infrastructure in the fuzz/ directory. The workflow now uses dockerfile-path: fuzz/Dockerfile to point to the fuzzing-specific Docker image. (604792e)

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds ClusterFuzzLite infrastructure to enable continuous fuzz testing of PyThaiNLP functions, helping discover edge cases, crashes, and security vulnerabilities in text processing operations.

Changes:

  • Added GitHub Actions workflow for automated fuzzing on PRs, pushes, and daily schedule
  • Created fuzzing Docker environment with OSS-Fuzz base builder
  • Implemented two initial fuzzing harnesses for word_tokenize() and normalize() functions
  • Added build script and comprehensive documentation for fuzzing infrastructure

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 18 comments.

Show a summary per file
File Description
.github/workflows/clusterfuzzlite.yml GitHub Actions workflow orchestrating ClusterFuzzLite fuzzing with AddressSanitizer
fuzz/Dockerfile Docker configuration extending OSS-Fuzz base builder for Python projects
fuzz/build.sh Build script for compiling fuzzing harnesses with atheris
fuzz/fuzz_tokenize.py Fuzzing harness testing word_tokenize() with random Unicode input
fuzz/fuzz_util_normalize.py Fuzzing harness testing normalize() with random Unicode input
fuzz/README.md Documentation covering setup, local testing, CI/CD integration, and expansion roadmap

@bact
Copy link
Member

bact commented Feb 5, 2026

@copilot apply suggestions in this thread #1282 (review)

Co-authored-by: bact <128572+bact@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Feb 5, 2026

@copilot apply suggestions in this thread #1282 (review)

Applied all suggestions from the code review (499966b):

Workflow changes:

  • Removed **.yml from paths-ignore to allow workflow updates to trigger
  • Changed permissions to contents: write for storing corpus/coverage data
  • Made fuzzing mode conditional: code-change for PRs, batch for scheduled/push events

Fuzzer files:

  • Added :rtype: None to docstrings for complete type annotations
  • Replaced assert statements with explicit TypeError raises
  • Removed unnecessary except Exception: blocks

Build script:

  • Fixed to copy fuzzer files directly (instrumentation happens at runtime)
  • Added version constraint: atheris>=2.3.0
  • Removed unused fuzzer_package variable

Dockerfile:

  • Reordered SPDX headers to match project convention
  • Added comment about build script location for OSS-Fuzz/ClusterFuzzLite

@bact bact requested a review from Copilot February 5, 2026 02:58
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.

Comment on lines +72 to +74
storage-repo: https://${{ secrets.GITHUB_TOKEN }}@github.com/${{ github.repository }}.git
storage-repo-branch: gh-pages
storage-repo-branch-coverage: gh-pages
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow configuration specifies storage-repo-branch: gh-pages and storage-repo-branch-coverage: gh-pages for storing fuzzing corpus and coverage data. This assumes that a gh-pages branch exists in the repository. If this branch doesn't exist, the workflow may fail when trying to store corpus/coverage data. Consider either: 1) ensuring the gh-pages branch exists before this workflow runs, 2) documenting the need to create this branch in the PR description or setup documentation, or 3) using a different branch that already exists (like a dedicated fuzzing-data branch).

Copilot uses AI. Check for mistakes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot from security point of view, storing fuzzing corpus in gh-pages (publicly available) is a concern?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storing fuzzing corpus in gh-pages (public) is generally acceptable for open-source projects. The corpus contains test inputs that triggered interesting code paths, not secrets or sensitive data. However, if crash-triggering inputs are concerning:

Options:

  1. Keep gh-pages public - Standard practice for OSS fuzzing (e.g., OSS-Fuzz projects)
  2. Use a private branch - Change storage-repo-branch to a non-public branch if this repo has one
  3. Disable corpus storage - Remove the storage-repo* parameters (fuzzing will still work, just won't persist corpus between runs)

The current setup follows ClusterFuzzLite best practices for public repositories. Crash artifacts (line 76-83) are only uploaded as workflow artifacts (not gh-pages), which have limited retention and access.

@bact
Copy link
Member

bact commented Feb 5, 2026

@copilot apply suggestions from this thread
#1282 (review)

- Fix SPDX header order (FileCopyrightText, FileType, License-Identifier)
- Add :rtype: None to main() docstrings for complete type annotations
- Replace assert with explicit TypeError in README example
- Add nullglob guard to build.sh for loop
- Pin atheris to exact version (==2.3.0) for security

Co-authored-by: bact <128572+bact@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Feb 5, 2026

@copilot apply suggestions from this thread
#1282 (review)

Applied all 9 suggestions from the second code review (7f258ce):

SPDX header order:

  • Fixed order in all fuzz files to: FileCopyrightText → FileType → License-Identifier

Type annotations:

  • Added :rtype: None to main() docstrings in both fuzzers

README example:

  • Replaced assert isinstance() with explicit TypeError raise

Build script security & robustness:

  • Pinned atheris to exact version ==2.3.0 (instead of >=2.3.0) to prevent supply-chain attacks
  • Added [[ -e "$fuzzer" ]] || continue guard to handle missing fuzz files gracefully

- Add comprehensive "Corpus Storage Best Practices" section to fuzz/README.md
  - Minimize and de-duplicate guidelines with libFuzzer commands
  - Data sanitization guidelines (no sensitive/production data)
  - Dedicated storage explanation (gh-pages branch usage)
  - Crash monitoring and handling procedures
  - Security considerations for public/private repos
  - Corpus management command reference
- Update .gitignore to exclude fuzzing artifacts
  - Local corpus directories (fuzz/corpus/, fuzz/crashes/, fuzz/artifacts/)
  - Fuzzer output files (crash-*, leak-*, timeout-*, oom-*)
  - Profiling data (*.profraw, *.profdata)

Co-authored-by: bact <128572+bact@users.noreply.github.com>
Copilot AI changed the title Add ClusterFuzzLite for continuous fuzz testing Add fuzzing corpus storage best practices and artifact exclusions Feb 5, 2026
@sonarqubecloud
Copy link

sonarqubecloud bot commented Feb 5, 2026

Quality Gate Failed Quality Gate failed

Failed conditions
4 Security Hotspots

See analysis details on SonarQube Cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

infrastructure build process, configuration, GitHub Actions tests Unit test, code coverage, test case

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

Setup ClusterFuzzLite for fuzz testing

2 participants