Add fuzzing corpus storage best practices and artifact exclusions #1282

Copilot · 2026-02-05T00:45:10Z

What do these changes do

Implements safe corpus management practices for ClusterFuzzLite fuzzing infrastructure.

What was wrong

Fuzzing setup lacked guidance on corpus handling. No protection against committing large/duplicate corpus files, sensitive data, or crash-triggering inputs to version control.

How this fixes it

Documentation (fuzz/README.md)

Minimize/de-duplicate: libFuzzer -merge=1 commands and workflow
Data sanitization: Explicit do's/don'ts for test data, grep patterns for sensitive content detection
Dedicated storage: Explains gh-pages branch usage, alternatives for private repos
Crash handling: 5-step procedure (reproduce → debug → fix → verify → regression test → add to corpus)
Security considerations: Public corpus storage safety, artifact retention policies
Command reference: Common libFuzzer operations for corpus management

Artifact Exclusions (.gitignore)

fuzz/corpus/
fuzz/crashes/
fuzz/artifacts/
crash-*
leak-*
timeout-*
oom-*
*.profraw
*.profdata

Prevents accidental commits of fuzzing artifacts while maintaining ClusterFuzzLite's automatic gh-pages corpus persistence.

Your checklist for this pull request

Passed code styles and structures
Passed code linting checks and unit test

Original prompt

Setup ClusterFuzzLite for Fuzz Testing

Context

Issue #1070 requests setting up ClusterFuzzLite for continuous fuzz testing of PyThaiNLP. This will help discover edge cases, crashes, and potential security issues in text processing functions.

Objectives

Add ClusterFuzzLite GitHub Actions workflow that integrates with existing CI/CD

Create fuzzing harnesses for high-priority modules (tokenization and utilities)

Add Docker configuration for ClusterFuzzLite build environment

Provide documentation for maintaining and expanding fuzz tests

Requirements

1. GitHub Actions Workflow (.github/workflows/clusterfuzzlite.yml)

Create a new workflow that:

Runs on pull requests to dev branch (matching existing workflow patterns)

Runs on push to dev branch

Runs on schedule (daily at 06:00 UTC)

Uses AddressSanitizer for memory safety checks

Fuzzes for 5 minutes per harness (300 seconds)

Uses mode: 'code-change' to focus on PR changes

Follows the concurrency pattern used in other workflows (cancel-in-progress for same branch)

Uploads crash artifacts if failures occur

Has proper permissions (contents: read, issues: write)

2. Docker Configuration (Dockerfile)

Create a Dockerfile at the repository root that:

Extends gcr.io/oss-fuzz-base/base-builder-python

Installs Python development dependencies

Copies the repository to $SRC/pythainlp

Installs pythainlp in development mode with minimal dependencies

3. Fuzzing Harnesses (fuzz/ directory)

Create the following fuzz test files:

fuzz/fuzz_tokenize.py

Test pythainlp.tokenize.word_tokenize() with random Unicode input

Use atheris.FuzzedDataProvider to generate test data

Validate output is a list of strings

Catch expected exceptions (ValueError, TypeError, UnicodeDecodeError)

Re-raise unexpected exceptions for investigation

fuzz/fuzz_util_normalize.py

Test pythainlp.util.normalize() with random Unicode input

Use atheris.FuzzedDataProvider to generate test data

Validate output is a string

Catch expected exceptions

Re-raise unexpected exceptions

fuzz/build.sh

Install atheris for Python fuzzing

Compile each fuzzer (all fuzz/fuzz_*.py files)

Make fuzzers executable

Copy to $OUT/ directory

Print build success message

fuzz/README.md

Explain the fuzzing setup and directory structure

Provide instructions for local testing

Document how to add new fuzzers

List expansion plan for future fuzzing targets:

spell/ (spelling correction)

corpus/ (data loading)

soundex/ (phonetic encoding)

transliterate/ (romanization)

tag/ (part-of-speech tagging)

4. Code Style and Standards
Follow existing PyThaiNLP code style (check .github/workflows/lint.yml)
Add SPDX headers to new files where appropriate:
# SPDX-FileCopyrightText: 2026 PyThaiNLP Project
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileType: SOURCE
Use proper error handling and logging

Include descriptive docstrings
5. Integration Testing

The solution should:

Not break existing CI/CD workflows

Use minimal dependencies (only atheris for fuzzing)

Be compatible with Python 3.9+ (matching requires-python in pyproject.toml)

Follow the path-ignore patterns used in other workflows

Success Criteria

GitHub Actions workflow runs successfully on PRs

Fuzzing harnesses execute without immediate crashes

Docker container builds successfully

Documentation is clear and complete

No conflicts with existing workflows

Follows PyThaiNLP's existing CI/CD patterns

References

ClusterFuzzLite documentation: https://google.github.io/clusterfuzzlite/

Issue Setup ClusterFuzzLite for fuzz testing #1070: Setup ClusterFuzzLite for fuzz testing #1070

Atheris (Python fuzzer): https://github.com/google/atheris

Notes

Start with minimal fuzzing (2 harnesses) to validate the setup

Can expand to more modules after initial validation

Fuzzing time (300s) can be adjusted based on CI/CD resource constraints

The workflow uses path-ignore patterns similar to existing workflows to avoid unnecessary runs

This pull request was created from Copilot chat.

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

github-actions · 2026-02-05T00:46:54Z

Hello @Copilot, thank you for submitting a PR! We will respond as soon as possible.

สวัสดี @Copilot ขอบคุณที่ส่ง PR เข้ามา เราจะตอบกลับให้เร็วที่สุดเท่าที่จะทำได้

Co-authored-by: bact <128572+bact@users.noreply.github.com>

bact · 2026-02-05T01:15:22Z

@copilot people will confuse about Dockerfile* files. we should name it properly, telling which one is for normal use, which one is for fuzzing. Alternatively, put the Dockerfile for fuzzing in the fuzz/ directory.

What is the best pattern that commonly used?

Co-authored-by: bact <128572+bact@users.noreply.github.com>

Copilot · 2026-02-05T01:18:32Z

@copilot people will confuse about Dockerfile* files. we should name it properly, telling which one is for normal use, which one is for fuzzing. Alternatively, put the Dockerfile for fuzzing in the fu...

Moved the fuzzing Dockerfile to fuzz/Dockerfile and restored the original Dockerfile at root for the main application. This follows the common pattern of keeping fuzzing infrastructure in the fuzz/ directory. The workflow now uses dockerfile-path: fuzz/Dockerfile to point to the fuzzing-specific Docker image. (604792e)

Copilot

Pull request overview

This pull request adds ClusterFuzzLite infrastructure to enable continuous fuzz testing of PyThaiNLP functions, helping discover edge cases, crashes, and security vulnerabilities in text processing operations.

Changes:

Added GitHub Actions workflow for automated fuzzing on PRs, pushes, and daily schedule
Created fuzzing Docker environment with OSS-Fuzz base builder
Implemented two initial fuzzing harnesses for word_tokenize() and normalize() functions
Added build script and comprehensive documentation for fuzzing infrastructure

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 18 comments.

Show a summary per file

File	Description
`.github/workflows/clusterfuzzlite.yml`	GitHub Actions workflow orchestrating ClusterFuzzLite fuzzing with AddressSanitizer
`fuzz/Dockerfile`	Docker configuration extending OSS-Fuzz base builder for Python projects
`fuzz/build.sh`	Build script for compiling fuzzing harnesses with atheris
`fuzz/fuzz_tokenize.py`	Fuzzing harness testing word_tokenize() with random Unicode input
`fuzz/fuzz_util_normalize.py`	Fuzzing harness testing normalize() with random Unicode input
`fuzz/README.md`	Documentation covering setup, local testing, CI/CD integration, and expansion roadmap

.github/workflows/clusterfuzzlite.yml

fuzz/fuzz_util_normalize.py

fuzz/build.sh

fuzz/fuzz_util_normalize.py

fuzz/build.sh

fuzz/fuzz_tokenize.py

fuzz/README.md

.github/workflows/clusterfuzzlite.yml

bact · 2026-02-05T02:42:37Z

@copilot apply suggestions in this thread #1282 (review)

Co-authored-by: bact <128572+bact@users.noreply.github.com>

Copilot · 2026-02-05T02:45:00Z

@copilot apply suggestions in this thread #1282 (review)

Applied all suggestions from the code review (499966b):

Workflow changes:

Removed **.yml from paths-ignore to allow workflow updates to trigger
Changed permissions to contents: write for storing corpus/coverage data
Made fuzzing mode conditional: code-change for PRs, batch for scheduled/push events

Fuzzer files:

Added :rtype: None to docstrings for complete type annotations
Replaced assert statements with explicit TypeError raises
Removed unnecessary except Exception: blocks

Build script:

Fixed to copy fuzzer files directly (instrumentation happens at runtime)
Added version constraint: atheris>=2.3.0
Removed unused fuzzer_package variable

Dockerfile:

Reordered SPDX headers to match project convention
Added comment about build script location for OSS-Fuzz/ClusterFuzzLite

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.

fuzz/README.md

fuzz/fuzz_util_normalize.py

fuzz/fuzz_tokenize.py

fuzz/build.sh

fuzz/fuzz_util_normalize.py

fuzz/fuzz_tokenize.py

Copilot · 2026-02-05T03:06:51Z

.github/workflows/clusterfuzzlite.yml

+          storage-repo: https://${{ secrets.GITHUB_TOKEN }}@github.com/${{ github.repository }}.git
+          storage-repo-branch: gh-pages
+          storage-repo-branch-coverage: gh-pages


The workflow configuration specifies storage-repo-branch: gh-pages and storage-repo-branch-coverage: gh-pages for storing fuzzing corpus and coverage data. This assumes that a gh-pages branch exists in the repository. If this branch doesn't exist, the workflow may fail when trying to store corpus/coverage data. Consider either: 1) ensuring the gh-pages branch exists before this workflow runs, 2) documenting the need to create this branch in the PR description or setup documentation, or 3) using a different branch that already exists (like a dedicated fuzzing-data branch).

@copilot from security point of view, storing fuzzing corpus in gh-pages (publicly available) is a concern?

Storing fuzzing corpus in gh-pages (public) is generally acceptable for open-source projects. The corpus contains test inputs that triggered interesting code paths, not secrets or sensitive data. However, if crash-triggering inputs are concerning:

Options:

Keep gh-pages public - Standard practice for OSS fuzzing (e.g., OSS-Fuzz projects)

Use a private branch - Change storage-repo-branch to a non-public branch if this repo has one

Disable corpus storage - Remove the storage-repo* parameters (fuzzing will still work, just won't persist corpus between runs)

The current setup follows ClusterFuzzLite best practices for public repositories. Crash artifacts (line 76-83) are only uploaded as workflow artifacts (not gh-pages), which have limited retention and access.

fuzz/build.sh

bact · 2026-02-05T11:38:18Z

@copilot apply suggestions from this thread
#1282 (review)

- Fix SPDX header order (FileCopyrightText, FileType, License-Identifier) - Add :rtype: None to main() docstrings for complete type annotations - Replace assert with explicit TypeError in README example - Add nullglob guard to build.sh for loop - Pin atheris to exact version (==2.3.0) for security Co-authored-by: bact <128572+bact@users.noreply.github.com>

Copilot · 2026-02-05T11:40:41Z

@copilot apply suggestions from this thread
#1282 (review)

Applied all 9 suggestions from the second code review (7f258ce):

SPDX header order:

Fixed order in all fuzz files to: FileCopyrightText → FileType → License-Identifier

Type annotations:

Added :rtype: None to main() docstrings in both fuzzers

README example:

Replaced assert isinstance() with explicit TypeError raise

Build script security & robustness:

Pinned atheris to exact version ==2.3.0 (instead of >=2.3.0) to prevent supply-chain attacks
Added [[ -e "$fuzzer" ]] || continue guard to handle missing fuzz files gracefully

- Add comprehensive "Corpus Storage Best Practices" section to fuzz/README.md - Minimize and de-duplicate guidelines with libFuzzer commands - Data sanitization guidelines (no sensitive/production data) - Dedicated storage explanation (gh-pages branch usage) - Crash monitoring and handling procedures - Security considerations for public/private repos - Corpus management command reference - Update .gitignore to exclude fuzzing artifacts - Local corpus directories (fuzz/corpus/, fuzz/crashes/, fuzz/artifacts/) - Fuzzer output files (crash-*, leak-*, timeout-*, oom-*) - Profiling data (*.profraw, *.profdata) Co-authored-by: bact <128572+bact@users.noreply.github.com>

sonarqubecloud · 2026-02-05T13:13:07Z

Quality Gate failed

Failed conditions
4 Security Hotspots

See analysis details on SonarQube Cloud

Initial plan

e31b06f

Copilot AI assigned Copilot and bact Feb 5, 2026

Copilot started work on behalf of bact February 5, 2026 00:45 View session

bact linked an issue Feb 5, 2026 that may be closed by this pull request

Setup ClusterFuzzLite for fuzz testing #1070

Open

bact added infrastructure build process, configuration, GitHub Actions tests Unit test, code coverage, test case labels Feb 5, 2026

bact added this to PyThaiNLP Feb 5, 2026

bact moved this to In progress in PyThaiNLP Feb 5, 2026

bact added this to the Future milestone Feb 5, 2026

Copilot AI and others added 2 commits February 5, 2026 00:48

Add ClusterFuzzLite fuzzing infrastructure

5067ec8

Co-authored-by: bact <128572+bact@users.noreply.github.com>

Fix markdown line length in fuzz/README.md

9f9dce4

Co-authored-by: bact <128572+bact@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add ClusterFuzzLite setup for continuous fuzz testing~~ Add ClusterFuzzLite for continuous fuzz testing Feb 5, 2026

Copilot AI requested a review from bact February 5, 2026 00:51

Copilot finished work on behalf of bact February 5, 2026 00:51

Copilot started work on behalf of bact February 5, 2026 01:15 View session

Move fuzzing Dockerfile to fuzz/ directory for clarity

604792e

Co-authored-by: bact <128572+bact@users.noreply.github.com>

Copilot finished work on behalf of bact February 5, 2026 01:19

bact requested a review from Copilot February 5, 2026 01:30

Copilot started reviewing on behalf of bact February 5, 2026 01:31 View session

Copilot AI reviewed Feb 5, 2026

View reviewed changes

Copilot started work on behalf of bact February 5, 2026 02:43 View session

Apply code review suggestions from Copilot PR reviewer

499966b

Co-authored-by: bact <128572+bact@users.noreply.github.com>

Copilot finished work on behalf of bact February 5, 2026 02:45

bact requested a review from Copilot February 5, 2026 02:58

Copilot started reviewing on behalf of bact February 5, 2026 02:59 View session

Copilot AI reviewed Feb 5, 2026

View reviewed changes

Copilot started work on behalf of bact February 5, 2026 11:38 View session

Copilot finished work on behalf of bact February 5, 2026 11:41

Copilot started work on behalf of bact February 5, 2026 13:04 View session

Copilot finished work on behalf of bact February 5, 2026 13:06

Copilot started work on behalf of bact February 5, 2026 13:09 View session

Copilot AI changed the title ~~Add ClusterFuzzLite for continuous fuzz testing~~ Add fuzzing corpus storage best practices and artifact exclusions Feb 5, 2026

Copilot finished work on behalf of bact February 5, 2026 13:12

Add fuzzing corpus storage best practices and artifact exclusions #1282

Are you sure you want to change the base?

Add fuzzing corpus storage best practices and artifact exclusions #1282

Conversation

Copilot AI commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What do these changes do

What was wrong

How this fixes it

Your checklist for this pull request

Setup ClusterFuzzLite for Fuzz Testing

Context

Objectives

Requirements

1. GitHub Actions Workflow (.github/workflows/clusterfuzzlite.yml)

2. Docker Configuration (Dockerfile)

3. Fuzzing Harnesses (fuzz/ directory)

fuzz/fuzz_tokenize.py

fuzz/fuzz_util_normalize.py

fuzz/build.sh

fuzz/README.md

4. Code Style and Standards

5. Integration Testing

Success Criteria

References

Notes

Uh oh!

github-actions bot commented Feb 5, 2026

Uh oh!

bact commented Feb 5, 2026

Uh oh!

Copilot AI commented Feb 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bact commented Feb 5, 2026

Uh oh!

Copilot AI commented Feb 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

bact Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bact commented Feb 5, 2026

Uh oh!

Copilot AI commented Feb 5, 2026

Uh oh!

sonarqubecloud bot commented Feb 5, 2026

Quality Gate failed

Uh oh!

Copilot AI commented Feb 5, 2026 •

edited

Loading

1. GitHub Actions Workflow (`.github/workflows/clusterfuzzlite.yml`)

2. Docker Configuration (`Dockerfile`)

3. Fuzzing Harnesses (`fuzz/` directory)

`fuzz/fuzz_tokenize.py`

`fuzz/fuzz_util_normalize.py`

`fuzz/build.sh`

`fuzz/README.md`