-
Notifications
You must be signed in to change notification settings - Fork 289
Add fuzzing corpus storage best practices and artifact exclusions #1282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
|
Hello @Copilot, thank you for submitting a PR! We will respond as soon as possible. สวัสดี @Copilot ขอบคุณที่ส่ง PR เข้ามา เราจะตอบกลับให้เร็วที่สุดเท่าที่จะทำได้ |
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Co-authored-by: bact <128572+bact@users.noreply.github.com>
|
@copilot people will confuse about Dockerfile* files. we should name it properly, telling which one is for normal use, which one is for fuzzing. Alternatively, put the Dockerfile for fuzzing in the fuzz/ directory. What is the best pattern that commonly used? |
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Moved the fuzzing Dockerfile to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds ClusterFuzzLite infrastructure to enable continuous fuzz testing of PyThaiNLP functions, helping discover edge cases, crashes, and security vulnerabilities in text processing operations.
Changes:
- Added GitHub Actions workflow for automated fuzzing on PRs, pushes, and daily schedule
- Created fuzzing Docker environment with OSS-Fuzz base builder
- Implemented two initial fuzzing harnesses for
word_tokenize()andnormalize()functions - Added build script and comprehensive documentation for fuzzing infrastructure
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 18 comments.
Show a summary per file
| File | Description |
|---|---|
.github/workflows/clusterfuzzlite.yml |
GitHub Actions workflow orchestrating ClusterFuzzLite fuzzing with AddressSanitizer |
fuzz/Dockerfile |
Docker configuration extending OSS-Fuzz base builder for Python projects |
fuzz/build.sh |
Build script for compiling fuzzing harnesses with atheris |
fuzz/fuzz_tokenize.py |
Fuzzing harness testing word_tokenize() with random Unicode input |
fuzz/fuzz_util_normalize.py |
Fuzzing harness testing normalize() with random Unicode input |
fuzz/README.md |
Documentation covering setup, local testing, CI/CD integration, and expansion roadmap |
|
@copilot apply suggestions in this thread #1282 (review) |
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Applied all suggestions from the code review (499966b): Workflow changes:
Fuzzer files:
Build script:
Dockerfile:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.
| storage-repo: https://${{ secrets.GITHUB_TOKEN }}@github.com/${{ github.repository }}.git | ||
| storage-repo-branch: gh-pages | ||
| storage-repo-branch-coverage: gh-pages |
Copilot
AI
Feb 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workflow configuration specifies storage-repo-branch: gh-pages and storage-repo-branch-coverage: gh-pages for storing fuzzing corpus and coverage data. This assumes that a gh-pages branch exists in the repository. If this branch doesn't exist, the workflow may fail when trying to store corpus/coverage data. Consider either: 1) ensuring the gh-pages branch exists before this workflow runs, 2) documenting the need to create this branch in the PR description or setup documentation, or 3) using a different branch that already exists (like a dedicated fuzzing-data branch).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot from security point of view, storing fuzzing corpus in gh-pages (publicly available) is a concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storing fuzzing corpus in gh-pages (public) is generally acceptable for open-source projects. The corpus contains test inputs that triggered interesting code paths, not secrets or sensitive data. However, if crash-triggering inputs are concerning:
Options:
- Keep gh-pages public - Standard practice for OSS fuzzing (e.g., OSS-Fuzz projects)
- Use a private branch - Change
storage-repo-branchto a non-public branch if this repo has one - Disable corpus storage - Remove the
storage-repo*parameters (fuzzing will still work, just won't persist corpus between runs)
The current setup follows ClusterFuzzLite best practices for public repositories. Crash artifacts (line 76-83) are only uploaded as workflow artifacts (not gh-pages), which have limited retention and access.
|
@copilot apply suggestions from this thread |
- Fix SPDX header order (FileCopyrightText, FileType, License-Identifier) - Add :rtype: None to main() docstrings for complete type annotations - Replace assert with explicit TypeError in README example - Add nullglob guard to build.sh for loop - Pin atheris to exact version (==2.3.0) for security Co-authored-by: bact <128572+bact@users.noreply.github.com>
Applied all 9 suggestions from the second code review (7f258ce): SPDX header order:
Type annotations:
README example:
Build script security & robustness:
|
- Add comprehensive "Corpus Storage Best Practices" section to fuzz/README.md - Minimize and de-duplicate guidelines with libFuzzer commands - Data sanitization guidelines (no sensitive/production data) - Dedicated storage explanation (gh-pages branch usage) - Crash monitoring and handling procedures - Security considerations for public/private repos - Corpus management command reference - Update .gitignore to exclude fuzzing artifacts - Local corpus directories (fuzz/corpus/, fuzz/crashes/, fuzz/artifacts/) - Fuzzer output files (crash-*, leak-*, timeout-*, oom-*) - Profiling data (*.profraw, *.profdata) Co-authored-by: bact <128572+bact@users.noreply.github.com>
|


What do these changes do
Implements safe corpus management practices for ClusterFuzzLite fuzzing infrastructure.
What was wrong
Fuzzing setup lacked guidance on corpus handling. No protection against committing large/duplicate corpus files, sensitive data, or crash-triggering inputs to version control.
How this fixes it
Documentation (
fuzz/README.md)-merge=1commands and workflowArtifact Exclusions (
.gitignore)Prevents accidental commits of fuzzing artifacts while maintaining ClusterFuzzLite's automatic gh-pages corpus persistence.
Your checklist for this pull request
Original prompt
This pull request was created from Copilot chat.
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.