-
Notifications
You must be signed in to change notification settings - Fork 289
Add fuzzing corpus storage best practices and artifact exclusions #1282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
Copilot
wants to merge
7
commits into
dev
Choose a base branch
from
copilot/setup-clusterfuzzlite-testing
base: dev
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+547
−0
Draft
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
e31b06f
Initial plan
Copilot 5067ec8
Add ClusterFuzzLite fuzzing infrastructure
Copilot 9f9dce4
Fix markdown line length in fuzz/README.md
Copilot 604792e
Move fuzzing Dockerfile to fuzz/ directory for clarity
Copilot 499966b
Apply code review suggestions from Copilot PR reviewer
Copilot 7f258ce
Apply second round of code review suggestions
Copilot c7cc356
Add corpus storage best practices and fuzzing artifact exclusions
Copilot File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,81 @@ | ||
| # SPDX-FileCopyrightText: 2026 PyThaiNLP Project | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| name: ClusterFuzzLite | ||
|
|
||
| on: | ||
| push: | ||
| branches: | ||
| - dev | ||
| paths-ignore: | ||
| - '**.cff' | ||
| - '**.json' | ||
| - '**.md' | ||
| - '**.rst' | ||
| - '**.txt' | ||
| - 'docs/**' | ||
| pull_request: | ||
| branches: | ||
| - dev | ||
| paths-ignore: | ||
| - '**.cff' | ||
| - '**.json' | ||
| - '**.md' | ||
| - '**.rst' | ||
| - '**.txt' | ||
| - 'docs/**' | ||
| schedule: | ||
| - cron: '0 6 * * *' # Daily at 06:00 UTC | ||
|
|
||
| # Avoid duplicate runs for the same source branch and repository. | ||
| # For pull_request events, uses the source repo name from | ||
| # github.event.pull_request.head.repo.full_name; otherwise uses github.repository. | ||
| # For push events, uses the branch name from github.ref_name. | ||
| # For pull_request events, uses the source branch name from github.head_ref. | ||
| # This ensures events for the same repo and branch share the same group, | ||
| # and avoids cross-fork collisions when branch names are reused. | ||
| concurrency: | ||
| group: >- | ||
| ${{ github.workflow }}-${{ | ||
| github.event.pull_request.head.repo.full_name || github.repository | ||
| }}-${{ github.head_ref || github.ref_name }} | ||
| cancel-in-progress: true | ||
|
|
||
| permissions: | ||
| contents: write | ||
| issues: write | ||
|
|
||
| jobs: | ||
| fuzzing: | ||
| runs-on: ubuntu-latest | ||
| strategy: | ||
| fail-fast: false | ||
| matrix: | ||
| sanitizer: [address] | ||
| steps: | ||
| - name: Build Fuzzers (${{ matrix.sanitizer }}) | ||
| id: build | ||
| uses: google/clusterfuzzlite/actions/build_fuzzers@v1 | ||
| with: | ||
| sanitizer: ${{ matrix.sanitizer }} | ||
| language: python | ||
| dockerfile-path: fuzz/Dockerfile | ||
|
|
||
| - name: Run Fuzzers (${{ matrix.sanitizer }}) | ||
| id: run | ||
| uses: google/clusterfuzzlite/actions/run_fuzzers@v1 | ||
| with: | ||
| github-token: ${{ secrets.GITHUB_TOKEN }} | ||
| fuzz-seconds: 300 | ||
| mode: ${{ github.event_name == 'pull_request' && 'code-change' || 'batch' }} | ||
| sanitizer: ${{ matrix.sanitizer }} | ||
| storage-repo: https://${{ secrets.GITHUB_TOKEN }}@github.com/${{ github.repository }}.git | ||
| storage-repo-branch: gh-pages | ||
| storage-repo-branch-coverage: gh-pages | ||
|
|
||
| - name: Upload crash artifacts | ||
| if: failure() && steps.run.outcome == 'failure' | ||
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: ${{ matrix.sanitizer }}-artifacts | ||
| path: ./out/artifacts | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,29 @@ | ||
| # SPDX-FileCopyrightText: 2026 PyThaiNLP Project | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # SPDX-FileType: SOURCE | ||
|
|
||
| # Dockerfile for ClusterFuzzLite fuzzing | ||
| # This extends the OSS-Fuzz base builder image for Python projects | ||
|
|
||
| FROM gcr.io/oss-fuzz-base/base-builder-python | ||
|
|
||
| # Install system dependencies | ||
| RUN apt-get update && \ | ||
| apt-get install -y --no-install-recommends \ | ||
| build-essential \ | ||
| libicu-dev \ | ||
| pkg-config && \ | ||
| rm -rf /var/lib/apt/lists/* | ||
|
|
||
| # Copy repository to $SRC/pythainlp | ||
| COPY . $SRC/pythainlp | ||
|
|
||
| # Set working directory | ||
| WORKDIR $SRC/pythainlp | ||
|
|
||
| # Install pythainlp in development mode with minimal dependencies | ||
| # This installs the package without heavy ML dependencies to speed up builds | ||
| RUN pip install --no-cache-dir -e . | ||
|
|
||
| # Copy build script to $SRC/build.sh as expected by OSS-Fuzz/ClusterFuzzLite | ||
| COPY fuzz/build.sh $SRC/ |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workflow configuration specifies
storage-repo-branch: gh-pagesandstorage-repo-branch-coverage: gh-pagesfor storing fuzzing corpus and coverage data. This assumes that a gh-pages branch exists in the repository. If this branch doesn't exist, the workflow may fail when trying to store corpus/coverage data. Consider either: 1) ensuring the gh-pages branch exists before this workflow runs, 2) documenting the need to create this branch in the PR description or setup documentation, or 3) using a different branch that already exists (like a dedicated fuzzing-data branch).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot from security point of view, storing fuzzing corpus in gh-pages (publicly available) is a concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storing fuzzing corpus in gh-pages (public) is generally acceptable for open-source projects. The corpus contains test inputs that triggered interesting code paths, not secrets or sensitive data. However, if crash-triggering inputs are concerning:
Options:
storage-repo-branchto a non-public branch if this repo has onestorage-repo*parameters (fuzzing will still work, just won't persist corpus between runs)The current setup follows ClusterFuzzLite best practices for public repositories. Crash artifacts (line 76-83) are only uploaded as workflow artifacts (not gh-pages), which have limited retention and access.