Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GTF comparison script #151

Merged
merged 22 commits into from
Dec 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
133 changes: 133 additions & 0 deletions .github/workflows/gtf_tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
name: GTF Comparison Tests

on:
pull_request:
branches: [ "develop", "master" ]
paths:
- '3rd-party-tools/build-indices/**'

jobs:
test:
runs-on: ubuntu-latest

defaults:
run:
working-directory: 3rd-party-tools/build-indices

steps:
- uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.x'

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install pandas
- name: Create output directories
run: |
mkdir -p test_output/comparison_files
mkdir -p test_data/reference_outputs
- name: Verify test data
run: |
if [ ! -f "test_data/test1.gtf" ]; then
echo "Error: Required test file test_data/test1.gtf not found"
ls -la test_data/
exit 1
fi
if [ ! -f "Biotypes.tsv" ]; then
echo "Error: Required Biotypes.tsv file not found"
ls -la
exit 1
fi
echo "Test files present:"
ls -l test_data/test1.gtf Biotypes.tsv
- name: Run GTF modification and comparison
id: gtf_tests
env:
PYTHONPATH: ${{ github.workspace }}/3rd-party-tools/build-indices
run: |
# Run the unit tests
python -m unittest test_gtf_comparison.py -v
continue-on-error: true

- name: Prepare artifacts
if: always()
run: |
# Create directory for all artifacts
mkdir -p artifact_output
# Copy test input files
cp test_data/test1.gtf artifact_output/
cp Biotypes.tsv artifact_output/
# Copy test outputs if they exist
if [ -d "test_output" ]; then
cp -r test_output/* artifact_output/
fi
# Copy reference outputs if they exist
if [ -d "test_data/reference_outputs" ]; then
mkdir -p artifact_output/reference_outputs
cp -r test_data/reference_outputs/* artifact_output/reference_outputs/
fi
# Create manifest
{
echo "GTF Test Artifacts"
echo "Generated: $(date)"
echo ""
echo "Test Input Files:"
ls -l artifact_output/test1.gtf artifact_output/Biotypes.tsv
echo ""
echo "Test Outputs:"
ls -R artifact_output/test_output/ 2>/dev/null || echo "No test outputs found"
echo ""
echo "Reference Outputs:"
ls -R artifact_output/reference_outputs/ 2>/dev/null || echo "No reference outputs found"
} > artifact_output/manifest.txt
- name: Upload test artifacts
if: always()
uses: actions/upload-artifact@v4
with:
name: gtf-test-results
path: |
3rd-party-tools/build-indices/artifact_output/**/*
3rd-party-tools/build-indices/test_output/**/*
3rd-party-tools/build-indices/test_data/reference_outputs/**/*
compression-level: 9
retention-days: 14

- name: Check test results
if: always()
run: |
echo "=== Test Results Summary ==="
# Check unit test results
if [ "${{ steps.gtf_tests.outcome }}" == "success" ]; then
echo "✅ GTF tests passed"
else
echo "❌ GTF tests failed"
# Display difference report if it exists
if [ -f "test_output/comparison_files/difference_report.txt" ]; then
echo ""
echo "Differences found:"
cat test_output/comparison_files/difference_report.txt
fi
# Display test summary if it exists
if [ -f "test_output/comparison_files/test_summary.txt" ]; then
echo ""
echo "Test Summary:"
cat test_output/comparison_files/test_summary.txt
fi
exit 1
fi
129 changes: 124 additions & 5 deletions 3rd-party-tools/build-indices/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,9 @@

## Quick reference

Copy and paste to pull this image
Copy and paste to pull this image:

#### `docker pull us.gcr.io/broad-gotc-prod/build-indices:1.0.0-2.7.10a-1663605340`
`

- __What is this image:__ This image is a Debian-based custom image with STAR installed and pre-configured along with python scripts to build indices.
- __What is STAR:__ Spliced Transcripts Alignment to a Reference (STAR) is a fast RNA-seq read mapper, with support for splice-junction and fusion read detection. STAR aligns reads by finding the Maximal Mappable Prefix (MMP) hits between reads (or read pairs) and the genome, using a Suffix Array index, [more info here](https://github.com/alexdobin/STAR).
Expand All @@ -15,7 +14,7 @@ Copy and paste to pull this image

Build_indices uses the following convention for versioning:

#### `us.gcr.io/broad-gotc-prod/build-indices:<image-version>-<star-version>-<unix-timestamp>`
#### `us.gcr.io/broad-gotc-prod/build-indices:<image-version>-<star-version>-<unix-timestamp>`

We keep track of all past versions in [docker_versions](docker_versions.tsv) with the last image listed being the currently used version in WARP.

Expand All @@ -28,12 +27,132 @@ $ docker inspect us.gcr.io/broad-gotc-prod/build-indices:1.0.0-2.7.10a-166360534

## Usage

### Build_indices
### Build_indices Docker Container

```bash
$ docker run --rm -it \
us.gcr.io/broad-gotc-prod/build-indices:1.0.0-2.7.10a-1663605340 \
build-indices bash
```

Then you can exec into the container and use STAR or any of the scripts accordingly. Alternatively, you can run one-off commands by passing the command as a docker run parameter.
Then you can exec into the container and use STAR or any of the scripts accordingly. Alternatively, you can run one-off commands by passing the command as a docker run parameter.

## GTF Comparison Tools

This repository includes tools for comparing and testing GTF (Gene Transfer Format) file modifications. These tools ensure consistency in GTF processing and provide detailed comparison reports.

### Components

#### Scripts
- `compare_gtfs.py` - Analyzes differences between two GTF files
- `test_gtf_comparison.py` - Unit tests for GTF comparison functionality
- `modify_gtf.py` - Script to modify GTF files

#### Required Files
- `test_data/test1.gtf` - Test GTF file
- `Biotypes.tsv` - File containing allowed biotypes

### Features

The comparison tool analyzes:
- Structural differences in GTF fields
- Attribute differences, including:
- Reordered attributes
- Extra or missing attributes
- Different attribute values
- Gene-level differences
- Mitochondrial gene comparisons

### Running GTF Comparison

```bash
python compare_gtfs.py <gtf1> <gtf2> --output-prefix <prefix>
```

Example:
```bash
python compare_gtfs.py test_data/test1.gtf modified_output.gtf --output-prefix comparison
```

### Testing

Run the test suite:
```bash
python -m unittest test_gtf_comparison.py -v
```

### GitHub Actions Integration

Automated testing is configured via GitHub Actions:
- Runs comparison tests
- Generates reports
- Uploads test artifacts

Configuration file: `.github/workflows/gtf_tests.yml`

### Output Reports

1. Structural Differences (`<prefix>_structural_diff.txt`):
- Row counts
- Field differences
- Sample comparisons

2. Attribute Differences (`<prefix>_attribute_diff.txt`):
- Attribute summaries
- Detailed comparisons
- Value differences

3. Gene Differences (`<prefix>_gene_diff.txt`):
- Gene counts
- Unique gene lists
- MT gene analysis

### Requirements

- Python 3.x
- pandas
- Standard Python libraries

Install dependencies:
```bash
pip install pandas
```

### Directory Structure

```
build-indices/
├── test_data/
│ ├── test1.gtf
│ └── reference_outputs/
├── test_output/
│ └── comparison_files/
├── compare_gtfs.py
├── test_gtf_comparison.py
├── modify_gtf.py
└── Biotypes.tsv
```

### Error Handling

The tools include comprehensive error handling for:
- Missing files
- Malformed GTF content
- Directory issues
- Attribute parsing errors

### Contributing

When modifying these tools:
1. Ensure all tests pass
2. Update test cases for new features
3. Maintain Docker compatibility
4. Update documentation
5. Follow GitHub Actions workflow requirements

## Notes

- GTF comparison is sensitive to format variations
- Docker container provides consistent environment
- All scripts are accessible within the container
- Use reference files for reliable testing
Loading
Loading