Skip to content

Commit

Permalink
GTF comparison script (#151)
Browse files Browse the repository at this point in the history
* comparison script updated

* added tests

* added yaml for testing

* removed extra gtf tester

* updated when to run test

* upgratde actions/upload-artifact@v3

* fixed paths for pytests

* fixed python path for pytest

* fixed python path for pytest

* updated the script to run off Marmoset gtf file

* updated the unittest to run modify_gtf script and saved an example output from previous run

* updated yml to run all analysis

* updated yml to run all analysis

* updated yml to run all analysis

* bug fix

* bug fixes

* improved comparisons

* added readme

* added readme
  • Loading branch information
khajoue2 authored Dec 18, 2024
1 parent 0936a47 commit 47693d9
Show file tree
Hide file tree
Showing 7 changed files with 880 additions and 5 deletions.
133 changes: 133 additions & 0 deletions .github/workflows/gtf_tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
name: GTF Comparison Tests

on:
pull_request:
branches: [ "develop", "master" ]
paths:
- '3rd-party-tools/build-indices/**'

jobs:
test:
runs-on: ubuntu-latest

defaults:
run:
working-directory: 3rd-party-tools/build-indices

steps:
- uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.x'

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install pandas
- name: Create output directories
run: |
mkdir -p test_output/comparison_files
mkdir -p test_data/reference_outputs
- name: Verify test data
run: |
if [ ! -f "test_data/test1.gtf" ]; then
echo "Error: Required test file test_data/test1.gtf not found"
ls -la test_data/
exit 1
fi
if [ ! -f "Biotypes.tsv" ]; then
echo "Error: Required Biotypes.tsv file not found"
ls -la
exit 1
fi
echo "Test files present:"
ls -l test_data/test1.gtf Biotypes.tsv
- name: Run GTF modification and comparison
id: gtf_tests
env:
PYTHONPATH: ${{ github.workspace }}/3rd-party-tools/build-indices
run: |
# Run the unit tests
python -m unittest test_gtf_comparison.py -v
continue-on-error: true

- name: Prepare artifacts
if: always()
run: |
# Create directory for all artifacts
mkdir -p artifact_output
# Copy test input files
cp test_data/test1.gtf artifact_output/
cp Biotypes.tsv artifact_output/
# Copy test outputs if they exist
if [ -d "test_output" ]; then
cp -r test_output/* artifact_output/
fi
# Copy reference outputs if they exist
if [ -d "test_data/reference_outputs" ]; then
mkdir -p artifact_output/reference_outputs
cp -r test_data/reference_outputs/* artifact_output/reference_outputs/
fi
# Create manifest
{
echo "GTF Test Artifacts"
echo "Generated: $(date)"
echo ""
echo "Test Input Files:"
ls -l artifact_output/test1.gtf artifact_output/Biotypes.tsv
echo ""
echo "Test Outputs:"
ls -R artifact_output/test_output/ 2>/dev/null || echo "No test outputs found"
echo ""
echo "Reference Outputs:"
ls -R artifact_output/reference_outputs/ 2>/dev/null || echo "No reference outputs found"
} > artifact_output/manifest.txt
- name: Upload test artifacts
if: always()
uses: actions/upload-artifact@v4
with:
name: gtf-test-results
path: |
3rd-party-tools/build-indices/artifact_output/**/*
3rd-party-tools/build-indices/test_output/**/*
3rd-party-tools/build-indices/test_data/reference_outputs/**/*
compression-level: 9
retention-days: 14

- name: Check test results
if: always()
run: |
echo "=== Test Results Summary ==="
# Check unit test results
if [ "${{ steps.gtf_tests.outcome }}" == "success" ]; then
echo "✅ GTF tests passed"
else
echo "❌ GTF tests failed"
# Display difference report if it exists
if [ -f "test_output/comparison_files/difference_report.txt" ]; then
echo ""
echo "Differences found:"
cat test_output/comparison_files/difference_report.txt
fi
# Display test summary if it exists
if [ -f "test_output/comparison_files/test_summary.txt" ]; then
echo ""
echo "Test Summary:"
cat test_output/comparison_files/test_summary.txt
fi
exit 1
fi
129 changes: 124 additions & 5 deletions 3rd-party-tools/build-indices/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,9 @@

## Quick reference

Copy and paste to pull this image
Copy and paste to pull this image:

#### `docker pull us.gcr.io/broad-gotc-prod/build-indices:1.0.0-2.7.10a-1663605340`
`

- __What is this image:__ This image is a Debian-based custom image with STAR installed and pre-configured along with python scripts to build indices.
- __What is STAR:__ Spliced Transcripts Alignment to a Reference (STAR) is a fast RNA-seq read mapper, with support for splice-junction and fusion read detection. STAR aligns reads by finding the Maximal Mappable Prefix (MMP) hits between reads (or read pairs) and the genome, using a Suffix Array index, [more info here](https://github.com/alexdobin/STAR).
Expand All @@ -15,7 +14,7 @@ Copy and paste to pull this image

Build_indices uses the following convention for versioning:

#### `us.gcr.io/broad-gotc-prod/build-indices:<image-version>-<star-version>-<unix-timestamp>`
#### `us.gcr.io/broad-gotc-prod/build-indices:<image-version>-<star-version>-<unix-timestamp>`

We keep track of all past versions in [docker_versions](docker_versions.tsv) with the last image listed being the currently used version in WARP.

Expand All @@ -28,12 +27,132 @@ $ docker inspect us.gcr.io/broad-gotc-prod/build-indices:1.0.0-2.7.10a-166360534

## Usage

### Build_indices
### Build_indices Docker Container

```bash
$ docker run --rm -it \
us.gcr.io/broad-gotc-prod/build-indices:1.0.0-2.7.10a-1663605340 \
build-indices bash
```

Then you can exec into the container and use STAR or any of the scripts accordingly. Alternatively, you can run one-off commands by passing the command as a docker run parameter.
Then you can exec into the container and use STAR or any of the scripts accordingly. Alternatively, you can run one-off commands by passing the command as a docker run parameter.

## GTF Comparison Tools

This repository includes tools for comparing and testing GTF (Gene Transfer Format) file modifications. These tools ensure consistency in GTF processing and provide detailed comparison reports.

### Components

#### Scripts
- `compare_gtfs.py` - Analyzes differences between two GTF files
- `test_gtf_comparison.py` - Unit tests for GTF comparison functionality
- `modify_gtf.py` - Script to modify GTF files

#### Required Files
- `test_data/test1.gtf` - Test GTF file
- `Biotypes.tsv` - File containing allowed biotypes

### Features

The comparison tool analyzes:
- Structural differences in GTF fields
- Attribute differences, including:
- Reordered attributes
- Extra or missing attributes
- Different attribute values
- Gene-level differences
- Mitochondrial gene comparisons

### Running GTF Comparison

```bash
python compare_gtfs.py <gtf1> <gtf2> --output-prefix <prefix>
```

Example:
```bash
python compare_gtfs.py test_data/test1.gtf modified_output.gtf --output-prefix comparison
```

### Testing

Run the test suite:
```bash
python -m unittest test_gtf_comparison.py -v
```

### GitHub Actions Integration

Automated testing is configured via GitHub Actions:
- Runs comparison tests
- Generates reports
- Uploads test artifacts

Configuration file: `.github/workflows/gtf_tests.yml`

### Output Reports

1. Structural Differences (`<prefix>_structural_diff.txt`):
- Row counts
- Field differences
- Sample comparisons

2. Attribute Differences (`<prefix>_attribute_diff.txt`):
- Attribute summaries
- Detailed comparisons
- Value differences

3. Gene Differences (`<prefix>_gene_diff.txt`):
- Gene counts
- Unique gene lists
- MT gene analysis

### Requirements

- Python 3.x
- pandas
- Standard Python libraries

Install dependencies:
```bash
pip install pandas
```

### Directory Structure

```
build-indices/
├── test_data/
│ ├── test1.gtf
│ └── reference_outputs/
├── test_output/
│ └── comparison_files/
├── compare_gtfs.py
├── test_gtf_comparison.py
├── modify_gtf.py
└── Biotypes.tsv
```

### Error Handling

The tools include comprehensive error handling for:
- Missing files
- Malformed GTF content
- Directory issues
- Attribute parsing errors

### Contributing

When modifying these tools:
1. Ensure all tests pass
2. Update test cases for new features
3. Maintain Docker compatibility
4. Update documentation
5. Follow GitHub Actions workflow requirements

## Notes

- GTF comparison is sensitive to format variations
- Docker container provides consistent environment
- All scripts are accessible within the container
- Use reference files for reliable testing
Loading

0 comments on commit 47693d9

Please sign in to comment.