Making temporary code name decisions reproducible for fungal DNA barcoding.
OTUnify doesn't replace temporary codes—it makes them scientifically reproducible.
When you review MycoMap BLAST results and decide "this sequence is close enough to Amanita sp. 'IN01'," that expert judgment currently exists only in your head. OTUnify captures that decision as explicit, versioned methodology that others can apply consistently. Your expertise remains essential—OTUnify just ensures it can be shared and reproduced.
Every day, mycologists reviewing sequence data make critical decisions:
- "This 99.5% match is the same species as our temporary code"
- "That 99.2% match is actually different" (because this clade has low variation)
- "These sequences at 99.0-99.8% all belong together"
These decisions require deep expertise about fungal diversity patterns, but they're currently irreproducible. Different reviewers make different choices, and there's no record of why boundaries were drawn where they were.
OTUnify provides a two-layer system that transforms expert decisions into explicit methodology:
Layer 1: Molecular OTUs (mOTUs) - The reproducible backbone
- You provide: A set of sequences you know belong together
- OTUnify finds: The most representative "anchor" sequence and calculates how similar the others are to it
- Result:
Cortinarius-HASH123-99.2-ADJ-0.1.fasta(a precise molecular boundary) - Anyone can now test if new sequences fall within YOUR defined boundary
Layer 2: Operational Species Concepts (OSCs) - Your polyphasic interpretation
- Maps temporary codes like "Cortinarius sp. 'MI03'" to one or more mOTUs
- Handles species with multiple haplotypes (one code → multiple mOTUs)
- Handles regional variants (multiple codes → one mOTU)
- Preserves morphological, ecological, geographic context, and expert notes
- Currently implemented with basic functionality, full features in development
# Today's workflow (manual, expert-dependent):
# 1. BLAST sequence against MycoMap
# 2. Expert reviews: "Close enough to Amanita sp. 'IN01'"
# 3. Hope next person makes same judgment
# OTUnify-enabled workflow (automated, reproducible):
# 1. Check if sequence matches existing species concepts
$ otunify-classify new_sequence.fasta species_concepts/
✓ Amanita sp. 'IN01': 99.7% match (included)
# 2. Or discover it needs a new code
$ otunify-classify novel_sequence.fasta species_concepts/
✗ No matches above threshold
# 3. Create new OTU with your expert criteria
$ otunify-create novel_sequences.fasta --vanity-prefix "Amanita"
✓ Created: Amanita-HASH789-99.5-ADJ-0.1.fasta
# 4. Document as new temporary code
$ otunify-describe amanita_oh99.yaml --provisional-name "Amanita sp. 'OH99'"The North American fungal community has built an impressive ecosystem around temporary code names—provisional identifiers that bridge the centuries-long gap between discovery and formal taxonomy. With >70,000 DNA-barcoded observations on iNaturalist and ~100,000 sequences in MycoMap, we've proven this system works.
The challenge: Every temporary code assignment involves expert judgment that currently can't be reproduced. When you decide a sequence belongs with "Cortinarius sp. 'MI03'," that decision relies on your knowledge of variation patterns in that clade—knowledge that's locked in your expertise.
The solution: OTUnify makes your expert decisions explicit and shareable. It's version control for your taxonomic judgment—preserving not just what you decided, but the key criteria you used to decide it, plus any notes and context you want to pass to other experts.
✅ Codify Your Expertise: Transform "this looks right" into "sequences ≥99.5% similar belong here"
✅ Share Your Criteria: Export your OTU definitions so others apply your exact standards
✅ Document Decisions: Record who expanded or narrowed a code boundary and when
✅ Track Code Evolution: See the complete history as understanding improves
✅ Automate Matching: Let computers apply your criteria to incoming sequences
- For Sequencing Labs: Consistent, reproducible OTU assignment across batches
- For Data Review: Apply the same expert criteria across MycoMap, iNaturalist, and other platforms
- For Publications: Cite exact, reproducible methods for sequence delimitation
- For Future Taxonomy: Clear audit trail from temporary code to formal description
- For Collaboration: Share your taxonomic expertise with other labs and projects
OTUnify is built on command-line tools and simple file formats to ensure maximum flexibility and integration potential. This design provides several key advantages:
No Centralized Dependencies: Works entirely offline with whatever data you have available. There's no requirement to connect to servers, wait for database updates, or depend on external systems. Your expertise and data remain fully under your control.
Low Barrier to Integration: Simple FASTA and YAML formats mean any tool can read and write OTUnify data. This enables an ecosystem of tools to evolve while keeping the core methodology stable and shared across the community.
Future GUI Integration: While OTUnify currently uses a command-line interface, we anticipate most users will eventually interact with it through seamless web interfaces on MycoMap.com or other platforms. These integrations will provide all the fundamental benefits—reproducible methodology, shared expertise, automated matching—without requiring command-line knowledge or GitHub familiarity.
The CLI foundation ensures the methodology remains transparent, reproducible, and accessible to computational workflows, while future GUI layers will make it accessible to everyone in the mycological community.
pip install git+https://github.com/joshuaowalker/otunify.gitFor development setup, see the Development section below.
Let's say you have sequences you know belong to "Russula sp. 'brevipes-CA01'". Here's how to make that knowledge reproducible:
# 1. Create an OTU from your clustered sequences
$ otunify-create russula_brevipes_sequences.fasta --vanity-prefix "Russula"
✓ Created: Russula-A5B7C9D1-99.2-ADJ-0.1.fasta
Identity cutoff: 99.2% (based on your sequences)
Algorithm: MycoBLAST-adjustedWhat just happened? OTUnify:
- Found the most representative "anchor" sequence from your set
- Calculated how similar all your sequences are to it
- Set the boundary at the furthest sequence you included (99.2%)
- Created a computer-readable definition file others can use
Note: OTU filenames (like Russula-A5B7C9D1-99.2-ADJ-0.1.fasta) are designed for computers to process efficiently. For human-readable names, you'll create an OSC that maps your temporary code to this OTU.
Now when you get new sequences, you can check if they match your definition:
# 2. Check if a new sequence matches
$ otunify-match unknown_sequence.fasta Russula-A5B7C9D1-*.fasta
Query: unknown_seq_001
✓ INCLUDED in Russula-A5B7C9D1 (99.6% identity)
Query: unknown_seq_002
✗ EXCLUDED from Russula-A5B7C9D1 (98.8% identity)
→ Below 99.2% threshold - may need new temporary codeDocument your temporary code with its full context:
# 3. Create an Operational Species Concept linking human-readable name to OTU
$ otunify-describe russula_brevipes_ca01.yaml \
--provisional-name "Russula sp. 'brevipes-CA01'" \
--otu "Russula-A5B7C9D1" \
--description "Large white Russula from California oak woodlands" \
--notes "Consistently found under Quercus agrifolia, spore print cream"This OSC file captures your polyphasic understanding—morphology, ecology, geography—and links the human-readable temporary code to the computer-readable OTU.
# Check all new sequences against your species concepts
$ otunify-classify new_batch.fasta species_concepts/
# For unmatched sequences, create new OTUs
$ otunify-create unmatched_cluster1.fasta --vanity-prefix "Cortinarius"
$ otunify-create unmatched_cluster2.fasta --vanity-prefix "Inocybe"
# Then document as new temporary codes
$ otunify-describe cortinarius_or42.yaml --provisional-name "Cortinarius sp. 'OR42'"# Remove obvious outliers (>95th percentile distance)
$ otunify-create messy_cluster.fasta --exclude-outliers 95
# Or just report what would be excluded
$ otunify-create messy_cluster.fasta --report-outliers 95# For well-studied species with known variation
$ otunify-create amanita_muscaria.fasta --identity-cutoff-override 99.5
# For poorly known groups, let the data decide
$ otunify-create unknown_cortinarius.fasta --vanity-prefix "Cortinarius"The "anchor" sequence is the reference point for your OTU. OTUnify automatically selects the best anchor by finding the medoid—the sequence that minimizes total distance to all others in your set. This ensures the most representative sequence becomes your reference point.
Best practice: Let OTUnify choose the anchor automatically. It will pick the sequence that best represents your entire cluster.
# Check for overlapping OTUs that might be the same species
$ otunify-validate --check-overlaps simple my_otus/
# Use BLAST mode for large reference sets (>100 OTUs)
$ otunify-validate --check-overlaps blast --reference-set all_otus/ new_otus/Replace expert "close enough" decisions when reviewing MycoMap BLAST results with explicit, reproducible methodology for assigning sequences to existing temporary codes vs. creating new ones.
Enable local DNA labs to generate consistent, comparable OTU definitions that integrate with the broader MycoMap ecosystem.
Support the >70,000 DNA-barcoded fungal observations on iNaturalist with systematic species delimitation that bridges citizen science and formal taxonomy.
Facilitate the North American goal of documenting macrofungal biodiversity through coordinated, reproducible methods across multiple community science projects.
This is an alpha release demonstrating the core concepts. Here's what's working:
- OTU Creation (
otunify-create): Define reproducible sequence boundaries - Validation (
otunify-validate): Check OTU definitions and detect overlaps - Matching (
otunify-match): Compare sequences against OTU definitions - BLAST Mode: 5-50× performance improvement for large datasets
- OSC Creation (
otunify-describe): Document species concepts linking to mOTUs - Classification (
otunify-classify): Assign sequences to temporary codes via OSCs - OSC Validation: Automatic checking of OSC-OTU relationships
- Full polyphasic integration features still in development
- Convenience utilities (
otunify-revise,otunify-merge) for managing OTU evolution - Web services for DOI minting
- Integration with UNITE Species Hypotheses
- Automated GitHub Actions validation
- Phylogenetic borrowing for under-sampled groups
Creates OTU definitions from pre-clustered FASTA sequences.
otunify-create [OPTIONS] INPUT_FASTA [OUTPUT_FASTA]Options:
--algorithm [STD|ADJ]: Identity calculation algorithm (default: ADJ)--vanity-prefix TEXT: Human-readable OTU prefix (default: OTU)--exclude-outliers N: Remove sequences beyond Nth percentile distance--min-identity FLOAT: Remove sequences below identity threshold--identity-cutoff-override FLOAT: Manual cutoff override (expert use)--report-outliers N: Report sequences beyond Nth percentile distance--min-length INTEGER: Minimum sequence length for anchor selection--max-length INTEGER: Maximum sequence length for anchor selection--no-ambiguity: Exclude sequences with ambiguity codes--description TEXT: Optional description--output-dir PATH: Output directory for generated files--disable-reorient: Disable sequence reorientation--verbose, -v: Enable verbose output
Examples:
# Clean Amanita cluster with outliers
otunify-create amanita_cluster.fasta --exclude-outliers 95 --min-identity 95.0
# Quality control with outlier reporting
otunify-create sequences.fasta --report-outliers 95 --vanity-prefix "Cortinarius_sp"
# Conservative clustering for publication
otunify-create sequences.fasta --algorithm ADJ --min-identity 97.0 --no-ambiguity
# Batch processing to output directory
otunify-create sequences.fasta --output-dir otus/ --vanity-prefix "Amanita_muscaria"Validates OTU definition files according to OTUnify specification.
otunify-validate [OPTIONS] PATHS...Options:
--verbose, -v: Show all validation messages (errors, warnings, info)--quiet, -q: Show only errors--show-warnings, -w: Show warnings and errors (but not info messages)--output, -o PATH: Write JSON report to file--format [text|json]: Output format (default: text)--fail-on-warnings: Exit with error code if warnings are found--check-overlaps [none|simple|blast]: Enable overlap detection (default: none)--overlap-factor FLOAT: Adjust overlap sensitivity (default: 1.0, >1.0 = more sensitive)--reference-set PATH: Reference set for overlap detection (default: use validation set as reference)
Examples:
# Validate single file
otunify-validate my_otu.fasta
# Validate all FASTA files in directory (ignores auxiliary files)
otunify-validate otus/
# Show warnings and errors
otunify-validate --show-warnings otus/
# Verbose validation with JSON report
otunify-validate --verbose --output report.json --format json otus/
# Validation for CI/CD (fail on warnings)
otunify-validate --fail-on-warnings otus/
# Check for overlapping OTUs
otunify-validate --check-overlaps simple otus/
# More sensitive overlap detection
otunify-validate --check-overlaps simple --overlap-factor 1.5 otus/
# Check validation set against separate reference set
otunify-validate --check-overlaps simple --reference-set reference_otus/ new_otus/
# High-performance BLAST mode for large datasets
otunify-validate --check-overlaps blast --reference-set large_reference_db/ new_otus/
# BLAST mode with custom sensitivity
otunify-validate --check-overlaps blast --overlap-factor 1.2 --reference-set reference_otus/ validation_set/
# Validate multiple files and directories
otunify-validate file1.fasta file2.fasta directory/Validation Levels:
- File Format: FASTA parsing and basic structure
- OTU ID Format: Prefix-hash structure and character validation
- Metadata Validation: Required fields and format compliance
- Sequence Validation: IUPAC nucleotides and length checks
- Filename Convention: Consistency between filename and content
- Hash Consistency: Verify calculated hash matches declared hash
- Overlap Detection: Identify potentially overlapping OTU boundaries (optional)
Overlap Detection:
The overlap detection feature identifies OTUs that may have overlapping inclusion criteria using a 1D linear model. Each OTU defines a "footprint" extending from its anchor sequence, and overlaps are detected when these footprints intersect.
Detection Modes:
- Simple mode: Brute-force pairwise comparison of all OTU anchor sequences
- BLAST mode: High-performance candidate selection using NCBI BLAST+ with optimized thresholds
Core Features:
- Reference set support: Compare validation OTUs against separate reference set instead of self-comparison
- 1D linear analysis: Maps OTUs to line segments for intuitive geometric overlap calculation
- Coverage analysis: Shows what percentage of each OTU's footprint overlaps with the other
- Special case detection: Identifies identical anchors and complete containment scenarios
- Mixed algorithms: Warns when OTUs use different identity algorithms (STD vs ADJ)
- Overlap factor: Adjusts sensitivity (1.0 = mathematical threshold, >1.0 = more sensitive)
- Algorithm compatibility: Compares OTUs within same major algorithm version
BLAST Mode Optimizations:
- Batch processing: Single BLAST query for all validation OTUs (dramatically reduces overhead)
- Multi-threading: Automatic detection and use of available CPU cores
- Dynamic thresholds: Intelligent BLAST identity thresholds based on OTU cutoffs
- Algorithm-aware safety factors: 4x buffer for ADJ algorithm, 2x for STD algorithm
- Hash collision detection: Validates sequence identity during deduplication
- Database caching: Reuses BLAST databases across multiple validation runs
- Performance scaling: 5-10x reduction in candidates for large reference sets
Reference Set Usage:
By default, overlap detection compares OTUs within the validation set (O(n²) comparisons). When --reference-set is specified, validation OTUs are compared against the reference set instead (O(m×n) comparisons), where:
- Validation set: OTUs being validated (from PATHS arguments)
- Reference set: Trusted OTUs to check against (from --reference-set)
- Use cases:
- Check new OTUs against established database
- Validate subset of large collection against full dataset
- Quality control with known-good reference OTUs
Performance Comparison:
| Dataset Size | Simple Mode | BLAST Mode | Improvement |
|---|---|---|---|
| 30 validation × 30 reference | 900 comparisons | ~180 comparisons | 5× reduction |
| 609 validation × 1,103 reference | 671,727 comparisons | ~13,107 comparisons | 51× reduction |
| Execution time | O(m×n) linear | O(m×avg_candidates) | 5-50× faster |
BLAST mode provides significant performance benefits for large reference sets while maintaining identical overlap detection accuracy.
Example overlap detection output:
[WARNING] OTU:OTU_A-HASH123456 vs OTU_B-HASH789012: Partial overlap: anchor distance 0.5%
(thresholds: A≥99.5%, B≥99.0%) Coverage: A≈45%, B≈30%
[WARNING] OTU:OTU_C-HASH345678 vs OTU_D-HASH901234: Complete containment (D in C): anchor distance 0.2%
(thresholds: C≥99.0%, D≥99.8%) Coverage: C≈15%, D≈100%
Matches query sequences against OTU definitions to determine inclusion/exclusion.
otunify-match [OPTIONS] QUERY_FASTA REFERENCE_PATHS...Key Options:
--mode [simple|blast]: Performance mode (use blast for large reference sets)--nearby-distance FLOAT: Find matches within additional distance--output-format [table|tsv|json]: Output format
Classifies sequences to Operational Species Concepts (temporary codes).
otunify-classify [OPTIONS] QUERIES PATHS...Key Options:
--max-classifications INTEGER: Show alternative classifications--show-otu-details: Include underlying OTU match details--mode [simple|blast]: Performance mode
Creates OSC definitions linking temporary codes to OTUs.
otunify-describe [OPTIONS] OUTPUT_FILEKey Options:
--provisional-name TEXT: The temporary code (e.g., "Russula sp. 'brevipes-CA01'")--otu TEXT: OTU reference(s) to include--description TEXT: Human-readable description--notes TEXT: Additional expert observations--interactive: Interactive mode for guided creation
Named automatically as: {prefix}-{hash}-{cutoff}-{algorithm}-{version}.fasta
Example: Galerina_marginata-KMQ537FKVP-99.1-ADJ-0.1.fasta
Excluded Sequences: When sequences are filtered by outlier removal, excluded sequences are written to:
excluded.{otu-filename}.fasta
Reoriented Sequences: When sequences are reverse-complemented during processing, the original orientations are saved to:
reoriented.{otu-filename}.fasta
These auxiliary files are automatically ignored during directory validation with otunify-validate but can still be validated directly if specified.
Uses traditional percent identity calculation via the adjusted-identity library with no corrections.
Applies MycoBLAST-style adjustments for:
- Homopolymer length normalization
- IUPAC ambiguity code handling
- Repeat motif adjustment
- End trimming for sequencing artifacts
OTU definitions use FASTA format with structured headers:
>OTU-KMQ537FKVP Example description identity_cutoff=99.10 identity_algorithm=ADJ-0.1 format=OTUnify-0.1
ATGCGTACGATC...
See FORMAT_SPECIFICATION.md for complete technical details.
North American citizen science fungal barcoding has evolved a sophisticated system using "temporary code names" (Russell, 2025) to handle the gap between DNA sequence data and formal taxonomy:
- Current process: MycoMap BLAST search → Human review → Expert decision about "close enough"
- Scale challenge: >10,000 putative species with temporary codes, >70,000 barcoded observations
- Consistency issues: Different reviewers make different decisions about sequence inclusion
- Documentation gap: No record of why specific identity thresholds were chosen
Current Infrastructure:
- MycoMap.com: Central database with ~100,000 validated sequences and temporary codes, includes BLAST search functionality
- iNaturalist: >70,000 DNA-barcoded fungal observations (1% of all US fungal observations)
- Community labs: Local DNA sequencing facilities supporting citizen science
- Temporary codes: Can be created instantly by labeling new sequences
Temporary Code System:
- Polyphasic delimitations: DNA + morphology + ecology + geography + phenology
- Dynamic clustering: Based on barcode gaps, not arbitrary thresholds
- Examples: Amanita "sp-IN01", Hygrocybe sp. 'conica-MI03'
- Goal: Bridge data collection and formal taxonomy
OTUnify provides a two-layer system that transforms subjective temporary code decisions into explicit, reproducible methodology:
Layer 1: Molecular OTUs (mOTUs)
- Formal circumscription: Precise sequence-based boundaries using anchor sequences and identity thresholds
- Content-addressable: Hash-based identifiers ensure reproducible results
- Conservative defaults: Start with tight boundaries, require explicit decisions to expand
- Current implementation: Available now via
otunify-create
Layer 2: Operational Species Concepts (OSCs) (Planned)
- Polyphasic integration: Map temporary codes/binomials to one or more mOTUs
- Many-to-many relationships: Handle multiple barcode regions, haplotype variants, cryptic species
- Structured metadata: Geographic distribution, synonyms, ecological data
- Relationship types:
discriminated-by: {mOTU}- Enables automatic temporary code assignmentindicated-by: {mOTU}- Suggests species concept as strong candidate match
This separation allows the molecular layer (formal circumscription) to remain stable while the species concept layer (empirical choices about useful taxonomic units) can evolve with new evidence and community consensus.
While global databases like UNITE serve the academic community with Species Hypotheses for formal identification, OTUnify serves the North American citizen science community's immediate need for reproducible temporary code methodology. These systems address different scales, user communities, and operational requirements, with potential for future integration as the field matures.
git clone https://github.com/joshuaowalker/otunify.git
cd otunify
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -e ".[dev]"For high-performance BLAST mode:
# macOS
brew install blast
# Ubuntu/Debian
sudo apt-get install ncbi-blast+
# Windows: Download from NCBI BLAST+ website# Test basic functionality
otunify-create test_data/sample_sequences.fasta output.fasta --vanity-prefix "TestOTU"
# Test overlap detection
otunify-validate --check-overlaps simple test_data/
# Test BLAST mode
otunify-validate --check-overlaps blast --reference-set test_data/ validation/We welcome contributions! This is an alpha release focused on validating the conceptual model. We especially welcome:
- Testing with your sequence data
- Feedback on the temporary code workflow
- Use case documentation
- Bug reports and feature requests
Please submit issues and pull requests via GitHub.
If you use OTUnify in your research, please cite:
Walker, J.O. (2024). OTUnify: A version-controlled framework for fungal DNA barcode OTU management.
GitHub: https://github.com/joshuaowalker/otunify
Please also cite the North American temporary code system and related work:
Russell, S.D., Birkebak, J., Burzynski, T., Canan, K., D'Elia, G., Geurin, Z., Hunt, B., Jacob, S.,
Mueller, G.M., Ospina, S., Ostuni, S., Peace, R., Quark, M., Reitan, A., Rockefeller, A., Singer, H.,
Walker, J., Williams, J. (2025). Approaching Full-Scale DNA Barcoding for North American Macrofungi:
Highlights from the MycoMap Network. Inoculum 76(3):17-22. Newsletter of the Mycological Society of America.
Russell, S. (2025). Using Temporary Code Names for Documenting Macrofungi.
Retrieved from https://mycotalab.substack.com/p/using-temporary-code-names-for-documenting
Russell, S. (2025). FAQ: Temporary Code Names for Macrofungi - How Temporary Code Names Help Us Map Fungal Biodiversity.
Retrieved from https://mycotalab.substack.com/p/faq-temporary-code-names-for-macrofungi
When appropriate, also cite global fungal databases:
Abarenkov, K., Nilsson, R.H., Larsson, K.H., Taylor, A.F.S., May, T.W., Frøslev, T.G., Pawlowska, J.,
Lindahl, B., Põldmaa, K., Truong, C., Vu, D., Hosoya, T., Niskanen, T., Piirmann, T., Ivanov, F.,
Zirk, A., Peterson, M., Cheeke, T.E., Ishigami, Y., Jansson, A.T., Jeppesen, T.S., Kristiansson, E.,
Mikryukov, V., Miller, J.T., Oono, R., Ossandon, F.J., Paupério, J., Saar, I., Schigel, D., Suija, A.,
Tedersoo, L., Kõljalg, U. (2024). The UNITE database for molecular identification and taxonomic
communication of fungi and other eukaryotes: sequences, taxa and classifications reconsidered.
Nucleic Acids Research, 52(D1), D791-D797.
- GitHub Issues: https://github.com/joshuaowalker/otunify/issues
- Discussions: https://github.com/joshuaowalker/otunify/discussions
BSD 2-Clause License. See LICENSE file for details.