Note: This is ALIGN 2.0, a complete revamp of the original ALIGN package. For information about the original version (still available on PyPI), see original Github repository and the 2019 Psychological Methods paper by Duran, Paxton, & Fusaroli.
ALIGN 2.0 is a comprehensive Python package for measuring linguistic alignment in conversational data—the tendency of speakers to adopt similar linguistic patterns during interaction. This package provides flexible tools to quantify alignment at multiple linguistic levels:
- Semantic alignment: Using modern embedding models (BERT or FastText)
- Lexical alignment: Based on repeated words and phrases (n-grams)
- Syntactic alignment: Based on part-of-speech patterns
The package now includes enhanced preprocessing capabilities, support for multiple POS taggers, IMPROVED surrogate generation for baseline comparisons, and increased flexibility for diverse research needs.
New to ALIGN? The fastest way to get started is through our comprehensive Jupyter notebook tutorials:
-
Tutorial 1: Preprocessing
Learn how to transform raw conversational transcripts into analysis-ready format -
Tutorial 2: Alignment Analysis
Discover how to measure linguistic alignment at multiple levels -
Tutorial 3: Baseline/Surrogate Analysis
Generate surrogate conversation pairs to establish chance-level baseline alignment for statistical comparison with real conversations
These tutorials provide hands-on, step-by-step guidance using real conversational data and are the recommended starting point for all users.
Important: ALIGN 2.0 is not yet on PyPI. The original ALIGN remains available there, but to use ALIGN 2.0, you must install from GitHub.
- Python 3.7+ (tested with Python 3.13)
- pip (Python package installer)
git clone https://github.com/nickduran/align2-linguistic-alignment.git
cd align2-linguistic-alignmentpip install -r requirements.txtThis allows you to modify the code and see changes immediately:
pip install -e .ALIGN 2.0 represents a complete modernization while maintaining compatibility with the original methodology (Duran, Paxton, & Fusaroli, 2019. Here's what's different:
✅ Modern language models: Native integration of BERT for contextualized semantic embeddings—no manual model downloads required
✅ Dramatically faster processing: spaCy POS tagging is 100-200x faster than Stanford tagger, with minimal accuracy trade-offs
✅ Streamlined setup: No external dependencies to manually download (Stanford tagger optional)—BERT and FastText models download automatically via Hugging Face and Gensim
✅ Enhanced surrogate generation: More flexible baseline creation with better control over pairing logic and turn order preservation
✅ Multiple POS tagger support: Choose between NLTK, spaCy, or Stanford taggers, or compare them side-by-side in the same analysis
✅ Robust preprocessing pipeline: Comprehensive data validation, spell-checking, and cleaning with detailed error messages when issues arise
✅ Progress visualization: Real-time progress bars (via tqdm) for all long-running operations—no more wondering if it's working
✅ Production-ready code: Full type hints, comprehensive docstrings, and modular architecture for easier extension and debugging
✅ Interactive tutorials: Step-by-step Jupyter notebooks with real conversational data to get you started quickly
The fundamental alignment calculations remain consistent with the original ALIGN:
- Cosine similarity for semantic and lexical alignment
- N-gram based lexical and syntactic analysis
- Surrogate generation for baseline comparisons
- Turn-by-turn directionality tracking
- Support for multiple lag values
ALIGN 2.0 uses a clear two-phase approach:
Transforms raw conversational data into analysis-ready format:
from align_test.prepare_transcripts import prepare_transcripts
# Basic usage: NLTK only (fastest, default)
prepare_transcripts(
input_files="path/to/raw/transcripts", # Directory containing .txt files
output_file_directory="path/to/preprocessed", # Where to save processed files
run_spell_check=True, # Enable spell-checking
minwords=2, # Minimum words per turn
use_filler_list=None # Use default filler removal
)
# With spaCy tagging (recommended for speed)
prepare_transcripts(
input_files="path/to/raw/transcripts",
output_file_directory="path/to/preprocessed",
run_spell_check=True,
minwords=2,
add_additional_tags=True, # Enable additional tagger
tagger_type='spacy', # Use spaCy for additional tags
spacy_model='en_core_web_sm' # spaCy model to use
)
# With Stanford tagging (slowest, most accurate)
prepare_transcripts(
input_files="path/to/raw/transcripts",
output_file_directory="path/to/preprocessed",
run_spell_check=True,
minwords=2,
add_additional_tags=True, # Enable additional tagger
tagger_type='stanford', # Use Stanford for additional tags
stanford_pos_path="/path/to/stanford-postagger-full-2020-11-17/",
stanford_language_path="models/english-left3words-distsim.tagger"
)Important Notes:
- Input files must be tab-delimited with columns named
participantandcontent - NLTK tagging is always included (base columns:
tagged_token,tagged_lemma) - Additional tagger columns are added only if
add_additional_tags=True:- spaCy:
tagged_spacy_token,tagged_spacy_lemma - Stanford:
tagged_stanford_token,tagged_stanford_lemma
- spaCy:
- Text cleaning (non-ASCII removal, filler removal) happens automatically
Key Parameters:
input_files: Directory path containing raw transcript filesoutput_file_directory: Directory where preprocessed files will be savedrun_spell_check: Enable/disable automatic spell-checking (default: True)minwords: Minimum number of words per turn (shorter turns are removed; default: 2)add_additional_tags: Add second set of POS tags beyond NLTK (default: False)tagger_type: Which additional tagger to use—'spacy'or'stanford'(default: 'stanford')use_filler_list: Custom list of fillers to remove (None = use default regex)
Output Files: Each conversation produces a processed file with these columns:
participant: Speaker IDscontent: Cleaned utterance texttoken: Tokenized words (string representation of list)lemma: Lemmatized tokens (string representation of list)tagged_token: NLTK POS-tagged tokens (string representation of list of tuples)tagged_lemma: NLTK POS-tagged lemmas (string representation of list of tuples)tagged_spacy_token: spaCy POS-tagged tokens (ifadd_additional_tags=Trueandtagger_type='spacy')tagged_spacy_lemma: spaCy POS-tagged lemmas (ifadd_additional_tags=Trueandtagger_type='spacy')tagged_stanford_token: Stanford POS-tagged tokens (ifadd_additional_tags=Trueandtagger_type='stanford')tagged_stanford_lemma: Stanford POS-tagged lemmas (ifadd_additional_tags=Trueandtagger_type='stanford')
Calculates alignment metrics on preprocessed data:
from align_test.alignment import LinguisticAlignment
# Initialize analyzer
analyzer = LinguisticAlignment(
alignment_types=["bert", "fasttext", "lexsyn"]
)
# Analyze conversations
results = analyzer.analyze_folder(
folder_path="path/to/preprocessed/files",
output_directory="path/to/results",
lag=1,
# FastText parameters
high_sd_cutoff=3,
low_n_cutoff=2,
# Lexical/syntactic parameters
max_ngram=3,
ignore_duplicates=True,
add_additional_tags=True # Use additional POS tags if available
)Key Parameters:
lag: Number of turns between paired utterances (default: 1)max_ngram: Maximum n-gram size for lexical/syntactic analysis (default: 2)ignore_duplicates: Remove lexical overlap from syntactic patterns (default: True)add_additional_tags: Whether to use additional POS tags from preprocessing (default: False)additional_tagger_type: Which additional tagger columns to use—'spacy'or'stanford'high_sd_cutoff: FastText high-frequency word filter (default: 3)low_n_cutoff: FastText low-frequency word filter (default: 1)
Note: The add_additional_tags and additional_tagger_type parameters tell the analyzer which POS tag columns to use from your preprocessed files. These must match what you created during preprocessing.
Measures semantic similarity using contextualized embeddings:
analyzer = LinguisticAlignment(alignment_type="bert")
results = analyzer.analyze_folder(
folder_path="preprocessed_data/",
output_directory="results/",
model_name="bert-base-uncased", # or other BERT variants
lag=1
)Setup Required: See Hugging Face Token Setup section below.
Output: semantic_alignment_bert-base-uncased_lag1.csv
Measures semantic similarity using static word embeddings:
analyzer = LinguisticAlignment(alignment_type="fasttext")
results = analyzer.analyze_folder(
folder_path="preprocessed_data/",
output_directory="results/",
model_name="fasttext-wiki-news-300",
high_sd_cutoff=3, # Filter high-frequency words
low_n_cutoff=2, # Filter rare words
save_vocab=True, # Save vocabulary lists
lag=1
)Output: semantic_alignment_fasttext-wiki-news-300_lag1_sd3_n2.csv
Measures word and phrase repetition (lexical) and grammatical structure reuse (syntactic):
analyzer = LinguisticAlignment(alignment_type="lexsyn")
# Using NLTK tags only
results = analyzer.analyze_folder(
folder_path="preprocessed_data/",
output_directory="results/",
max_ngram=3, # Analyze uni-, bi-, and trigrams
ignore_duplicates=True, # Remove lexical overlap from syntax
add_additional_tags=False, # Use only NLTK tags
lag=1
)
# Using additional tags (spaCy)
results = analyzer.analyze_folder(
folder_path="preprocessed_data/",
output_directory="results/",
max_ngram=3,
ignore_duplicates=True,
add_additional_tags=True, # Use additional tags
additional_tagger_type='spacy', # Specify which additional tagger
lag=1
)
# Using additional tags (Stanford)
results = analyzer.analyze_folder(
folder_path="preprocessed_data/",
output_directory="results/",
max_ngram=3,
ignore_duplicates=True,
add_additional_tags=True,
additional_tagger_type='stanford',
lag=1
)Output Filenames:
- With NLTK only:
lexsyn_alignment_ngram3_lag1_noDups_noAdd.csv - With spaCy:
lexsyn_alignment_ngram3_lag1_noDups_withSpacy.csv - With Stanford:
lexsyn_alignment_ngram3_lag1_noDups_withStan.csv
Key Parameters:
max_ngram: Maximum n-gram size (2=bigrams, 3=trigrams, etc.)ignore_duplicates: IfTrue, removes syntactic patterns that share lexical contentadd_additional_tags: Whether to use additional POS tags (default: False)additional_tagger_type: Which additional tagger to use:'spacy'or'stanford'
Run all analyses together:
analyzer = LinguisticAlignment(
alignment_types=["bert", "fasttext", "lexsyn"]
)
results = analyzer.analyze_folder(
folder_path="preprocessed_data/",
output_directory="results/",
lag=1,
# FastText-specific
high_sd_cutoff=3,
low_n_cutoff=2,
# Lexsyn-specific
max_ngram=3,
ignore_duplicates=True,
add_additional_tags=True,
additional_tagger_type='spacy'
)Output: Individual CSV files for each analysis type, plus merged_alignment_results_lag1.csv
The lag parameter controls which utterances are paired for alignment calculation:
lag=1(default): Each utterance paired with the immediately following onelag=2: Each utterance paired with the utterance 2 positions laterlag=3: Each utterance paired with the utterance 3 positions later
Example with lag=1:
Turn 1: "I love pizza" → paired with → Turn 2: "Me too"
Turn 2: "Me too" → paired with → Turn 3: "What's your favorite?"
Turn 3: "What's your favorite?" → paired with → Turn 4: "Pepperoni"
This allows analysis of alignment at different conversational distances.
Surrogate analysis creates artificial conversation pairs to establish chance-level baseline alignment. This is crucial for determining whether observed alignment exceeds what would occur randomly.
- Takes participants from different real conversations
- Pairs them to create artificial dyads
- Calculates alignment metrics for these fabricated pairs
- Provides baseline for statistical comparison
The surrogate generator needs to parse your filenames to identify participants and experimental conditions. Configure these parameters to match your naming scheme:
from align_test.alignment import LinguisticAlignment
analyzer = LinguisticAlignment(alignment_types=["bert", "lexsyn"])
# Analyze baseline alignment
baseline_results = analyzer.analyze_baseline(
input_files="preprocessed_data/",
output_directory="results/baseline/",
# Surrogate configuration
id_separator="_", # Character separating filename parts
dyad_label="ASU-", # Prefix identifying participant/dyad IDs
condition_label="ExpBlock", # Prefix identifying experimental conditions
all_surrogates=False, # Generate subset (True = all combinations)
keep_original_turn_order=True, # Maintain sequential turn order (recommended)
# Analysis parameters (must match your real data analysis)
lag=1,
max_ngram=3,
ignore_duplicates=True,
add_additional_tags=True,
additional_tagger_type='spacy'
)Filenames: ASU-T104_ExpBlock2-TrunkSlide.txt
surrogate_params = {
"id_separator": "_",
"dyad_label": "ASU-",
"condition_label": "ExpBlock"
}Parsing logic:
ASU-T104→ Dyad ID (T104)ExpBlock2→ Condition (2)- Surrogates only pair participants from same condition
Filenames: dyad23_condition1.txt
surrogate_params = {
"id_separator": "_",
"dyad_label": "dyad",
"condition_label": "condition"
}Filenames: time191-cond1.txt
surrogate_params = {
"id_separator": "-",
"dyad_label": "time",
"condition_label": "cond"
}-
all_surrogates:False: Generate representative subset (faster, usually sufficient)True: Generate every possible pairing (computationally expensive)
-
keep_original_turn_order:True: Preserve sequential turn order (recommended—maintains temporal structure)False: Randomly shuffle turns (less conservative baseline)
If you've already generated surrogate files:
baseline_results = analyzer.analyze_baseline(
input_files="preprocessed_data/",
output_directory="results/baseline/",
use_existing_surrogates="path/to/surrogate/files/",
lag=1,
# Include other analysis parameters as needed
max_ngram=3,
add_additional_tags=True,
additional_tagger_type='spacy'
)Baseline analysis generates files with baseline_ prefix:
baseline_alignment_bert-base-uncased_lag1.csvbaseline_alignment_fasttext_lag1_sd3_n2.csvbaseline_alignment_lexsyn_ngram3_lag1_noDups_withSpacy.csv
ALIGN 2.0 uses descriptive filenames that encode analysis parameters:
Format: [prefix]_alignment_[model]_lag[N]_[params].csv
Examples:
semantic_alignment_bert-base-uncased_lag1.csvsemantic_alignment_fasttext-wiki-news-300_lag1_sd3_n2.csvlexsyn_alignment_ngram3_lag1_noDups_withSpacy.csvbaseline_alignment_bert-base-uncased_lag1.csv
All alignment result files share these core columns:
| Column | Description |
|---|---|
file |
Source conversation filename |
order |
Turn sequence number (0-indexed) |
direction |
Who follows whom (e.g., "PA→PB" or "PB→PA") |
participant_lead |
ID of leading speaker |
participant_follow |
ID of following speaker |
turn_lead |
Utterance from leading speaker |
turn_follow |
Utterance from following speaker |
Alignment scores (additional columns depend on analysis type):
- BERT/FastText:
semantic_similarity(range: -1 to 1, typically 0.3-0.9) - Lexical:
lexical_sim_ngram[N]for each n-gram size (range: 0-1) - Syntactic:
syntactic_sim_ngram[N]for each n-gram size (range: 0-1)
When running multiple analysis types together, ALIGN generates:
merged_alignment_results_lag[N].csv
This file combines all alignment metrics in a single row per turn pair, making it easy to:
- Compare alignment across linguistic levels
- Perform multi-level statistical modeling
- Visualize relationships between alignment types
BERT-based semantic alignment requires a (free) Hugging Face account and access token.
- Go to huggingface.co and sign up
- Log in to your account
- Visit Settings → Access Tokens
- Click "New token"
- Name it (e.g., "ALIGN_ACCESS")
- Select "read" permission
- Generate and copy the token
Choose one method:
# Linux/Mac
export HUGGINGFACE_TOKEN="your_token_here"
# Windows Command Prompt
set HUGGINGFACE_TOKEN=your_token_here
# Windows PowerShell
$env:HUGGINGFACE_TOKEN="your_token_here"Create ~/.config/my_package/config.json:
{
"huggingface_token": "your_token_here"
}analyzer = LinguisticAlignment(
alignment_type="bert",
token="your_token_here"
)from align_test.alignment import LinguisticAlignment
# Test BERT with minimal data
analyzer = LinguisticAlignment(alignment_type="bert")
# Use any preprocessed conversation files
results = analyzer.analyze_folder(
folder_path="preprocessed_data/",
output_directory="test_output/",
lag=1
)
print("✓ BERT analyzer working correctly!")If you see "401 Client Error: Unauthorized", double-check your token configuration.
from align_test.alignment import LinguisticAlignment
# Simplest possible analysis
# Note: Requires HUGGINGFACE_TOKEN environment variable for BERT
analyzer = LinguisticAlignment(alignment_type="bert")
results = analyzer.analyze_folder(
folder_path="preprocessed_conversations/", # Already preprocessed files
output_directory="results/",
lag=1 # Pair each utterance with the next one
)
print(f"Analyzed {len(results)} conversation files")
# Output: results/semantic_alignment_bert-base-uncased_lag1.csvfrom align_test.alignment import LinguisticAlignment
# Initialize with all analysis types
analyzer = LinguisticAlignment(
alignment_types=["bert", "fasttext", "lexsyn"],
cache_dir="cache/" # Store downloaded models here
)
# Configure parameters (can reuse for both real and baseline)
common_params = {
"lag": 1
}
fasttext_params = {
"high_sd_cutoff": 3, # Filter high-frequency words
"low_n_cutoff": 2, # Filter rare words
"save_vocab": True # Save vocabulary to output
}
lexsyn_params = {
"max_ngram": 3, # Analyze up to 3-word phrases
"ignore_duplicates": True, # Remove lexical overlap from syntax
"add_additional_tags": True # Use spaCy/Stanford tags (must exist in preprocessed files)
}
# Analyze real conversations
real_results = analyzer.analyze_folder(
folder_path="preprocessed_conversations/",
output_directory="results/real/",
**common_params,
**fasttext_params,
**lexsyn_params
)
# Configure surrogate generation
surrogate_params = {
"id_separator": "_", # e.g., "dyad5_condition1.txt"
"dyad_label": "dyad", # Filename prefix for dyad ID
"condition_label": "condition", # Filename prefix for condition
"all_surrogates": False, # Sample ~50% of possible pairs
"keep_original_turn_order": True # Maintain temporal structure
}
# Generate baseline with surrogates
baseline_results = analyzer.analyze_baseline(
input_files="preprocessed_conversations/",
output_directory="results/baseline/",
**common_params,
**fasttext_params,
**lexsyn_params,
**surrogate_params
)
print("Analysis complete!")
print(f"Real conversations: {len(real_results)} files")
print(f"Baseline surrogates: {len(baseline_results)} files")from align_test.alignment import LinguisticAlignment
analyzer = LinguisticAlignment(alignment_type="lexsyn")
# Analyze at different conversational distances
# Note: Each lag value requires a separate analysis run
for lag_value in [1, 2, 3]:
print(f"\nAnalyzing with lag={lag_value}...")
print(f" - lag=1: consecutive turns")
print(f" - lag=2: skip 1 turn between pairs")
print(f" - lag=3: skip 2 turns between pairs")
results = analyzer.analyze_folder(
folder_path="preprocessed_conversations/",
output_directory=f"results/lag{lag_value}/",
lag=lag_value,
max_ngram=2,
ignore_duplicates=True,
add_additional_tags=True # Use additional POS tags if available
)
print(f" ✓ Results saved to: results/lag{lag_value}/")
print(f" Files analyzed: {len(results)}")
print("\n✓ All lag analyses complete!")
print("Compare results across lag values to see how alignment changes with distance.")from align_test.prepare_transcripts import prepare_transcripts
from align_test.alignment import LinguisticAlignment
# ============================================================
# PHASE 1: Preprocess raw transcripts
# ============================================================
print("PHASE 1: Preprocessing raw conversations...")
print(" Input: Tab-separated files with 'participant' and 'content' columns")
print(" Output: Preprocessed files with tokens, lemmas, and POS tags\n")
preprocessed_results = prepare_transcripts(
input_files="raw_transcripts/",
output_file_directory="preprocessed/",
run_spell_check=True,
minwords=2,
add_additional_tags=True, # Add spaCy POS tags
tagger_type='spacy', # Use spaCy (fast and accurate)
spacy_model='en_core_web_sm'
)
print(f"✓ Preprocessed {len(preprocessed_results)} utterances")
print(f" Saved to: preprocessed/\n")
# ============================================================
# PHASE 2: Analyze alignment
# ============================================================
print("PHASE 2: Analyzing linguistic alignment...")
print(" Analyzers: BERT (semantic) + Lexical/Syntactic\n")
analyzer = LinguisticAlignment(alignment_types=["bert", "lexsyn"])
results = analyzer.analyze_folder(
folder_path="preprocessed/",
output_directory="results/",
lag=1,
max_ngram=3,
ignore_duplicates=True,
add_additional_tags=True # Use spaCy columns from preprocessing phase
)
print(f"\n✓ Analysis Complete! Analyzed {len(results)} files\n")
print("Output files:")
print(" - results/bert/semantic_alignment_bert-base-uncased_lag1.csv")
print(" - results/lexsyn/lexsyn_alignment_ngram3_lag1_noDups_withSpacy.csv")
print(" - results/merged_alignment_results_lag1.csv")align2-linguistic-alignment/
├── src/ # Source code
│ └── align_test/ # Core package files
│ ├── __init__.py
│ ├── prepare_transcripts.py # Phase 1: Preprocessing
│ ├── alignment.py # Phase 2: Alignment analysis
│ ├── alignment_bert.py
│ ├── alignment_fasttext.py
│ ├── alignment_lexsyn.py
│ ├── bert_model.py
│ ├── fasttext_model.py
│ ├── config.py
│ ├── surrogates.py
│ └── data/ # Sample data
│ ├── gutenberg.txt # Spell-check corpus
│ ├── CHILDES/ # 20 properly formatted input files for preprocessing
├── tutorials/ # Example usage scripts
│ ├── TUTORIAL_README.md
| ├── tutorial_1_preprocessing.ipynb
| ├── tutorial_2_alignment.ipynb
│ ├── tutorial_3_baseline.ipynb
├── README.md
├── setup.py
├── requirements.txt
├── MANIFEST.in
├── LICENSE
ALIGN 2.0 uses cosine similarity to measure alignment across all linguistic levels:
- Semantic (BERT/FastText): Cosine similarity between utterance embeddings
- Lexical: Cosine similarity between n-gram frequency vectors
- Syntactic: Cosine similarity between POS n-gram frequency vectors
Why cosine similarity?
- Interpretable: Values from -1 to 1 (or 0 to 1 for n-grams)
- Length-normalized: Controls for utterance length differences
- Established: Widely used in NLP and information retrieval
- Consistent: Same metric across all linguistic levels
ALIGN tracks alignment directionality separately:
- PA→PB: How much does PB align with PA?
- PB→PA: How much does PA align with PB?
This allows research on:
- Leader-follower dynamics
- Power relationships
- Conversational roles
When comparing real vs. baseline alignment:
- Match parameters: Use identical settings for real and surrogate analyses
- Aggregate appropriately: Consider conversation-level or turn-level aggregation
- Account for non-independence: Turns within conversations are related
- Use mixed-effects models: Account for random effects (e.g., dyad, individual)
Example statistical approach:
# In R with lme4
library(lme4)
model <- lmer(
alignment ~ data_type * condition + (1|dyad) + (1|turn_order),
data = combined_data
)If you use ALIGN 2.0 in your research, please cite the original methodology paper:
@article{duran2019align,
title={ALIGN: Analyzing Linguistic Interactions with Generalizable techNiques—A Python Library},
author={Duran, Nicholas D and Paxton, Alexandra and Fusaroli, Riccardo},
journal={Psychological Methods},
year={2019},
publisher={American Psychological Association},
doi={10.1037/met0000206}
}Note: A paper describing ALIGN 2.0 enhancements is in preparation. Check this repository for updates.
Issue: "No module named 'align_test'"
- Solution: Ensure you installed in editable mode:
pip install -e .
Issue: "401 Client Error: Unauthorized" (BERT)
- Solution: Check your Hugging Face token configuration (see setup section)
Issue: "FileNotFoundError: [Errno 2] No such file or directory"
- Solution: Verify file paths are correct and files exist
Issue: "ValueError: No valid conversation files found"
- Solution: Check that preprocessed files have required columns (
participant,content,token, etc.)
Issue: "KeyError: 'tagged_spacy_lemma'"
- Solution: Ensure preprocessing included
add_additional_tags=Trueor specify correcttaggerparameter
Issue: Slow BERT analysis
- Solution: BERT is compute-intensive. Consider:
- Using FastText for initial exploration
- Running on GPU-enabled machine
- Processing smaller batches
Issue: Surrogate generation fails
- Solution: Verify
id_separator,dyad_label, andcondition_labelmatch your filename structure exactly
- Check the tutorials for complete working examples
- Review this README carefully
- Open an issue on GitHub
- Include error messages, code snippets, and Python version
We welcome contributions! This package is under active development. To contribute:
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
Areas for contribution:
- Additional preprocessing options
- New alignment metrics
- Performance optimizations
- Documentation improvements
- Bug fixes
MIT License - see LICENSE file for details.
Copyright (c) 2025 Nicholas D. Duran
ALIGN 2.0 builds upon the original ALIGN methodology developed by Nicholas D. Duran, Alexandra Paxton, and Riccardo Fusaroli (2019).
The package leverages several excellent open-source projects:
- Hugging Face Transformers for BERT models
- Gensim for FastText embeddings
- NLTK for NLP utilities
- spaCy for linguistic processing
- Stanford CoreNLP for POS tagging
For questions about ALIGN 2.0:
- GitHub Issues: Open an issue
- Email: nicholas.duran@utexas.edu
For questions about the original ALIGN methodology, please refer to the 2019 paper.
Ready to start? Head to the Quick Start section or jump directly to the Phase 1 tutorial!