A specialized toolkit for detecting and fixing typos in Japanese railway station name romanizations, developed for the TrainLCD project to ensure data transparency and quality. This tool analyzes CSV datasets from the TrainLCD StationAPI repository and uses AI to identify potential spelling errors in English station names, providing a human-reviewed correction pipeline.
- Railway-specific typo detection using OpenAI GPT models trained on Japanese station name patterns
- Context-aware analysis leveraging railway company, prefecture, and line information
- Romanization-aware filtering that ignores legitimate macron variations (Ε vs ou)
- Human-supervised workflow ensuring no corrections are applied without review
- Batch processing with intelligent caching to minimize API costs
- TrainLCD integration specifically designed for railway app data management
- Python version: 3.13.6 (fixed via
.python-version
) pyenv
installedpipenv
installed- OpenAI API key
- TrainLCD StationAPI CSV datasets (from https://github.com/TrainLCD/StationAPI/tree/dev/data):
stations.csv
β Main station data with romanized nameslines.csv
β Railway line informationcompanies.csv
β Railway company dataprefectures.csv
β Prefecture reference data
Note: This tool is exclusively designed for the CSV schema used in the TrainLCD StationAPI repository. Other CSV formats will not work and may cause errors.
-
Install the Python version defined in
.python-version
:pyenv install 3.13.6
-
Install dependencies via pipenv:
pipenv install
Set your OpenAI API key before running any scripts:
export OPENAI_API_KEY="sk-β¦"
Analyzes Japanese railway station name romanizations in your CSV data using OpenAI's GPT models. The tool considers context from railway companies, prefectures, and line information to identify potential spelling errors in English station names.
Basic detection (most common):
pipenv run python detect_typos.py \
--input data/stations.csv \
--lines data/lines.csv \
--companies data/companies.csv \
--prefectures data/prefectures.csv \
--column station_name_r \
--json-output data/typos_raw.json \
--batch-size 80 \
--throttle-ms 200 \
--retry-attempts 8 \
--retry-base-wait 0.8 \
--retry-max-wait 20 \
--cache .typos_cache.json \
--log detect.log
With additional options:
pipenv run python detect_typos.py \
--input data/stations.csv \
--lines data/lines.csv \
--companies data/companies.csv \
--prefectures data/prefectures.csv \
--column station_name_r \
--json-output data/typos_raw.json \
--model gpt-4o \
--output data/summary.csv \
--verbose \
--dry-run
Direct overwrite with backup:
pipenv run python detect_typos.py \
--input data/stations.csv \
--lines data/lines.csv \
--companies data/companies.csv \
--prefectures data/prefectures.csv \
--column station_name_r \
--overwrite \
--backup \
--log detect.log
Key features:
- Railway-context analysis: Uses company, prefecture, and line data for accurate detection
- Romanization expertise: Understands Japanese station naming conventions
- Batch processing: Efficiently processes large station databases
- Smart caching: Avoids redundant API calls for previously analyzed stations
Key options:
--input
β Path to the main CSV file--lines
,--companies
,--prefectures
β Supplemental CSVs for context--column
β Column name to check for typos--json-output
β File to save raw detection results--cache
β Cache file to avoid repeated API calls--log
β Log file (timestamps included)--model
β LLM model name (default: gpt-4o)--output
β If set without --overwrite, write a CSV of (station_cd, original, suggestion)--overwrite
β Overwrite the --column in the input CSV with suggestions--backup
β When --overwrite, create .bak backup of --input--dry-run
β Do not modify CSV even if --overwrite is set--verbose
β Verbose logs to stdout--batch-size
β Batch size for LLM requests (default: 80)--throttle-ms
β Sleep milliseconds between LLM requests (default: 200)--retry-attempts
β Max retry attempts on rate limit (default: 8)--retry-base-wait
β Base wait seconds for backoff (default: 0.8)--retry-max-wait
β Max wait seconds for backoff (default: 20.0)
Applies intelligent filtering to remove false positives common in Japanese romanization, such as legitimate macron variations. The tool enriches detection results with full railway context to help reviewers make informed decisions.
Full context review:
pipenv run python review_typos.py \
--input data/typos_raw.json \
--output data/typos_reviewed.json \
--export-csv data/typos_reviewed.csv \
--stations data/stations.csv \
--lines data/lines.csv \
--companies data/companies.csv \
--prefectures data/prefectures.csv \
--column station_name_r \
--drop-macron-only \
--drop-allcaps-only \
--drop-case-only \
--log review.log
Minimal review (without context files):
pipenv run python review_typos.py \
--input data/typos_raw.json \
--output data/typos_reviewed.json \
--no-drop-macron-only \
--log review.log
Main features:
- Romanization-aware filtering: Automatically excludes legitimate macron differences (Ε vs ou)
- Railway context enrichment: Adds company, prefecture, and line information to each suggestion
- Company-specific rules: Applies different standards based on railway operator conventions
- Human review support: Exports both machine-readable JSON and human-friendly CSV formats
Options:
--input
β Path to raw suggestions JSON (from detect_typos.py)--output
β Path to save reviewed JSON--export-csv
β Optional path to export a human-friendly CSV list with context--stations
,--lines
,--companies
,--prefectures
β Context files for information attachment--column
β Column name that will be fixed (used for context preview)--drop-macron-only
β Drop suggestions that differ only by macrons (default: true)--no-drop-macron-only
β Disable macron-only filtering--drop-allcaps-only
β Drop suggestions that differ only by ALL CAPS (default: true)--no-drop-allcaps-only
β Disable ALL CAPS-only filtering--drop-case-only
β Drop suggestions that differ only by letter case (default: true)--no-drop-case-only
β Disable case-only filtering--log
β Log file path
Generates the final corrected railway station CSV by applying reviewed corrections to the original dataset. Ensures data integrity while updating station name romanizations.
# Save to a new file
pipenv run python fix_typos.py \
--input data/stations.csv \
--typos data/typos_reviewed.json \
--column station_name_r \
--output data/stations_fixed.csv \
--log fix.log
# Overwrite in place
pipenv run python fix_typos.py \
--input data/stations.csv \
--typos data/typos_reviewed.json \
--column station_name_r \
--overwrite \
--log fix.log
Key features:
- Safe station data updates: Validates column structure and preserves all non-target fields
- Railway data integrity: Maintains relationships between stations, lines, and companies
- Flexible output options: Create corrected copy or update original dataset in place
Options:
--input
β Path to the original CSV file--typos
β Path to reviewed typos JSON--column
β Name of the column to update--output
β Path to save fixed CSV (ignored if --overwrite)--overwrite
β Overwrite the input CSV in place--log
β Optional log file path
detect_typos.py
β Analyzes railway station romanizations and outputs potential typo candidatesreview_typos.py
β Filters false positives and prepares human-reviewed correctionsfix_typos.py
β Applies approved corrections to generate the final corrected station dataset
TrainLCD StationAPI CSV data (stations.csv + lines.csv + companies.csv + prefectures.csv)
β
detect_typos.py β typos_raw.json (potential station name typos)
β
review_typos.py β typos_reviewed.json (filtered & approved corrections)
β
fix_typos.py β stations_fixed.csv (corrected TrainLCD station dataset)
Data Source: All input CSV files must follow the exact schema defined in the TrainLCD StationAPI repository.
This tool is exclusively designed for TrainLCD project transparency and only supports the specific CSV data format used in the TrainLCD StationAPI repository:
- Supported data source: https://github.com/TrainLCD/StationAPI/tree/dev/data
- Schema dependency: Hardcoded to work with TrainLCD's specific column names, data types, and relationships
- No generic CSV support: Other railway datasets or custom CSV formats will not work
- Purpose: Ensuring data quality and transparency in the TrainLCD mobile application
If you need to process other railway datasets, this tool will require significant modifications to accommodate different schemas and data structures.
The AI detection system is specifically designed for Japanese railway station romanizations:
- Input: Station name (romanized), railway company, prefecture, and line information
- Task: Identify potential spelling errors in station name romanizations using railway context
- Output: JSON array of suggested corrections, empty if no issues found
- Context: Leverages Japanese romanization conventions and railway naming patterns
The tool understands Japanese romanization conventions and automatically excludes common variations:
- Macron variations (Ε vs ou, Ε« vs uu) are treated as legitimate alternatives, not typos
- Company-specific conventions are respected (e.g., JR vs private railway standards)
- Historical romanizations are preserved when contextually appropriate
- Regional variations in romanization styles are accommodated
- Context-aware analysis using railway company, prefecture, and line data
- Romanization expertise built for Japanese station naming conventions
- Batch processing optimized for large railway databases
- Smart caching to reduce API costs on incremental updates
- Romanization filtering that understands legitimate macron variations
- Company-specific rules for different railway operator standards
- Context enrichment with full railway metadata for informed decisions
- Export flexibility supporting both automated processing and manual review
- Railway data integrity preservation during correction process
- Backup functionality for safe dataset updates
- Progress tracking with detailed correction statistics
- Flexible deployment supporting both new file creation and in-place updates
- TrainLCD-specific: This tool is designed exclusively for the TrainLCD project and its data transparency requirements
- Schema dependency: Only works with CSV files from https://github.com/TrainLCD/StationAPI/tree/dev/data
- No generic support: Other railway datasets or custom CSV schemas are not supported
- Modification required: Using this tool with different data sources requires significant code changes
This project is licensed under the MIT License - see the LICENSE file for details.