Skip to content

πŸ”  A specialized toolkit for detecting and fixing typos in Japanese railway station name romanizations.

License

Notifications You must be signed in to change notification settings

TrainLCD/csv-autofixer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

CSV Auto Fixer

A specialized toolkit for detecting and fixing typos in Japanese railway station name romanizations, developed for the TrainLCD project to ensure data transparency and quality. This tool analyzes CSV datasets from the TrainLCD StationAPI repository and uses AI to identify potential spelling errors in English station names, providing a human-reviewed correction pipeline.

⚠️ Important Notice: This tool is specifically designed for and only supports the CSV data format used in the TrainLCD StationAPI repository. Compatibility with other CSV formats or data structures is not guaranteed and not supported.

Features

  • Railway-specific typo detection using OpenAI GPT models trained on Japanese station name patterns
  • Context-aware analysis leveraging railway company, prefecture, and line information
  • Romanization-aware filtering that ignores legitimate macron variations (ō vs ou)
  • Human-supervised workflow ensuring no corrections are applied without review
  • Batch processing with intelligent caching to minimize API costs
  • TrainLCD integration specifically designed for railway app data management

Requirements

  • Python version: 3.13.6 (fixed via .python-version)
  • pyenv installed
  • pipenv installed
  • OpenAI API key
  • TrainLCD StationAPI CSV datasets (from https://github.com/TrainLCD/StationAPI/tree/dev/data):
    • stations.csv β€” Main station data with romanized names
    • lines.csv β€” Railway line information
    • companies.csv β€” Railway company data
    • prefectures.csv β€” Prefecture reference data

Note: This tool is exclusively designed for the CSV schema used in the TrainLCD StationAPI repository. Other CSV formats will not work and may cause errors.

Installation

  1. Install the Python version defined in .python-version:

    pyenv install 3.13.6
  2. Install dependencies via pipenv:

    pipenv install

Environment Variables

Set your OpenAI API key before running any scripts:

export OPENAI_API_KEY="sk-…"

Usage

1. Detect typos

Analyzes Japanese railway station name romanizations in your CSV data using OpenAI's GPT models. The tool considers context from railway companies, prefectures, and line information to identify potential spelling errors in English station names.

Basic detection (most common):

pipenv run python detect_typos.py \
  --input data/stations.csv \
  --lines data/lines.csv \
  --companies data/companies.csv \
  --prefectures data/prefectures.csv \
  --column station_name_r \
  --json-output data/typos_raw.json \
  --batch-size 80 \
  --throttle-ms 200 \
  --retry-attempts 8 \
  --retry-base-wait 0.8 \
  --retry-max-wait 20 \
  --cache .typos_cache.json \
  --log detect.log

With additional options:

pipenv run python detect_typos.py \
  --input data/stations.csv \
  --lines data/lines.csv \
  --companies data/companies.csv \
  --prefectures data/prefectures.csv \
  --column station_name_r \
  --json-output data/typos_raw.json \
  --model gpt-4o \
  --output data/summary.csv \
  --verbose \
  --dry-run

Direct overwrite with backup:

pipenv run python detect_typos.py \
  --input data/stations.csv \
  --lines data/lines.csv \
  --companies data/companies.csv \
  --prefectures data/prefectures.csv \
  --column station_name_r \
  --overwrite \
  --backup \
  --log detect.log

Key features:

  • Railway-context analysis: Uses company, prefecture, and line data for accurate detection
  • Romanization expertise: Understands Japanese station naming conventions
  • Batch processing: Efficiently processes large station databases
  • Smart caching: Avoids redundant API calls for previously analyzed stations

Key options:

  • --input β€” Path to the main CSV file
  • --lines, --companies, --prefectures β€” Supplemental CSVs for context
  • --column β€” Column name to check for typos
  • --json-output β€” File to save raw detection results
  • --cache β€” Cache file to avoid repeated API calls
  • --log β€” Log file (timestamps included)
  • --model β€” LLM model name (default: gpt-4o)
  • --output β€” If set without --overwrite, write a CSV of (station_cd, original, suggestion)
  • --overwrite β€” Overwrite the --column in the input CSV with suggestions
  • --backup β€” When --overwrite, create .bak backup of --input
  • --dry-run β€” Do not modify CSV even if --overwrite is set
  • --verbose β€” Verbose logs to stdout
  • --batch-size β€” Batch size for LLM requests (default: 80)
  • --throttle-ms β€” Sleep milliseconds between LLM requests (default: 200)
  • --retry-attempts β€” Max retry attempts on rate limit (default: 8)
  • --retry-base-wait β€” Base wait seconds for backoff (default: 0.8)
  • --retry-max-wait β€” Max wait seconds for backoff (default: 20.0)

2. Review detected typos

Applies intelligent filtering to remove false positives common in Japanese romanization, such as legitimate macron variations. The tool enriches detection results with full railway context to help reviewers make informed decisions.

Full context review:

pipenv run python review_typos.py \
  --input data/typos_raw.json \
  --output data/typos_reviewed.json \
  --export-csv data/typos_reviewed.csv \
  --stations data/stations.csv \
  --lines data/lines.csv \
  --companies data/companies.csv \
  --prefectures data/prefectures.csv \
  --column station_name_r \
  --drop-macron-only \
  --drop-allcaps-only \
  --drop-case-only \
  --log review.log

Minimal review (without context files):

pipenv run python review_typos.py \
  --input data/typos_raw.json \
  --output data/typos_reviewed.json \
  --no-drop-macron-only \
  --log review.log

Main features:

  • Romanization-aware filtering: Automatically excludes legitimate macron differences (ō vs ou)
  • Railway context enrichment: Adds company, prefecture, and line information to each suggestion
  • Company-specific rules: Applies different standards based on railway operator conventions
  • Human review support: Exports both machine-readable JSON and human-friendly CSV formats

Options:

  • --input β€” Path to raw suggestions JSON (from detect_typos.py)
  • --output β€” Path to save reviewed JSON
  • --export-csv β€” Optional path to export a human-friendly CSV list with context
  • --stations, --lines, --companies, --prefectures β€” Context files for information attachment
  • --column β€” Column name that will be fixed (used for context preview)
  • --drop-macron-only β€” Drop suggestions that differ only by macrons (default: true)
  • --no-drop-macron-only β€” Disable macron-only filtering
  • --drop-allcaps-only β€” Drop suggestions that differ only by ALL CAPS (default: true)
  • --no-drop-allcaps-only β€” Disable ALL CAPS-only filtering
  • --drop-case-only β€” Drop suggestions that differ only by letter case (default: true)
  • --no-drop-case-only β€” Disable case-only filtering
  • --log β€” Log file path

3. Apply fixes

Generates the final corrected railway station CSV by applying reviewed corrections to the original dataset. Ensures data integrity while updating station name romanizations.

# Save to a new file
pipenv run python fix_typos.py \
  --input data/stations.csv \
  --typos data/typos_reviewed.json \
  --column station_name_r \
  --output data/stations_fixed.csv \
  --log fix.log

# Overwrite in place
pipenv run python fix_typos.py \
  --input data/stations.csv \
  --typos data/typos_reviewed.json \
  --column station_name_r \
  --overwrite \
  --log fix.log

Key features:

  • Safe station data updates: Validates column structure and preserves all non-target fields
  • Railway data integrity: Maintains relationships between stations, lines, and companies
  • Flexible output options: Create corrected copy or update original dataset in place

Options:

  • --input β€” Path to the original CSV file
  • --typos β€” Path to reviewed typos JSON
  • --column β€” Name of the column to update
  • --output β€” Path to save fixed CSV (ignored if --overwrite)
  • --overwrite β€” Overwrite the input CSV in place
  • --log β€” Optional log file path

Workflow

  1. detect_typos.py β†’ Analyzes railway station romanizations and outputs potential typo candidates
  2. review_typos.py β†’ Filters false positives and prepares human-reviewed corrections
  3. fix_typos.py β†’ Applies approved corrections to generate the final corrected station dataset

Data Flow

TrainLCD StationAPI CSV data (stations.csv + lines.csv + companies.csv + prefectures.csv)
                                        ↓
              detect_typos.py β†’ typos_raw.json (potential station name typos)
                                        ↓
              review_typos.py β†’ typos_reviewed.json (filtered & approved corrections)
                                        ↓
              fix_typos.py β†’ stations_fixed.csv (corrected TrainLCD station dataset)

Data Source: All input CSV files must follow the exact schema defined in the TrainLCD StationAPI repository.

Data Compatibility

This tool is exclusively designed for TrainLCD project transparency and only supports the specific CSV data format used in the TrainLCD StationAPI repository:

  • Supported data source: https://github.com/TrainLCD/StationAPI/tree/dev/data
  • Schema dependency: Hardcoded to work with TrainLCD's specific column names, data types, and relationships
  • No generic CSV support: Other railway datasets or custom CSV formats will not work
  • Purpose: Ensuring data quality and transparency in the TrainLCD mobile application

If you need to process other railway datasets, this tool will require significant modifications to accommodate different schemas and data structures.

Prompt Specification

The AI detection system is specifically designed for Japanese railway station romanizations:

  • Input: Station name (romanized), railway company, prefecture, and line information
  • Task: Identify potential spelling errors in station name romanizations using railway context
  • Output: JSON array of suggested corrections, empty if no issues found
  • Context: Leverages Japanese romanization conventions and railway naming patterns

Japanese Romanization Logic

The tool understands Japanese romanization conventions and automatically excludes common variations:

  • Macron variations (ō vs ou, Ε« vs uu) are treated as legitimate alternatives, not typos
  • Company-specific conventions are respected (e.g., JR vs private railway standards)
  • Historical romanizations are preserved when contextually appropriate
  • Regional variations in romanization styles are accommodated

Additional Features

Railway-Specific Detection

  • Context-aware analysis using railway company, prefecture, and line data
  • Romanization expertise built for Japanese station naming conventions
  • Batch processing optimized for large railway databases
  • Smart caching to reduce API costs on incremental updates

Intelligent Review System

  • Romanization filtering that understands legitimate macron variations
  • Company-specific rules for different railway operator standards
  • Context enrichment with full railway metadata for informed decisions
  • Export flexibility supporting both automated processing and manual review

Safe Data Management

  • Railway data integrity preservation during correction process
  • Backup functionality for safe dataset updates
  • Progress tracking with detailed correction statistics
  • Flexible deployment supporting both new file creation and in-place updates

Important Limitations

  • TrainLCD-specific: This tool is designed exclusively for the TrainLCD project and its data transparency requirements
  • Schema dependency: Only works with CSV files from https://github.com/TrainLCD/StationAPI/tree/dev/data
  • No generic support: Other railway datasets or custom CSV schemas are not supported
  • Modification required: Using this tool with different data sources requires significant code changes

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

πŸ”  A specialized toolkit for detecting and fixing typos in Japanese railway station name romanizations.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages