CSV Auto Fixer

A specialized toolkit for detecting and fixing typos in Japanese railway station name romanizations, developed for the TrainLCD project to ensure data transparency and quality. This tool analyzes CSV datasets from the TrainLCD StationAPI repository and uses AI to identify potential spelling errors in English station names, providing a human-reviewed correction pipeline.

⚠️ Important Notice: This tool is specifically designed for and only supports the CSV data format used in the TrainLCD StationAPI repository. Compatibility with other CSV formats or data structures is not guaranteed and not supported.

Features

Railway-specific typo detection using OpenAI GPT models trained on Japanese station name patterns
Context-aware analysis leveraging railway company, prefecture, and line information
Romanization-aware filtering that ignores legitimate macron variations (ō vs ou)
Human-supervised workflow ensuring no corrections are applied without review
Batch processing with intelligent caching to minimize API costs
TrainLCD integration specifically designed for railway app data management

Requirements

Python version: 3.13.6 (fixed via .python-version)
pyenv installed
pipenv installed
OpenAI API key
TrainLCD StationAPI CSV datasets (from https://github.com/TrainLCD/StationAPI/tree/dev/data):
- stations.csv — Main station data with romanized names
- lines.csv — Railway line information
- companies.csv — Railway company data
- prefectures.csv — Prefecture reference data

Note: This tool is exclusively designed for the CSV schema used in the TrainLCD StationAPI repository. Other CSV formats will not work and may cause errors.

Installation

Install the Python version defined in .python-version:
```
pyenv install 3.13.6
```
Install dependencies via pipenv:
```
pipenv install
```

Environment Variables

Set your OpenAI API key before running any scripts:

export OPENAI_API_KEY="sk-…"

Usage

1. Detect typos

Analyzes Japanese railway station name romanizations in your CSV data using OpenAI's GPT models. The tool considers context from railway companies, prefectures, and line information to identify potential spelling errors in English station names.

Basic detection (most common):

pipenv run python detect_typos.py \
  --input data/stations.csv \
  --lines data/lines.csv \
  --companies data/companies.csv \
  --prefectures data/prefectures.csv \
  --column station_name_r \
  --json-output data/typos_raw.json \
  --batch-size 80 \
  --throttle-ms 200 \
  --retry-attempts 8 \
  --retry-base-wait 0.8 \
  --retry-max-wait 20 \
  --cache .typos_cache.json \
  --log detect.log

With additional options:

pipenv run python detect_typos.py \
  --input data/stations.csv \
  --lines data/lines.csv \
  --companies data/companies.csv \
  --prefectures data/prefectures.csv \
  --column station_name_r \
  --json-output data/typos_raw.json \
  --model gpt-4o \
  --output data/summary.csv \
  --verbose \
  --dry-run

Direct overwrite with backup:

pipenv run python detect_typos.py \
  --input data/stations.csv \
  --lines data/lines.csv \
  --companies data/companies.csv \
  --prefectures data/prefectures.csv \
  --column station_name_r \
  --overwrite \
  --backup \
  --log detect.log

Key features:

Railway-context analysis: Uses company, prefecture, and line data for accurate detection
Romanization expertise: Understands Japanese station naming conventions
Batch processing: Efficiently processes large station databases
Smart caching: Avoids redundant API calls for previously analyzed stations

Key options:

--input — Path to the main CSV file
--lines, --companies, --prefectures — Supplemental CSVs for context
--column — Column name to check for typos
--json-output — File to save raw detection results
--cache — Cache file to avoid repeated API calls
--log — Log file (timestamps included)
--model — LLM model name (default: gpt-4o)
--output — If set without --overwrite, write a CSV of (station_cd, original, suggestion)
--overwrite — Overwrite the --column in the input CSV with suggestions
--backup — When --overwrite, create .bak backup of --input
--dry-run — Do not modify CSV even if --overwrite is set
--verbose — Verbose logs to stdout
--batch-size — Batch size for LLM requests (default: 80)
--throttle-ms — Sleep milliseconds between LLM requests (default: 200)
--retry-attempts — Max retry attempts on rate limit (default: 8)
--retry-base-wait — Base wait seconds for backoff (default: 0.8)
--retry-max-wait — Max wait seconds for backoff (default: 20.0)

2. Review detected typos

Applies intelligent filtering to remove false positives common in Japanese romanization, such as legitimate macron variations. The tool enriches detection results with full railway context to help reviewers make informed decisions.

Full context review:

pipenv run python review_typos.py \
  --input data/typos_raw.json \
  --output data/typos_reviewed.json \
  --export-csv data/typos_reviewed.csv \
  --stations data/stations.csv \
  --lines data/lines.csv \
  --companies data/companies.csv \
  --prefectures data/prefectures.csv \
  --column station_name_r \
  --drop-macron-only \
  --drop-allcaps-only \
  --drop-case-only \
  --log review.log

Minimal review (without context files):

pipenv run python review_typos.py \
  --input data/typos_raw.json \
  --output data/typos_reviewed.json \
  --no-drop-macron-only \
  --log review.log

Main features:

Romanization-aware filtering: Automatically excludes legitimate macron differences (ō vs ou)
Railway context enrichment: Adds company, prefecture, and line information to each suggestion
Company-specific rules: Applies different standards based on railway operator conventions
Human review support: Exports both machine-readable JSON and human-friendly CSV formats

Options:

--input — Path to raw suggestions JSON (from detect_typos.py)
--output — Path to save reviewed JSON
--export-csv — Optional path to export a human-friendly CSV list with context
--stations, --lines, --companies, --prefectures — Context files for information attachment
--column — Column name that will be fixed (used for context preview)
--drop-macron-only — Drop suggestions that differ only by macrons (default: true)
--no-drop-macron-only — Disable macron-only filtering
--drop-allcaps-only — Drop suggestions that differ only by ALL CAPS (default: true)
--no-drop-allcaps-only — Disable ALL CAPS-only filtering
--drop-case-only — Drop suggestions that differ only by letter case (default: true)
--no-drop-case-only — Disable case-only filtering
--log — Log file path

3. Apply fixes

Generates the final corrected railway station CSV by applying reviewed corrections to the original dataset. Ensures data integrity while updating station name romanizations.

# Save to a new file
pipenv run python fix_typos.py \
  --input data/stations.csv \
  --typos data/typos_reviewed.json \
  --column station_name_r \
  --output data/stations_fixed.csv \
  --log fix.log

# Overwrite in place
pipenv run python fix_typos.py \
  --input data/stations.csv \
  --typos data/typos_reviewed.json \
  --column station_name_r \
  --overwrite \
  --log fix.log

Key features:

Safe station data updates: Validates column structure and preserves all non-target fields
Railway data integrity: Maintains relationships between stations, lines, and companies
Flexible output options: Create corrected copy or update original dataset in place

Options:

--input — Path to the original CSV file
--typos — Path to reviewed typos JSON
--column — Name of the column to update
--output — Path to save fixed CSV (ignored if --overwrite)
--overwrite — Overwrite the input CSV in place
--log — Optional log file path

Workflow

detect_typos.py → Analyzes railway station romanizations and outputs potential typo candidates
review_typos.py → Filters false positives and prepares human-reviewed corrections
fix_typos.py → Applies approved corrections to generate the final corrected station dataset

Data Flow

TrainLCD StationAPI CSV data (stations.csv + lines.csv + companies.csv + prefectures.csv)
                                        ↓
              detect_typos.py → typos_raw.json (potential station name typos)
                                        ↓
              review_typos.py → typos_reviewed.json (filtered & approved corrections)
                                        ↓
              fix_typos.py → stations_fixed.csv (corrected TrainLCD station dataset)

Data Source: All input CSV files must follow the exact schema defined in the TrainLCD StationAPI repository.

Data Compatibility

This tool is exclusively designed for TrainLCD project transparency and only supports the specific CSV data format used in the TrainLCD StationAPI repository:

Supported data source: https://github.com/TrainLCD/StationAPI/tree/dev/data
Schema dependency: Hardcoded to work with TrainLCD's specific column names, data types, and relationships
No generic CSV support: Other railway datasets or custom CSV formats will not work
Purpose: Ensuring data quality and transparency in the TrainLCD mobile application

If you need to process other railway datasets, this tool will require significant modifications to accommodate different schemas and data structures.

Prompt Specification

The AI detection system is specifically designed for Japanese railway station romanizations:

Input: Station name (romanized), railway company, prefecture, and line information
Task: Identify potential spelling errors in station name romanizations using railway context
Output: JSON array of suggested corrections, empty if no issues found
Context: Leverages Japanese romanization conventions and railway naming patterns

Japanese Romanization Logic

The tool understands Japanese romanization conventions and automatically excludes common variations:

Macron variations (ō vs ou, ū vs uu) are treated as legitimate alternatives, not typos
Company-specific conventions are respected (e.g., JR vs private railway standards)
Historical romanizations are preserved when contextually appropriate
Regional variations in romanization styles are accommodated

Additional Features

Railway-Specific Detection

Context-aware analysis using railway company, prefecture, and line data
Romanization expertise built for Japanese station naming conventions
Batch processing optimized for large railway databases
Smart caching to reduce API costs on incremental updates

Intelligent Review System

Romanization filtering that understands legitimate macron variations
Company-specific rules for different railway operator standards
Context enrichment with full railway metadata for informed decisions
Export flexibility supporting both automated processing and manual review

Safe Data Management

Railway data integrity preservation during correction process
Backup functionality for safe dataset updates
Progress tracking with detailed correction statistics
Flexible deployment supporting both new file creation and in-place updates

Important Limitations

TrainLCD-specific: This tool is designed exclusively for the TrainLCD project and its data transparency requirements
Schema dependency: Only works with CSV files from https://github.com/TrainLCD/StationAPI/tree/dev/data
No generic support: Other railway datasets or custom CSV schemas are not supported
Modification required: Using this tool with different data sources requires significant code changes

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CSV Auto Fixer

Features

Requirements

Installation

Environment Variables

Usage

1. Detect typos

2. Review detected typos

3. Apply fixes

Workflow

Data Flow

Data Compatibility

Prompt Specification

Japanese Romanization Logic

Additional Features

Railway-Specific Detection

Intelligent Review System

Safe Data Management

Important Limitations

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
detect_typos.py		detect_typos.py
fix_typos.py		fix_typos.py
review_typos.py		review_typos.py

License

TrainLCD/csv-autofixer

Folders and files

Latest commit

History

Repository files navigation

CSV Auto Fixer

Features

Requirements

Installation

Environment Variables

Usage

1. Detect typos

2. Review detected typos

3. Apply fixes

Workflow

Data Flow

Data Compatibility

Prompt Specification

Japanese Romanization Logic

Additional Features

Railway-Specific Detection

Intelligent Review System

Safe Data Management

Important Limitations

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages