Word Frequency Analysis - Language in Genetics

A research project analyzing racial/ethnic terminology usage in genetics journals. The system processes CrossRef article metadata and uses OpenAI's API to identify terminology patterns in academic publications.

Architecture

PostgreSQL Database: Stores CrossRef article metadata and analysis results
Go Tools (cmd/pgjsontool): Bulk import tool for CrossRef dump files
Python Pipeline (extractor/): OpenAI batch processing for content analysis
Automated Dashboard: Generates static HTML dashboards showing analysis results

Quick Start

Prerequisites

PostgreSQL with the CrossRef database imported
Python 3.x with virtual environment
OpenAI API key
Go 1.21+ (only if importing new CrossRef data)

Setup

# Set database environment variable
export PGDATABASE=crossref

# Set up Python environment
cd extractor/
python -m venv .venv
source .venv/bin/activate
pip install -r ../requirements.txt

Running Analysis

cd extractor/


## Computational Analysis Methodology

The `extractor/bulkquery.py` script implements an automated approach to analyze genetics articles for specific racial terminology. It processes metadata.json files containing article titles and abstracts, then submits them to OpenAI's API for analysis.

The core of the analysis uses a carefully constructed prompt:

"Does this article use any terms like "Caucasian" or "white" or "European ancestry" in a way that refers to race, ancestry, ethnicity or population?\n\n" "TITLE: {title}\n" "ABSTRACT: {abstract}\n"


This prompt is deliberately framed in a neutral manner to avoid biasing the language model's analysis. It specifically asks about terms related to European ancestry without suggesting preference for any particular terminology.

The analysis is structured through a function-calling API that forces OpenAI to return standardized responses across all articles. The analysis function includes parameters for detecting:
- "caucasian" terminology
- "white" racial descriptors
- "European ancestry" phrasing
- Other phrases describing European populations

When phrases are detected, the system also captures the exact terminology used, enabling detailed analysis of language variations across the literature.

The batch processing system allows efficient processing of thousands of articles with proper error handling and progress tracking, making large-scale analysis feasible within reasonable time and cost constraints.


# Process articles from enabled journals
./bulkquery.py --limit 1000

# Check batch status
./batchcheck.py

# Fetch completed results
./batchfetch.py

# Generate dashboard
./generate_dashboard.py --output-dir ../dashboard

Automation

The cronscript.sh script automates the entire workflow:

# Run manually
./cronscript.sh

# Schedule with cron (every 6 hours)
crontab -e
# Add: 0 */6 * * * /home/languageingenetics/Word-Frequency-Analysis-/cronscript.sh

Database Schema

public.raw_text_data: CrossRef article metadata (JSON in filesrc column)

languageingenetics schema:

journals: Manages which journals to process
files: Analysis results per article
batches: OpenAI batch job tracking

Managing Journals

-- List journals and their status
SELECT name, enabled FROM languageingenetics.journals ORDER BY name;

-- Disable a journal
UPDATE languageingenetics.journals SET enabled = false WHERE name = 'Heredity';

-- Add a new journal
INSERT INTO languageingenetics.journals (name) VALUES ('New Journal Name');

Development

# Build Go tools (only needed for importing new data)
make all

# Run tests
make test

# Run linter
make lint

Documentation

See CLAUDE.md for detailed project documentation, including:

Complete architecture details
Database setup and permissions
Initial import procedures
Python workflow details
Configuration options

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
.claudesync		.claudesync
.github/workflows		.github/workflows
articles		articles
bin		bin
cmd/pgjsontool		cmd/pgjsontool
database		database
extractor		extractor
new-articles/The_American_Journal_of_Human_Genetics		new-articles/The_American_Journal_of_Human_Genetics
.claudeignore		.claudeignore
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Makefile		Makefile
PERFORMANCE_OPTIMIZATIONS.md		PERFORMANCE_OPTIMIZATIONS.md
README.md		README.md
SETUP.md		SETUP.md
batchcheck_cron.sh		batchcheck_cron.sh
cronscript.sh		cronscript.sh
go.mod		go.mod
go.sum		go.sum
indexing.sql		indexing.sql
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Word Frequency Analysis - Language in Genetics

Architecture

Quick Start

Prerequisites

Setup

Running Analysis

Automation

Database Schema

Managing Journals

Development

Documentation

About

Uh oh!

Releases 1

Packages

Contributors 2

Uh oh!

Languages

language-in-genetics/language-analysis

Folders and files

Latest commit

History

Repository files navigation

Word Frequency Analysis - Language in Genetics

Architecture

Quick Start

Prerequisites

Setup

Running Analysis

Automation

Database Schema

Managing Journals

Development

Documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Uh oh!

Languages

Packages