PHP-IR

PHP-IR is a modern, research-oriented Information Retrieval (IR) and Vector Space Modeling library for PHP, focused on correctness, transparency, and theoretical grounding.

It provides low-level, composable primitives for text representation, weighting, similarity, clustering, and evaluation, designed for engineers who need full control and explainability, not opaque ML abstractions.

Why PHP-IR exists

The PHP ecosystem has historically lacked serious IR tooling beyond thin wrappers around search engines. PHP-IR fills that gap by offering:

Explicit vector space modeling
Reproducible term weighting pipelines
Deterministic clustering algorithms
Quantitative cluster quality metrics
APIs aligned with Information Retrieval literature

The goal is not convenience-first APIs, but scientifically correct and inspectable IR workflows.

Core capabilities

Text processing

Tokenization (regex, whitespace)
Text normalization (lowercasing, accent folding, composition)
Stop-word filtering with language support (English, Spanish)

Vocabulary & statistics

Vocabulary construction
Document frequency tracking
IDF computation (per-term and vectorized)
Corpus-level statistics via dedicated façades (no core pollution)

Vectorization

Sparse and dense vector representations
Term Frequency (TF)
TF-IDF weighting
Spherical (L2-normalized) vector spaces
Explicit densification for algorithms that require fixed dimensions

Similarity

Cosine similarity
Pluggable similarity interfaces

Clustering

Spherical K-Means
Spherical K-Medians (robust to outliers)
Deterministic centroid update strategies
Explicit iteration control
Centroid initialization and update policies

Cluster evaluation

Intra-cluster cohesion
Inter-cluster separation
Global quality score aligned with IR theory
Metrics designed for algorithm comparison, not just reporting

Design philosophy

PHP-IR is intentionally not:

A search engine
A machine learning framework
A black-box clustering toolkit

Instead, it provides clear, inspectable building blocks that let you:

Reason about every step of the IR pipeline
Swap strategies without side effects
Validate theoretical assumptions with executable code
Compare algorithms using quantitative invariants

If you are familiar with TF-IDF, cosine similarity, and clustering theory, PHP-IR should feel predictable and rigorous.

Theoretical foundation

The library is grounded in classical and modern IR research, including:

Introduction to Information Retrieval - Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze
Spherical k-means clustering - I. S. Dhillon and D. S. Modha
Spherical K - Medians - Rafael E. Espinosa Santiesteban

Current status

Actively developed
API stabilized through real-world usage
Strong test coverage with invariant-based tests
English and Spanish corpora used for validation
Designed to evolve without breaking theoretical guarantees

Detailed documentation, examples, and usage guides will be added incrementally.

Roadmap (high level)

Advanced convergence criteria beyond fixed iteration limits
Additional robustness heuristics for clustering
Optional serialization of evaluation artifacts
Extended language tooling and corpora support

License

MIT License.
Use it, extend it, and build on it responsibly.

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
docker/php-cli		docker/php-cli
extensions		extensions
fixtures		fixtures
schemas		schemas
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
.php-cs-fixer.dist.php		.php-cs-fixer.dist.php
.version		.version
CHANGELOG.md		CHANGELOG.md
LICENSE.md		LICENSE.md
README.md		README.md
composer.json		composer.json
composer.lock		composer.lock
docker-compose.yml		docker-compose.yml
phpmd.xml		phpmd.xml
phpstan.dist.neon		phpstan.dist.neon
phpunit.xml		phpunit.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PHP-IR

Why PHP-IR exists

Core capabilities

Text processing

Vocabulary & statistics

Vectorization

Similarity

Clustering

Cluster evaluation

Design philosophy

Theoretical foundation

Current status

Roadmap (high level)

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PHP-IR

Why PHP-IR exists

Core capabilities

Text processing

Vocabulary & statistics

Vectorization

Similarity

Clustering

Cluster evaluation

Design philosophy

Theoretical foundation

Current status

Roadmap (high level)

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages