GitHub - itchio/facetembeddings: Tool that generates facet embeddings for assist in generating recommendations

This project contains a Go command-line tool for generating facet vector embeddings from the set of indexed content on itch.io. It uses a Singular Value Decomposition (SVD) based approach on a tag co-occurrence matrix.

On itch.io, every indexed piece of content is given a series of facet tags that represent how creator has classified the content. As an example, the browse category URL games/input-mouse/made-with-unity/tag-retro maps to the array of tags: [c.1, in.2, tl.3, tg.retro].

By looking at the co-occurrence of tags on project pages in aggregate, a vector space can be computed that represents the semantic relationship between tags. (eg. A tag of christmas and santa might be used togther frequently on single project pagess, so the computed vector space will have those two tags near each other)

Any group of content can be summed into a set of tags can be reduced to a single vector that can then be searched using a nearest neighbor algorithm to find related content. These vectors know as embeddings.

The Algorithm

Build a Co-occurrence Matrix: The algorithm first constructs a large, symmetric matrix where rows and columns represent the unique tags from the vocabulary. A cell at (row_i, col_j) stores the number of times tag_i and tag_j appeared together on the same game. This matrix captures the raw co-occurrence relationship between all pairs of tags.
Apply SVD for Dimensionality Reduction: The co-occurrence matrix is often very large and noisy. To distill the most significant patterns, Singular Value Decomposition (SVD) is used. SVD factorizes the matrix into three separate matrices, capturing its underlying structure. This step effectively reduces the dimensionality of the data, filtering out noise and retaining the strongest signals.
Extract Embeddings: The final embedding for each tag is a dense vector derived from the SVD output. By taking the top N dimensions (e.g., 32 or 64), we get a low-dimensional representation that captures the essence of the tag's relationship with all other tags. Each tag is now represented by a point in an N-dimensional space.
Normalize and Store: The vectors are normalized to unit length and saved to a database table for downstream use.

Schema

The algorithm will read from a table that looks like:

CREATE TABLE games_search (
  game_id integer NOT NULL,
  facets tsvector NOT NULL,
  -- ... other columns ignored
);

See games_search.md for detailed reference of facets stored in games_search.

The tool reads game_id and facets columns in batches, where facets contains tags as a PostgreSQL tsvector (e.g., 'c.1' 'in.2' 'tg.horror' 'tg.puzzle'). Tags are extracted from the tsvector using PostgreSQL's tsvector_to_array() function.

And will generate a table that looks like:

CREATE TABLE facet_embeddings (
  facet text PRIMARY KEY,
  dim int NOT NULL,
  vector double precision[] NOT NULL,
  last_trained_at timestamp without time zone NOT NULL DEFAULT now()
);

This output table stores one row per unique tag, where:

facet - The facet name (e.g., "c.1", "tg.puzzle")
dim - The dimensionality of the embedding vector set at time of generation
vector - The normalized embedding vector as a PostgreSQL array
last_trained_at - Timestamp when the embedding was computed

The table is created automatically if it doesn't exist and is truncated on each run before inserting new embeddings.

Build

go build .

Usage

The tool is configured via command-line flags.

./facetembeddings [flags]

Configuration Flags

Flag	Description	Default
`-table`	Database table to write embeddings into.	`facet_embeddings`
`-output-file`	If set, write embeddings to this CSV file instead of the database (columns: facet, dim, vector, last_trained_at).	empty
`-embedding-dim`	The dimensionality of the output vectors.	`32`
`-min-tag-frequency`	Minimum number of times a tag must appear across all games to be included in the vocabulary.	`5`
`-max-tags`	Maximum number of unique tags to generate embeddings for, sorted by frequency. `0` means unlimited.	`20000`
`-min-cooccurrence`	Minimum co-occurrence count required to keep an entry in the matrix. Helps prune noise.	`1`
`-matrix-type`	Matrix type: `cooc` (raw co-occurrence) or `ppmi` (positive PMI).	`ppmi`
`-factorization`	Factorization method: `svd` or `als`.	`svd`
`-als-iterations`	Max ALS iterations (used when `-factorization=als`).	`15`
`-als-lambda`	ALS regularization parameter (λ).	`0.1`
`-als-convergence`	ALS early-stop threshold for relative loss change.	`1e-4`
`-batch-size`	Number of game rows to fetch from the database in a single batch.	`20000`

Example

To generate 128-dimensional embeddings for all tags that appear at least 10 times and save them to the tag_vectors table:

PGDATABASE=itchio_development ./facetembeddings \
  -table="tag_vectors" \
  -embedding-dim=128 \
  -min-tag-frequency=10

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
README.md		README.md
als.go		als.go
als_test.go		als_test.go
db.go		db.go
embed.go		embed.go
games_search.md		games_search.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go
output.go		output.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

The Algorithm

Schema

Build

Usage

Configuration Flags

Example

About

Uh oh!

Releases

Packages

Contributors 2

Languages

itchio/facetembeddings

Folders and files

Latest commit

History

Repository files navigation

The Algorithm

Schema

Build

Usage

Configuration Flags

Example

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages