Skip to content

Autoencoder to compress distance matrices of pretrained embedding files.

Notifications You must be signed in to change notification settings

langfield/embedding-encoder

Repository files navigation

embedding-encoder

Autoencoder for word2vec word embedding files.

train.sh

Main training script. Runs ae.py.

dist_only.sh

Script to compute and save distance vectors for a given vocab set.

ae.py

Main script. Calls preprocessing.py and next_batch.py. Given an embedding in .txt or .bin format, preproceses, and generates in batches distance vectors. Uses single-hidden layer autoencoder to compress distance vectors into shape of source embedding file. Saves the model and saves embedding vectors as text file. If the script is run with a model name which already exists, it saves the embedding vectors instead of retraining.

next_batch.py

Function which creates a new batch of size batch_size, randomly chosen from our dataset. For batch_size = 1, we are just taking one 100-dimen- sional vector and computing its distance from every other vector in the dataset and then we have a num_inputs-dimensional vector which rep -resents the distance of every vector from our "batch" vector. If we choose batch_size = k, then we would have k num_inputs-dimensional ve- ctors.

rand_vecs.py

Script to generate an embedding with random, normalized vectors for each token from a source vocab file.

preprocessing.py

Preprocessing for ae.py

convert_embedding.py

Quick and dirty script to convert embeddings from text to binary or vice-versa.

emb_modulus.py

Compute and print the average vector norm given the location of a pretrained embedding.

save_first_n.py

Saves first n most frequent vectors given a pretrained embedding file.

nn.py

Unfinished script to compute nearest neighbors.

notes.txt

Development notes.

The ideas is that we pick one (ora few, this is "batch_size"), and compute the distance from this embedding to all others, and train on this at each step. A placeholder is a stand-in for our dataset. We'll assign data to it at a later date. Data is "fed" into the persistent TensorFlow network graph through these placeholders.

About

Autoencoder to compress distance matrices of pretrained embedding files.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published