embedding-encoder

Autoencoder for word2vec word embedding files.

train.sh

Main training script. Runs ae.py.

dist_only.sh

Script to compute and save distance vectors for a given vocab set.

ae.py

Main script. Calls preprocessing.py and next_batch.py. Given an embedding in .txt or .bin format, preproceses, and generates in batches distance vectors. Uses single-hidden layer autoencoder to compress distance vectors into shape of source embedding file. Saves the model and saves embedding vectors as text file. If the script is run with a model name which already exists, it saves the embedding vectors instead of retraining.

next_batch.py

Function which creates a new batch of size batch_size, randomly chosen from our dataset. For batch_size = 1, we are just taking one 100-dimen- sional vector and computing its distance from every other vector in the dataset and then we have a num_inputs-dimensional vector which rep -resents the distance of every vector from our "batch" vector. If we choose batch_size = k, then we would have k num_inputs-dimensional ve- ctors.

rand_vecs.py

Script to generate an embedding with random, normalized vectors for each token from a source vocab file.

preprocessing.py

Preprocessing for ae.py

convert_embedding.py

Quick and dirty script to convert embeddings from text to binary or vice-versa.

emb_modulus.py

Compute and print the average vector norm given the location of a pretrained embedding.

save_first_n.py

Saves first n most frequent vectors given a pretrained embedding file.

nn.py

Unfinished script to compute nearest neighbors.

notes.txt

Development notes.

The ideas is that we pick one (ora few, this is "batch_size"), and compute the distance from this embedding to all others, and train on this at each step. A placeholder is a stand-in for our dataset. We'll assign data to it at a later date. Data is "fed" into the persistent TensorFlow network graph through these placeholders.

Name		Name	Last commit message	Last commit date
Latest commit History 138 Commits
assorted_scripts		assorted_scripts
logs		logs
README.md		README.md
ae.py		ae.py
convert_embedding.py		convert_embedding.py
dist_only.sh		dist_only.sh
emb_modulus.py		emb_modulus.py
gen_random.sh		gen_random.sh
get_dist_vecs.py		get_dist_vecs.py
get_dist_vecs_OLD.py		get_dist_vecs_OLD.py
next_batch.py		next_batch.py
nn.py		nn.py
notes.txt		notes.txt
preprocessing.py		preprocessing.py
rand_vecs.py		rand_vecs.py
save_first_n.py		save_first_n.py
tf_test.py		tf_test.py
train-adamantium.sh		train-adamantium.sh
train-vibranium.sh		train-vibranium.sh
train.sh		train.sh
transform.py		transform.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

embedding-encoder

train.sh

dist_only.sh

ae.py

next_batch.py

rand_vecs.py

preprocessing.py

convert_embedding.py

emb_modulus.py

save_first_n.py

nn.py

notes.txt

About

Releases

Packages

Languages

langfield/embedding-encoder

Folders and files

Latest commit

History

Repository files navigation

embedding-encoder

train.sh

dist_only.sh

ae.py

next_batch.py

rand_vecs.py

preprocessing.py

convert_embedding.py

emb_modulus.py

save_first_n.py

nn.py

notes.txt

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages