Embed-VTT ✨

This repo uses openai embeddings and the pinecone vector database to generate and query embeddings from a VTT file.

The purpose of this repo was to implement semantic search to provide an extra resource in understanding Andrej Karpathy's latest video: Let's build GPT, however it is general enough to use for any transcript.

shoutout to miguel's yt-whisper library for helping with the youtube transcription. The data/ in this repo was generated using the small model.

Setup

Install

pip install -r requirements.txt

Environment

cp .env.sample .env

you'll need an API keys from openai & pinecone
OPENAI_KEY=***
PINECONE_KEY=***
PINECONE_ENVIRONMENT=*** (Optional)

Pinecone

Head over to pinecone and create an index with dimension 1536

Data

the data in this repo was generated from Let's build GPT using yt-whisper

/data/karpathy.vtt - contains the raw VTT file
/data/karpathy_embeddings.csv - contains the dataframe with the embeddings. you can use this file to directly seed your pinecone index

Usage

Generate Embeddings from VTT file

this will save an embedding csv file as {file_name}_embeddings.csv

python embed_vtt.py generate --vtt-file="data/karpathy.vtt"

Upload Embeddings from a CSV Embedding file

python embed_vtt.py upload --csv-embedding-file="data/karpathy_embeddings.csv"

Query Embeddings from text

python embed_vtt.py query --text="the usefulness of trill tensors"

sample output

0.81: But let me talk through it. It uses softmax. So trill here is this matrix, lower triangular ones. 00:54:52.240-00:55:01.440
0.81: but torches this function called trill, which is short for a triangular, something like that. 00:48:48.960-00:48:55.920
0.80: which is a very thin wrapper around basically a tensor of shape vocab size by vocab size. 00:23:17.920-00:23:23.280
0.79: I'm creating this trill variable. Trill is not a parameter of the module. So in sort of pytorch 01:19:36.880-01:19:42.160
0.79: does that. And I'm going to start to use the PyTorch library, and specifically the Torch.tensor 00:12:54.320-00:12:59.200

License

This script is open-source and licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
.env.sample		.env.sample
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
embed_vtt.py		embed_vtt.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Embed-VTT ✨

Setup

Install

Environment

Pinecone

Data

Usage

Generate Embeddings from VTT file

Upload Embeddings from a CSV Embedding file

Query Embeddings from text

License

About

Releases

Packages

Contributors 2

Languages

License

gmchad/embed-vtt

Folders and files

Latest commit

History

Repository files navigation

Embed-VTT ✨

Setup

Install

Environment

Pinecone

Data

Usage

Generate Embeddings from VTT file

Upload Embeddings from a CSV Embedding file

Query Embeddings from text

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages