CodeVerb

Welcome to CodeVerb

CodeVerb took the initiative to revolutionize the development like never before. CodeVerb generates Python Language Code from English Language Text.

CodeVerb Architecture

There are three repositories, each with their own purpose.

transformer-pytorch: Contains the code for transformer based model. The model is developed using PyTorch.
web-portal: Contains the code for Frontend Portal. The portal is made using ReactJS, TailwindCSS.
model-api: Contains the code for Backend Server of the portal. The server is made using Flask.

Dataset Collection

Scraped Dataset from GitHub Public Repositories, StackExchange, GeeksForGeeks

100 of Millions of Python Code Lines

Approx. 7.2 million files of Python are scraped

Assuming 5 files take 1 second to scrape 7.2million/432000 secs ~ 15 Days

Parallel Processing helped us scrape this dataset in just ~ 7 Days

State Diagram

Project Workflow

Iteration 01

Data Scraping & Collection Flow

We are scraping our dataset from StackExchange (StackOverflow, CodeReview, etc.) and GitHub. To achieve this, we have made our custom scrappers from scratch for the both platforms. For GitHub, as the dataset was massive, we had to use a cluster setup to perform multithreading to execute the processes in parallel in order to achieve faster and efficient execution. We stored the dataset in separate files as it is more convenient to transfer over the network as well as load while training or viewing the dataset.

Iteration 02

CodeVerb uses state of the art deep learning model to achieve its target of code generation from natural language input. In 2017, research named “Attention is all you Need” was published which helped pave the way for the advent of large language models making breakthroughs in the field of Natural Language Processing (NLP). Our system was designed using the idea behind that research as its basis.

Model Architecture

CodeVerb uses Transformer based model to achieve its goal. The Encode-Decoder model serves the purpose really well according the use case.

Web Portal

Landing Page

Playground Page

Iteration 03

Set up Distributed Training Environment

Neural Network Based Large Language Model Training [Currently Ongoing]

Training Environment

Implemented PyTorch Data Distributed Pipeline
Used Nvidia Nickel (NCCL) backend to communicate with distributed machine setup
Total Machines = 3
Total GPUs = 3 (Nvidia 3060)

Training Time

EPOCHS: 50000

Single Epoch Training Time: ~ 0.7 secs

Total Training Time: 0.7*EPOCHS = ~ 24 days

Current Model Epochs: ~ 5000

Training Time: 0.7*5000 = ~ 3 Days

Provide feedback

Saved searches

Use saved searches to filter your results more quickly