CodeVerb took the initiative to revolutionize the development like never before. CodeVerb generates Python Language Code from English Language Text.
There are three repositories, each with their own purpose.
-
transformer-pytorch: Contains the code for transformer based model. The model is developed using PyTorch.
-
web-portal: Contains the code for Frontend Portal. The portal is made using ReactJS, TailwindCSS.
-
model-api: Contains the code for Backend Server of the portal. The server is made using Flask.
Scraped Dataset from GitHub Public Repositories, StackExchange, GeeksForGeeks
100 of Millions of Python Code Lines
Approx. 7.2 million files of Python are scraped
Assuming 5 files take 1 second to scrape 7.2million/432000 secs ~ 15 Days
Parallel Processing helped us scrape this dataset in just ~ 7 Days
We are scraping our dataset from StackExchange (StackOverflow, CodeReview, etc.) and GitHub. To achieve this, we have made our custom scrappers from scratch for the both platforms. For GitHub, as the dataset was massive, we had to use a cluster setup to perform multithreading to execute the processes in parallel in order to achieve faster and efficient execution. We stored the dataset in separate files as it is more convenient to transfer over the network as well as load while training or viewing the dataset.
CodeVerb uses state of the art deep learning model to achieve its target of code generation from natural language input. In 2017, research named “Attention is all you Need” was published which helped pave the way for the advent of large language models making breakthroughs in the field of Natural Language Processing (NLP). Our system was designed using the idea behind that research as its basis.
CodeVerb uses Transformer based model to achieve its goal. The Encode-Decoder model serves the purpose really well according the use case.
Set up Distributed Training Environment
Neural Network Based Large Language Model Training [Currently Ongoing]
- Implemented PyTorch Data Distributed Pipeline
- Used Nvidia Nickel (NCCL) backend to communicate with distributed machine setup
- Total Machines = 3
- Total GPUs = 3 (Nvidia 3060)
EPOCHS: 50000
Single Epoch Training Time: ~ 0.7 secs
Total Training Time: 0.7*EPOCHS = ~ 24 days
Current Model Epochs: ~ 5000
Training Time: 0.7*5000 = ~ 3 Days