Skip to content

madhavgupta6803/LanguageModel-HTML-CodeGen

Repository files navigation

LanguageModel-HTML-CodeGen

This is the fineTuned model hosted on Hugging Face Hub

Files and their uses

Challenges faced and their solutions are written in the pdf file

Data_loading_and_Fine_tuning.ipynb

This file deals with HTML dataset preparation and finetuning the sharded and 4- bit quantized CodeLlama 7-B model using native PyTorch training loop followed by saving the model
  1. Import the necessary libraries
  2. Dataset Loading: Using load_dataset from the datasets library to load a dataset named "jawerty/html_dataset" from Hugging Face Hub. Splits the dataset into training and testing sets, with 80% for training and 20% for testing.
  3. Model and Tokenizer Setup: model_id: Identifier for a pre-trained model, "TinyPixel/CodeLlama-7B-Instruct-bf16-sharded". quantization_config: Configures the model to use 4-bit quantization for reduced memory usage and potentially faster computation. tokenizer: Loads a tokenizer corresponding to the model. The tokenizer converts text into a format the model can understand. model: Loads the actual model with the specified quantization configuration.
  4. Model Preparation: The model is prepared for training or inference, with adjustments for using 4-bit quantization. The tokenizer's padding token is set to its end-of-string token, if available.
  5. Tokenize Function: A function named tokenize_function is defined to tokenize the dataset. It tokenizes the inputs (assumed to be in the 'label' field of the dataset) and the targets (assumed to be in the 'html' field) using the previously loaded tokenizer. The tokenized inputs and labels are truncated or padded to a maximum length of 512 tokens. The labels (targets) are then added to the inputs under the key 'labels'.
  6. Applying Tokenization: The tokenize_function is applied to both the training and testing datasets using the map function, which processes the datasets in batches for efficiency.
  7. Removing Unneeded Columns: Columns 'html' and 'label' are removed from both tokenized datasets as they are no longer needed after tokenization.
  8. Setting Dataset Format: The datasets are set to the "torch" format, preparing them to be used with PyTorch for model training or evaluation.
  9. DataLoaders: It creates DataLoader instances for both training and evaluation datasets, which are used to efficiently load data in batches during training and evaluation.
  10. Optimizer: An AdamW optimizer is initialized for the model's parameters with a specified learning rate.
  11. Learning Rate Scheduler: A linear learning rate scheduler is set up, which adjusts the learning rate over training steps.
  12. Device Setup: The script checks for GPU availability and sets the model to run on GPU if available, otherwise on CPU. This is for efficient training and evaluation.
  13. Gradient Checkpointing and PEFT Preparation: Enables gradient checkpointing in the model to save memory during training. Prepares the model for k-bit training using prepare_model_for_kbit_training for efficient training of large models.
  14. Print Trainable Parameters Function: Defines a function to print the number of trainable parameters in the model, giving an insight into the model's size and complexity.
  15. LoRA Configuration and Model Adaptation: Sets up a configuration for Low-Rank Adaptation (LoRA) to enhance the model's ability to fine-tune efficiently. Applies LoRA to the model using get_peft_model, targeting specific modules within the model.
  16. Accelerate with FSDP (Fully Sharded Data Parallel) Plugin: Initializes the Fully Sharded Data Parallel plugin from accelerate library, which shards the model's parameters across multiple GPUs for efficient parallel training. Offloads state dictionary to CPU and configures it for all ranks. Prepares the model with Accelerator for efficient parallel training using the FSDP strategy.
  17. Training using native Pytorch The tensor pairs are also converted to cuda to match the model's environment using to.(device) argument Backpropagation technique with negative gradient is being used
  18. Model Evaluation: Iterates over the evaluation dataset without computing gradients (to save memory and compute resources).
  19. Saving the finetuned model: Saved all the model files in a folder and zipped it to download and upload in hugging Face Hub

Testing_finetuned_Model.ipynb

This file deals with inference using the finetuned model as an adapter over the original model. A sample prompt is given and the result generated by the model is analyzed
  1. Model and Adapter Configuration: Identifiers for a base model (TinyPixel/CodeLlama-7B-Instruct-bf16-sharded) and an adapter (MG650/CodeLlama_HTML_FineTuned) are specified.
  2. Quantization Configuration (bnb_config): A configuration for 4-bit quantization is created to optimize the model for reduced memory usage and faster computation.
  3. Loading the Language Model (llm): The base language model is loaded with the specified quantization configuration for efficient processing.
  4. Loading the Tokenizer: A tokenizer corresponding to the base model is loaded to convert text to a format the model can process.
  5. Loading the Adapter: An adapter, which is a modular add-on to the base model for specific tasks (here, likely related to HTML), is loaded into the model.
  6. Function generate_html_code: Takes a prompt and generates HTML code using the previously loaded language model (llm) and tokenizer. The function tokenizes the prompt, uses the model to generate a response, and then decodes the response back into text.
  7. Generating HTML: A prompt for generating HTML code for a simple school project website is defined. The generate_html_code function is called with this prompt to generate HTML.

User_Interface.ipynb

This file uses various commands and Python code to run a web application using Chainlit and expose it publicly using ngrok which basically functions as an API
  1. Starting Chainlit App: !chainlit run /content/app.py: This command starts a Chainlit application located at /content/app.py. Chainlit is a tool for running Python apps, similar to Streamlit. The process is run in the background (&) and any output is redirected to /content/logs.txt.
  2. Ngrok Setup: !ngrok config add-authtoken : This command configures ngrok with an authentication token(You need to generate your own token by going on https://dashboard.ngrok.com/get-started/your-authtoken). Ngrok is a tool that creates secure tunnels to localhost, allowing you to expose your local server to the internet.
  3. Creating Ngrok Tunnel: ngrok.connect(8000): Establishes an ngrok tunnel to the local server running on port 8000. This makes the locally running Chainlit app accessible over the internet. The public URL of the ngrok tunnel is printed, allowing access to the Chainlit app from anywhere.
  4. Getting Current Tunnels: ngrok.get_tunnels(): Retrieves information about the active ngrok tunnels.
  5. Closing Ngrok: ngrok.kill(): Terminates all active ngrok tunnels.

app.py

This file sets up an interactive chat application using Chainlit (a chat-based interface for Python applications) to generate HTML code with a pre-trained language model
  1. Initialization: Loads a pre-trained language model (TinyPixel/CodeLlama-7B-Instruct-bf16-sharded) and an adapter (MG650/CodeLlama_HTML_FineTuned) with specific quantization settings for efficiency. The model and tokenizer are stored in the user session for later use.
  2. Chat Interface Setup: Sets up a chat title and icon for the Chainlit interface.
  3. HTML Code Generation: Defines a function that activates on receiving a message. The function uses the stored model and tokenizer to generate HTML code based on the user's input. The generated HTML code is sent back to the user in the chat interface.
  4. Execution: The script is designed to run as a main program and provide an interactive HTML code generation service through a chat interface.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published