This repository contains a large dataset of news articles scraped from various Pakistani news websites. The dataset covers diverse categories including:
- Politics
- Sports
- Fashion & Style
- International News
- Domestic Affairs
- Science & Technology
We evaluated several large language models (LLMs) for generating question-answer pairs from the scraped news articles:
- Llama2: Generates high-quality question-answer pairs but is relatively slow.
- T5-small: Fast but less accurate, often producing duplicate question-answer pairs.
- GPT-3.5 Turbo and GPT-4: Effective for generating high-quality question-answer pairs efficiently.
Our case study revealed that while Llama2 offers the best quality, it is slower compared to GPT models. T5-small
, though fast, has limitations in accuracy and duplication. Consequently, we used GPT-3.5 Turbo
and GPT-4
to generate a more substantial dataset.
This dataset is open-source and can be used for:
- Fine-tuning LLMs
- Evaluating model performance
Additionally, we have fine-tuned Tiny Llama on this dataset.
LLaMA2 | T5-small | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
GPT-3.5-Turbo | GPT-4 | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
GPT3.5-Turbo
and GPT4
generates desired response.
Fig. Gradio demo using T5-small
git clone https://github.com/faizan1234567/QALLM.git
cd QALLM
Create a virtual enviroment using python venv
python3 -m venv qa_llm
source qa_llm/bin/activate
alternatively, you can use anaconda package manager
conda create -n qa_llm python=3.8.10 -y
conda activate qa_llm
Now install all the required dependencies
pip install --upgrade pip
pip install -r requirements.txt
QA generation, make sure to read and understand the configs and replace appropriate values as required.
python create_alpaca_format_dataset.py --chunk_size 5000 --dataset <path>
and run QA generation
python qa_generator.py --model T5-small --cfg cfg/qa_generator.yaml
And there is a run_qa_llm_repo.ipynb
under notebooks
directory to install and run the QA on google colab, kaggle, Gradient, or local machine with GPU.
if you find the dataset useful for fine-tuning, research, and development purposes, please star & cite the repo:
Muhammad Faizan and Sana Zafar
@misc{QALLM,
title={NewsQA: News Dataset for QA Generation},
authors={Muhammad Faizan and Sana Zafar},
howpublished = {\url{https://github.com/faizan1234567/QALLM}},
year={2024}
}
- QA dataset generation using Llama2 and T5-small
- QA dataset generation using GPT-3.5 Turbo and GPT4
- Scrapping News articles from Pakistan based News channels
- Creating a Large fine-tuning dataset in Alpaca format
- Add installation / virtual environment instructions
- fine-tuing Tiny-llama, Mistral, and Llama3 on generated dataset
- Evaluation
- Complete ChatBot for QA generation
[1]. A fast and powerful scraping and web crawling framework. Scrapy. (n.d.). https://scrapy.org/
[2]. https://huggingface.co/TheBloke/Llama-2-70B-GGML. (n.d.).
[3]. Ushio, A., Alva-Manchego, F., & Camacho-Collados, J. (2023). An empirical comparison of LM-based question and answer generation methods. arXiv preprint arXiv:2305.17002.
[4]. OpenAI’s GPT-3.5 Turbo, platform.openai.com/docs/models/gpt-3-5-turbo. Accessed 28 July 2024.