Vietnamese RAG Project

Introduction

This project aims to develop a simple Retrieval-Augmented Generation (RAG) model for generating Vietnamese text. By leveraging state-of-the-art machine learning techniques and comprehensive datasets, this project seeks to enhance the quality and relevance of generated text, assisting in various natural language processing tasks.

Tech Stack

Programming Language: Python
Libraries: PyTorch, Transformers, Scikit-learn, Pandas, NumPy, Matplotlib
Search Engine: Elastic Search
Embeddings: Vector embeddings
Containerization: Docker
Web Framework: Streamlit
Database: PostgreSQL
Monitoring: Grafana

Dataset

As a Vietnamese speaker, I thought of creating a simple Vietnamese RAG project for my final project. I found Vietnamese RAG datasets on Huggingface.

This is the link to my datasets: My datasets

This dataset has four subsets, but I only use three of them: General Text, Legal, and Expert.

The General subset contains data scraped from various internet sources, like newspapers, and includes text related to news.
The Legal subset is about Vietnamese law.
The Expert subset covers various fields, such as law, political science, biology, and physics.

Project Structure

.
├── README.md
├── UI_and_online_evaluation
├── data
│   ├── README.md
│   └── vietnamese_rag
│       ├── answer_vector_pickle
│       ├── context_vector_pickle
│       ├── documents-with-ids.json
│       ├── documents-with-ids1.json
│       ├── documents-with-ids2.json
│       ├── documents-with-ids3.json
│       ├── documents-with-ids4.json
│       ├── documents-with-ids5.json
│       ├── documents.json
│       ├── evaluations_aqa
│       ├── evaluations_qa
│       ├── ground_truth_data
│       ├── llm_answer
│       ├── llm_answer_cosine.csv
│       ├── question_context_answer_vector_pickle
│       └── question_vector_pickle
├── data_ingestion_and_preparation
│   ├── README.md
│   ├── create_vector_embeddings.ipynb
│   └── load_datasets_and_create_documents.ipynb
├── evaluation
│   ├── README.md
│   ├── llm_answer_evaluation.ipynb
│   ├── llm_as_a_judge_aqa_evaluation.ipynb
│   ├── llm_as_a_judge_qa_evaluation.ipynb
│   └── retrieval_evaluation.ipynb
├── grafana
├── images
├── n.txt
└── requirements.txt

Setup

I recommend you create a github codespace to run this repository

At the root directory, run this command to install dependencies:

pip install -r requirements.txt

Optional

Before creating this repository, I created another repository to clean the datasets I chose from Huggingface. The data was too large, and I faced rate limits when calling the API, so I had to split it into many chunks and run them separately. This took a lot of time. If you want to understand more about how I did this, you can check this repository

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vietnamese RAG Project

Introduction

Table of Contents

Tech Stack

Dataset

Project Structure

Setup

Optional

About

Releases

Packages

Contributors 5

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
UI_and_online_evaluation		UI_and_online_evaluation
data		data
data_ingestion_and_preparation		data_ingestion_and_preparation
evaluation		evaluation
grafana		grafana
images		images
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

vucongtuanduong/vietnamese-rag-project

Folders and files

Latest commit

History

Repository files navigation

Vietnamese RAG Project

Introduction

Table of Contents

Tech Stack

Dataset

Project Structure

Setup

Optional

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages