Data Centric NLP project

한국어 뉴스 헤드라인을 7가지 주제로 분류하는 프로젝트입니다.

Project Overview

개요

모델에 대한 수정 없이 데이터만을 개선하여 성능을 향상시켜야 합니다.

데이터셋

데이터 종류	개수	설명
라벨 에러 데이터	1,000개	라벨이 잘못 지정된 데이터
노이즈 데이터	1,600개	노이즈가 추가된 데이터
정상 데이터	200개	정상 데이터

라벨 에러 데이터에는 노이즈가 없고, 노이즈 데이터는 라벨이 정상입니다.

주요 접근법

노이즈 탐지 및 복구
라벨 에러 탐지 및 수정
데이터 증강

Getting Started

Requirement: Python 3.10

1. Clone the repository

$ git clone git@github.com:boostcampaitech7/level2-nlp-datacentric-nlp-14.git
$ cd level2-nlp-datacentric-nlp-14

2. Create Virtual Environment with Pipenv

pipenv install 명령어를 통해 필요한 패키지들을 받습니다.

$ pip install pipenv
$ pipenv install

pipenv 가상환경에 진입합니다.

$ pipenv shell
(level2-nlp-datacentric-nlp-14)$

3. Set Up Data

data/ 폴더 내에 데이터들을 위치시킵니다.

sample_submission.csv
test.csv
train.csv

test.csv에 대한 추론은 data/ 내에 output.csv로 저장됩니다.

4. Run the Project

다음 명령어를 통해 프로젝트를 실행합니다.

$ python main.py

Project Structure

level2-nlp-datacentric-nlp-14
├── augment
│   └── back_translate.py         # 역번역을 통한 데이터 증강
├── configs
│   └── config.py                 # 프로젝트 설정 파일
├── denoise
│   ├── noise_data_filter.py      # 형태소 분석기를 이용한 노이즈 필터
│   └── restore_noise_data.py     # LLM을 활용한 노이즈 문장 복원
├── relabel
│   ├── relabel_with_embedding.py       # 임베딩을 사용한 re-labeling
│   ├── relabel_with_llm.py             # LLM을 이용한 re-labeling
│   └── train_contrastive_embedding.py  # 대조 학습을 통한 임베딩 학습
├── utils
│   └── util.py
└── main.py                       # 메인 실행 파일

Workflow

main.py 실행 과정

1. Load Data

main.py를 실행하면 가장 먼저 train.csv를 불러오게 됩니다.

# Load Data
print("Loading data...")
data = pd.read_csv(os.path.join(DATA_DIR, "train.csv"))
print(f"Data loaded. Shape: {data.shape}\n")

2. Clean Data

불러온 train.csv에 대해 노이즈 분류 및 복원, re-labeling을 진행합니다.

# Clean Data
print("Labeling noise in data...")
noise_labeled_data = noise_labeling(data)
print("Restoring noise in data...")
restored_data = restore_noise(noise_labeled_data)
print("Relabeling data...")
relabeled_data = relabel_data(restored_data)
cleaned_data = pd.DataFrame(
    {
        "ID": relabeled_data["ID"],
        "text": relabeled_data["restored"],
        "target": relabeled_data["new_target"],
    }
)

줄바꿈(\n)이 들어간 문장을 제거합니다.

filtered_data = cleaned_data[~cleaned_data["text"].str.contains("\n")]
print(f"Data cleaned. Shape: {filtered_data.shape}\n")

노이즈 복원 과정에서 줄바꿈이 추가되는 경우가 있어, 데이터 품질을 위해 해당 데이터들을 일괄적으로 제거합니다. 작은 조정이지만 성능에 큰 영향을 줄 수 있습니다.

3. Augment Data

역번역으로 증강된 데이터를 추가합니다.

# Augment Data
print("Back translating data for augmentation...")
back_translated_data = back_translate(filtered_data)
augmented_data = pd.concat([cleaned_data, back_translated_data], ignore_index=True)
print(f"Data augmented. Shape: {augmented_data.shape}\n")

4. Train and Predict

최종 데이터셋을 main 함수에 전달하여 학습 및 예측을 진행합니다.

print("Start training and predicting...")
main(augmented_data, do_predict=not args.train_only)

학습 및 예측에 사용되는 모델은 klue/bert-base로 고정됩니다.

def main(data: pd.DataFrame, do_predict: bool = True):

    model_name = "klue/bert-base"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=7).to(DEVICE)

    train(data, model, tokenizer)
    if do_predict:
        predict(model, tokenizer)

Collaborators

NLP-14조 Word Maestro(s)

김현서	단이열	안혜준	이재룡	장요한

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
.github		.github
.vscode		.vscode
augment		augment
configs		configs
denoise		denoise
examples		examples
relabel		relabel
utils		utils
.flake8		.flake8
.gitignore		.gitignore
Pipfile		Pipfile
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
relabel_with_classification.ipynb		relabel_with_classification.ipynb
relabel_with_classification.py		relabel_with_classification.py
train_and_predict.ipynb		train_and_predict.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Centric NLP project

Project Overview

개요

데이터셋

주요 접근법

Getting Started

1. Clone the repository

2. Create Virtual Environment with Pipenv

3. Set Up Data

4. Run the Project

Project Structure

Workflow

1. Load Data

2. Clean Data

3. Augment Data

4. Train and Predict

Collaborators

NLP-14조 Word Maestro(s)

About

Languages

jagaldol/data-centric-topic-classifier

Folders and files

Latest commit

History

Repository files navigation

Data Centric NLP project

Project Overview

개요

데이터셋

주요 접근법

Getting Started

1. Clone the repository

2. Create Virtual Environment with Pipenv

3. Set Up Data

4. Run the Project

Project Structure

Workflow

1. Load Data

2. Clean Data

3. Augment Data

4. Train and Predict

Collaborators

NLP-14조 Word Maestro(s)

About

Topics

Resources

Stars

Watchers

Forks

Languages