ParaKQC

Parallel dataset of Korean Questions and Commands

Description

Dataset generation

paraKQC_v1 in data folder contains 10,000 utterances, namely 1,000 sets of 10 similar sentences. The corpus generation process is depicted in mkdata.py, to make up the whole corpus of size 545,000 utterances with the followings as subcorpus:

Sentence similarity corpus (494,500)
Paraphrase corpus (45,000)

Training process

Also described in mkdata.py, utilizing BiLSTM, Self-attentive BiLSTM, Parallel BiLSTM, and BiLSTM Cross-attention. The demo file adopts BiLSTM-SA.

Demo

Requirements

Model trained in Python 3.5 (so does the demo require)
git clone https://github.com/warnikchow/paraKQC and pip install -r Requirements

Usage

In python console,

# The function that finds the most similar sentence among the candidates
>>> from pred import fast_docu

# Document of candidates
>>> t1 = '너 몇 살이냐'
>>> t2 = '거기 가는데 얼마나 걸려'
>>> t3 = '내일 다섯 시까지 옥상으로 와'
>>> t4 = '굳이 그렇게까지 해야돼'
>>> t5 = '동작 그만 밑장빼기냐'
>>> cand = [t1,t2,t3,t4,t5]

# Not the best, but relatively accurate answer
>>> fast_docu('하던 거 멈춰',cand)
'굳이 그렇게까지 해야돼'

Reference

If you find this dataset useful, please cite the following:

@inproceedings{cho2020discourse,
  title={Discourse Component to Sentence (DC2S): An Efficient Human-Aided Construction of Paraphrase and Sentence Similarity Dataset},
  author={Cho, Won Ik and Kim, Jong In and Moon, Young Ki and Kim, Nam Soo},
  booktitle={Proceedings of The 12th Language Resources and Evaluation Conference},
  pages={6819--6826},
  year={2020}
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
model		model
LICENSE		LICENSE
README.md		README.md
Requirements.txt		Requirements.txt
han2one.py		han2one.py
mkdata.py		mkdata.py
pred.py		pred.py
shuffled.npy		shuffled.npy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ParaKQC

Description

Dataset generation

Training process

Demo

Requirements

Usage

Reference

About

Releases

Packages

Languages

License

warnikchow/paraKQC

Folders and files

Latest commit

History

Repository files navigation

ParaKQC

Description

Dataset generation

Training process

Demo

Requirements

Usage

Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages