Skip to content

NLP on Korean news articles. Automatic topic extraction through dynamic clustering.

Notifications You must be signed in to change notification settings

inhorha5/Korean-NLP-Project

Repository files navigation

Korean-NLP-Project

Last Edit: Sept 12, 2017

Overview

A data science project about using NLP techniques on Korean news articles. Attempts to achieve a quick, unsupervised, automatic, and dynamic topic clustering on news articles retrieved from a keyword. The articles are retrieved from a local Elasticsearch database that is indexed with Korean news articles and can be updated in real-time.

Key technologies include: Elasticsearch, KoNLPy, Word2Vec, and HDBSCAN.

Package requirements

Basic process:

  1. Make sure Elasticsearch is running and the database is updated
  2. Input a keyword which will retrieve up to a 1000 relevant articles (accounts for time relevancy)
  3. A pre-trained Doc2Vec model is loaded and is used to infer the vectors of the 1000 articles.
  4. The vectors are labeled into clusters using HDBSCAN (density based clustering)
  5. Optional visualization of the clusters

Model Flowchart

Model diagram

A more in-depth explanation of the model can be found here.

Data source

  • The news articles are from the Naver's news hub: http://news.naver.com/main/officeList.nhn
  • The selected news outlets for this project are as follows:
    • Outlet Name: Source ID
    • 국민일보: 005
    • 동아일보: 020
    • 문화일보: 021
    • 세계일보: 022
    • 조선일보: 023
    • 중앙일보: 025
    • 한겨례: 028
    • 경향신문: 032
    • 서울신문: 081
    • 한국일보: 469
  • For the initial bootstraping for my database and model training, I scraped about a year worth of articles from these sources (from Aug 2016 to Aug 2017) except for 조선일보 (023) Where I only scraped 6 month worth of data.

Future Works

  • Allow Elasticsearch address parameters to be inputted (Currently only localhost:9200)
  • Add internet disconnect detection for scraper
  • Improve search algorithm
  • Untapped sentiment data for analysis
  • Add automatic and continuous Doc2Vec training
  • Implement the Phraser model (model exists but isn't implemented due to uncertainty of its performance)
  • Improve text processing
    • Improve author name extraction algorithm
    • Improve noise and template filtering algorithm (ex. delete all '동아일보', '포토' articles)

Contact

If you have any questions regarding this project, feel free to email me.

Email: inhorha5+github@gmail.com

About

NLP on Korean news articles. Automatic topic extraction through dynamic clustering.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages