A data science project about using NLP techniques on Korean news articles. Attempts to achieve a quick, unsupervised, automatic, and dynamic topic clustering on news articles retrieved from a keyword. The articles are retrieved from a local Elasticsearch database that is indexed with Korean news articles and can be updated in real-time.
Key technologies include: Elasticsearch, KoNLPy, Word2Vec, and HDBSCAN.
- Python 3
- Doc2Vec models: Download models. Place the files inside
/models
- Basic: numpy, sklearn, pandas, beautifulsoup4, sklearn, matplotlib
- Elasticsearch:
pip install elasticsearch
- Gensim:
conda install gensim
- HDBSCAN:
pip install hdbscan
ORconda install -c conda-forge hdbscan
- KoNLPy: Instructions here
- with Mecab-ko: Mecab for Windows (requires some environment variable tweaking)
- Networkx:
pip install networkx
- PyTagCloud: Instructions
- Add Korean font to python3/site-packages/pytagcloud and edit .json file
- Googletrans: Instructions (optional)
- Make sure Elasticsearch is running and the database is updated
- Input a keyword which will retrieve up to a 1000 relevant articles (accounts for time relevancy)
- A pre-trained Doc2Vec model is loaded and is used to infer the vectors of the 1000 articles.
- The vectors are labeled into clusters using HDBSCAN (density based clustering)
- Optional visualization of the clusters
A more in-depth explanation of the model can be found here.
- The news articles are from the Naver's news hub: http://news.naver.com/main/officeList.nhn
- The selected news outlets for this project are as follows:
- Outlet Name: Source ID
- 국민일보: 005
- 동아일보: 020
- 문화일보: 021
- 세계일보: 022
- 조선일보: 023
- 중앙일보: 025
- 한겨례: 028
- 경향신문: 032
- 서울신문: 081
- 한국일보: 469
- For the initial bootstraping for my database and model training, I scraped about a year worth of articles from these sources (from Aug 2016 to Aug 2017) except for 조선일보 (023) Where I only scraped 6 month worth of data.
- Allow Elasticsearch address parameters to be inputted (Currently only localhost:9200)
- Add internet disconnect detection for scraper
- Improve search algorithm
- Untapped sentiment data for analysis
- Add automatic and continuous Doc2Vec training
- Implement the Phraser model (model exists but isn't implemented due to uncertainty of its performance)
- Improve text processing
- Improve author name extraction algorithm
- Improve noise and template filtering algorithm (ex. delete all '동아일보', '포토' articles)
If you have any questions regarding this project, feel free to email me.
Email: inhorha5+github@gmail.com