Wiki_Extractor

抓取維基百科中文資料，簡轉繁並萃取文字內容整理成 JSON 檔案

檔案說明

opencc資料夾為簡體中文轉繁體中文之套件

Wiki_Extractor.py 萃取維基百科內文（使用 https://github.com/attardi/wikiextractor 所提供的 code ）

Wiki_Cleaning.py 將資料轉換為 json 格式

Wiki_Tokenize.py 將內文進行斷詞

Wiki_to_Word2vec_Data. 轉換成 Word2vec 的訓練資料格式

初始化

git clone https://github.com/NCHU-NLU-Lab/Wiki_Extractor.git

或者使用下載方式把 github 上的資料載到本地端（解壓縮後資料夾名稱為 Wiki_Extractor-master ）

安裝所需套件

pip3 install -r requirements.txt

下載維基百科資料

資料下載處：https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2

在 linux 可直接下指令

wget https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2

萃取維基百科內容

python3 Wiki_Extractor.py -b 1024M -o extracted zhwiki-latest-pages-articles.xml.bz2

萃取完的資料會跑到 /extracted/AA/

將文章內容簡轉繁並整理成 Json 格式

python3 Wiki_Cleaning.py --file_path ./extracted/AA/

轉換後資料格式

[
  { 
    "id" : (int) 編號 ,
    "title" : (str) 文章標題  ,
    "articles" : (str) 文章內容
  },
...
]

依照文章每一句的內容進行斷詞

python3 Wiki_Tokenize.py --file_path wiki.json

轉換後資料格式

[
  { 
    "id" : (int) 編號 ,
    "title" : (str) 文章標題  ,
    "tokens" : (list) 每一句斷詞內容
  },
...
]

將維基百科內容轉換成 Word2vec 訓練資料格式

python3 Wiki_to_Word2vec_Data.py --file_path wiki_tokenize.json

轉換後資料為

下載資料

底下的連結有我們整理好的 wiki data

https://drive.google.com/drive/folders/1BvVVbRLD-W_954UchTi2KJTYPjqD-LJX?usp=sharing

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
old_code		old_code
opencc		opencc
README.md		README.md
Wiki_Cleaning.py		Wiki_Cleaning.py
Wiki_Extractor.py		Wiki_Extractor.py
Wiki_Tokenize.py		Wiki_Tokenize.py
Wiki_to_Word2vec_Data.py		Wiki_to_Word2vec_Data.py
requirements.txt		requirements.txt
stopwords-en.pkl		stopwords-en.pkl
stopwords.pkl		stopwords.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wiki_Extractor

檔案說明

初始化

安裝所需套件

下載維基百科資料

萃取維基百科內容

將文章內容簡轉繁並整理成 Json 格式

依照文章每一句的內容進行斷詞

將維基百科內容轉換成 Word2vec 訓練資料格式

下載資料

About

Releases

Packages

Contributors 2

Languages

NCHU-NLP-Lab/Wiki_Extractor

Folders and files

Latest commit

History

Repository files navigation

Wiki_Extractor

檔案說明

初始化

安裝所需套件

下載維基百科資料

萃取維基百科內容

將文章內容簡轉繁並整理成 Json 格式

依照文章每一句的內容進行斷詞

將維基百科內容轉換成 Word2vec 訓練資料格式

下載資料

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages