Skip to content

Implement a Chinese Word Segmentation Method based on a paper, extract features and labels from training data and train a classification model.

Notifications You must be signed in to change notification settings

Htiango/Chinese-Word-Segmentation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Chinese-Word-Segmentation

Implement a Chinese Word Segmentation Method based on paper A Realistic and Robust Model for Chinese Word Segmentation. Extract features and labels from training data and use LogicRegression to train a classification model. And then do prediction on testing set and get the accuracy of this segmentation method.

In this project, I get a 89.542% accuracy of segmentation on test set, covering 99.79% of the testing data

Introduction

The intuition behind this technique is to look at 4 consecutive (non-space) characters, or 4-grams, along with a learned model to guess whether or not there should be a word separation between the middle two characters.

Suppose we have a 4-gram of Chinese characters that we’ll represent by the letters ABCD. Using this sequence, we can define the feature vector x and label y as follows:

  • x = (AB,B,BC,C,CD); y=1 if BC can be separated, y=0 if not.

I set each 1-gram and 2-gram to be a dimension in feature vector. So each feature vector's dimension should be the same as the size of corpus I build. Then I use sparse matrix to represent each feature vector because only 5 dimensions are non-zero.

The way generating a sparse is shown as following:

row = np.array([0, 2, 3, 1, 0])
col = np.array([0, 10, 20, 30, 40])
data = np.array([1, 1, 1, 1, 1])
mtx = sparse.csr_matrix((data, (row, col)), shape=(4, 50))

In this way, the training feature X_train is a (n_samples, corpus_size) sparse matrix, while the labeling Y_train is a n_samples array.

Here I use LogicRegression to train the model.

Process

The process can be listed as follows:

  • Generate Corpus (About 10 second)
    • Read from training file, convert to utf-8 and get a list of each convertible line. (Ignore the UnicodeDecodeError)
    • Use the training data to generate a corpus of all the 1-gram and 2-gram words. (Remove spaces) (9.5 seconds)
    • In order to handle the first 2 and last 2 words, here I use a tricky method: Remain the '\n' in the end and add a '\t' in the beginning of each line.
  • Get the training features and labels (About 70 seconds)
    • Remove all the separate mark (spaces) and get a list of indexes where there should be a separate mark behind.
    • Use the index list to set the label.
    • generate the row (represent index in n_sample) and col (represent the index in corpus). In order to speed up feature generating, here I use list to contain the row and col.
    • Use the method introduced above to generate a sparse matrix, get the feature X_train. Also get the labels Y_train
    • Save features and their relative labels into files. Also save the corpus into a file.
  • Train a LogicRegression Model (more than 1000 seconds)
    • Load features and their relative labels from files, use the LogisticRegression from sklearn.linear_model
    • Save the model into file
  • Predict on testing set (less than 2 seconds)
    • Load model, corpus from files
    • Read from testing file, convert to utf-8 and get a list of each convertible line.
    • Use the same method as the training stage to generate testing features and the ground-truth labels.
      • KeyError will be raised when test file has some 1-gram and 2-gram not in corpus. Most of errors come from 2-gram words and I can handle them by using a 1-gram to replace the 2-gram. However, I can't handle the error when 1-gram not in corpus. In the program, I just skip the lines where there are some 1-grams not recorded by corpus. (Fortunately 1-gram missing is very rare and we are able to handle them manually)
    • Use the model we trained to get the predicted model, compare with the ground-truth and get the accuracy.

All the time illustrated above is measured on my MacBook Pro (Retina, 13-inch, Late 2013), 2.4 GHz Intel Core i5 and 8 GB 1600 MHz DDR3

Structure

In 'main/main.py', use 3 functions to finish all the processes listed above:

  • get_feature: generate the corpus, training features/labels and save them to files.
  • get_model: train the LR model and save.
  • get_prediction: get the accuracy and print to the console.

The training and testing data files are in the 'data/' directory. Both the training and test file is encode in big5hkscs.

The saving files of training features and labels, the corpus, the model are in the 'output/' directory. (You should create this directory before you run)

Result

From the whole 1398 lines in the testing file, there are 3 lines we can't handle. Because they have the characters '.' and '洴' not showing up in the training data set.

After I ignore the 1-gram errors, the accuracy of segmentation is 89.542% (15446/17250).

If I ignore both the 1-gram and 2-gram errors, the accuracy can rise up to 91.266% (7795/8541). However, the coverage is less than 50% and it's not a good idea to ignore the 2-gram errors.

The way to handle the 2-gram error is shown in Process

Environment

Python 3.6.2

Python packages:

  • numpy (1.13.3)
  • scipy (1.0.0)
  • scikit-learn (0.19.1)

About

Implement a Chinese Word Segmentation Method based on a paper, extract features and labels from training data and train a classification model.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages