Skip to content

This project build a classification model for topics of news. With the target is automatically recognize suitable topic (class) to a random article. There are two architectures implemented which are LSTM and Hybrid models

Notifications You must be signed in to change notification settings

Narius2030/Vietnamese-Text-Classification-and-Clustering

Repository files navigation

Vietnamese-Text-Classification-and-Clustering

The model aims to automatically classify news data based on genre or topic. Data was collected from VnExpress by BeatifulSoup combining Airflow to schedule the collection.

Introduction

Text classification

  • In Text Classification task, I apply natural language processing techniques first for normalizing text and then I design a neural network using LSTM for learning the features of each paragraph
  • I choose Softmax activation function for classification in the final layer, the number of the unit in this layer are equal to the number of class I want to predict

Text clustering

  • In the Text Clustering task, I implement Word2Vec (skip-gram) for finding the relationships among words.
  • Then, I embed word for the whole paragraphs and calculate the mean vectors which represent for embedded paragraphs
  • Finally, I calculate the Cosine similarity to estimate the similar among paragraphs so that I can cluster similar paragraphs into groups

Scrape data from website

The goal of this data pipeline will be to help automate the ETL process, extract - transform - store data from the source to the target data warehouse. The data flow in this project is timed at 17:55 every day, automatically executing the declared ETL process. In the data flow there will be 3 main components:

  • extract_data: extract raw data from the above sources and save to the stage area (text files)
  • transform_load: retrieve data from text files saved from the previous task, transform and clean the data to suit the purpose of the problem, export data to text files according to each topic folder and csv files for all
  • print_date: output the completion time

DAG source code

def print_date():
    print('Today is {}'.format(datetime.today().date()))
    
dag = DAG(
    'ETL-VNExpress',
    default_args={'start_date': days_ago(1)},
    schedule_interval='55 17 * * *',
    catchup=False
)

extract_data = PythonOperator(
    task_id='extract_data',
    python_callable=scrape_news,
    dag=dag
)

transform_load = PythonOperator(
    task_id='transform_load',
    python_callable=transform_load,
    dag=dag
)

print_date_task = PythonOperator(
    task_id='print_date',
    python_callable=print_date,
    dag=dag
)


# Set the dependencies between the tasks
extract_data >> transform_load >> print_date_task
# transform_load >> print_date_task

DAG Diagram

image

Build Model

Text classification

1. LSTM

___________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding (Embedding)       (None, None, 128)         4699392   
                                                                 
 batch_normalization (Batch  (None, None, 128)         512       
 Normalization)                                                  
                                                                 
 lstm (LSTM)                 (None, None, 64)          49408     
                                                                 
 lstm_1 (LSTM)               (None, 64)                33024     
                                                                 
 dropout (Dropout)           (None, 64)                0         
                                                                 
 dense (Dense)               (None, 9)                 585       
                                                                 
 dense_1 (Dense)             (None, 9)                 90        
                                                                 
 dense_2 (Dense)             (None, 9)                 90        
                                                                 
 dense_3 (Dense)             (None, 9)                 90        
                                                                 
 dense_4 (Dense)             (None, 7)                 70        
                                                                 
=================================================================
Total params: 4783261 (18.25 MB)
Trainable params: 4783005 (18.25 MB)
Non-trainable params: 256 (1.00 KB)
_________________________________________________________________

2. Hybrid (CNN and LSTM)

_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding_14 (Embedding)    (None, None, 128)         4955904   
                                                                 
 batch_normalization_26 (Ba  (None, None, 128)         512       
 tchNormalization)                                               
                                                                 
 conv1d_13 (Conv1D)          (None, None, 128)         49280     
                                                                 
 max_pooling1d_12 (MaxPooli  (None, None, 128)         0         
 ng1D)                                                           
                                                                 
 conv1d_14 (Conv1D)          (None, None, 128)         49280     
                                                                 
 max_pooling1d_13 (MaxPooli  (None, None, 128)         0         
 ng1D)                                                           
                                                                 
 batch_normalization_27 (Ba  (None, None, 128)         512       
 tchNormalization)                                               
                                                                 
 dropout_26 (Dropout)        (None, None, 128)         0         
                                                                 
 lstm_42 (LSTM)              (None, None, 128)         131584    
                                                                 
 lstm_43 (LSTM)              (None, None, 128)         131584    
                                                                 
 lstm_44 (LSTM)              (None, 128)               131584    
                                                                 
 dropout_27 (Dropout)        (None, 128)               0         
                                                                 
 dense_69 (Dense)            (None, 128)               16512     
                                                                 
 dense_70 (Dense)            (None, 64)                8256      
                                                                 
 dense_71 (Dense)            (None, 32)                2080      
                                                                 
 dense_72 (Dense)            (None, 7)                 231       
                                                                 
=================================================================
Total params: 5477319 (20.89 MB)
Trainable params: 5476807 (20.89 MB)
Non-trainable params: 512 (2.00 KB)
_________________________________________________________________

Text clustering

# Calculate sentence embeddings by averaging word embeddings
mean_sentence_embedding = mean_vector_embedding(question_embeddings)
mean_post_embedding = mean_embedded_posts(post_embeddings)
# Calculate similarity (cosine similarity)
similarity_score = cosine_similarity([mean_sentence_embedding], mean_post_embedding)
  • Example with a short sentence
question = '''Với CLB Hà Lan, tiền đạo cánh người Brazil đạt tỷ lệ ghi bàn và kiến tạo kỳ vọng là 0,58, chỉ xếp thứ 14 nếu đặt ở Ngoại hạng Anh. 
Ngoài ra, Antony cũng được "thổi phồng" nhờ chơi cho CLB vượt trội về tài chính và lực lượng so với phần còn lại của giải vô địch Hà Lan.'''

image

  • Visualize the relationship among words

image

  • Visualize the relationship among paragraphs

image

Visualize on webpage

I will use classification model and Cosine similarity technique to erect a simple webpage for sporting magazine searching

image

About

This project build a classification model for topics of news. With the target is automatically recognize suitable topic (class) to a random article. There are two architectures implemented which are LSTM and Hybrid models

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published