Vietnamese-Text-Classification-and-Clustering

The model aims to automatically classify news data based on genre or topic. Data was collected from VnExpress by BeatifulSoup combining Airflow to schedule the collection.

Introduction

Text classification

In Text Classification task, I apply natural language processing techniques first for normalizing text and then I design a neural network using LSTM for learning the features of each paragraph
I choose Softmax activation function for classification in the final layer, the number of the unit in this layer are equal to the number of class I want to predict

Text clustering

In the Text Clustering task, I implement Word2Vec (skip-gram) for finding the relationships among words.
Then, I embed word for the whole paragraphs and calculate the mean vectors which represent for embedded paragraphs
Finally, I calculate the Cosine similarity to estimate the similar among paragraphs so that I can cluster similar paragraphs into groups

Scrape data from website

The goal of this data pipeline will be to help automate the ETL process, extract - transform - store data from the source to the target data warehouse. The data flow in this project is timed at 17:55 every day, automatically executing the declared ETL process. In the data flow there will be 3 main components:

extract_data: extract raw data from the above sources and save to the stage area (text files)
transform_load: retrieve data from text files saved from the previous task, transform and clean the data to suit the purpose of the problem, export data to text files according to each topic folder and csv files for all
print_date: output the completion time

DAG source code

def print_date():
    print('Today is {}'.format(datetime.today().date()))
    
dag = DAG(
    'ETL-VNExpress',
    default_args={'start_date': days_ago(1)},
    schedule_interval='55 17 * * *',
    catchup=False
)

extract_data = PythonOperator(
    task_id='extract_data',
    python_callable=scrape_news,
    dag=dag
)

transform_load = PythonOperator(
    task_id='transform_load',
    python_callable=transform_load,
    dag=dag
)

print_date_task = PythonOperator(
    task_id='print_date',
    python_callable=print_date,
    dag=dag
)


# Set the dependencies between the tasks
extract_data >> transform_load >> print_date_task
# transform_load >> print_date_task

DAG Diagram

Build Model

Text classification

1. LSTM

___________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding (Embedding)       (None, None, 128)         4699392   
                                                                 
 batch_normalization (Batch  (None, None, 128)         512       
 Normalization)                                                  
                                                                 
 lstm (LSTM)                 (None, None, 64)          49408     
                                                                 
 lstm_1 (LSTM)               (None, 64)                33024     
                                                                 
 dropout (Dropout)           (None, 64)                0         
                                                                 
 dense (Dense)               (None, 9)                 585       
                                                                 
 dense_1 (Dense)             (None, 9)                 90        
                                                                 
 dense_2 (Dense)             (None, 9)                 90        
                                                                 
 dense_3 (Dense)             (None, 9)                 90        
                                                                 
 dense_4 (Dense)             (None, 7)                 70        
                                                                 
=================================================================
Total params: 4783261 (18.25 MB)
Trainable params: 4783005 (18.25 MB)
Non-trainable params: 256 (1.00 KB)
_________________________________________________________________

2. Hybrid (CNN and LSTM)

_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding_14 (Embedding)    (None, None, 128)         4955904   
                                                                 
 batch_normalization_26 (Ba  (None, None, 128)         512       
 tchNormalization)                                               
                                                                 
 conv1d_13 (Conv1D)          (None, None, 128)         49280     
                                                                 
 max_pooling1d_12 (MaxPooli  (None, None, 128)         0         
 ng1D)                                                           
                                                                 
 conv1d_14 (Conv1D)          (None, None, 128)         49280     
                                                                 
 max_pooling1d_13 (MaxPooli  (None, None, 128)         0         
 ng1D)                                                           
                                                                 
 batch_normalization_27 (Ba  (None, None, 128)         512       
 tchNormalization)                                               
                                                                 
 dropout_26 (Dropout)        (None, None, 128)         0         
                                                                 
 lstm_42 (LSTM)              (None, None, 128)         131584    
                                                                 
 lstm_43 (LSTM)              (None, None, 128)         131584    
                                                                 
 lstm_44 (LSTM)              (None, 128)               131584    
                                                                 
 dropout_27 (Dropout)        (None, 128)               0         
                                                                 
 dense_69 (Dense)            (None, 128)               16512     
                                                                 
 dense_70 (Dense)            (None, 64)                8256      
                                                                 
 dense_71 (Dense)            (None, 32)                2080      
                                                                 
 dense_72 (Dense)            (None, 7)                 231       
                                                                 
=================================================================
Total params: 5477319 (20.89 MB)
Trainable params: 5476807 (20.89 MB)
Non-trainable params: 512 (2.00 KB)
_________________________________________________________________

Text clustering

# Calculate sentence embeddings by averaging word embeddings
mean_sentence_embedding = mean_vector_embedding(question_embeddings)
mean_post_embedding = mean_embedded_posts(post_embeddings)

# Calculate similarity (cosine similarity)
similarity_score = cosine_similarity([mean_sentence_embedding], mean_post_embedding)

Example with a short sentence

question = '''Với CLB Hà Lan, tiền đạo cánh người Brazil đạt tỷ lệ ghi bàn và kiến tạo kỳ vọng là 0,58, chỉ xếp thứ 14 nếu đặt ở Ngoại hạng Anh. 
Ngoài ra, Antony cũng được "thổi phồng" nhờ chơi cho CLB vượt trội về tài chính và lực lượng so với phần còn lại của giải vô địch Hà Lan.'''

Visualize the relationship among words

Visualize the relationship among paragraphs

Visualize on webpage

I will use classification model and Cosine similarity technique to erect a simple webpage for sporting magazine searching

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
airflow		airflow
data		data
function		function
model		model
src		src
.gitignore		.gitignore
README.md		README.md
app.py		app.py
crawl_news.py		crawl_news.py
requirements.txt		requirements.txt
text-classification-hybrid.ipynb		text-classification-hybrid.ipynb
text-classification-knn.ipynb		text-classification-knn.ipynb
text-classification-lstm.ipynb		text-classification-lstm.ipynb
text-clustering.ipynb		text-clustering.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vietnamese-Text-Classification-and-Clustering

Introduction

Text classification

Text clustering

Scrape data from website

Build Model

Text classification

1. LSTM

2. Hybrid (CNN and LSTM)

Text clustering

Visualize on webpage

About

Releases

Packages

Languages

Narius2030/Vietnamese-Text-Classification-and-Clustering

Folders and files

Latest commit

History

Repository files navigation

Vietnamese-Text-Classification-and-Clustering

Introduction

Text classification

Text clustering

Scrape data from website

Build Model

Text classification

1. LSTM

2. Hybrid (CNN and LSTM)

Text clustering

Visualize on webpage

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages