Xiyu_part.Rmd

---
title: "China electricity industry reform: Insights with text analysis"
author: "Xiyu Zhang & Jinli Wu"
date: "5/5/2022"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

# Part 1

## Preface

China's electricity system reform is an important part of China's economic reform process. In 2015, a new round of power system reform formulated the institutional framework naming “controlling the middle and liberalizing the two ends”. This means limiting the monopoly power of the grid company occupying the "middle" part, and liberalize the users side and the generation side, which are the "two ends", and once have been regulated by the generation and consumption plan made by the government. In the power industry, China is gradually transitioning from a planned economy to a market economy. 

This process has been driven by a series of detailed policies. In recent years, China's electric power reform has certain achievements, but many policies are far from successful. For example, the theme of one of the six supporting documents of "the No.9 Document", the pilot reform of incremental power distribution, by the end of 2020, only one fifth of the pilots survived to obtain operational qualification. In this context, our core concern is: to what extent, did the array of policy concepts brought up at 2015 still seems feasible and even successful? To what extent, on the contrary, these concepts were already abandoned and substituted by other new concepts?

This part of the study will inspect a series of national-level policy documents issued by the government from 2015 to 2020, using a set of techniques of text mining.

### 1. Set Up Packages

```{r setup packages, warning=FALSE, message=FALSE}
library(dplyr)
library(tidyverse)
library(tidymodels)
library(textrecipes)
library(jiebaR) 
 # the Chinese tokenizer package
library(purrr)
library(ggplot2)
library(wordcloud2)
library(RColorBrewer)
library(topicmodels)
library(tidytext)
library(quanteda)
library(SnowballC)
library(sentimentr)
library(reshape2)
library(wordcloud)
library(textdata)
```

### 2. Build Up Corpus

The document issued by the Central Committee of the Communist Party of China and the State Council in March 2015, "Several Opinions on Further Deepening the Reform of the Electric Power System" (also been called, "the No.9 Document"), and six supporting documents issued six months later, marked the beginning of this round of electricity industry reform. The No.9 Document itself, clarified the main direction and spirit of the reform, whilst the six supporting documents clarified more detailed implementation plans. These documents are considered constructed the basic direction and the framework of the reform. That is the reason why we started from these six documents below.

In all, we inspected 79 national-level documents on electricity reform from 2015 to 2020. All the original materials come from the official website of Chinese government, as well as the Electricity Reform Policy Document Catalog compiled by Polaris Power Grid.

Manually, we build up the corresponding corpus. We annotated the year, title and body content of the literature as features, deleted all the format including space and lines (http://www.esjson.com/delSpace.html), and excluded irrelevant documents. Limited by our capacity, we are not including documents issued from 2021 to 2022.

We used an R package named [jiebaR] (https://github.com/qinwf/jiebaR) for Chinese text mining. The default setting of the tokenizer, worker(), includes word segmentation engine "MixSegment", as "type = 'mix'", which is a mix of "MPSegment" (Maximum Probability segmentation procedure) and "HMMSegment" (A hidden Markov model segmentation procedure); a default Chinese word segmentation lexicon, as "dict = "inst/dict/jieba.dict.utf8", which can be supplemented by a user customized terms segmentation lexicon, using "user =" argument; a default setting of segmenting a string, so if want to apply it to a data frame, then need to turn the "bylines = TRUE".

We customized a Chinese word segmentation lexicons and a Chinese stop word list. For the lexicon, we downloaded the electricity industry topic cell lexicons from an official website of an Chinese input software Sougou (https://pinyin.sogou.com/dict/search/search_list/%B5%E7%C1%A6/normal/), and manually added tens of electricity industry reform terms, by converting the .scel format lexicon to .txt format through an open source converter ShenLan (https://github.com/studyzy/imewlconverter). For the stop words, we downloaded the list from GitHub (https://github.com/YueYongDev/stopwords), and manually added about a hundred stop words due to the output in the project.

Lastly, we defined a loop function in order to build up a tidy-format text tibble, as the segment() function in JiebaR could only used to segment a vector.

```{r tokenize Chinese Policy Documents}
policy_doc <- read_csv("data/policydocument.csv") %>%
  select(number, title, year, content) %>%
  filter(!is.na(content)) %>%
  filter(!is.na(year))

# Customize a tokenizer
tokenizer = worker(
  user = "data/electricity_word.txt",
  # Added customed user vocabulary list. 
  stop_word = "data/customed_chinese_stop_word.txt",
  # The stop words are customed based on Chinese R online forum & manually edited based on industry knowledge
  bylines = TRUE
  # Able to convert each document respectively, tokenize all documents by row
  )

# Tokenize all the content, the electricity industry reform policy documents, respectively
doc_token <- segment(policy_doc$content,tokenizer)

# Define a function to extract the tokens, and apply this function to all of the tokens by documents, because the Chinese text mining package cannot make it automatically
extract_token <- function(x){
  df <- data_frame(token = doc_token[[x]], number = x, year = policy_doc$year[x], title = policy_doc$title[x])
  return(df)
}
policy_token <- bind_rows(lapply(1:79, extract_token))
```


### 3. Exploratory Data Analysis: six supporting policy documents

In our corpus, the first file is "Several Opinions on Further Deepening the Reform of the Electric Power System" ("the No.9 Document"), and the six supporting policy documents are numbered from 2 to 7. We are going to conduct the exploratory data analysis on these six texts, considering their representative status in regulating the electricity industry reform.

#### Key Words

We are going to extract key words from the documents through tf-idf. By setting argument "keywords" in worker(), from JiebaR, we are able to retrieve the result. Note worthy, the input and the output of keywords() should all be a vector, so a loop function "lapply()" is used again. We added two columns to the result tibble, they are the English translation of the title of the six documents, and the English translation of the key words, respectively. We could tell the relevance between the titles and the corresponding key words easily.

```{r keywords}
idf <- worker(
  "keywords",
  user = "data/electricity_word.txt",
  stop_word = "data/customed_chinese_stop_word.txt",
  topn = 5,
  # To select the top 5 key words for each supportive documents
  bylines = TRUE
)

extract_keywords <- function(x){
  df <- data_frame(keyword = keywords(policy_doc$content[x], idf), number = x, year = policy_doc$year[x], title = policy_doc$title[x])
  return(df)
}
policy_keywords <- bind_rows(lapply(2:7, extract_keywords)) %>%
  filter(!is.na(keyword)) %>%
  add_column(keyword_Eng = c(".","transmission and distribution elec_price","pilot","reform","power grid company",".","electricity market","market","transanction","market player",".","transanction institution","transanction","electricity transanction","market",".","electricity volumn","electricity generation","priority","direct transanction",".","market player","electricity sales company","electricity saling","power supply",".","self-contained","power plant","coal power","generator set")) %>%
  add_column(title_Eng = c("Implementation Opinions on Promoting the Reform of Transmission and Distribution Electricity Prices","Implementation Opinions on Promoting the Reform of Transmission and Distribution Electricity Prices","Implementation Opinions on Promoting the Reform of Transmission and Distribution Electricity Prices","Implementation Opinions on Promoting the Reform of Transmission and Distribution Electricity Prices","Implementation Opinions on Promoting the Reform of Transmission and Distribution Electricity Prices","Implementation Opinions on Promoting the Construction of Electricity Market","Implementation Opinions on Promoting the Construction of Electricity Market","Implementation Opinions on Promoting the Construction of Electricity Market","Implementation Opinions on Promoting the Construction of Electricity Market","Implementation Opinions on Promoting the Construction of Electricity Market","Implementation Opinions on the Formation and Standardized Operation of Electricity Transanction Institutions","Implementation Opinions on the Formation and Standardized Operation of Electricity Transanction Institutions","Implementation Opinions on the Formation and Standardized Operation of Electricity Transanction Institutions","Implementation Opinions on the Formation and Standardized Operation of Electricity Transanction Institutions","Implementation Opinions on the Formation and Standardized Operation of Electricity Transanction Institutions","Implementation Opinions on the Planned Liberation of Electricity Generation and Consumption Plans","Implementation Opinions on the Planned Liberation of Electricity Generation and Consumption Plans","Implementation Opinions on the Planned Liberation of Electricity Generation and Consumption Plans","Implementation Opinions on the Planned Liberation of Electricity Generation and Consumption Plans","Implementation Opinions on the Planned Liberation of Electricity Generation and Consumption Plans","Implementation Opinions on Promoting the Reform of the Electricity Sales Side","Implementation Opinions on Promoting the Reform of the Electricity Sales Side","Implementation Opinions on Promoting the Reform of the Electricity Sales Side","Implementation Opinions on Promoting the Reform of the Electricity Sales Side","Implementation Opinions on Promoting the Reform of the Electricity Sales Side","Guiding Opinions on Strengthening and Standardizing the Supervision and Management of Coal-fired Self-Contained Power Plants","Guiding Opinions on Strengthening and Standardizing the Supervision and Management of Coal-fired Self-Contained Power Plants","Guiding Opinions on Strengthening and Standardizing the Supervision and Management of Coal-fired Self-Contained Power Plants","Guiding Opinions on Strengthening and Standardizing the Supervision and Management of Coal-fired Self-Contained Power Plants","Guiding Opinions on Strengthening and Standardizing the Supervision and Management of Coal-fired Self-Contained Power Plants")) %>%
  filter(keyword_Eng != ".")
  
policy_keywords
```

The chart shows six supportive documents and their key words in Chinese and English.

#### Key Word: Word Cloud

Generate six word cloud, for more intuitive presentation of the main keywords of the six documents. We referred to several notes (https://cosx.org/2016/08/wordcloud2), (https://blog.csdn.net/qq_38865429/article/details/89407493) and (https://www.seedhk.org/2019/03/03/r-for-wordcloud/). Set a seed before generating a word cloud to make sure the work is repeatable. An R package, RColorBrewer, was installed for more color schemes, but the application was failed.

```{r wordcloud}
subfile_token <- policy_token %>%
  filter(number >= 2 & number <= 7) %>%
  count(title, token, sort = TRUE) %>%
  filter(n >= 2)

set.seed(20220506)
p1 <- subfile_token %>%
  filter(title == "关于推进输配电价改革的实施意见") %>%
  select(token, n) %>%
  wordcloud2(size = 1, color = "#003366", backgroundColor = "#FFFFCC", shape = 'diamond')
p1

set.seed(20220606)
p2 <- subfile_token %>%
  filter(title == "关于推进电力市场建设的实施意见") %>%
  select(token, n) %>%
  wordcloud2(size = 1, color = "#003366", backgroundColor = "#FFFFCC", shape = 'diamond')
knitr::include_graphics("data/p2.png")
set.seed(20220706)
p3 <- subfile_token %>%
  filter(title == "关于电力交易机构组建和规范运行的实施意见") %>%
  select(token, n) %>%
  wordcloud2(size = 1, color = "#003366", backgroundColor = "#FFFFCC", shape = 'diamond')
knitr::include_graphics("data/p3.png")
set.seed(20220806)
p4 <- subfile_token %>%
  filter(title == "关于有序放开发用电计划的实施意见") %>%
  select(token, n) %>%
  wordcloud2(size = 1, color = "#003366", backgroundColor = "#FFFFCC")
knitr::include_graphics("data/p4.png")
set.seed(20220906)
p5 <- subfile_token %>%
  filter(title == "关于推进售电侧改革的实施意见") %>%
  select(token, n) %>%
  wordcloud2(size = 1, color = "#003366", backgroundColor = "#FFFFCC")
knitr::include_graphics("data/p5.png")
set.seed(20221006)
p6 <- subfile_token %>%
  filter(title == "关于加强和规范燃煤自备电厂监督管理的指导意见") %>%
  select(token, n) %>%
  wordcloud2(size = 1, color = "#003366", backgroundColor = "#FFFFCC", shape = 'diamond')
knitr::include_graphics("data/p6.png")
```

### 4. Policy Trend Discussion

First, we are going to generate a variable called "freq", which is the proportion of a token' appearance among the sum of the appearances of all of the tokens in this year. This variable is generated for comparison between different years because the document tokens amount is different in each year.

```{r Policy Trend Discussion}
policy_token_tidy <- policy_token %>%
  count(title, year, token) %>%
  rename(count = n)
total <- policy_token_tidy %>%
  group_by(year) %>%
  summarise(total_freq = sum(count))
year_term_count <- policy_token_tidy %>%
  count(year, token) %>%
  rename(count = n) %>%
  inner_join(total)%>%
  mutate(freq = count/total_freq) %>%
  #check the total tokens in one year, to see the frequency of the selected key words appearred in each years
  arrange(year, desc(freq), token)

year_term_count %>% 
  group_by(year) %>%
  #to choose the top 10 frequent tokens in each year
  slice_max(freq, n = 10)
```


The list is helpful for completing the stop words list.

With the key words generated in the last section, we are able to count the frequencies of these key words by year in the whole document tokens, to depict a policy trend during these years. 

For each of the six supportive documents, we selected one representative key word to indication the topic of the specific supportive document, such as "transmission and distribution elec_price" for the "Implementation Opinions on Promoting the Reform of Transmission and Distribution Electricity Prices" document, "electricity market" for the "Implementation Opinions on Promoting the Construction of Electricity Market", "transanction institution" for the "Implementation Opinions on the Formation and Standardized Operation of Electricity Transanction Institutions" document, and so on. These words are selected because they are simultaneously among the frequent appeared words in these years, and are the key words appeared in the list of previous word clouds, and thus been thought of representative for a policy topic which was brought out at the beginning of the electricity reform in 2015.

Draw the trend of six key words with ggplot2:

```{r trend of six key words}

p_year_term_counts <- year_term_count %>%
  filter(token %in% c("输配电价", "电力市场","交易机构","直接交易","售电公司","燃煤"))

# Translate those tokens into English for data visualization
p_year_term_counts[p_year_term_counts == "输配电价"] <- "transmission and distribution elec_price"
p_year_term_counts[p_year_term_counts == "电力市场"] <- "electricity market"
p_year_term_counts[p_year_term_counts == "交易机构"] <- "transanction institution"
p_year_term_counts[p_year_term_counts == "直接交易"] <- "direct transanction"
p_year_term_counts[p_year_term_counts == "售电公司"] <- "electricity sales company"
p_year_term_counts[p_year_term_counts == "燃煤"] <- "coal power"

# Conduct data visualization
p_dataviz <- p_year_term_counts %>%
  ggplot(aes(year, freq)) +
  geom_point() +
  geom_smooth() +
  facet_wrap(~token, scales = "free_y") +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(y = "% frequency of word in yearly policy documents") +
  theme_minimal()
p_dataviz
```


The titles of the pictures should be Chinese key words selected previously, but cannot be print. We translated them into English for reference.

#### Conclusions

(1) The data fluctuates greatly, might out of the relatively long policy-period in China's policy making sector. That means, for the same topic, the national-level documents, which are the materials of our corpus, don't need to come out each year. The first and third plot, about electricity sales company, transmission and distribution electricity price, are consistent with the conclusion.

(2) We could observe some decreasing trend in the two middle plots here, which are correlated with the term of "direct transaction" and "transaction institution". "Direct transaction" refers to the transaction between the power plant and the user directly, the two sides could decide the price by negotiation, but not need to follow the government pricing. "Transaction institution" refers to a third-party platform supporting this kind of "direct transaction", given that the electricity transaction is so complicated that there are many categories (including medium and long-term transactions and spot transactions, etc.) as long as requiring cooperation from multiple parties (including power plants, power grids and users, etc.). Based on other qualitative interviews, we know that the spot goods transaction is lesser popular as the time went, and the decreasing trend here we spotted might offer some support to this conclusion.

(3) Third, another possible explanation might be, we only use one keyword as a topic indicator for analysis, but not a set of keywords to represent a topic, leading to the data fluctuates greatly. Therefore, the following will use the topic modeling method for further analysis.

### 5. Topic Modeling

We are interested in grouping the documents into several clusters through topic modeling, and to compare with the six supportive documents to see if there are any deviations from the initial topic.

LDA is applied with reference to several websites and books, including (https://rdrr.io/cran/topicmodels/man/lda.html) and (https://www.tidytextmining.com/topicmodeling.html#topicmodeling).

```{r topic-modeling}
#generate a Document Term Matrix for topic analysis
policy_token_dfm <- policy_token_tidy %>%
  cast_dfm(title, token, count)

# conduct LDA while set the number of topic as 7, out of the assumption of including 6 initial topics and one irrelevant random topic
# set a seed so that the prediction is predictable
policy_lda <- LDA(policy_token_dfm, k = 7, control = list(seed = 1509))

policy_topics <- tidy(policy_lda, matrix = "beta") 
# "beta" is the method of extracting the per-topic-per-word probabilities from the model

policy_topic_terms <- policy_topics %>%
  group_by(topic) %>%
  slice_max(beta, n = 10) %>%
  ungroup() %>%
  arrange(topic, -beta) %>%
  
  #provide English translation for each topics
  add_column(term_Eng = c("electricity grid", "provincial","permission","regional","supervision","cost","price","transmission and distribution elec_price","ratified","period","electricity generation","electricity","priority","users","transanction","market","electricity volume","safeguard","participate","electricity purchase","energy","renewable energy","electricity","electricity generation","distributive","utilization efficiency","regional","pilot","project","power grid company","market","information","electricity","transanction","spot goods","institution","disclosure","spot goods transanction","transregional","region","market","transaction","electricity","institution","electricity transanction","market players","electricity volumn","transanction institution","information","electricity market","power distribution","project","power distribution grid","increment","supervision","energy","pilot","electricity grid","power supply","planning","energy","project","renewable energy","coal power","electricity price","reform","electricity generation","price","subsidy","funding"))

#find the 10 terms that are most common within each topic.
policy_topic_terms
```


```{r fig.height=3}
# Draw plots for these topics
policy_topic_terms %>%
  mutate(term_Eng = reorder_within(term_Eng, beta, topic)) %>%
  ggplot(aes(beta, term_Eng, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~topic, ncol = 2, scales = "free_y") + 
  # to show the top terms for each topics, respectively
  scale_y_reordered() +
  labs(title = "Top 10 terms for 7 topics") +
  theme_minimal()
```


#### Result of Topic Modeling

The result of topic modeling is pretty approximate to our initial key words generated from the 6 sub_files of the electricity reform file in 2015, which indicates the beginning of the new round electricity reform in China. However, there are also some differences.

Compare the plots here to the "policy_keywords" chart at the beginning of the report. 

```{r comparison}
policy_keywords %>%
  select(number, keyword_Eng, title_Eng) %>%
  mutate(Num = as.double(number) - 1) %>%
  select(Num, keyword_Eng, title_Eng)
```

The first chart seems approximate to the sub-document of "Implementation Opinions on Promoting the Reform of Transmission and Distribution Electricity Prices"; both of the second chart (on the right) and the forth chart (row 2, on the right) seems correlated to "Implementation Opinions on Promoting the Construction of Electricity Market", whilst the forth is more about spot goods; the third seems to be approximate to the topic of renewable energy, which is not included in the six sub-documents, but is a more and more important topic after President Xi promoted "carbon neutrality" target in 2020. The fifth chart seems approximate to "Implementation Opinions on the Formation and Standardized Operation of Electricity Transaction Institutions". The sixth topic is about "Implementation Opinions on Promoting the Reform of the Electricity Sales Side", because the most important one of the reform of the electricity sales side is the incremental pilot of the distribution power grid. The seventh is about the coal power, which is consistent with the sixth topic, "Guiding Opinions on Strengthening and Standardizing the Supervision and Management of Coal-fired Self-Contained Power Plants".

## Summarize

Of the seven topics, five overlapped with six supportive documents, while the remaining two appeared to consist of random keywords. To see a shift in policy discourse requires more analysis in the future.

Due to time constraints, policy documents for the six main sub-topic areas beyond 2020 were not included. Some of them are related to the epidemic, resumption of work and production, some are related to heating production, and more are related to the construction of legislation and regulatory systems in the energy industry. Hope to have the opportunity to present this part in the future.