“A good newspaper is a nation talking to itself.”
Arthur Miller
In the newspaper and magazine articles we cover more than events, we share what we are doing, what we are interested in and what matters to us as a society. Our aim in this project is to see how the news content changes for countries over time and is there a connection between country profiles and the news published. Our motivation behind this project is to understand, the underlying profile of the published news content: the topics of the news, its tone, most commonly used words and whether these information change between countries. Do the news become more global or they manage to conserve some trends from the country they are published? We aim to see if country specific information and published news are somehow correlated or a good newspaper/article is now a world talking to itself? We use the News on the Web dataset more specifically, Now corpus data and we gather the country profiles from The World Factbook data [1].
- What are the main topics of the published news? (tech, politics, sports, etc.)
- What are the distributions of articles over country and time?
- What are the distributions of these topics over country and time?
- What are some mostly used words in the countries topics?
- Is there some trends or patterns between news articles and country profiles?
- Are countries having X attribute/fact, have more news on topic Y?
Now Corpus has online magazine and newspaper text data from 20 different English speaking countries collected each day. The dataset we have in the cluster is for 6 years from 2010-2016. It has lexicon, source, text and wlp (word, lemma, PoS tag) data.
The data is in txt format but th data size is too large, it is around 5.9 million words, in terms of computation time we might need to limit and use part of the dataset, we can do this either by choosing some specific countries or specific time frames. We might limit the time by just doing the analysis for 1 year, or some selected months for each year or some selected days for each month.
We can enrich the dataset by doing following:
- By using the source data we can group the news by country and time.
- By using the title and url information of the news we can infer the topic of the newspaper (most cases url contains sports, healthcare, tech words in the urls).
- By using the wlp data we can find the Topics of the articles.
World Factbook contains the data about each country profiles collected by CIA. The data is open source and can be obtained from this link as a json format. We are only interested in 20 countries therefore, we use only the countries in which we have news article available. Also, the country profiles are in-depth, wecchoose key information such as population, age profile, sex ratio, and other socio-economic informations.
Our data analysis consists of three main parts: Analysis of factbook data, analysis of now corpus data and correlation of two data. In the first part, we determined and extracted the relevant facts for each countries. In the second part we determined topics for news and get the tpic distribution for each countries news articles. At the last part we checked if there are some correlations between the facts and toic distributions.
The World Factbook data set provides information on the main topics of geography, history, people, government, economy, communications, transportation, military, and transnational issues for 267 world entities. Among them, we selected 20 countries whose internet media coverage data exists in the now corpus data. In our initial analysis in FactBook dataset, we observed that the features of the countries mostly belongs to 2015 to 2017. Since our aim is to see the correlation between news and factbook data, we decided to use the news belonging to last few years.
Since the dataset is rather small compared to the news data, we downloaded the dataset as json to our local computer. We filtered the data by getting only 20 countries that exist in the News on the web dataset (NowCorpus). These countries are:0 United States, Ireland, Australia, United Kingdom, Canada, India, New Zealand, South Africa, Sri Lanka, Singapore, Philippines, Ghana, Nigeria, Kenya, Hong Kong, Jamaica, Pakistan, Bangladesh, Malaysia and Tanzania.
For each country, Factbook provides us more than 100 facts under the different main topics listed above. First we read all the given facts under the each topic on the website (https://www.cia.gov/library/publications/the-world-factbook/). We decided on the useful features of the data that we might possibly find some correlation between the news topics that are extracted from news data.
The following facts are selected to compare with the now corpus data: 1) People and Society: Population, Age structure, Median age, Population growth rate, Birth rate, Death rate, Net migration rate, Sex ratio, Life expectancy at birth, Religions, Ethnic groups 2) Economy: GDP - per capita (PPP), Unemployment rate, Inflation rate (consumer prices), Population below poverty line 3) Energy: Electricity - from other renewable source, Carbon dioxide emissions from consumption of energy 4) Communications: Internet users 5) Government: Country name 6) Geography: Geographic coordinates, Natural hazards, Environment - current issues
We extracted the selected features for our 20 specific countries and did some preprocessing to clean the data. These preprocessing includes text data cleaning, splitting text and extracting the only usefull information needed. We also checked which facts include up to date data and which data have clear comparible usefull information. During this process, we decided not to use Natural Hazards feature since the effect, frequency and size of the hazards are not comparible between countries. Our second concern was that distribution of the different hazards are very varied and for 20 countries does not seem possible to do correlation. Similarly, Environmental Issues data also does not provide the degree of the problem and creates similar analysis issues. We also decided to exclude Religious, Population below poverty line and Ethnic groups data since the latest data belongs to the years of 2000-2016 range changing in each countries. For the rest of the facts, we selected the latest date information mostly belongs to 2016. Some of the facts includes values per population, per gender or per age group. In these facts we mostly selected values for overall population.
Then we created a dataframe including the country names as columns and each fact added as columns. All rate, percentage and numbers in texts converted to float. We plotted all the facts for each countries in /Notebooks/WorldFact.ipynb.
In Now Corpus dataset have 5 different data file type which are Database, WordLemPoS, Text, Sources, Lexicon. We first downloaded the samples for each of these datafiles and examined which ones are relevant for our research questions. We decided to use Source Data file including each article's source website, source link, word counts etc.
In the Source Data, we use all the 7 attributes which are textId, #words, date, country, website, url and title. Since the 2 source data files together has managable size we ran them on the local with the Source_data_exploration notebook. We answered some crucial question about data to see how the news article distributed around country, websites etc. We also checked if we can find the article topics based on URL. We tried this approach to check if the urls already including the topic of the articles or not. Unfortunately, the percentage of found topics were really low around 30%. Therefore, we decided to go with a different approach.
In our main approach, we decided to use WordLemPos Data since it includes each article's word list and run LDA to find the news topics for each country data. We selected LDA since it was one to newest state of the art algorithm in topic finding in text data. In the WordLemPos data we selected textId and lemma columns to use for analysis instead of using the raw text data from each article. Lemma columns were already lemmatized and stemmed.
Our further now corpus data set analysis composed of two parts:
- Source data anaylsis to see source distribitions and word counts
- Topic Findng with LDA
In source data analysis we did analysis for full data of all years, then we also did the analysis for articles and sources we used in the selected year interval. Details of these results can be seen in the website of our project. In general we checked from how many unique website resources, the news articles are collected. How many articles are provided for each country. We observed that number of articles are not equally distributed, US has the more articles compared to other countrie and the least number of articles belong to Tanzania with 15848 articles. In average countries has around 306608 articles. We also explored the total number of words in articles per country. US again has the most word count which makes sense since they have more articles collected. Overall, for all years for each country we have at least 8 million of words collected. We also explore the article counts per website to understand if the articles collected evenly. Howerver, some websites such as Times of India, Telegraph.co.uk has more articles collected ompared to other resources. Therefore, one needs to consider this fact while interpreting the results.
In natural language processing, latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's presence is attributable to one of the document's topics. LDA is an example of a topic model.(check https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation for the description).
In order to use this methodology in our project, we first had to provide a clean wordlist of articles by doing data preparation and noise elimination. Then we contructed LDA model, by determining right parameters for a logical clustering of the data resulting with meaningfull news topics. During this process, several parameters are tested and tuned iteratively by running the LDA on clusters with different parametric combinations. In addition to these steps we also had to discover which lda algorithm in spark is useful and matches our needs, and with which optimizer we can get the best results. Each of these steps are explained in detail in the following subsections.
Data preparation step includes time interval selection, sampling and noise elimination for the news data.
In time interval selection process, we had to take into account the up to date provided country facts from factbook data, computational time of LDA and size of the now corpus data.
After analysing the data of factbook we observed that most of the values belong to 2016, in order to correlate fact with news we decided to use the recent news data for further analysis. First we tried 3 and 2 years' news data but computational time was so high (around 40 hours) in LDA. Then we tried 1 year data. The last year data of Now Corpus belongs to 2016 but does not include last two months. In order to eliminate month bias in the data results we used last 12 months of the data set corresponds to the data published between November 2015 and October 2016. This data even was very log to process within the given time interval of our project (each 24 hour). Therefore we sampled our data by selecting random articles through the selected year interval and time decreased to 8 hours. We also paralelized the process to dicrease the run time.
We cleaned the data from numbers and specific characters then deleted stopwords by using two different already existing lists. However, in the results there were several words that are not contributing to determine a specific topic per cluster. Afterwards, we applied following steps to make words list data cleaner. Some of the steps are applied by doing several LDA iterations and checking most frequent word lists of each clusters.
- Elimination of Stopwords Using already existing spark ml library default stopwordlist.
- Iterative elimination of unrelated words that are not contributing to the topic selection (such as :also, share, many, like) We did several LDA clustering iterations and check the resulting most frequent/common words for each topic and iteratively eliminated unrelated words.
- Eliminating digits
- Eliminating less than 3 letter words: this step is used in several previous studies on natural language processing.
- Adding back important 2/3 letter word list: we were concerned about eliminating useful important words potantially contributes to the clustering such as art, man, gun, war, eu, us, win, car. Therefore, we decided to create a 2 -3 letter word list to put back to the data.
- Iteratively selecting sigma value for tail cutting to create most freq and least freq extra words: In the previous studies on nlp, in order to eliminate noise, most frequent and least frequent words are deleted from the data sets. Since we did not want to eliminate some X percent of the data randomly we used a more scientific methodology. In nlp, the word distribution is assumed as gaussian, in order to cut the tails of this distribution, we decided to use sigma value. We cutted the tails and keep the data within the borders of 1 sigma, 1.5 sigma, and 2 sigma from each side (each side of the graph: to the left x sigma and to the right x sigma),that corresponds to %95.4, %86.6, and %68.3 of the data respectively. In order to see the visual version of this check the following link: https://thecuriousastronomer.wordpress.com/2014/06/26/what-does-a-1-sigma-3-sigma-or-5-sigma-detection-mean/ Then we did LDA with the combination of these 3 sigma values and different cluster(k) numbers. At the end we got the best results from using 2 sigma.
In order to get relevant most frequent words for each clusters and having meaningfull topics we had to decide 1)how many clusters should be determined in the model 2)which sample size is best to represent the data 3)which sigma value should be selected for most/least frequent word election for cleaning
In order to determine these 3 values to create our final model we run the LDA with combinations of different values of each number. Then we checked the cluster results and thought about possible topics that can be determined from each cluster. We selected the values resulting with the most relevant, meaningful and distinguishable clusters of frequent words.
We run LDA for each country seperately. The reason behind this approach is following: Each country speaks differently about on a topic even the consept is the same. For instance if the Topic is Sports, Canada talks about Hokey while India talks about Cricket. Correspondingly the name of the sports celebrities changes as well. Similarly, if the topic is politics in India we can see the religion related words while in US we see Trump. However, if we see Trump word in another country the word belongs to International topic. Therefore, running the model on a mixed country data may result in either very undistinguishable clusters with very general words or wrong topic assignments.
We first tried 3-4-5 to have similar topics for each country but the most frequent words in the clusters were very close. Then we tried 7-10-13 number of cluster and selected 7 since 10-13 were not distinguishable while k=7 gives the logical news topics.
As we discussed in the Time Interval Selection section, we decided to sample the data. In order to determine which percentage of the data will be sampled we iteratively sampled and tested %10, %20 and %25 of the data. The best result is obtained from %25 sampling.
In order to find most logical topics and better clustering by cutting most frequenct and least frequent words we tried different sigma values to cut the tails of our word data distribution for each country. Best result is obtained with the k=7, sampling: %25 and using sigma: 2 which corresponds to %95.4 of our sampled word data.
We tried two different lda library: mllib clustering lda and ml.clustering lda. mllib clustering lda was used and we reognized that after clustering it doe not provide a function for cluster assignment for each article. It was also problematic in paralelization. ml.clustering lda improved our computational time by paralelization and it provides easier topic assignment after clustering using transform function. Second model selected according to lower perplexity score and meaningfull topic distribution. Topic distribution is the term we used for distribution of a countries articles among clusters. In here we checked the following: when we have k clusters and meaningfull topics, after assigning each article to one of these k clusters do we have a proper distribution of articles among the clusters. In here, an example of not proper distribution can be like that : we have 7 clusters 0,1,2,3,4,5,6 and we have 5670, 300, 0, 30, 20, 5, 4 number of articles respectively. In this scenario, we can easily say that one cluster dominates and it is not a good distribution. We selected the model giving a meaningful topic distribution for articles among the clusters. During model selection log likehoods are also taken into account.
There are two optimizer we tried in LDA model: EM optimizer and Online optimizer. Each optimizer provides different list of most frequent word list and we selected the one giving most meaningfull most frequent words list for each cluster and the one with better perplexity score which is EM optimizer.
After selecting the best model we assigned corresponding topic names to the clusters and for each country we created topic distribution by counting article number on belonging to each topic. These distributions can be seen in our website with interactive pie charts showing percentage of each topic per country. We ended up following news topics: ENVIRONMENT/ENERGY, INTERNATIONAL, POLITICS, SPORTS, TECHNOLOGY/SCIENCE/SOCIAL MEDIA, SOCIAL_LIFE/DAILY, ENTERTAINMENT/ART/MAGAZINE, COMPANY/BUSINESS, ECONOMY, POLICE/ACCIDENT/VIOLENCE, LEGAL/LAW and HEALTH/MEDICAL.
After having the distributions of topics and different facts for each country we checked the correlation. We used both pearsonr and spearman correlations. Since our research questions were focusing on a monotonic relationship between topic distribution on media coverage and facts we decided to publish the results with Spearman correlation.
We created heatmap for the correlations and considered only the relations with higher than 0.25 and lower than -0.25 correlation coefficient. We tried to interpret the results and the meaning of the each correlation. Since we had several results and correlation does not mean causality, we tried to select the meaningfull results having p value smaller than 0.05. We interpreted these results and published on our website. But the results should be interpreted carefully since there are limitations in the project.
We also checked the results by grouping the different facts data into groups of countries such as countries having high, medium and low carbondioxide emission rate. This ranking approach did not change our significant correlation results.
Gorkem Camli: choice of datasets, creating the plan for each milestone, exploratory data analysis and attribute description on Now Corpus Data, generating interactive graphs,generating interactive maps, analysis of final results, creating a website that also serves as a platform for the data story, development of project topic, commenting the code, writing the explanations in the notebook, writing the data story, LDA model construction, LDA code implementation and iterative run on clusters,topic selection for each country, correlation analysis.
Arzu Guneysu Ozgur: creating the plan for each milestone,exploratory data analysis and attribute description on Factbook data, aggregating data and plotting, analysis of final results,data preparation,sigma value use in data cleaning,generating interactive maps, creating content for the website, commenting the code, writing the explanations in the notebook,writing the data story,LDA model construction, development of project topic,topic selection for each country, correlation analysis.
Ezgi Yuceturk: choice of datasets,creating the plan for each milestone, explatory data analysis with spark, LDA model construction and selection, LDA code implementation and iterative run on clusters,code implementation for managing the big data on spark,analysis of final results, developing host website for the final presentation, development of project topic, commenting the code, writing the explanations in the notebook, writing the data story, topic selection for each country, topic assignment for LDA results, correlation analysis, most frequent word extraction.
- Understand how to manage the data.
- Decide on how to filter the now corpus to have a managable data size.
- Decide on which attributes will be taken from Factbook data.
- Collect and filter both data.
- Clean the datasets.
- Do descriptive statistics and exploration of the data.
- Find the most frequent X words in each article (excluding stop words).
- Start doing preliminary analysis of news content.
- Start doing preliminary analysis of country profiles vs news content.
- Finalize preliminary analyses.
- Revise and comment the code.
- Decide the next steps.
- Precise the topics per each country: Run LDA for each country (we do it separately otherwise the words selected might be more biased per some country has more words or articles such as US).
- For each article do the topic assignment.
- For each country find topic percentages.
- Find the country-wise correlations between Now Corpus and Factbook.
- Gather analysis and correlation results.
- Do data visualization.
- Finalize the project,
- Clean, comment and revise the notebooks.
- Create data story.
This project's github repository folder structure explained below:
We have 3 folders:
- Data:
In this folder, we included the data we have extracted from FactBook dataset for 20 countries, the parquet data that we created for preliminary analysis.
- Counrty: folder contains news-topic information for sampled news adn their assigned topic for each country
- Factbook.json : Factbook data from CIA for 20 countries
- data.parquet folder: Query results saved from pyspark_script.py results
- Scripts:
- pyspark_script.py: Source data filtered on cluster with pyspark.
- LDA_script_params.py: Run LDA script with params (k, sigma etc)
- LDA_script_sample.py: Improve version LDA script (include sampling)
- create_LDA_Model.py: For model selection
- yet_another_LDA.py: Final version
- Notebooks:
- WorldFact.ipynb : Data Exploration and Analysis for FactBook data
- Source_Data_Exploration.ipynb : Now Corpus data: Source data: Data Exploration and Analysis
- WordLemPos_Topic_Modelling.ipynb : Now Corpus data: WordLemPos data: Data Exploration, Analysis and LDA Topic Modelling
- Spark_Notebook.ipynb : Initial steps to understand managing big data in spark and cluster
- Data_Analysis.ipynb : Correlations of two datasets, answers to research questions(left)
- Topic_Assignment.ipynb : Human understandable topic assignment to documents clustered by LDA Model
- Plots: Interactive plots created for website
- LDAResult:
- all-pre-results: results found during model selecting and hyper parameter tuning for models
- for all countries: LDA model, topic distributions, document assignments, topics and words can be found along with perplexity scores and log likelihood results.
Order of notebooks are same as above. You can start checking the notebooks with WorldFact notebook for FactBook data analysis then proceed with Source data Exploration for Now Corpus Source analysis and then continue with WordLemPos_Topic_Modelling. After them, Topic_Assignment and finally Data_Analysis Notebook to see how we discovered correlations. Spark_Notebook contains our initial steps to understand how we can run and manage the big data on cluster with pyspark. We initally tried this with Source data of Now Corpus, for later this notebook is turned into scripts.