This article focuses on analyzing the questions on askubuntu.com to find the most common topics asked about in order to better understand what areas of Ubuntu may need more attention for bug fixing and also what features might be good to add in future releases of Ubuntu. To do this, I analyzed public data from askubuntu.com using Azure HDInsights with Spark. Tags were the most useful. Word counting the titles and body text was less useful. Future research might try using a natural language parsing libraries such as NLTK to better identify topics asked about and also better identify what type of questions are asked for each topic.
Big Data consists of largs amounts of unstructured or semi-structed data that can be analyzed to derrive new insights that can not easily be found by manually searching the data. For example, one could parse gigabytes of server log files to find common causes of errors or slowdowns on a cluster of servers. Another example would be analyzing tweets from Twitter to determine public sentiment about a product. A third example would be analyzing customer buying behavior to better deliver targeted advertising.
This article focuses on analyzing the questions on askubuntu.com to find the most common topics asked about in order to better understand what areas of Ubuntu may need more attention for bug fixing and also what features might be good to add in future releases of Ubuntu.
I'm in no way affiliated with Ubuntu itself. This analysis is for demonstration purposes only.
The data was obtained from the Stack Exchange Data Dump on archive.org, after which it was extracted out of its 7z achive and the XML files inside were uploaded to HDFS storage on Microsoft Azure.
After the files were uploaded to Azure, two Spark 2.0 scripts were written and executed in Python (find_top_tags_for_askubuntu.com.py and find_top_words_for_askubunut.com.py) in an HD Insights cluster following this Azure HD Insights/Spark guide.
These scripts can be run from a Spark 2.0 cluster using the following commands:
spark-submit find_top_tags_for_askubuntu.com.py
spark-submit find_top_words_for_askubuntu.com.py
The scripts are saved in the scripts section of this repository.
The results from these scripts are saved in the results section of this repository. For find_top_words_for_askubuntu.com.py, only the top 1,000 results were saved due to size limitations.
Rank | Tag | count |
1 | 14.04 | 21148 |
2 | 12.04 | 17412 |
3 | boot | 13098 |
4 | command-line | 12294 |
5 | networking | 12101 |
6 | 16.04 | 11278 |
7 | dual-boot | 10458 |
8 | drivers | 9723 |
9 | unity | 9122 |
10 | wireless | 9018 |
11 | server | 8852 |
12 | apt | 8589 |
13 | grub2 | 7755 |
14 | partitioning | 7474 |
15 | installation | 7221 |
16 | nvidia | 6498 |
17 | gnome | 5818 |
18 | system-installation | 5651 |
19 | upgrade | 5507 |
20 | bash | 5470 |
21 | usb | 5404 |
22 | package-management | 5356 |
23 | 11.10 | 5125 |
24 | software-installation | 5054 |
25 | sound | 4961 |
*Filtered out generic words such as prepositions and conjunctions
Rank | Word | Count |
---|---|---|
1 | ubuntu | 71478 |
2 | install | 19539 |
3 | 14.04 | 12902 |
4 | windows | 12491 |
5 | boot | 11096 |
6 | error | 9763 |
7 | 16.04 | 9562 |
8 | file | 9426 |
9 | cant | 9348 |
10 | 12.04 | 9318 |
11 | installing | 8392 |
12 | - | 8332 |
13 | working | 8046 |
14 | using | 7860 |
15 | screen | 7003 |
16 | server | 6999 |
17 | files | 6851 |
18 | usb | 6647 |
19 | work | 5557 |
20 | installation | 5438 |
21 | system | 5340 |
22 | command | 5055 |
23 | update | 5006 |
24 | drive | 4921 |
25 | upgrade | 4898 |
*Filtered out generic words such as prepositions and conjunctions
Rank | word | count |
1 | ubuntu | 385199 |
2 | install | 264532 |
3 | file | 210150 |
4 | using | 176274 |
5 | all | 151640 |
6 | 0 | 143552 |
7 | - | 141588 |
8 | windows | 141341 |
9 | installed | 140034 |
10 | apt-get | 131660 |
11 | like | 130961 |
12 | sudo | 127988 |
13 | boot | 120906 |
14 | system | 117750 |
15 | its | 114235 |
16 | some | 113739 |
17 | need | 111594 |
18 | run | 111025 |
19 | up | 110354 |
20 | one | 108843 |
21 | command | 105521 |
22 | error | 102161 |
23 | 1 | 101823 |
24 | only | 100010 |
25 | files | 97827 |
Tags were the most useful. The most common questions seemed to be about Ubuntu LTS releases (12.04, 14.04, 16.04), with all three recent LTS releases being in the top 6 tags. This may be due to LTS releases being used the most. A lot of questions are related to booting (boot: 3rd place, dual-boot: 7th place, grub2: 13th place). This might be due to the wide variety of hardware Ubuntu runs on but I cannot say for sure. Networking-related questions were common (networking: 5th place, wireless: 10th place). Additionally, many people seem to be interested in running Ubuntu as a server, judging by the server tag coming in at 11th place. Other notably high tags were related to drivers , graphics, installation, patitioning, and sound, again possibly due to the wide variety of hardware on which Ubuntu can run (drivers: 8th place, nvdia: 16th place, installation: 15th place, parititioning: 14th place, sound: 25th place).
Word counting did not provide much useful information compared to tag counting. A lot of the words were pronouns, prepositions, conjunctions, or other words that do not provide any meaningful information. I tried to filter such words out but it was difficult due to the large number of such words.
The information collected here may be useful for what common problems Ubuntu users face and also what features they are most interested in. However, more investigation is needed before it can be turned into actionable insights.
Future research might try using a natural language parsing library such as NLTK to better identify topics asked about and also better identify what type of questions are asked for each topics.