-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathAnalysisReport-Part3.Rmd
407 lines (234 loc) · 12 KB
/
AnalysisReport-Part3.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
---
title: "Analyzing Kaggle Data Science Survey Data-2017:PART-3"
author: "Anish Singh Walia"
output:
html_document:
df_print: paged
---
```{r,echo=FALSE,results='hide' ,message=FALSE, warning=FALSE, r,echo=FALSE, results='hide'}
require(data.table)
require(highcharter)
require(ggplot2)
require(tidyverse)
SurveyDf<-fread("../Datasets/kagglesurvey2017/multipleChoiceResponses.csv") #for faster data reading
attach(SurveyDf)
```
## AIM
This is the part 3 of this project.This is a data analytics project for mining analyzing, visualizing the data collected by the Kaggle Data science survey conducted in 2017.
> Part 3 - This section will analyze and study the professional lives of the participants, their major degree ,time spend studying data science topics, what job titles they hold,which ML method they actually use in the industries , which bolgs the participants prefer the most for studying data science etc.
###Let's get started.
1) Let's start with the most preferred blog sites for learning datascience-This is a multiple answer field. Let's find the top 15 most preferred answers.
```{r message=FALSE, warning=FALSE}
blogs<-SurveyDf %>% group_by(BlogsPodcastsNewslettersSelect) %>%
summarise(count=n()) %>%
top_n(15) %>%
arrange(desc(count))
#removing NA value
blogs[1,1]<-NA
colnames(blogs)<-c("Blogname","Count")
#let's plot them
hchart(na.omit(blogs),hcaes(x=Blogname,y=Count),type="column",color="#062D67") %>%
hc_exporting(enabled = TRUE) %>%
hc_title(text="Barplot of most preferred blogs for learning",align="center") %>%
hc_add_theme(hc_theme_elementary())
```
hence one can see that one of the most famous and preferred blog sites are __R bloggers and Kdnuggets__.
----------------------------
2) Let's now study how long aprticipants have been learning data science-
```{r message=FALSE, warning=FALSE}
table(LearningDataScienceTime)
hchart(SurveyDf$LearningDataScienceTime,type="pie",name="count")
```
So most of the participants have started learning data science in the past year itself or its been less than a year since they started studying learning data science.
Let's check the age distribution of the particpiants and for how long they have been learning data science.
```{r}
hcboxplot(x=SurveyDf$Age,var=SurveyDf$LearningDataScienceTime,outliers = F,color="#09870D",name="Age Distribution") %>%
hc_chart(type="column") %>%
hc_exporting(enabled = TRUE) %>%
hc_title(text="Boxplot of ages and the learning time of participants",align="center") %>%
hc_add_theme(hc_theme_elementary())
```
The above plot was quiet predictable as people with less time learning data science are younger.
---------------------------
### 3) Let's now study what participants entered and feel which skills are important for becoming a data scientist?
Let's do some data wrangling and transformations.
Making a separate data frame for each variable, for easier understanding.
```{r message=FALSE, warning=FALSE}
#let's make a function to ease things
#function takes argument as a dataframe and the categorical variable which we want summarize and group
aggr<-function(df,var)
{
require(dplyr)
var <- enquo(var) #quoting
dfname<-df %>%
group_by_at(vars(!!var)) %>% ## Group by variables selected by name:
summarise(count=n()) %>%
arrange(desc(count))
dfname#function returns a summarized dataframe
}
RSkill<-aggr(SurveyDf,JobSkillImportanceR)
RSkill[1,]<-NA
SqlSkill<-aggr(SurveyDf,JobSkillImportanceSQL)
SqlSkill[1,]<-NA
PythonSkill<-aggr(SurveyDf,JobSkillImportancePython)
PythonSkill[1,]<-NA
BigDataSkill<-aggr(SurveyDf,JobSkillImportanceBigData)
BigDataSkill[1,]<-NA
StatsSkill<-aggr(SurveyDf,JobSkillImportanceStats)
StatsSkill[1,]<-NA
DegreeSkill<-aggr(SurveyDf,JobSkillImportanceDegree)
DegreeSkill[1,]<-NA
EnterToolsSkill<-aggr(SurveyDf,JobSkillImportanceEnterpriseTools)
EnterToolsSkill[1,]<-NA
MOOCSkill<-aggr(SurveyDf,JobSkillImportanceMOOC)
MOOCSkill[1,]<-NA
DataVisSkill<-aggr(SurveyDf,JobSkillImportanceVisualizations)
DataVisSkill[1,]<-NA
KaggleRankSkill<-aggr(SurveyDf,JobSkillImportanceKaggleRanking)
KaggleRankSkill[1,]<-NA
hchart(na.omit(RSkill),hcaes(x=JobSkillImportanceR,y=count),type="pie",name="Count") %>%
hc_exporting(enabled = TRUE) %>%
hc_title(text="Piechart of importance of R skill",align="center") %>%
hc_add_theme(hc_theme_elementary())
hchart(na.omit(PythonSkill),hcaes(x=JobSkillImportancePython,y=count),type="pie",name="Count") %>%
hc_exporting(enabled = TRUE) %>%
hc_title(text="Piechart of importance of Python skill",align="center") %>%
hc_add_theme(hc_theme_elementary())
hchart(na.omit(SqlSkill),hcaes(x=JobSkillImportanceSQL,y=count),type="pie",name="Count") %>%
hc_exporting(enabled = TRUE) %>%
hc_title(text="Piechart of importance of SQL skill",align="center") %>%
hc_add_theme(hc_theme_elementary())
hchart(na.omit(BigDataSkill),hcaes(x=JobSkillImportanceBigData,y=count),type="pie",name="Count") %>%
hc_exporting(enabled = TRUE) %>%
hc_title(text="Piechart of importance of Big Data skill",align="center") %>%
hc_add_theme(hc_theme_elementary())
hchart(na.omit(StatsSkill),hcaes(x=JobSkillImportanceStats,y=count),type="pie",name="Count") %>%
hc_exporting(enabled = TRUE) %>%
hc_title(text="Piechart of importance of Statistics kill",align="center") %>%
hc_add_theme(hc_theme_elementary())
hchart(na.omit(DataVisSkill),hcaes(x=JobSkillImportanceVisualizations,y=count),type="pie",name="Count") %>%
hc_exporting(enabled = TRUE) %>%
hc_title(text="Piechart of importance of Data Viz skill",align="center") %>%
hc_add_theme(hc_theme_elementary())
hchart(na.omit(DegreeSkill),hcaes(x=JobSkillImportanceDegree,y=count),type="pie",name="Count") %>%
hc_exporting(enabled = TRUE) %>%
hc_title(text="Piechart of importance of Degree",align="center") %>%
hc_add_theme(hc_theme_elementary())
hchart(na.omit(EnterToolsSkill),hcaes(x=JobSkillImportanceEnterpriseTools,y=count),type="pie",name="Count") %>%
hc_exporting(enabled = TRUE) %>%
hc_title(text="Piechart of importance of Enterprise Tools skill",align="center") %>%
hc_add_theme(hc_theme_elementary())
hchart(na.omit(MOOCSkill),hcaes(x=JobSkillImportanceMOOC,y=count),type="pie",name="Count") %>%
hc_exporting(enabled = TRUE) %>%
hc_title(text="Piechart of importance of MOOCs",align="center") %>%
hc_add_theme(hc_theme_elementary())
hchart(na.omit(KaggleRankSkill),hcaes(x=JobSkillImportanceKaggleRanking,y=count),type="pie",name="Count") %>%
hc_exporting(enabled = TRUE) %>%
hc_title(text="Piechart of importance of Kaggle Rankings",align="center") %>%
hc_add_theme(hc_theme_elementary())
```
1. We can see from the above plot that the most unnecessary skill amongst all is having a knowledge of Enterprise tools, Degree, Kaggle Rankings and MOOCs. These have higher count of unnecessary skills entered by the participants.
2. Whereas, Knowledge of Statistics,Python,R and Big data skills are most necessary and Nice to have skills as per answers entered by the survey participants.
-----------------------
### What proves that you have good Data science knowledge?
```{r}
knowlegdeDf<-SurveyDf %>% group_by(ProveKnowledgeSelect) %>%
summarise(count=n()) %>%
arrange(desc(count))
knowlegdeDf[1,]<-NA
hchart(na.omit(knowlegdeDf),hcaes(x=ProveKnowledgeSelect,y=count),type="column",color="#049382",name="count") %>%
hc_exporting(enabled = TRUE) %>%
hc_title(text="Barplot of what proves you have Datascience knowledge",align="center") %>%
hc_add_theme(hc_theme_elementary())
```
Let's now heck the formal education of participants:
```{r}
table(FormalEducation)
```
------------------------
Let's check the most famous machine learning technique in which participants consider themselves competent?
```{r}
Mltechique<-SurveyDf %>% group_by(MLTechniquesSelect) %>%
summarise(count=n()) %>%
arrange(desc(count)) %>%
top_n(20)
Mltechique[1,]<-NA
hchart(na.omit(Mltechique),hcaes(x=MLTechniquesSelect,y=count),type="column",color="purple",name="count") %>%
hc_exporting(enabled = TRUE) %>%
hc_title(text="Barplot of competent ML techniques of participants",align="center") %>%
hc_add_theme(hc_theme_elementary())
```
So we cant notice that *Logistic regression, Decision trees, Random forets* are the top 2 most competent techniques in which the participants are competent and can successfully implement and are most efficient in implementing.
-----------------------
###Let's check which Learning algorithm participants use at work ?
Now we will check which machine learning algorithm is most used by the participants at their work.
```{r}
MLalgoWork<-SurveyDf %>% group_by(WorkAlgorithmsSelect) %>%
summarise(count=n()) %>%
arrange(desc(count)) %>%
top_n(20)
MLalgoWork[c(1,3),]<-NA
hchart(na.omit(MLalgoWork),hcaes(x=WorkAlgorithmsSelect,y=count),type="column",color="green",name="count") %>%
hc_exporting(enabled = TRUE) %>%
hc_title(text="Barplot of Most used ML algorithms at Work",align="center") %>%
hc_add_theme(hc_theme_elementary())
```
> Again as we can see from the above plot, *Regression,Logistic regression and decision trees* lead the pack as the most used learning algorithms which used at work by participants.
----------------------
### Now let's check which tools are used most at work?
This field answers -For work, which data science/analytics tools, technologies, and languages the participants have used in the past year?
We are going to find tht top 20 tools.
```{r}
ToolatWork<-SurveyDf %>% group_by(WorkToolsSelect) %>%
summarise(count=n()) %>%
arrange(desc(count)) %>%
top_n(20)
ToolatWork[c(1),]<-NA
hchart(na.omit(ToolatWork),hcaes(x=WorkToolsSelect,y=count),type="column",color="#7C0E3E",name="count") %>%
hc_exporting(enabled = TRUE) %>%
hc_title(text="Barplot of Most used data science tools used at Work",align="center") %>%
hc_add_theme(hc_theme_elementary())
```
> From the above plot we can see that *Python and R are collectively used* by datascientists the most as entered by the survey participants. Hence Python and R still tops the most used tools at work according to the survey.
----------------------
### Most used ML method at work?
```{r}
MethodatWork<-SurveyDf %>% group_by(WorkMethodsSelect
) %>%
summarise(count=n()) %>%
arrange(desc(count)) %>%
top_n(20)
MethodatWork[c(1,3),]<-NA
hchart(na.omit(MethodatWork),hcaes(x=WorkMethodsSelect
,y=count),type="column",color="#F14B5B",name="count") %>%
hc_exporting(enabled = TRUE) %>%
hc_title(text="Barplot of Most used ML and DS methods used at Work",align="center") %>%
hc_add_theme(hc_theme_elementary())
```
--------------------------------
#Let's check for how much time the Emloyer has been using ML?
Let's first check the different tyepes of employers and the industry entered by the participants.
This can give us some details about the Employers involvement in Data science and ML.
```{r}
table(EmployerIndustry)
Employerdf<-as.data.frame(table(EmployerIndustry)) %>% arrange(desc(Freq))
Employerdf[1,]<-NA
hchart(Employerdf,hcaes(x=EmployerIndustry,y=Freq),type="column",color="lightgreen")
employerCountrydf<-SurveyDf %>% group_by(EmployerIndustry,Country) %>%
filter(EmployerIndustry %in% Employerdf$EmployerIndustry, Country %in% countryCount$Var1) %>%
summarise(count=n()) %>%
arrange(desc(count))
USEmployerdf<-employerCountrydf %>% filter(Country=="United States")
IndiaEmployerdf<-employerCountrydf %>% filter(Country=="India")
ChinaEmployerdf
hchart(USEmployerdf,hcaes(x=EmployerIndustry,y=count),type="column",color="#00008b",name="Count") %>%
hc_exporting(enabled = TRUE) %>%
hc_title(text="Barplot of of Employer industry of United states survey participants",align="center") %>%
hc_add_theme(hc_theme_elementary())
hchart(IndiaEmployerdf,name="Count",hcaes(x=EmployerIndustry,y=count),type="column",color="#33ccff") %>%
hc_exporting(enabled = TRUE) %>%
hc_title(text="Barplot of Employer industry of Indian survey participants ",align="center") %>%
hc_add_theme(hc_theme_elementary())
table(EmployerMLTime)
hchart(SurveyDf$EmployerMLTime,type="column",color="lightgreen")
```