-
Notifications
You must be signed in to change notification settings - Fork 8
/
03-DataFutures.Rmd
407 lines (221 loc) · 57.1 KB
/
03-DataFutures.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
---
output:
html_document: default
pdf_document: default
---
# Data Futures
## Andrew: On Visualisations
### True Value and Purpose of Visualisations
To consider trends in visualisation let's begin by identifying the core value of data visualisation - what does it do? In general terms we think about things using language but we understand things using imagery. Our memories work better when we link items to images (even unrelated images) as evidenced by many mnemonic tools.
Data visualisation is the process of presenting information using means other than language as the principal conduit for transfer of meaning/understanding/knowledge. If we think about our senses, 3 of the 5 (touch, smell, taste) are principally involved with our situational awareness. Hearing and sight, whilst obviously having key situational awareness roles, are our principal sources of knowledge awareness/learning at a higher cognitive level. Higher level information that we receive verbally is principally delivered via language - yes we could be learning about sounds themselves in which case there is a combination of non-verbal and verbal but when attending lectures or at work listening to the boss the language is important - doubtless we are also processing a ton of non-verbal messages at work too. At uni and at work our non-situational visible inputs are often computer screens or paper - what Edward Tufte refers to as "flatland".
So now consider what we look at in flatland - a lot of words and numbers which require processing via our brain's language centres (Wernicke's area, Angular Gyrus, Insular Cortex, etc.) before they can be understood. That understanding is very often visual, do you **see** what I mean? Data visualisations cut out the language middle-man and, because they are not constrained by the bandwidth of our language processing systems, provide a high-speed information channel capable of carrying lots of data very quickly into the "understanding" part of the brain. Language is an incredibly powerful brain function. It is arguably that which lifts humans above other species, more even than opposable thumbs, but it is slow to process compared to image processing - we have been finding meaning in what we see for a lot longer than we have been translating meaning into and out of language.
All this is by way of suggesting that the purpose of data visualisation is to provide a means to convey understanding/knowledge without the use of language. In practice, of course, language is often used to augment/enhance visualisations (scales, labels, titles, explanatory notes) but in many cases the data is just way too dense to be conveyed in any other manner than a visualisation (think picture, thousand words) as evidenced by the most common, oldest and data dense visualisations we have - maps.
```{r echo=FALSE, fig.cap="School map of the Canton of Zurich 1:150 000, Eduard Imhof and collaborator"}
knitr::include_graphics("Images/DataFutures/Andrew/Imhof.jpg", dpi = NA)
```
### The Effective Use of Visualisations
This chapter is heavily influenced (as many discussions of visualisation are) by the works of Edward Tufte, the Yoda of data visualisation. Yoda is an appropriate term because his approach is to use many examples of good and bad data visualisation practice and his objective appears to be to guide and advise by providing clarity as to why certain visualisations are easier to understand. He advocates concepts such as:
- minimise use of non-data ink
- remove chartjunk
- avoid harsh palettes
- its okay to have high data density
Perhaps because of the clarity and almost pervasive uptake of Tufte's guidelines/advice, we are beginning to see more appropriate (gentler) palettes being offered as default colour schemes and more awareness of perception issues (e.g. Moirè effect) in business intelligence tools such as Tableau and Sisense.
```{r echo=FALSE, fig.cap="Moirè Effect - Vibrations in the Image", out.width="60%"}
knitr::include_graphics("Images/DataFutures/Andrew/moire.png", dpi = NA)
```
Before diving into trends and fads though, pause to consider and remember that the objective of data visualisations should be to transfer knowledge and understanding via images rather than words. If we understand things ourselves in a visual way and want to transfer that understanding to our reader then we can most simply do that by presenting that information visually.
Are you beginning to understand why all those styles of the 90's with Moirè effects were so bad?
The effective use of a visualisation to transfer knowledge or information, however, assumes there is knowledge/understanding on the part of the author in the first place. As we turn to tools, technologies, trends and fads consider the risks associated with any mechanism that makes it easier to deliver "information" about data (especially big data) even if the "author" doesn't understand the data themselves.
### Tools, Technologies, Trends and Fads
The first, and possibly most significant, trend of data visualisation is the need for it. Big data is presenting a challenge. A lot of information is being gathered and supporting a conclusion or recommendation based on big data often requires some form of supporting encapsulation/presentation of that information. Overly simplifying large volumes of data risks losing the message behind the detail so visualisation is becoming a more important tool. The volume and depth of data, the need to present multivariate analyses, the complexity of the messages all can be addressed by powerful, well-structured data visualisations.
#### How Big is Yours?
In 1990 Tufte mused on:
> the essential dilemma of a computer display: at every screen there are two powerful information-processing capabilities, human and computer. Yet all communication between the two must pass the low-resolution, narrow-band video display terminal, which chokes off fast, precise, and complex communication. (p.89, Envisioning Information, 1990, Edward Tufte)
Clearly, screen resolutions have progressed very significantly since 1990 such that HD displays (e.g. Retina) are providing resolution and colour/contrast ranges much closer to the range of the human eye to perceive. To this we are now adding virtual reality, data walls and data rooms. All of these are permutations of "bigger display spaces" to display more data. Its cool and sexy and great for University PR but does it work? (Doubtless this risks being very unpopular with certain groups at UTS) Does the human mind have the capacity to work in the round, holding context from one part of the room to the next? We have trouble maintaining data context flipping a page so I'll let you decide how easily we maintain that context turning around. So what does that mean for the super-sexy, very expensive displays - use them wisely. They are not without merit or purpose but just throwing stuff up there to show off risks falling into a world that Tufte might describe as "mega chart junk land".
```{r echo=FALSE, fig.cap="Mine's Bigger...", out.width="60%"}
knitr::include_graphics("Images/DataFutures/Andrew/minesbigger.jpg", dpi = NA)
```
There is an important philosophical point to be made here. If you really need a 360° view of your data to explain it - do you actually understand it that well? It goes back to the underlying requirement that the author understand the message before trying to convey it.
With that said, these new display technologies are understandably being played with, people are learning what can be done and how to do it - hopefully as part of that they will also learn if/when they actually need 360° views (VR, data rooms). Don't let the medium get bigger than the message.
#### Intelligent Analytics - the Good, the Bad and the Less Ugly
Looking at the array of smart business intelligence tools being brought to market (e.g. Tableau, Sisense, Periscope) it is clear that the integration of data access with data analytics is progressing at pace with a big component of these visualisation tools being their libraries of data access layers and data wrangling tools. This work is directed at removing the peripheral effort from the work involved with analysing and presenting information about data. Visualisation tools are being built/interfaced into stream processing solutions/environments like AWS Kinesis to provide real time data visualisations more simply. Open source tools like Kibana and Elasticsearch are providing accessible functionality while R and Python extensions like Shiny apps, ipywidgets and Bokeh plots are providing similar (if less readily accessible) functionality in the machine learning community space.
The power of the commercial tools to provide tips, advice and suggestions is one of their big commercial selling points and if they are being used for data investigation then this is not a bad thing. It doubtless streamlines the process of investigating and understanding data. If used for this purpose this sort of guided analytics is useful but care needs to be taken that the guidance doesn't limit the analyst's ability/willingness to do the hard work to understand all the possible interpretations of the data they are viewing. Worse still, the easier it is to produce good looking visualisations, the greater the temptation/capacity to present data that is not truly understood by the "author" of the visualisation.
```{r echo=FALSE, fig.cap="Simple Really"}
knitr::include_graphics("Images/DataFutures/Andrew/justpushthis.png", dpi = NA)
```
A scan of the various (newer) data visualisation tools shows an increasing awareness of how to present data (the less ugly). Being a commercial marketplace there remains (at least in the demo side) a preponderance of the gaudy, flashy visualisations but that is to be expected. The move to the pastel palettes, fainter lines, less cluttered, confronting visuals are all improvements on the clunky, flashy graphics of the 90's and 2000's but the balance between understanding the data (the message) and the best way to present it so as to be easily consumed by its observer (the medium) remains a challenge. This is not least because there are a lot of skills required for both those tasks.
Thus, data analytics is moving forward in:
- improving access to source data
- improving exposure of data/information (interactive visualisations, live visualisations)
- tentative steps to use new display technologies
- better physical (perceptual) properties of visualisations
#### Recommendations
So what are the best (or least worse) tools options available today? There really is no single recommendation that can be made but the questions to be considered might guide any selection process:
- What is your objective?
- Where is your data?
- How much access and wrangling is required?
- Is your data real-time?
- What are your circumstances (corporate budget, personal or research)?
- What are you priorities (analysis, presentation)?
- What are your legacy technical and political constraints?
A wily consultant would read this as an opportunity in and of itself.
The tools are less the issue than the desired outcome. Visualisations make data more accessible - large volumes of data can be represented in a small amount of space, data can be presented in an approachable manner to colleagues, customers and the public. The adoption of data visualisation as an integral part of how information is presented rather than as a fad or gimmick is progressing apace so if you read some of Tufte's work or that of predecessors and more recent data visualisation advocates and bear the advice in mind when building data visualisations then that is the best start that can be made.
### The Future of Visualisations
So, more data, from more sources is being amalgamated more effectively. Businesses and researchers are looking for ways to better transfer understanding of the information buried in those large, complex data sets. What might happen next?
The gimmicky/fad nature of some of the display formats (VR, data wall, etc.) will diminish with the combination of a better understanding of their applied values and improvements in tools that utilise the capacities of these technologies. This probably isn't the next big step in data visualisations, however.
The big steps happen when components that already exist are integrated more effectively. There is a slow move away from traditional reports-based business operations to dashboards and real-time awareness of the state of a business, market, campaign, etc. The nirvana of business intelligence is to amalgamate all the disparate data sources available to a business in a manner that exposes the underlying forces/reasons behind business trends. The data access tools are facilitating the move towards this nirvana. Once the data is exposed then the next two steps will be:
1. Use data visualisations as real-time views (or historical real-time views) of the state of a business (move away from tables and reports).
2. Provide visualisation interfaces (now we get to Minority Report territory) linked to business models allowing business planners to see the impacts of different business decisions and strategies.
As to the next area of advanced visualisation research, perhaps image processing neural networks can be used as a starting point to reverse engineer the principal visual vectors of understanding.
### Bibliography
Imhof and collaborator 1969, *Schulkarte des Kantons Zürich 1:150 000* Orell Füssli AG, Zürich.
Tufte, Edward 2001, *The Visual Display of Quantitactie Information*, Graphics Press
Tufte, Edward 1990, *Envisioning Information*, Graphics Press
## Corinna: Some thoughts on Self-service Analytics and Data Democratisation
The ability to enable evidence based and data driven decision making in the workplace has been evolving in the data science industry through the introduction of self-service data analytics tools. Speculators such as Gartner predict that by 2020, self-service business intelligence applications will make up 80% of all enterprise reporting (Dykes 2016). Self-service analytics seeks to provide novice analysts and non-technical business users with the capability to access and use or manipulate data with the hope that these activities will lead to new knowledge (Rouse 2014). Vendors including Tableau, Power BI and Hyper Anna, amongst others are investing into this trend by combining artificial intelligence solutions or user interfaces to manage the thirst of non-technical users for data. Opening up analytics to wider audiences promises to develop new insights, knowledge and innovation by crowdsourcing minds (Kitchin 2014). Therefore, self-service analytical tools are enablers to the effects of data democratisation, breaking down data silos, and providing access to data when and where it is needed at any given moment (Kitchin 2014).
The push for evidence-based thinking in the workplace is justified by a legacy of successful outcomes of this approach in many industries especially medicine. The medical field provides a golden standard of an area where evidence based decision making is clearly valuable. Historically, medicine has relied on funded random controlled trials and other forms of formal research to develop standards for decision making; favouring treatments which had been proven to be most effective in practice (Lumen 2017). By relying on data driven evidence, much of the uncertainty about treatment practices has been removed, further improving the quality of services. Primarily evidence based practices aim to improve the quality of decision making by justifying actions and applying knowledge derived from data. The flow of knowledge development can be represented by the knowledge pyramid below, where data are first abstracted from the world before being processed and repackaged into usable artefacts (Kitchin 2014).
```{r echo=FALSE, fig.cap="A Knowledge Pyramid (taken from Kitchin 2014 and adapted from Adler 1986 and McCandless 2010)"}
knitr::include_graphics("Images/DataFutures/Corinna/corinna_36111_a1_knowledge_pyramid.png", dpi = NA)
```
Traditionally, due to the costs and constraints of generating data, the practices of generating new knowledge with data has been constrained to larger entities that could afford the funding and personnel required (Kitchin 2014, Bowker 2000). Smaller amounts of data were formally collected in studies designed with established methodologies and modes of analysis, as well as rules of conduct (Kitchin 2014, Bowker 2000). There is a long reaching record of producing answers to tailored and specific research questions and iteration on the scientific process. As phenomena become easier to monitor with the help of the digitisation of data, the amount of data, technologies and techniques available become more accessible at a lower cost to time and effort (Brynjolfsson & McAfee 2014). Concerns have arisen over the risks of misinterpretation, and misuse of information generated in the data by untrained users (Marr 2017, Dykes 2016 and Harris 2012).
### On Bias
Untrained users may be unaware of and unable to assert control over personal biases. There have been multiple varieties of bias identified through the iterations of scientific practice including observer bias and other experimental effects that occur when researchers' expectations influence study and data outcomes (Holman et.al. 2015 & Young 2009). Biases may be influenced by the following:
- Researchers expect or assume specific occurrences
- Research design encourages human subject or researchers to preferentially detect or focus on and recall outcomes that affirm beliefs
- Analysis or data recording that requires subjective judgements
- Incentives and agendas or conflicts of interest
The effects of bias pervade multiple stages of formal studies and primary data, and so this bias can also affect informal studies, secondary and tertiary data of all sizes (Young 2009). Some of the traditional approaches to control bias and improve credibility include the use of blinding, randomised sampling and peer review. Peer review can be considered the bedrock of credibility for formal studies (Wheeler 2011). Since peer review relies on willing participation between academics to critically assess studies, there are limitations to its power. Hence, peer reviews can also be subject to some biases and conflicts of interest.
Many organisations recognize that data in the hands of a few data experts can be powerful, and are hopeful that data at the fingertips of many more domain experts and other staff members will be truly revolutionary, improving knowledge output, efficiency, flexibility or quality of work (Kitchin 2014) The management and interpretation of data through a community of users has the potential to crowdsource insights in a new dynamic that can be likened to peer review. For this reason, the power of self-serve analytics when competently governed and supported may prove more efficient and enriching for the development of industry knowledge than previous infrastructures have allowed.
Self-service data analytics models provide a new means to conduct collaboration and peer review. Especially if the functionality to collaborate on understanding data is integrated into the user interface, the review of insights may occur as they are developed. With the current dashboard solutions and reporting mechanisms, the contesting of information is more formalised and structured. A collaborative environment produced by a well implemented self-service analytics strategy has the potential to create collaborative support from peers and mentors that both empowers users and facilitates user learning experiences, improving ways of justifying decisions and developing unique results in a domain (Chesler et. al 2013).
### Meta-data and Data Context
The data does not speak for itself and people are a large component to the production of data driven knowledge. The perception that data is objective has pervaded into industry due to much of the scientific work conducted with "small data" (Kitchin 2011, Michener 2009). However, imperfections in "small data" research have been historically identified as "artifacts", errors in results or manmade imperfections that distort the properties of the subjects (Schmidt & Hunter 2015).
"Data do not exist independently of ideas, techniques, systems, people and contexts, regardless of them often being presented in this manner." (Kitchin 2014)
Although data may have been data widely thought of as benign "raw" elements, which have been abstracted from the world neutrally and objectively, there are many claims to suggest otherwise (Kitchin 2014, Michener 2009). Data are described from established normative, political and ethical processes where decisions about generalisations, assumptions and representations as well as what remains visible and invisible have consequences on the subsequent analysis and conclusions (Kitchin 2014).
If data is so socially constructed and ideologically loaded from its conception, then ignoring these contextual aspects about the data risk misinterpretation and misjudgement. Not only this but the storage and sharing of the data becomes problematic if these artefacts from the data are not also passed on (Zimmermann 2008 & Bowker 2000). Unfortunately, the tidy formats that data are transferred in and stored (such as within databases) may fail to maintain the important metadata and information regarding the original agendas of the data (Kitchin 2014). Furthermore, much of repurposed information may not have been maintained to a standard that ensures data artefacts are shared (Zimmermann 2008, Michener 2009). The data can thus become uncoupled from its original political and social contexts leaving only what the organisational rules, philosophies and practices determine to be important (Kitchin 2014).
The use of data formats and advances in database and storage options continue to allow for more and more unstructured and unprocessed forms of data to be stored (Song & Zhu 2016, Kitchin 2014, & Service 2017). For instance, unlike traditional data repositories, a data lake is a store of unformatted data where pathways and processes are required to explore the data, since most organisations contain multiple applications with variable non-combined formats (McKennar 2016). Yet throwing all data formats into a 'data-lake' may not be the best nor gentlest approach to meeting the thirst for data. Whilst the data stored in data lakes may comply with currently accepted data standards, there is often still a lack of documentation and commonality in standardisation, especially when data is retained from research that has previously been informally stored (McKennar 2016, Kitchin 2014). Much of the data used to develop knowledge in the past has been lost in favour of aggregations or when personal move on, where only the most valuable datasets of cultural and political significance have been retained in data archives (Michener 2009). Again, the uncoupling of data from its context can occur when data is not accountably curated and archived. Secondly, tidy data has quality control, productivity and sense-making advantages; all are vital components to efficiently yielding knowledge from data. Similarly, untidy data is far more difficult to manipulate and interpret for new unfamiliar user groups.
The handover of data artefacts therefore does not have a generic solution and will depend on the capabilities and judgement of governance bodies as well as the availability of documentation. In some cases, the quality of data may be compromised such that it becomes stale and unusable due to pore maintenance or the nature of business operations. For instance, through the nature of operations, application data may not enter the data base offered to users for analysis for several months after entering the business pipeline. Such data is not yet digital, whether or not the database can be updated at regular intervals. The data may eventually join the database but the time taken to process it may skew ongoing trends expressed within visualisations. To prevent stale data, fresh and relevant input to these systems require constant maintenance (Marr 2017). Combining large data segments within data can also exhibit Simpson's Paradox where overall patterns do not reflect the true trends of the separated groups (Huber 2011). It is of great importance that such issues are not overlooked to allow for smooth and correct interactions with data.
### Non-technical User Training
Simply supplying new user groups with access to data and technology will not guarantee success. Just as a person who is illiterate, cannot gain from a rich library of books and written information, so lack of data literacy and experience with interpretation can prevent the use and value extraction from unprepared user groups (Harris 2012, Dykes 2017). Acknowledging gaps in human centred processes and building the confidence and skills to master the self-service systems will ultimately require the bestowal of knowledge and mentorship from those who have a history with the data and with using data tools. Brynjolfssen and McAfee in 'The Second Machine Age' predict that the exponential gains expected from combinatorial innovation are intended outcomes of serving data to a wider audience (Brynjolfsson & McAfee 2016). The catch for self-service analytics lies in the scalability of supplying the needed mentorship and training to an ever-extending user group.
Even if data could be justified as neutral, the use of data in analysis to develop knowledge, insights and innovation can also become twisted by political and ideological agendas (Young 2009). When data is used to produce knowledge, meaning is derived from complex cognitive processes to form the basis of understanding, explaining and actioning insights (Young 2009). This data analysis stage is human-centred and subjective, with each data consumer framing data from personal knowledge, understanding and experience.
For half a century data analysis has been framed to emphasise the application of judgement rather than simply applying mathematical and statistical tools (Tukey 1962). Tukey's influential paper elaborates that this judgement is constituted by:
- Subject matter experience
- Broad experience of analytical tools and techniques applied to various situations
- As well as judgement of the obtained abstract results.
For today's consumers of data, the business user and analyst alike, the strengths of these components vary greatly. Self-serve analytics and data democratisation shifts the problems of overly technical and potentially irrelevant reporting from more technically experienced and smaller teams to broader groups of people with potential greater perceptions of the data driven business needs but lack of experience utilising data and its insights. Although the potential of discovery and the productivity of data driven knowledge acquisition may have been amplified in the new analytical climate, there is little evidence to suggest that the value can be attained without the proper preparation of new user groups.
### Support and Resources
Data analysis has traditionally been one of the most demanding applications of interactive computing since it covers a wide range of tasks and outputs from research to business intelligence reporting (Huber 2011). Languages and tools for analysis have aimed to be both interactive but also programmable to ensure evidence derived from data is repeatable but also customisable as data requirements change (Huber 2011). As a result, data analytics practices have been inaccessible to the technically untrained. However, the widespread use of computing, and the introduction of more natural language programming languages have opened the opportunities for people to become familiar with data manipulation techniques for a lower overhead.
Huber suggests data analysis for novices should be offered in canned form for routine investigations with more flexible methods and customisations available for deeper research (Huber 2011). Dashboards have been the bread and butter approach for providing users with canned visualisations of data for daily use. Where a dashboard's weaknesses lie such as regarding explanations of the visualisations, updates in the form of reports have been used as a supplement. New vendors such as IBM Watson, Hyper Anna and Data Robot are attempting to hybridise the two approaches so that more customised and complicated analysis can be facilitated by a search sequence. Accessing customised analytics via a search sequence, removes the user's requirement to know and understand code, and opens the data up to new audiences. This new approach introduces new concerns regarding the unknown levels and types of user support required to ensure automated complex modelling are accountably used and understood.
Masking complex data processes behind more user-friendly interfaces is a necessary evolution of these self-serve systems. The consequence is again a lower overhead for training and usability. However, for the user groups that have no ability to investigate the artificial mechanisms and data pipelines behind the scenes, there is a gap in their capacity for data discernment. Without catered and mindful support and mediation of users with the interfaces of these technologies, the quality of interactions is undermined.
### Missing Practicalities for Training
"Few academics and organizations willingly scrutinize the processes on which we stake so many of our goods and values. Transparency, confidentiality, gatekeeping, resource allocation, institutional reputations for excellence-all inform our vision of ourselves as fairminded, sound, disinterested critics and inhibit self- reflection." (Wheeler 2011)
As data handling is extrapolated to new audiences, previously unfamiliar with the methods of analysis, the requirement for training will increase with each user. Comprehensive and scalable training is therefore currently lacking. At present, general tutorials from third parties are available from vendors, however this is not enough to ensure responsible data management. It cannot be assumed that new audiences will have the necessary time resources, information structures nor motivation to conduct comprehensive self-directed study to understand the data. Since organisations often operate in private and closed data ecosystems, the support for the use of data may need to be facilitated internally (Floridi 2006).
Providing support to new users on a large scale is most likely to be solved with virtual solutions. Virtual support with structures like a massive open online course (MOOC) or preferably a massively adaptive complex online simulation (MACROSIM) may provide the information and mentorship infrastructures required (Virtual Internships). Such virtual programs have successfully incorporated a collaborative environment based on learning theories, and encouraged motivation and reflection on action (Chesler et. al 2013, Virtual Internships).
Improvements for user support in this new frontier will demand input from users. This will likely occur both anecdotally and through activity feeds where the analytics tools may even be used to process data from the processor (Floridi 2006). Data analytics of this cyclical kind will ultimately change or mutate the entire practice (Kitchin 2014). With the bridge between truly self-directed use and guided exploration through mentorship still open, it is my opinion that data experts will still play a role as data gatekeepers in the near future. The influence of such gatekeepers is yet to be fully explored (Leahey 2008).
### References
Bowker, G.C. (2000) Biodiversity Datadiversity, Social Studies of Science, SAGE Publications Ltd; 30(5) 643-683
Brynjolfsson, E., & McAfee A., (2016) The Second Machine Age: Work, Progress, and Prosperity in a Time of Brilliant Technologies, New York W. W. Norton & Company, London
Chesler, N.C., Arastoopour, G., D'Angelo, C.M., Bagley, E.A. and Williamson Shaffer, D. (2013) Design of a Professional Practice Simulator for Educating and Motivating First-Year Engineering Students, Advances in Engineering Education, American Society for Engineering Education, Madison
Dykes, B. (2016) Self-Service Analytics and the Illusion of Self-Sufficiency, last viewed 6 Nov 2017, https://www.forbes.com/sites/brentdykes/2016/11/15/self-service-analytics-and-the-illusion-of-self-sufficiency/#70349a54219a
Dykes, B. (2017) Why Companies Must Close The Data Literacy Divide, Forbes, last viewed 6 Nov 2017, https://www.forbes.com/sites/brentdykes/2017/03/09/why-companies-must-close-the-data-literacy-divide/#6c580d8a369d
Harris, J. (2012) Data Is Useless Without the Skills to Analyze It, Harvard Business Review, last viewed 6 Nov 2017, https://hbr.org/2012/09/data-is-useless-without-the-skills
Holman, L., Head, M. L., Lanfear, R., and Jennions, M.D. (2015) Evidence of Experimental Bias in the Life Sciences: Why We Need Blind Data Recording, PLoS Biology 13(7)
Huber, P. J. (2011) Data analysis: what can be learned from the past 50 years, Wiley series in probability and statistics. Wiley, Hoboken, N.J.
Kitchin, R. (2014) The Data Revolution: big data, open data, data infrastructures & their consequences., Sage Publications Ptld, London
Leahey, E. (2008) Overseeing Research Practice: The Case of Data Editing, Science, Technology & Human Values, SAGE Publications Inc., vol. 33, no. 5, pp. 620
Lumen Evidence Based Decision Making, last viewed 6 Nov 2017, https://courses.lumenlearning.com/wm-principlesofmanagement/chapter/evidence-based-decision-making/
Marr, B. (2017) What is data democratisation, a super simple explanation and the key pros and cons, last viewed 6 Nov 2017, https://www.forbes.com/sites/bernardmarr/2017/07/24/what-is-data-democratization-a-super-simple-explanation-and-the-key-pros-and-cons/#79ae1ce06013
Michener, William K., and Brunt, James W., eds. (2009) Ecological Data: Design, Management and Processing,
Hoboken, GB: Wiley-Blackwell, p92-100
McKennar, B. (2016) Data democratization in the age of big data: why data lakes won't work, last viewed 6 Nov 2017, http://www.computerweekly.com/blog/Data-Matters/Data-democratization-in-the-age-of-big-data-why-data-lakes-wont-work
Rouse, M. (2014) Self-service analytics, Tech Target, last viewed 6 Nov 2017, http://searchbusinessanalytics.techtarget.com/definition/self-service-analytics
Service, R. (2017) DNA could store all of the world's data in one room, last viewed 10 Nov 2017, http://www.sciencemag.org/news/2017/03/dna-could-store-all-worlds-data-one-room
Schmidt, F. & Hunter, J. (2015) Methods of meta-analysis, Availability bias, source bias, and publication bias in meta-analysis, SAGE Publications Ltd., London, pp. 513-551
Tukey, J. W. (1962), The Future of Data Analysis, The Annals of Mathemati- cal Statistics, vol. 33, pp. 1-67
Virtual Internships, About, University of Wisconsin-Madison, last viewed 13 Nov 2017, http://virtualinterns.org/about/
Wheeler, B. (2011) The Ontology of the Scholarly Journal and the Place of Peer Review, Journal of Scholarly Publishing, vol. 42, no. 3, pp. 307-322
Young, S. N. (2009) Bias in the research literature and conflict of interest: an issue for publishers, editors, reviewers and authors, and it is not just about the money, Journal of Psychiatry and Neuroscience; vol. 34, no. 6 pp. 412-417
Zimmerman, A. S. (2008) New Knowledge from Old Data: The Role of Standards in the Sharing and Reuse of Ecological Data, Science, Technology, & Human Values, SAGE Publications Inc., vol. 33, no. 5, pp. 631-652
### Bibliography
The following articles were not required for writing this post but were influential and complementary readings, I recommend exploring these should you have the interest.
Tableau talks about Natural Language Processing and User interfaces for non-technical users as a top 10 trend in 2018: https://www.tableau.com/reports/business-intelligence-trends?utm_campaign=Whitepaper%20-%20BI%20Trends%20-%20Prospect%20-%20APAC%20en-SG%20-%202017-11-16&utm_medium=Email&utm_source=Eloqua&domain=gmail.com&eid=CTBLS000010712266&elqTrackId=df07a602fd9948c0944bf2daa142366d&elq=dcbbe1d12fb5472bbb2a4d246930ef2b&elqaid=26647&elqat=1&elqCampaignId=28130#nlp
Dudley, I. (2016) UNCOMMON SENSE: THE DEMOCRATIZATION OF DATA ANALYSIS, Neilsen Insights last viewed 13 Nov 2017, http://www.nielsen.com/au/en/insights/news/2016/uncommon-sense-the-democratization-of-data-analysis.html
Mallows, C. (2006) Tukey's Paper After 40 Years, Technometrics, American Statistical Association and the American Society for Quality, vol. 48, no. 3
Marr, B. (2017) Why Data Democratization Is Such a Game-Changer In Our Big Data World, last viewed 6 Nov 2017, http://data-informed.com/why-data-democratization-is-such-a-game-changer-in-our-big-data-world/
Moats, D. (2015) Review of Rob Kitchin's The Data Revolution, last viewed 6 Nov 2017, https://www.theoryculturesociety.org/review-of-rob-kitchins-the-data-revolution/
Dykes, B. (2015) The Age Of Data Democratization: How To Effectively Share Data Across Your Business, last viewed 6 Nov 2017, https://www.forbes.com/sites/brentdykes/2015/09/09/the-age-of-data-democratization-how-to-effectively-share-data-across-your-business/#261201ac6c50
Strom, D & Baker, P. (2017) The Best Self-Service Business Intelligence (BI) Tools of 2017, last viewed 6 Nov 2017, http://au.pcmag.com/cloud-services/41015/guide/the-best-self-service-business-intelligence-bi-tools-of-2017
Kitchin, R. (2014) Rob Kitchin talks about big data, open data and the 'data revolution', Sage Publications Inc., last viewed 6 Nov 2017, https://www.youtube.com/watch?v=QpDfLoUHqE4
Shah HM, Chung KC. Archie Cochrane and his vision for evidence-based medicine. Plastic and reconstructive surgery. 2009;124(3):982-988. doi:10.1097/PRS.0b013e3181b03928. last viewed 6 Nov 2017, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2746659/
On decision making cultures see: https://hbr.org/2012/10/big-data-the-management-revolution
## Herry
## Passiona
## Rory: On Working with Sensitive Data
### Whatever Happened to CRISP-DM version 2.0?
<a data-flickr-embed="true" data-header="false" href="https://www.flickr.com/photos/nsaunders/22748694318/in/photolist-AEe1gy-6qHavF-6UCcar-ERXU3y-XcYgt9-daARkU-zACgeY-duvH1C-daAREo-zQTuUL-99sbWJ-o5ZEYo-Xj4Z1f-CY4sHk-e82Bk9-XAHwhw-Xj4YJy-jcshhv-vD3AZL-K4LoDs-79ZxJe-6PVPwj-4uWBUW-dAxzQi-4dnTmP-do5wWm-nZPPgy-ohLRJP-fgcHY4-jNn5BB-ikwKN6-iqCokx-hZ7qHj-bpQPP-S9gjcA-pGJSdS-hYcuBE-o2GYvX-pN8qoz-nHr6oG-nHs1Ra-nXSYRJ-nZPNqf-9CDXA7-nU37jU-29rSrd-otBhBv-fSijND-hjiAzf-9giLe6" title="Male satin bowerbird in his bower"><img src="https://farm6.staticflickr.com/5569/22748694318_0d88a88b24.jpg" width="800" height="571" alt="Male satin bowerbird in his bower"></a><script async src="//embedr.flickr.com/assets/client-code.js" charset="utf-8"></script>
Image: Saunders (2016)
I have searched high and low to find if there was an update to the “Cross Industry Standard Process for Data Mining” that we have been taught is the way to approach a data mining problem. The model was first released back in 1999 and given it is 18 years ago I wanted to see what the latest version contains. According to Piatetsky (2014), a [survey undertaken on KD Nuggets]( https://www.kdnuggets.com/2014/10/crisp-dm-top-methodology-analytics-data-mining-data-science-projects.html) places CRISP-DM still at the top of the methodology pile by a long way (43%). A caveat, there were only 200 responses to the poll (so there is some margin for error). However, a poll is a poll, and given its dominance, the next closest result for ‘My Own’ methodology only got 27.5% of the vote, I am sure that there must have been an update.
Unfortunately, it does not seem to have happened. There was an attempt to start working on version two, but that has faltered, and all activity has stopped. So much so that even the website has closed and there is [nothing there anymore](http://www.crisp-dm.org/). I **LOVE** this post from [Chris McCormick (2007)](http://keithmccormick.com/crisp-dm-20/) in which he talks about having just returned from a meeting about version 2.0 and how it is slated for release that summer! (The summer of 2007). I wonder what happened to it. For me, it talks about energy and impetus and the fact that there was this big new dawn that everyone was excited about, but in the end, the pressures of real life and competing interests win. It is a salutatory lesson in Open Standards and working with them. That said CRISP-DM is a fundamental standard, but it is showing its age, particularly around the ethical use of data and the models that they create. The moral aspect is particularly relevant in the current era of people being concerned about how all this data and its use is impacting them personally.
I also think that CRISP-DM could do with including more detail around the whole lifecycle including implementation and once the project complets, the final destruction of the data. A more holistic understanding of the elements is captured nicely by Chisholm (2015) in his post [Seven phases of a data lifecycle](https://www.bloomberg.com/professional/blog/7-phases-of-a-data-life-cycle/). I would also propose that it should refer to if not include the points in [The Nine Laws of Data Mining]( http://www.socdm.org/index.php/methodologies/21-methodologies-9-laws) (Khabaza 2010) suggested by The Society of Data Miners.
The only attempt I have found to suggest an [update to the CRISP-DM methodology]( https://www.researchgate.net/publication/277775478_CRISP_Data_Mining_Methodology_Extension_for_Medical_Domain) is the one proposed by Niaksu (2015). In it Niaksu (2015, p.108) focuses on the problems specific to the use of data mining in the medical domain and introduces 38 tasks specifically to address the following areas:
> 1. Mining non-static datasets: multi-relational, temporal and spatial data
> 2. Clinical information system interoperability
> 3. Semantic data interoperability
> 4. Ethical, social and personal data privacy constraints
> 5. Active engagement of clinicians in knowledge discovery process
Point four is of special interest in the context of the work that I am focused on as it is the first attempt at looking at the Ethical side of data mining within the CRISP-DM framework. However, it is limited to only the “evaluation of legal requirements and limitations in data usage” (Niaksu 2015, p.104). It does not go far enough in looking at how working with the data could be done ethically considering the social and privacy environment.
#### Bibliography
Chisholm, M. 2015, '7 phases of a data life cycle', Bloomberg for Enterprise, viewed 23 March 2017, <https://www.bloomberg.com/enterprise/blog/7-phases-of-a-data-life-cycle/>.
Khabaza, T. 2010, Nine Laws of Data Mining, Methodologies, The Society of Data Miners, viewed 17 November 2017, <http://www.socdm.org/index.php/methodologies/21-methodologies-9-laws>.
McCormick, K. 2007, CRISP-DM 2.0, viewed 17 November 2017, <http://keithmccormick.com/crisp-dm-20/>.
Niaksu, O. 2015, 'CRISP Data Mining Methodology Extension for Medical Domain', Baltic Journal of Modern Computing, vol. 3, pp. 92–109.
Piatetsky, G. 2014, 'CRISP-DM, still the top methodology for analytics, data mining, or data science projects', KD Nuggets, viewed 17 November 2017, <https://www.kdnuggets.com/2014/10/crisp-dm-top-methodology-analytics-data-mining-data-science-projects.html>.
Saunders, N. 2016, Male satin bowerbird in his bower, viewed 18 November 2017, <https://www.flickr.com/photos/nsaunders/22748694318/>.
### Ethics of Data and its Use
<a data-flickr-embed="true" href="https://www.flickr.com/photos/krysiab/6126773690/in/photolist-akpjn9-d2k7au-bsNjuC-WUQ9E5-iqCokx-hZ7qHj-bpQPP-7LamEj-S9gjcA-pGJSdS-hYcuBE-3upVT-o2GYvX-pN8qoz-nHr6oG-nHs1Ra-nXSYRJ-nZPNqf-nrN9Ci-nZVvVF-8DcczG-nHr6o1-nXSYUj-89hx6u-jYGNkj-bSPS8F-ZfAf7v-6m9v1N-8D95Ya-fQimPG-fQ1LoX-7NZJPM-fQ1Kox-o6xd3s-iQjKgp-nHs1G2-fvuET-nrMQsm-dnSqg7-96Vsjs-e98xBA-aePHni-yWn86z-bZ2Re-9F4eG2-7MYr8M-bvhsUp-nXSYtu-nHr9hA-iSbfTh" title="For me?"><img src="https://farm7.staticflickr.com/6189/6126773690_fab83a54f5_z.jpg" width="800" height="523.75" alt="For me?"></a><script async src="//embedr.flickr.com/assets/client-code.js" charset="utf-8"></script>
Image: (Christine 2011)
I wanted to look at the existing frameworks that focus on working with data. I am particularly interested in the different approaches people have proposed for working ethically with sensitive data. Sensitive data contains information about the individual that most people would not be comfortable sharing. Examples that come to mind are health records, police records and the ones I am specifically interested in, the records of children who were removed from their parents by Family and Community Services (FACS). The last example is particularly problematic because children, by definition, could never give consent to the release of this information.
I would like to postulate that there is another element that makes the data sensitive and that is the impact the information could have on the wellbeing of the people that are tasked to work with it. Often analysts do not have a background in the areas they are working, nor have they had any training in coping with the stories the data contains. This lack of support is unsurprising given that the bulk of the data currently being analysed will never tell a story that has a negative emotional impact. However, for those doing this vital work, thought should be given to how they receive protection from the individual accounts contained. Not doing so is failing in the duty of care for both the [individual]( http://managedhealthcareexecutive.modernmedicine.com/managed-healthcare-executive/news/clinical/clinical-pharmacology/health-affects-work-and-work-affect) (Lynch 2001) and the [workplace]( https://www.tinypulse.com/blog/negative-attitudes-affect-organizational-culture) (Reynolds 2016).
Given that I could find no research that considered mental illness due to working with sensitive data, it is not surprising that I also failed to find best practices for working with the same data. Everything I discovered is focused at a much higher level, that of the ethics of gathering, storing, using and sharing data. Much of the work comes from the medical domain with a particular focus on sharing and the best ways to do it (Wilson et al. 2017; Riso et al. 2017; Kirilova & Karcher 2017). The key takeaway is that sharing is essential as it increases the utility of the research, but conversely, it also increases the risk of reidentification due to the accumulation of more and more.
Lyons, Van & Lynn (2016) convincingly argue that embedding the values of good data practice based on the principles of justice at the heart of an organisation is key to reducing the risk of using information. I postulate that this creates an ongoing dialogue of ethics and its implementation establishes a culture of safety first. In the case the Red Cross and their web developers, Precedent, who by placing a single file on an unsecured web server [exposed the sexual history of 550,000 people]( http://www.smh.com.au/federal-politics/political-news/red-cross-data-leak-personal-data-of-550000-blood-donors-made-public-20161028-gscwms.html) (McIlroy, Hunter & Spooner 2016) it is clear that neither company understood or cared deeply about the privacy of their clients. The [report from Australian Information and Privacy Commissioner]( https://www.oaic.gov.au/media-and-speeches/statements/australian-red-cross-blood-service-data-breach) (Pilgrim 2017) lays out how the culture of both organisations caused multiple lapses that lead to the leak. Fortunately, in this case, the data was not released onto the internet due to the person finding it reporting it to authorities.
Developing an ethical lens is becoming more crucial to all of us being able to articulate what is meant by ethics. This ability to coherently bring a moral framework to an undertaking is an essential skill that data scientist must develop. Undertaking further study by enrolling in a [Data Science Ethics course](https://www.edx.org/course/data-science-ethics-michiganx-ds101x-1) (Jagadish 2016) is an excellent way to understand the fundamental concepts quickly. However, this view is not shared by many data scientist, which may explain the slew of stories about biased algorithms. Stories such as [Google tagging black people as Gorillas]( https://www.theguardian.com/technology/2015/jul/01/google-sorry-racist-auto-tag-photo-app) (Kasperkevic 2015) or [police using biased machine learning to predict criminal activity]( https://theconversation.com/why-big-data-analysis-of-police-activity-is-inherently-biased-72640) (Dixon & Isaac 2017). It seems that we can’t take the [human biases out of the machine]( https://www.theguardian.com/technology/2017/apr/13/ai-programs-exhibit-racist-and-sexist-biases-research-reveals).
#### Bibliography
Christine 2011, For me?, viewed 19 November 2017, <https://www.flickr.com/photos/krysiab/6126773690/>.
Dixon, A. & Isaac, W. 2017, 'Why big-data analysis of police activity is inherently biased', The Conversation, viewed 18 November 2017, <http://theconversation.com/why-big-data-analysis-of-police-activity-is-inherently-biased-72640>.
Guszcza, J. & Richardson, B. 2014, 'Two dogmas of big data: Understanding the power of analytics for predicting human behavior', Deloitte Review, no. 15, 28 July, viewed 16 November 2017, <https://dupress.deloitte.com/dup-us-en/deloitte-review/issue-15/behavioral-data-driven-decision-making.html>.
Jagadish, H.V. 2016, Data Science Ethics, MOOC, MichiganX, viewed 17 November 2017, <https://www.edx.org/course/data-science-ethics-michiganx-ds101x-1>.
Kasperkevic, J. 2015, 'Google says sorry for racist auto-tag in photo app', The Guardian, 1 July, viewed 18 November 2017, <http://www.theguardian.com/technology/2015/jul/01/google-sorry-racist-auto-tag-photo-app>.
Kirilova, D. & Karcher, S. 2017, 'Rethinking Data Sharing and Human Participant Protection in Social Science Research: Applications from the Qualitative Realm', Data Science Journal, vol. 16, no. 0, viewed 17 November 2017, <http://datascience.codata.org/articles/10.5334/dsj-2017-043/>.
Lynch, W.D. 2001, 'Health affects work, and work affects health.', Business and health, vol. 19, no. 10, pp. 31–4, 37.
Lyons, V., Van, D.W. & Lynn, T. 2016, 'Ethics as pacemaker: Regulating the heart of the privacy-trust relationship. A proposed conceptual model', ICIS 2016.
McIlroy, T., Hunter, F. & Spooner, R. 2016, 'Red Cross data leak: personal data of 550,000 blood donors made public', The Sydney Morning Herald, 28 October, viewed 18 November 2017, <http://www.smh.com.au/federal-politics/political-news/red-cross-data-leak-personal-data-of-550000-blood-donors-made-public-20161028-gscwms.html>.
Pilgrim, T. 2017, Australian Red Cross Blood Service data breach, Office of the Australian Information Commissioner (OAIC), viewed 18 November 2017, <https://www.oaic.gov.au/media-and-speeches/statements/australian-red-cross-blood-service-data-breach>.
Reynolds, J. 2016, 'How One Person’s Negative Attitude Affects the Whole Work Culture', Employee Engagement & Company Culture Blog, viewed 18 November 2017, <https://www.tinypulse.com/blog/negative-attitudes-affect-organizational-culture>.
Riso, B., Tupasela, A., Vears, D.F., Felzmann, H., Cockbain, J., Loi, M., Kongsholm, N.C.H., Zullo, S. & Rakic, V. 2017, 'Ethical sharing of health data in online platforms – which values should be considered?', Life Sciences, Society and Policy, vol. 13, no. 1, p. 12.
Wilson, R., Butters, O., Avraam, D., Baker, J., Tedds, J., Turner, A., Murtagh, M. & Burton, P. 2017, 'DataSHIELD – New Directions and Dimensions', Data Science Journal, vol. 16, no. 0, viewed 17 November 2017, <http://datascience.codata.org/articles/10.5334/dsj-2017-021/>.
### Navigating the Unknown
<a data-flickr-embed="true" href="https://www.flickr.com/photos/richardwc/2109138285/in/photolist-cwVV7-QVeZxE-Ks1ebA-Xo1vDT-NEs6KQ-K4LpGj-GA9gab-NutFFd-NX5y2K-MYYkjr-MYYjvn-NLJMhE-NLJJtU-NLJHcW-NLJFV7-MZes5b-NX5eNR-NX5dnV-NX5aMe-NX59gP-NX56y4-NX5bNH-NLJr9U-j8ycKg-NPgDVa-NLJq23-NLJnDh-NLJjZE-NX53Y4-NutskL-NLJmsj-NX53fa-NutEiy-MYYiGt-NLJKRy-MZepyj-xYnTxx-xYh7Zy-uuQSHy-sDjiCq-oB1ZW8-as56Ph-9PX2ch-8VyMGY-8Q4gek-7GNb95-4dnTkc-NX5wBv-NLJoJy-Nutr1S" title="Western Bower Bird"><img src="https://farm3.staticflickr.com/2348/2109138285_9310264707.jpg" width="800" height="534" alt="Western Bower Bird"></a><script async src="//embedr.flickr.com/assets/client-code.js" charset="utf-8"></script>
Image: (Crook 2005)
Given the lack of any research or discussions about the area I am interested in discussing makes it a little challenging to learn from others. There are not many times in life you get a chance to be the first to think about something. There is also the possibility that these thoughts are uninteresting, so they have never gone anywhere. Taking the optimistic view, I will assume the former until it becomes evident that the latter is true. With a lack of anything specific in my area of interest, I cast the net a bit a wider and looked at more general articles that I can use to inform my thinking.
I found [this article](https://dupress.deloitte.com/dup-us-en/deloitte-review/issue-15/behavioral-data-driven-decision-making.html) from Guszcza & Richardson (2014) fascinating for the two points they made. The first is that they debunk the myth of needing big data to get big insights, instead, showing how working in analytically and in a disciplined manner with currently available normal data leads to results. Having just completed my work at the NSW Data Analytics Centre, I observed that almost all their work does not involve *Big Data* or anything close to it. However the work they do have real impacts and delivers actual outcomes that change people’s lives.
The other point Guszcza & Richardson (2014) make is that social and behavioural data is highly predictive of many outcomes that are not at first glance obvious such as religious beliefs or pollical leanings. Using a device that is continually measuring features such as “people’s tones of voice, body language, and communication patterns” (Guszcza & Richardson 2014, p.172) can be used to predict such personal traits as dating behaviour and the outcomes from salary negotiations and other workplace outcomes. I wonder what would happen if the government started to want us to all wear these devices so that they could intervene early to help protect us from ourselves or others.
Sivarajah et al. (2017) conducted a literature review of all the papers relating to [Big Data challenges and analytical methods](http://www.sciencedirect.com/science/article/pii/S014829631630488X) between 1996 to 2015. Their findings are interesting, mainly looking at the total number of papers published, which reflect the sharp rise in interest in big data since 2013.
```{r echo=FALSE, fig.cap="Sivarajah et al. (2017) - Total number of papers published (from 1996 to 2015)"}
knitr::include_graphics("Images/DataFutures/Rory/rory_36111_i1.png", dpi = NA)
```
The top four countries produced 486 papers and China accounted for 49% of those, with Australia providing 10%. Given the plethora of work coming from China and the US (79%), it is important to note how well Australia stands out as a percentage of the size of its population. Australia’ population is currently 24 Million, and even though the UK’s population is 275% larger, they produced an equivalent amount of papers. The figures are even more impressive when considering the size of the population of the US (1,345%) and China (5,742%) (World Bank 2016).
```{r echo=FALSE, fig.cap="Sivarajah et al. (2017) - Frequency of researchers from different geographical locations (from 1996 to 2015)"}
knitr::include_graphics("Images/DataFutures/Rory/rory_36111_i2.png", dpi = NA)
```
The fact that 74% of the papers were either conceptual/descriptive/theoretical or analytical demonstrates that there is not much thinking occurring the space of application of techniques and the impacts of that. Developing my understanding of this area is what I am interested in exploring further.
```{r echo=FALSE, fig.cap="Sivarajah et al. (2017) - Classification of research methods (from 1996 to 2015)"}
knitr::include_graphics("Images/DataFutures/Rory/rory_36111_i3.png", dpi = NA)
```
As an extension to the outcomes of working with highly sensitive data, Pentland (2014) has started to postulate what a future based on social data may look like. In his book, he theorises that humans are as [predictable as the trajectory of billiard balls](https://www.technologyreview.com/s/526561/the-limits-of-social-engineering/) (Carr 2014) if there is enough data, and of the correct kind, known about them. It is an interesting theory that aims to reduce human interactions and actions to a mathematical formula that can then be used to produce the maximum outcome for society by tweaking behaviour through social messaging and small payments.
#### Bibliography
Carr, N. 2014, 'The Limits of Big Data: A Review of Social Physics by Alex Pentland', MIT Technology Review, viewed 20 November 2017, <https://www.technologyreview.com/s/526561/the-limits-of-social-engineering/>.
Crook, R. 2005, Western Bower Bird, viewed 18 November 2017, <https://www.flickr.com/photos/richardwc/2109138285/>.
Guszcza, J. & Richardson, B. 2014, 'Two dogmas of big data: Understanding the power of analytics for predicting human behavior', Deloitte Review, no. 15, 28 July, viewed 16 November 2017, <https://dupress.deloitte.com/dup-us-en/deloitte-review/issue-15/behavioral-data-driven-decision-making.html>.
Pentland, A. 2014, Social Physics: How Good Ideas Spread - The Lessons From A New Science, 1st edn, Scribe Publications.
Sivarajah, U., Kamal, M.M., Irani, Z. & Weerakkody, V. 2017, 'Critical analysis of Big Data challenges and analytical methods', Journal of Business Research, vol. 70, no. Supplement C, pp. 263–86.
World Bank 2016, 'Total Population', The World Bank, viewed 19 November 2017, <https://data.worldbank.org/indicator/SP.POP.TOTL>.
## Tracy