-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathDataScienceOnBlockchainWithR-Slides-HEC-DEC2022.qmd
347 lines (213 loc) · 13.1 KB
/
DataScienceOnBlockchainWithR-Slides-HEC-DEC2022.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
---
title: "Data Science on Blockchain, how to get started?"
subtitle: "Two analysis examples on NFT and Helium"
author: "Thomas de Marchin"
date: "14DEC2022"
format:
revealjs:
theme: white
slide-number: true
embed-resources: true
auto-stretch: false
smaller: true
toc: true
css: styles.css
execute:
eval: false
echo: true
---
# Introduction
- Associate Director, Statistics & Data Science at Pharmalex: support pharmaceutical companies in drug development
- Gravitated towards blockchain technology and the place of data science within it
- Lots of data available on the blockchain (~ 1x10^6 transaction per day on ethereum), how to access it?
- Obtaining cryptocurrencies price data is straightforward, more sophisticated data manipulations require access to the “source” data
- Github: https://github.com/tdemarchin
- Blog: https://tdemarchin.medium.com
## Blockchain is hard to read
1. Several Tb of data. Ethereum ~ 6x10^9 transactions.
2. Data are stored sequentially, requires developing specific tools to follow a transaction.
3. The structure of a transaction is difficult to read
4. Fragmentation of blockchain technologies
::: {layout="[[20,40, 40]]" layout-valign="center"}
![A huge hard drive](figures/ramac.jpg)
![Image from ig.com](figures/blocks.png)
![Structure of a transaction](figures/transactionStructure.png)
:::
## How to get the data ?
1. Set-up an ETL: Extract, Transform and Load. Most flexible but need to set up a server with huge and fast hard-drives.
2. Use data providers:
- Dashboards: Dune analytics, Nansen, GraphSense, icy.tools,...
- APIs (software intermediary that allows two applications to talk to each other): OpenSea, Etherscan, Infura, The Graph,...
Two examples implemented in R:
1. How to track and visualize NFT transactions on Ethereum
2. Blockchain IoT data visualization and the rise of the Helium network
# NFTs
![https://www.larvalabs.com/cryptopunks](figures/cryptopunks.jpg){height=200px fig-align="center"}
- Non-Fungible Tokens: represent ownership of unique items (art, collectibles, patents, real estate,...)
- Smart contracts: decentralized programs stored on a blockchain that run when predetermined conditions are met
- What can we do with NFT-related data ?
## NFTs price analysis with OpenSea API
OpenSea: big NFT market place
```{r}
resOpenSea <- GET("https://api.opensea.io/api/v1/events",
query = list(limit = 300,
event_type = "successful",
only_opensea = "true"))
```
:::: {layout="[[60, 40]]" layout-valign="center"}
![](figures/glimpse.png)
![](figures/pieChartOpenSea.png){height=200px}
::::
- 300 transactions max, limited to OpenSea transactions
- Pre-processed by OpenSea and not raw blockchain transactions
- Mainly price analysis
- Analysis done in June 2021: [Link](https://medium.com/towards-data-science/data-science-on-blockchain-with-r-afaf09f7578c)
## NFTs tracking with EtherScan API: the data
::: {.panel-tabset}
### The Weird Whales
![https://weirdwhalesnft.com/](figures/patchwork.png){height=300px fig-align="center"}
- Weird Whales: collection of 3350 whales programmatically generated, each with their unique characteristics and traits. Created by a 12-year-old programmer named Benyamin Ahmed who made the buzz.
- EtherScan: block explorer to view information about transactions, verify contract code... $\rightarrow$ access to *more raw* data
- Analysis done in October 2021 and updated in December 2022: [Link](https://tdemarchin.github.io/DataScienceOnBlockchainWithR-PartII//DataScienceOnBlockchainWithR-PartII.html)
### Events: Reverse engineering
:::: {.columns}
::: {.column width="60%"}
- Weird Whales are managed by a specific smart contract on the Ethereum blockchain. You can find it on [https://etherscan.io/](https://etherscan.io/).
- To make it easier to extract information from the blockchain, we can read the events: dispatched signals (easy to read) the smart contracts can fire.
![](figures/transferEvent.png)
:::
::: {.column width="40%"}
![](figures/smartContractWeirdWhales.png)
:::
::::
### Transfer and sales price
```{r}
resEventTransfer <- GET("https://api.etherscan.io/api",
query = list(module = "logs",
action = "getLogs",
fromBlock = fromBlock,
toBlock = "latest",
address = "0x96ed81c7f4406eff359e27bff6325dc3c9e042bd",
topic0 = "0xddf252ad1be2c89b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef",
apikey = EtherScanAPIToken))
```
![](figures/glimpseWeirdWhales.png){height="150px" fig-align="center"}
Where is the sales price? On OpenSea, sales are managed by the main contract and if approved, the second contract is called (here Weird Whales), which then triggers the transfer $\rightarrow$ need to download all the transactions from the OpenSea main smart contract address and then filter for the ones related to Weird Whales (~ 10000 API calls, can take several hours...).
### ETH/USD conversion
![](figures/ETHUSD.png){height="400px" fig-align="center"}
- Ethereum / USD rate is highly volatile, need to obtain the historical ETH market price if we want to convert ETH to USD.
- A spline is fitted on data from the CoinGecko API
:::
## NFTs tracking with EtherScan API: analysis and visualisation
::: {.panel-tabset}
### Descriptive statistics
![](figures/descriptiveStatWeirdWhales.png){fig-align="center"}
### Sales price evolution
![](figures/salesPriceEvolution.png){height=350px fig-align="center"}
- High variability in prices at the begining (buzz), followed by a quieter period
- Some NFTs are exchanged outside the OpenSea Market (price = 0) $\rightarrow$ difficult to trace....
- Starting March 2022, all sales price felt to 0 $\rightarrow$ OpenSea was hacked in March, they probably updated their smart contract...
### Network
- Perfect data to be plotted as a network
- Networks are described by:
- Vertices (or nodes): the wallet addresses
- Edges (or links): the transactions
- *network* package:
```{r}
# create the network using network function
network <- network(edges,
vertex.attr = vertices)
```
### Static network
:::: {layout="[[50, 50]]" layout-valign="center"}
![](figures/staticNetwork.png){height=400px fig-align="center"}
- Each color represents a unique token ID (we restricted to tokens involved in at least 10 transactions to improve visibility)
- All transactions of all the tokens originate from the single minting address (1)
- Some addresses are involved in multiple transactions (curved edges)
- Some tokens were transferred to an address only to be sent back to the sender...?
- Made with the *ggraph* packages
::::
### Longitudinal dimension
![](figures/timeline.png){height=400px fig-align="center"}
About 2/3 of the transactions happened very shortly after the NFT’s creation
### Dynamic network
![](figures/dynamicNetwork.mp4){height=400px fig-align="center"}
- Made with the *network* and *networkDynamic* packages
:::
# Helium IoT network
:::: {layout="[[-5, 40, -5, 20, -5, 20, -5]]" layout-valign="center"}
![Helium, the people’s network. Photo from Nima Mot.](figures/peopleNetwork.jpg)
![A miner](figures/miner.png)
![Rewards](figures/rewardsHelium.png)
::::
- Helium: decentralized wireless infrastructure for IoT devices (environmental sensors, localisation sensors to track bike fleets,...), relies on a blockchain
- People are incentivized to install hotspots and become a part of the network by earning Helium tokens
- Questions:
- How big is the Helium network infrastructure, where are located the hotspots?
- Are they actively utilized, i.e. are they used to transfer data with connected devices?
- Need to download the full blockchain $\rightarrow$ Dewi ETL project is a dedicated ETL server $\rightarrow$ CSV extracts in 10k/50k-block increments on a web server: centralized, bad practice!
- Analysis done in March 2022: [Link](https://tdemarchin.github.io/DataScienceOnBlockchainWithR-PartIII/DataScienceOnBlockchainWithR-PartIII.html)
## Helium IoT network: infrastructure
::: {.panel-tabset}
### Geospatial indexing
:::: {layout="[[50, 50]]" layout-valign="center"}
![Liege](figures/helium2.png)
![](figures/glimpseHelium1.png)
![](figures/descriptiveStatHelium1.png)
::::
- What is the size of the infrastructure? $\rightarrow$ hotspot data
- H3: geospatial indexing system using a hierarchical hexagonal grid. At resolution 8: earth is covered by 691,776,122 hexagons $\rightarrow$ we convert to lat/lng
### Growth and distribution
:::: {.columns aligm-items="center"}
::: {.column width="60%"}
![](figures/hotspotDistribution.png){}
- Some owners own more than 50 hotspots!
- Three phases: (1) a slow linear increase, (2) an exponential increase in the middle of 2021 followed by (3) a fast linear increase (did it continue?)
:::
::: {.column width="40%"}
![Cumulative growth of the network infrastructure in terms of number of hotspots added](figures/growthHotspotHelium.png){}
:::
::::
### Location
![](figures/hotspotLocalizationHelium.png){height=400px fig-align="center"}
- We start by creating an empty world map on which we overlay the hotspot data.
- Plotting all the individual hotspots on a map would be too much (> 500k hotspots) $\rightarrow$ we cluster the hotspots into hexagons (*hexbin* and *geom_hex*)
- Most hotspots are located in North America, Europe and Asia, mostly in big cities. Practically no hotspots in Africa, Russia and very few in South America. A few hotspots in the middle of the ocean... Data issue or cheating to increase rewards?
:::
## Helium IoT network: network usage
::: {.panel-tabset}
### Data transactions
- We understood that hotspots are distributed on the planet and provide good coverage in big cities. Are they being actively used by connected devices and how often? $\rightarrow$ Transaction data
- We can look at network usage from two perspectives: (1) check the volume of data exchanged and (2) check how often the hotspots have been involved in data transfer with connected devices
![](figures/glimpseHelium2.png){fig-align="center"}
### Growth
:::: {layout="[[60, 40]]" layout-valign="center"}
![Total volume of data exchanged](figures/descriptiveStatHelium2.png)
![Cumulative sum of the number of transactions](figures/growthTransactionsHelium.png)
::::
- 40.48% hotspots did not participate in any transaction so far
- Total volume of data exchanged is probably not a good metric $\rightarrow$ network is intented to transfer small data across long distance
- Slow linear increase followed by an exponential increase, which is finally followed by a fast linear increase
- Glitch in November 2021: major outage of the blockchain
- Despite having about 15% of the hotspots, Asia don’t seem to be so active in terms of data transfer
### Dynamic plot
![Evolution of the number of transactions across the globe](figures/dynamicGrowthHelium.gif){height=400px fig-align="center"}
- Transactions mainly occur in North America before mid 2020, then followed by a strong wave in Europe and Asia. Barely no transaction have occurred in South America and Africa.
- Made with the *gganimate* package
### Interactive plot
:::: {layout="[[-12, 50, -12, 50, -12]]" layout-valign="center"}
![Distribution of the transactions in the US](figures/distributionTransactionsUS.webp)
![Distribution of the transactions in Belgium](figures/distributionTransactionsBE.webp)
::::
- US: Transactions are homogeneously distributed across the country although the peaks of activity (note that the legend is logarithmic!) are located around big cities (New York, Los Angeles, San Francisco, Miami)
- BE: Transactions are not homogeneously distributed across the country $\rightarrow$ most transactions happen in the upper part of the country, which is consistent with the lower part of the country being scarcely populated (Ardennes)
- Made with the *rayshader* package, interactive version available in the [original article](https://tdemarchin.github.io/DataScienceOnBlockchainWithR-PartIII/DataScienceOnBlockchainWithR-PartIII.html)
:::
# Conclusion
- Approaches for reading the blockchains and visualizing blockchain transactions
- NFT: how to download and plot a network of transactions associated to a NFT collection $\rightarrow$ interesting patterns in the evolution of NFT prices, which can be linked to social media attention
- Helium: techniques on how to summarise and visualise the network growth in term of infrastructure and data usage $\rightarrow$ the infrastructure is well developed, the network has a lot of spare capacity
$\rightarrow$ obtaining data in an appropriate format requires a deep understanding of the blockchain
$\rightarrow$ visualisation of blockchain data is still challenging because of the huge amount of data, need data scientists to develop blockchain specific approaches
Next step: analyse the networks using the graph theory
Follow me on [Medium](https://tdemarchin.medium.com/), [Linkedin](https://www.linkedin.com/in/tdemarchin/) and/or [Twitter](https://twitter.com/tdemarchin)