-
Notifications
You must be signed in to change notification settings - Fork 2
/
pywren_s3.html
192 lines (160 loc) · 8.78 KB
/
pywren_s3.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
<!DOCTYPE html>
<html lang="en" prefix="og: http://ogp.me/ns# fb: https://www.facebook.com/2008/fbml">
<head>
<title>Microservices and Terabits - pywren</title>
<!-- Using the latest rendering mode for IE -->
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link rel="canonical" href="/pywren_s3.html">
<meta name="author" content="Eric Jonas" />
<meta name="keywords" content="python" />
<meta name="description" content="Using Pywren to benchmark S3, we achieve over 80 GB/sec of read performance and 60 GB/sec of write performance using Amazon S3." />
<meta property="og:site_name" content="pywren" />
<meta property="og:type" content="article"/>
<meta property="og:title" content="Microservices and Terabits"/>
<meta property="og:url" content="/pywren_s3.html"/>
<meta property="og:description" content="Using Pywren to benchmark S3, we achieve over 80 GB/sec of read performance and 60 GB/sec of write performance using Amazon S3."/>
<meta property="article:published_time" content="2016-10-27" />
<meta property="article:section" content="benchmarks, s3" />
<meta property="article:tag" content="python" />
<meta property="article:author" content="Eric Jonas" />
<!-- Bootstrap -->
<link rel="stylesheet" href="/theme/css/bootstrap.spacelab.min.css" type="text/css"/>
<link href="/theme/css/font-awesome.min.css" rel="stylesheet">
<link href="/theme/css/pygments/friendly.css" rel="stylesheet">
<link rel="stylesheet" href="/theme/css/style.css" type="text/css"/>
<link href="/static/custom.css" rel="stylesheet">
</head>
<body>
<div class="navbar navbar-default navbar-fixed-top" role="navigation">
<div class="container ">
<div class="row">
<div class="container col-lg-8 col-lg-offset-2">
<div class="navbar-header ">
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target=".navbar-ex1-collapse">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a href="/" class="navbar-brand">
pywren </a>
</div>
<div class="collapse navbar-collapse navbar-ex1-collapse">
<ul class="nav navbar-nav">
<li><a href="/blog.html">blog</a></li>
<li><a href="/pages/gettingstarted.html">getting started</a></li>
<li><a href="http://pywren.io/docs">documentation</a></li>
<li><a href="https://github.com/pywren/examples/">examples</a></li>
<li><a href="http://github.com/pywren/pywren">code</a></li>
<li><a href="https://github.com/pywren/pywren/issues">bugs</a></li>
</ul>
<ul class="nav navbar-nav navbar-right">
</ul>
</div>
<!-- /.navbar-collapse -->
</div> </div>
</div>
</div> <!-- /.navbar -->
<!-- Banner -->
<!-- End Banner -->
<div class="container">
<div class="row">
<div class="col-lg-8 col-lg-offset-2 ">
<section id="content">
<article>
<header class="page-header">
<h1>
<a href="/pywren_s3.html"
rel="bookmark"
title="Permalink to Microservices and Terabits">
Microservices and Terabits
</a>
</h1>
</header>
<div class="entry-content">
<div class="panel">
<div class="panel-body">
<footer class="post-info">
<span class="label label-default">Date</span>
<span class="published">
<i class="fa fa-calendar"></i><time datetime="2016-10-27T00:00:00-07:00"> Thu 27 October 2016</time>
</span>
<span class="label label-default">Tags</span>
<a href="/tag/python.html">python</a>
</footer><!-- /.post-info --> </div>
</div>
<h2>Making S3 (almost) as fast as local memory</h2>
<p><a href="https://aws.amazon.com/lambda/">AWS Lambda</a> is amazing,
and
<a href="/pywren.html">as I talked about last time, can be used for some pretty serious compute</a>. But
many of our potential use cases involve data preprocessing and data
manipulation -- the so-called extract, transform, and load (ETL) part
of data science. Often people code up Hadoop or Spark jobs to do
this. I think for many scientists, #thecloudistoodamnhard -- learning
to write <a href="http://hadoop.apache.org/">Hadoop</a> and <a href="http://spark.apache.org/">Spark</a> jobs requires both thinking about cluster
allocation as well as learning a Java/Scala stack that many are
unfamiliar with. What if I just want to resize some images or extract
some simple features ? Can we use AWS Lambda for high-throughput ETL
workloads? I wanted to see how
fast <a href="https://github.com/ericmjonas/pywren">PyWren</a> could get the job
done. This necessitated benchmarking S3 read and write from within
Lambda.</p>
<p>I wrote an example
script,
<a href="https://github.com/ericmjonas/pywren/blob/master/examples/s3_benchmark.py"><code>s3_benchmark.py</code></a> to
first write a bunch of objects with pseudorandom data to s3, and then
read those objects out. The data is pseudorandom to make sure our
results aren't confounded by compression, and everything is
done with streaming file-like objects in Python. To write 1800 2GB objects
to the S3 bucket <code>jonas-pywren-benchmark</code> I can use the following
command:</p>
<div class="codehilite"><pre><span></span>$ python s3_benchmark.py write --bucket_name<span class="o">=</span>jonas-pywren-benchmark <span class="se">\</span>
--mb_per_file<span class="o">=</span><span class="m">2000</span> --number<span class="o">=</span><span class="m">1800</span> --key_file<span class="o">=</span>big_keys.txt
</pre></div>
<p>Each object is placed at a random key, and the keys used are written
to <code>big_keys.txt</code>, one per line. This additionally generates a python pickle
of <code>(start time, stop time, transfer rate)</code> for each job. We can then
read these generated s3 objects with with : </p>
<div class="codehilite"><pre><span></span>$ python s3_benchmark.py <span class="nb">read</span> --bucket_name<span class="o">=</span>jonas-pywren-benchmark <span class="se">\</span>
--number<span class="o">=</span><span class="m">1800</span> --key_file<span class="o">=</span>big_keys.txt
</pre></div>
<p>We can look at the distribution of per-job throughput, the job runtimes
for the read and write benchmarks, and the total aggregate throughput
(colors are consistent -- green is write, blue is read):</p>
<p><a href="/images/pywren.s3.png"><img src="/images/pywren.s3.png" style="max-width:100%"></a></p>
<p>Note that at the peak, we have <strong>over 6O GB/sec read and 50 GB/sec
write</strong> to S3 -- that's nearly half a terabit a second of IO! For comparison, high-end Intel Haswell Xeons get about ~100GB/sec
<em>to RAM</em>. On average, we see per-object write speeds of ~30 MB/sec and read
speeds are ~40MB/sec. The amazing part is this is nearly linear
scaling of read and write throughput to S3. </p>
<p>I struggle to comprehend this level of scaling. We're working
on some applications now where this really changes our ability to quickly
and easily try out new machine learning and data analysis pipelines. <a href="/pywren.html">As I mentioned when talking about compute throughput</a>, this is peak performance -- it's likely real workloads will be a bit slower. Still, these rates are amazing! </p>
</div>
<!-- /.entry-content -->
</article>
</section>
</div>
</div>
</div>
<footer>
<div class="container">
<div class="row">
<div class="col-lg-8 col-lg-offset-2"> <hr>
© 2017 University of California, Berkeley
· Powered by <a href="https://github.com/getpelican/pelican-themes/tree/master/pelican-bootstrap3" target="_blank">pelican-bootstrap3</a>,
<a href="http://docs.getpelican.com/" target="_blank">Pelican</a>,
<a href="http://getbootstrap.com" target="_blank">Bootstrap</a> </div>
</div>
</div>
</footer>
<script src="/theme/js/jquery.min.js"></script>
<!-- Include all compiled plugins (below), or include individual files as needed -->
<script src="/theme/js/bootstrap.min.js"></script>
<!-- Enable responsive features in IE8 with Respond.js (https://github.com/scottjehl/Respond) -->
<script src="/theme/js/respond.min.js"></script>
<script src="/theme/js/bodypadding.js"></script>
</body>
</html>