-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathhacking.html
170 lines (141 loc) · 5.78 KB
/
hacking.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Norvig Web Data Science Award</title>
<meta name="description" content="">
<meta name="author" content="">
<!-- Le styles -->
<link href="assets/css/bootstrap.css" rel="stylesheet">
<link href="assets/css/nbwsa-2014.css" rel="stylesheet">
<!-- Le HTML5 shim, for IE6-8 support of HTML5 elements -->
<!--[if lt IE 9]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
<script type="text/javascript">
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-36109664-1']);
_gaq.push(['_trackPageview']);
(function() {
var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
})();
</script>
</head>
<body>
<!-- Navbar
================================================== -->
<div class="navbar navbar-inverse navbar-fixed-top">
<div class="navbar-inner">
<div class="container">
<button type="button" class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse">
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="brand" href="./index.html">Norvig Award</a>
<div class="nav-collapse collapse">
<ul class="nav">
<li class="">
<a href="index.html">Home</a>
</li>
<li class="">
<a href="learnmore.html">Learn more</a>
</li>
<li class="">
<a href="apply.html">Apply</a>
</li>
<li class="active">
<a href="gettingstarted.html">Getting started</a>
</li>
<li class="">
<a href="submittingresults.html">Submit results</a>
</li>
<li class="">
<a href="faq.html">FAQ</a>
</li>
</ul>
</div>
</div>
</div>
</div>
<div class="jumbotron masthead">
<div class="container">
<h1>Norvig Web Data Science Award</h1>
<p>show what you can do with 3 billion web pages<small><br/>by <a href="http://www.surfsara.nl"><img alt="SURFsara" src="assets/images/sara.logo.png"></a> and <a href="http://www.commoncrawl.org"><img src="assets/images/commoncrawl-small.png" alt="CommonCrawl"></a></small></p>
</div>
</div>
<div class="container">
<div class="span9">
<div class="page-header">
<h1>Start hacking on your own ideas!</h1>
</div>
<p>To start experimenting you need to know the <em>location</em> and
the <em>format</em> of the dataset. Here we provide pointers and give
the information you need to run your Hadoop programs on the VM and
cluster.<p>
<h2>Fair use policy</h2>
<p>First of all: make sure you are familiar with the <a
href="faq.html#fairuse">fair-use policy</a>.</p>
<h2>The datasets</h2>
<p>We have created 2 subsets of the Common Crawl set hosted at SURFsara: a
single file available for download for on the VM, and a single segment on the Hadoop
cluster. <a href="faq.html#datasets">Location and size</a> of the test
sets.</p>
<p>The dataset contains four different type of files: SEQ, WARC, WET and WAT files. You can find <a
href="examples.html#data">a
description of the file formats</a> on the examples page.</a></p>
<h2>Using the Hadoop cluster</h2>
<p>Once your program runs correctly on your local machine it is time to
move to a bigger dataset and a bigger machine: SURFsara's Hadoop cluster
(called Hathi). There are just a few things that you need to pay
attention to:</p>
<ul>
<li>Change the input path to the <strong>TEST</strong> set on the
cluster (see above).</li>
<li>Authenticate before submitting your job. You can do this by opening
a terminal and run <code>kinit USERNAME</code>. This is the username
you have received by email after applying. You only need to do this
once per session.</li>
</ul>
<h3>Submitting a MapReduce job</h3>
<p>As for the examples we showed before, you should build a jar from your source and run it with <code>yarn jar</code>.
<h3>Submitting a pig job</h3>
<p>You can run a pig job on the cluster by removing the '-x local' from
the command line:
<pre><code>$ kinit USERNAME
Password for USERNAME@CUA.SURFSARA.NL:
$ pig myjob.pig</code></pre>
</div>
<div class="span2" id="sponsors">
<ul class="thumbnails">
<li>
<a href="http://www.surfsara.nl" class="thumbnail">
<img src="assets/images/sara.logo.png" alt="SURFsara">
</a>
</li>
<li>
<a href="http://www.commoncrawl.org" class="thumbnail">
<img src="assets/images/cc.logo.png" alt="Common Crawl">
</a>
</li>
<li>
<a href="http://www.github.com" class="thumbnail">
<img src="assets/images/github.logo.png" alt="Github">
</a>
</li>
</ul>
</div>
</div>
<!-- Footer
================================================== -->
<footer class="footer">
<div class="container">
<p class="pull-right"><a href="#">Back to top</a></p>
<p>Design adapted from <a href="http://www.twitter.com">Twitter</a>’s <a href="http://twitter.github.com/bootstrap">Bootstrap</a> page</p>
</div>
</footer>
<script type="text/javascript" src="http://platform.twitter.com/widgets.js"></script>
</body>
</html>