forked from tgteacher/big-data-analytics-course
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathoutline.html
273 lines (268 loc) · 14.1 KB
/
outline.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="chrome=1">
<title>Big Data Analytics</title>
<link rel="stylesheet" href="stylesheets/styles.css">
<link rel="stylesheet" href="stylesheets/github-light.css">
<script src="javascripts/scale.fix.js"></script>
<meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=no">
<!--[if lt IE 9]>
<script src="//html5shiv.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
</head>
<body>
<div class="wrapper">
<header>
<h1 class="header">Outline</h1>
<p>
<a href="#instructors">Instructors</a><br/>
<a href="#lectures">Lectures & Labs</a><br/>
<a href="#objectives">Objectives</a><br/>
<a href="#schedule">Schedule</a><br/>
<a href="#books">Books</a><br/>
<a href="#evaluation">Evaluation</a><br/>
<a href="#integrity">Academic Integrity</a><br/>
</p>
</header>
<section>
<h1>Course Outline, Winter 2018<br/>
Big Data Analytics <br/> SOEN 691-UU / SOEN 498-UU
</h1>
<h2>
<a id="instructors" class="anchor"
href="#instructors" aria-hidden="true">
<span aria-hidden="true" class="octicon octicon-link"/>
</a>
Instructors
</h2>
<p><strong>Coordinator:</strong> Dr. Tristan Glatard<br/>
Office: EV 6.225<br/>
e-mail: <a href="mailto:tristan.glatard@concordia.ca">tristan.glatard@concordia.ca</a><br/>
Office hours: Monday 3pm - 5pm or by appointment.
</p>
<p>
<strong>Teaching Assistants:</strong>
<ul>
<li>Zhen Du<br/>
e-mail: <a href="mailto:jenkin.du@gmail.com">jenkin.du@gmail.com</a>
</li>
<li>Huaqiang Kang<br/>
e-mail: <a href="mailto:HU_KAN@encs.concordia.ca">HU_KAN@encs.concordia.ca</a>
</li>
</ul>
</p>
<h2>
<a id="lectures" class="anchor"
href="#lectures" aria-hidden="true">
<span aria-hidden="true" class="octicon octicon-link"/>
</a>
Lectures & Labs
</h2>
<ul>
<li>Lectures: Wednesday 5:45PM - 8:15PM at MB 3.430 SGW.</li>
<li>Labs:
<ul>
<li>UI (Kang): Wednesday 3:45PM - 5:35PM at H 917 SGW</li>
<li>UJ (Du): Tuesday 1:15PM - 3:05PM at H 903 SGW</li>
</ul>
</li>
</ul>
<p><a href="https://moodle.concordia.ca/moodle/course/view.php?id=104000" target="_blank" >Moodle page</a>.</p>
<h2>
<a id="objectives" class="anchor"
href="#objectives" aria-hidden="true">
<span aria-hidden="true" class="octicon octicon-link"/>
</a>
Objectives
</h2>
<p>Big Data analytics has been transforming industry and
science in various domains for the past few years, making
possible the processing of Terabytes of data on a daily
basis. This was enabled by the joint evolution of
programming models, data-analysis algorithms and computing
infrastructures.</p>
<p>This course introduces the concepts and some of the main
algorithms used for Big Data analytics. It presents the
principles of the Hadoop ecosystem, Apache Spark, and it
details the main algorithms for the analysis of large
datasets, related to similarity search, mining of frequent
itemsets, graph analysis, clustering, stream mining,
recommender systems and advertising.</p>
<p>By the end of this course, students will be able to write
and deploy efficient parallel algorithms to analyze Big Data
sources for various applications.</p>
<h2>
<a id="schedule" class="anchor"
href="#schedule" aria-hidden="true">
<span aria-hidden="true" class="octicon octicon-link"/>
</a>
Schedule
</h2>
<table>
<tr><th>Date </th> <th>Lecture </th> <th>Lab </th> <th> Assignments </th></tr>
<tr><td>Jan 10</td> <td>Introduction </td> <td> <font color="white">None</font> </td> <td> <font color="white">None</font> </td></tr>
<tr><td>Jan 17</td> <td>MapReduce, Hadoop, Spark (<a href="#mmdsbook">MMDS</a> Ch 2) </td> <td>Getting started with Python, Spark, Git, GitHub</td><td>Reading: <ul><li>Dean J, Ghemawat S. <a href="http://delivery.acm.org/10.1145/1330000/1327492/p107-dean.pdf?ip=132.205.230.2&id=1327492&acc=ACTIVE%20SERVICE&key=FD0067F557510FFB%2EB3808E783D2F97DD%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&CFID=1025377651&CFTOKEN=77711060&__acm__=1515606413_2eda3f66ac8c8a1e63cea40be09359f2">MapReduce: simplified data processing on large clusters</a>. Communications of the ACM. 2008 Jan 1;51(1):107-13.</li><li>Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin MJ, Ghodsi A. <a href="http://homepages.cs.ncl.ac.uk/paolo.missier/doc/p56-zaharia.pdf">Apache Spark: A unified engine for big data processing</a>. Communications of the ACM. 2016 Oct 28;59(11):56-65.</li></ul>
</td></tr>
<tr><td>Jan 24</td> <td>Recommender systems (<a href="#mmdsbook">MMDS</a> Ch 9) </td> <td>Spark RDDs and DataFrames<br/><a href="https://github.com/glatard/bigdata-LA1">Instructions (ask your TA for access)</a></td><td><strong>LA1: "Spark RDD and DataFrame APIs"</strong><br/>Due date: <strong>Jan 26, 11:55pm</strong></td> </tr>
<tr><td>Jan 31</td> <td>Clustering (<a href="#mmdsbook">MMDS</a> Ch 7) </td> <td>Recommender systems</td><td/></tr>
<tr><td>Feb 7 </td> <td>Frequent itemsets (<a href="#mmdsbook">MMDS</a> Ch 6) </td> <td>Clustering</td><td><strong>Project proposal</strong><br/> Due date: <strong>Feb 9, 11:55pm</strong>. </td></tr>
<tr><td>Feb 14</td> <td>Data streams (<a href="#mmdsbook">MMDS</a> Ch 4) </td> <td>Frequent itemsets</td><td/></tr>
<tr><td colspan="4"><center><font color="white">Spring break</font></center></td></tr>
<tr><td>Feb 28</td> <td>Similarity search (<a href="#mmdsbook">MMDS</a> Ch 3)</td><td>Data streams</td><td><strong>LA2: "Recommender systems, clustering and frequent itemsets in Spark"</strong><br/>Due date: <strong>Mar 2, 11:55pm</strong></td></tr>
<tr><td>Mar 7</td> <td>Graph analysis (<a href="#mmdsbook">MMDS</a> Ch 5 & 10)</td><td>Similarity search</td><td></td></tr>
<tr><td>Mar 14</td> <td>Advertising on the Web (<a href="#mmdsbook">MMDS</a> Ch 8) </td> <td>Graph analysis </td><td/></tr>
<tr><td>Mar 21</td> <td>Introduction to Machine Learning (<a href="#mmdsbook">MMDS</a> Ch 12)</td> <td>Advertising</td><td><strong>LA3: "Stream mining, similarity search and graph analysis in Spark"</strong><br/>Due date: <strong>Mar 23, 11:55pm</strong></td></tr>
<tr><td>Mar 28</td> <td><strong>Exam</strong> </td> <td><font color="white">None</font> </td><td><font color="white">None</font></td></tr>
<tr><td>Apr 4 </td> <td><strong>Project presentations</strong> </td> <td>Machine learning </td><td/></tr>
<tr><td>Apr 11 </td><td><strong>Project presentations</strong></td> <td>TBA</td><td><strong>LA4: "Advertising and Machine Learning in Spark"</strong><br/>Due date: <strong>Apr 13, 11:55pm</strong><br/><br/><strong>Project report</strong><br/>Due date: <strong>Apr 13, 11:55pm</strong></td></tr>
</table>
<p>
<strong>Please note</strong>: In the event of extraordinary circumstances beyond the University's control, the content
and/or evaluation scheme in this course is subject to change.
</p>
<h2>
<a id="books" class="anchor"
href="#books" aria-hidden="true">
<span aria-hidden="true" class="octicon octicon-link"/>
</a>
Book
</h2>
<ul>
<li>MMDS<a id="mmdsbook"></a> (<strong>Required</strong>):
Mining of Massive Datasets, <i>Jure Leskovec, Anand Rajaraman, Jeff Ullman</i>, beta version of the 3rd edition.
<a href="http://i.stanford.edu/~ullman/mmds/book0n.pdf">Available online</a>.
<li>MapReduce (<strong>Optional</strong>): Data-Intensive Text Processing with MapReduce, <i>Jimmy Lin and Chris Dyer</i>, 1st edition, 2010.
<a href="http://lintool.github.io/MapReduceAlgorithms/index.html">Available online</a>.
</ul>
A significant portion of the slides presented from session 3 will be taken
from <a href="http://www.mmds.org">http://www.mmds.org</a>. This
website also has useful videos explaining the slides.
<h2>
<a id="evaluation" class="anchor"
href="#evaluation" aria-hidden="true">
<span aria-hidden="true" class="octicon octicon-link"/>
</a>
Course Evaluation
</h2>
<p>
<strong>Lab assignments (30%)</strong>: You will be required
to develop data analysis programs in Python using Apache
Spark. There will be a total of four assignments. You must
work on these assignments <strong>individually</strong>. The
lab assignments are all due on a Friday evening, 11:55pm
(see exact dates on the <a href="#schedule">schedule</a>
table). A grace period of 48 hours will be automatically
granted (assignments will be accepted until Sunday night,
11:55pm), but <strong>no further extension will be
granted</strong>. Assignments must be submitted through
GitHub, see detailed instructions with your TA.
</p>
<p>
<strong>Exam (40%)</strong>: The exam is a
closed-book exam and will be conducted on the date
indicated on the <a href="#schedule">schedule</a> table,
during the lecture. In general, you will need to bring
your own ENCS calculator. There will be no substitution
for a missed exam. Passing the term exam is necessary
for passing the course.
</p>
<p>
<strong>Project (30%)</strong>: The project should fall in one of the following 3 categories:
<ul>
<li><u>Dataset analysis</u>: select a dataset (for instance
from your research) and apply at least two techniques seen
in the course using Apache Spark. You are not
required to re-implement these techniques.</li>
<li><u>Technology evaluation</u>: perform a comparative
study of at least two open-source technologies related to
Big Data Analysis, for instance from
the <a href="https://hadoop.apache.org">Hadoop
project</a>.</li>
<li><u>Algorithm implementation</u>: (Re-)implement at least two
algorithms seen in the course or related to the themes
seen in the course.
</ul>
<p>
No project template will be provided: you are supposed to
define your own project based on the instructions
above. Other types of (relevant) projects are welcome and
can be discussed with the instructor during office hours or
on Slack. You can work on the project <strong>individually
or in a team of two</strong>; larger teams will not be
accepted. The project will have the following
milestones. Deadlines are indicated on
the <a href="#schedule">schedule</a>. No deadline extension
will be granted.
<ol>
<li> The <strong>project proposal (5%)</strong> will be a
document of 3 pages or less with the following
structure:<ul>
<li>Abstract: a few sentences summarizing the document.</li>
<li>I. Introduction: context, objectives, presentation of the problem to solve, related work.</li>
<li>II. Materials and Methods: the dataset(s), technologies and algorithms that will be used.</li>
</ul>
Even though the project proposal is only worth 5%, you are
strongly recommended to take it seriously as it is your
chance to get formal feedback on the project before the
final deliverables. Besides, the grade obtained for the project
proposal is a good predictor of the grades obtained for
the project report and presentation.
</li>
<li> The <strong>project report (15%)</strong> will be a document of 6 pages or less with the following structure:
<ul>
<li>Abstract: as in the project proposal.</li>
<li>I. Introduction: as in the proposal.</li>
<li>II. Materials and Methods: as in the proposal.</li>
<li>III. Results: a description of the result of the
study (dataset analysis, technology comparison or
implementation), with quantitative data obtained by
the project team (graphs, tables, metrics, etc).</li>
<li>IV. Discussion: a discussion of the relevance of
the solution(s), of the limitations and of possible
future work.</li>
</ul>
Project proposals and reports will be submitted
through <a href="https://moodle.concordia.ca/moodle/course/view.php?id=93782"
target="_blank" >Moodle</a>. They will be evaluated using the following
criteria:
<ul>
<li>Clarity (writing, organization, formatting) (20%)</li>
<li>Relevance to the course topics (20%)</li>
<li>Technical quality (60%)</li>
</ul>
All criteria will be assessed on a 4-level scale: unacceptable, average, good, exceptional.
<li>The <strong>project presentation (10%)</strong> will be a 5-minute presentation of the project. It will be evaluated using the following
criteria:
<ul>
<li>Clarity (slides and speech) (2%)</li>
<li>Relevance (2%)</li>
<li>Technical quality (6%)</li>
</ul>
Expect 1 or 2 questions after your presentation. All criteria will be assessed on a 4-level scale: unacceptable, average, good, exceptional. </li>
</ol>
</p>
<p><strong>Grading Scheme</strong>: A passing mark on each of the 3
deliverables (lab assignments, exam and
project) is required to get a passing grade for the course. There
is no standard relationship between percentages and letter grades
assigned. The grading of the course will be done based on the relative
percentages assigned to the assignments, project and the exam. There is no
definite rule for translation of number grades to letter grades.
<h2>
<a id="integrity" class="anchor"
href="#integrity" aria-hidden="true">
<span aria-hidden="true" class="octicon octicon-link"/>
</a>
Academic Integrity
</h2>
Violation of the Academic Code of Conduct in any form will be
severely dealt with. This includes copying (even with modifications)
of program segments. You must demonstrate independent thought through
your submitted work. Click on the following link for more
information: <a href="http://www.concordia.ca/students/academic-integrity.html"
target="_blank">http://www.concordia.ca/students/academic-integrity.html</a>.
</section>
</body>
</html>