outline/outline.html

<!doctype html>
<html>
  <head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="chrome=1">
    <title>Big Data Analytics</title>

    <link rel="stylesheet" href="stylesheets/styles.css">
    <link rel="stylesheet" href="stylesheets/github-light.css">
    <script src="javascripts/scale.fix.js"></script>
    <meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=no">
    <!--[if lt IE 9]>
        <script src="//html5shiv.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
  </head>
  <body>
    <div class="wrapper">
      <header>
        <h1 class="header">Outline</h1>
        <p>
	  <a href="#instructors">Instructors</a><br/>
	  <a href="#lectures">Lectures &amp; Labs</a><br/>
	  <a href="#objectives">Objectives</a><br/>
          <a href="#schedule">Schedule</a><br/>
	  <a href="#books">Books</a><br/>
          <a href="#evaluation">Evaluation</a><br/>
	  <a href="#integrity">Academic Integrity</a><br/>
        </p>
      </header>
      <section>
	<h1>Course Outline, Winter 2018<br/>
	  Big Data Analytics <br/> SOEN 691-UU / SOEN 498-UU
	</h1>
	<h2>
          <a id="instructors" class="anchor"
             href="#instructors" aria-hidden="true">
            <span aria-hidden="true" class="octicon octicon-link"/>
          </a>
	  Instructors
        </h2>
	<p><strong>Coordinator:</strong> Dr. Tristan Glatard<br/>
	  Office: EV 6.225<br/>
	  e-mail: <a href="mailto:tristan.glatard@concordia.ca">tristan.glatard@concordia.ca</a><br/>
	  Office hours: Monday 3pm - 5pm or by appointment.
	</p>
	<p>
	  <strong>Teaching Assistants:</strong>
	  <ul>
	    <li>Zhen Du<br/>
	      e-mail: <a href="mailto:jenkin.du@gmail.com">jenkin.du@gmail.com</a>
	    </li>
	    <li>Huaqiang Kang<br/>
	      e-mail: <a href="mailto:HU_KAN@encs.concordia.ca">HU_KAN@encs.concordia.ca</a>
	    </li>
	  </ul>
	</p>
	<h2>
          <a id="lectures" class="anchor"
             href="#lectures" aria-hidden="true">
            <span aria-hidden="true" class="octicon octicon-link"/>
          </a>
	  Lectures &amp; Labs
        </h2>
	<ul>
	  <li>Lectures: Wednesday 5:45PM - 8:15PM at MB 3.430 SGW.</li>
	  <li>Labs: 
	    <ul>
	      <li>UI (Kang): Wednesday 3:45PM - 5:35PM at H 917 SGW</li>
	      <li>UJ (Du): Tuesday 1:15PM - 3:05PM at H 903 SGW</li>
	    </ul>
	  </li>
	</ul>
	<p><a href="https://moodle.concordia.ca/moodle/course/view.php?id=104000" target="_blank" >Moodle page</a>.</p>
	<h2>
          <a id="objectives" class="anchor"
             href="#objectives" aria-hidden="true">
            <span aria-hidden="true" class="octicon octicon-link"/>
          </a>
	  Objectives
        </h2>
	<p>Big Data analytics has been transforming industry and
	  science in various domains for the past few years, making
	  possible the processing of Terabytes of data on a daily
	  basis. This was enabled by the joint evolution of
	  programming models, data-analysis algorithms and computing
	  infrastructures.</p>
	
	<p>This course introduces the concepts and some of the main
	  algorithms used for Big Data analytics. It presents the
	  principles of the Hadoop ecosystem, Apache Spark, and it
	  details the main algorithms for the analysis of large
	  datasets, related to similarity search, mining of frequent
	  itemsets, graph analysis, clustering, stream mining,
	  recommender systems and advertising.</p>

	<p>By the end of this course, students will be able to write
	  and deploy efficient parallel algorithms to analyze Big Data
	  sources for various applications.</p>

	<h2>
          <a id="schedule" class="anchor"
             href="#schedule" aria-hidden="true">
            <span aria-hidden="true" class="octicon octicon-link"/>
          </a>
	  Schedule
        </h2>
	<table>
	  <tr><th>Date  </th> <th>Lecture                   </th> <th>Lab                                            </th>  <th> Assignments    </th></tr>
	  <tr><td>Jan 10</td> <td>Introduction              </td> <td> <font color="white">None</font>              </td>  <td> <font color="white">None</font> </td></tr>
	  <tr><td>Jan 17</td> <td>MapReduce, Hadoop, Spark (<a href="#mmdsbook">MMDS</a> Ch 2) </td> <td>Getting started with Python, Spark, Git, GitHub</td><td>Reading: <ul><li>Dean J, Ghemawat S. <a href="http://delivery.acm.org/10.1145/1330000/1327492/p107-dean.pdf?ip=132.205.230.2&id=1327492&acc=ACTIVE%20SERVICE&key=FD0067F557510FFB%2EB3808E783D2F97DD%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&CFID=1025377651&CFTOKEN=77711060&__acm__=1515606413_2eda3f66ac8c8a1e63cea40be09359f2">MapReduce: simplified data processing on large clusters</a>. Communications of the ACM. 2008 Jan 1;51(1):107-13.</li><li>Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin MJ, Ghodsi A. <a href="http://homepages.cs.ncl.ac.uk/paolo.missier/doc/p56-zaharia.pdf">Apache Spark: A unified engine for big data processing</a>. Communications of the ACM. 2016 Oct 28;59(11):56-65.</li></ul>
</td></tr>
	  <tr><td>Jan 24</td> <td>Recommender systems (<a href="#mmdsbook">MMDS</a> Ch 9)    </td> <td>Spark RDDs and DataFrames<br/><a href="https://github.com/glatard/bigdata-LA1">Instructions (ask your TA for access)</a></td><td><strong>LA1: "Spark RDD and DataFrame APIs"</strong><br/>Due date: <strong>Jan 26, 11:55pm</strong></td> </tr>
	  <tr><td>Jan 31</td> <td>Clustering (<a href="#mmdsbook">MMDS</a> Ch 7)               </td> <td>Recommender systems</td><td/></tr>
	  <tr><td>Feb 7 </td> <td>Frequent itemsets (<a href="#mmdsbook">MMDS</a> Ch 6)        </td> <td>Clustering</td><td><strong>Project proposal</strong><br/> Due date: <strong>Feb 9, 11:55pm</strong>.  </td></tr>
	  <tr><td>Feb 14</td> <td>Data streams (<a href="#mmdsbook">MMDS</a> Ch 4) </td> <td>Frequent itemsets</td><td/></tr>
          <tr><td colspan="4"><center><font color="white">Spring break</font></center></td></tr>
          <tr><td>Feb 28</td> <td>Similarity search (<a href="#mmdsbook">MMDS</a> Ch 3)</td><td>Data streams</td><td><strong>LA2: "Recommender systems, clustering and frequent itemsets in Spark"</strong><br/>Due date: <strong>Mar 2, 11:55pm</strong></td></tr>
          <tr><td>Mar 7</td>  <td>Graph analysis  (<a href="#mmdsbook">MMDS</a> Ch 5 &amp; 10)</td><td>Similarity search</td><td></td></tr>
          <tr><td>Mar 14</td> <td>Advertising on the Web (<a href="#mmdsbook">MMDS</a> Ch 8) </td> <td>Graph analysis </td><td/></tr>
	  <tr><td>Mar 21</td> <td>Introduction to Machine Learning (<a href="#mmdsbook">MMDS</a> Ch 12)</td> <td>Advertising</td><td><strong>LA3: "Stream mining, similarity search and graph analysis in Spark"</strong><br/>Due date: <strong>Mar 23, 11:55pm</strong></td></tr>
          <tr><td>Mar 28</td> <td><strong>Exam</strong>                  </td> <td><font color="white">None</font> </td><td><font color="white">None</font></td></tr>
          <tr><td>Apr 4 </td> <td><strong>Project presentations</strong>                   </td> <td>Machine learning </td><td/></tr>
          <tr><td>Apr 11 </td><td><strong>Project presentations</strong></td> <td>TBA</td><td><strong>LA4: "Advertising and Machine Learning in Spark"</strong><br/>Due date: <strong>Apr 13, 11:55pm</strong><br/><br/><strong>Project report</strong><br/>Due date: <strong>Apr 13, 11:55pm</strong></td></tr>
</table>
<p>
  <strong>Please note</strong>: In the event of extraordinary circumstances beyond the University's control, the content
  and/or evaluation scheme in this course is subject to change.
</p>
<h2>
  <a id="books" class="anchor"
     href="#books" aria-hidden="true">
    <span aria-hidden="true" class="octicon octicon-link"/>
  </a>
  Book
</h2>
<ul>
  <li>MMDS<a id="mmdsbook"></a> (<strong>Required</strong>):
    Mining of Massive Datasets, <i>Jure Leskovec, Anand Rajaraman, Jeff Ullman</i>, beta version of the 3rd edition.
    <a href="http://i.stanford.edu/~ullman/mmds/book0n.pdf">Available online</a>.
  <li>MapReduce (<strong>Optional</strong>): Data-Intensive Text Processing with MapReduce, <i>Jimmy Lin and Chris Dyer</i>, 1st edition, 2010.
    <a href="http://lintool.github.io/MapReduceAlgorithms/index.html">Available online</a>.
</ul>
A significant portion of the slides presented from session 3 will be taken
from <a href="http://www.mmds.org">http://www.mmds.org</a>. This
website also has useful videos explaining the slides.
<h2>
  <a id="evaluation" class="anchor"
     href="#evaluation" aria-hidden="true">
    <span aria-hidden="true" class="octicon octicon-link"/>
  </a>
  Course Evaluation
</h2>
	<p>
	  <strong>Lab assignments (30%)</strong>: You will be required
	  to develop data analysis programs in Python using Apache
	  Spark. There will be a total of four assignments. You must
	  work on these assignments <strong>individually</strong>. The
	  lab assignments are all due on a Friday evening, 11:55pm
	  (see exact dates on the <a href="#schedule">schedule</a>
	  table). A grace period of 48 hours will be automatically
	  granted (assignments will be accepted until Sunday night,
	  11:55pm), but <strong>no further extension will be
	    granted</strong>. Assignments must be submitted through
	  GitHub, see detailed instructions with your TA.
	</p>
	<p>
	  <strong>Exam (40%)</strong>: The exam is a
	  closed-book exam and will be conducted on the date
	  indicated on the <a href="#schedule">schedule</a> table,
	  during the lecture. In general, you will need to bring
	  your own ENCS calculator. There will be no substitution
	  for a missed exam. Passing the term exam is necessary
	  for passing the course.
	</p>
	<p>
	  <strong>Project (30%)</strong>: The project should fall in one of the following 3 categories:
	  <ul>
	    <li><u>Dataset analysis</u>: select a dataset (for instance
	    from your research) and apply at least two techniques seen
	    in the course using Apache Spark. You are not
	    required to re-implement these techniques.</li>
	    <li><u>Technology evaluation</u>: perform a comparative
	    study of at least two open-source technologies related to
	    Big Data Analysis, for instance from
	    the <a href="https://hadoop.apache.org">Hadoop
		project</a>.</li>
	    <li><u>Algorithm implementation</u>: (Re-)implement at least two
	    algorithms seen in the course or related to the themes
	    seen in the course.
	  </ul>
	<p>
	  No project template will be provided: you are supposed to
	  define your own project based on the instructions
	  above. Other types of (relevant) projects are welcome and
	  can be discussed with the instructor during office hours or
	  on Slack.  You can work on the project <strong>individually
	    or in a team of two</strong>; larger teams will not be
	  accepted.  The project will have the following
	  milestones. Deadlines are indicated on
	  the <a href="#schedule">schedule</a>. No deadline extension
	  will be granted.
	  <ol>
	    <li> The <strong>project proposal (5%)</strong> will be a
	      document of 3 pages or less with the following
	      structure:<ul>
		<li>Abstract: a few sentences summarizing the document.</li>
		<li>I. Introduction: context, objectives, presentation of the problem to solve, related work.</li>
		<li>II. Materials and Methods: the dataset(s), technologies and algorithms that will be used.</li>
	      </ul>
	    Even though the project proposal is only worth 5%, you are
	    strongly recommended to take it seriously as it is your
	    chance to get formal feedback on the project before the
	    final deliverables. Besides, the grade obtained for the project
	    proposal is a good predictor of the grades obtained for
	      the project report and presentation.
	    </li>
	    <li> The <strong>project report (15%)</strong> will be a document of 6 pages or less with the following structure:
	      <ul>
		<li>Abstract: as in the project proposal.</li>
		<li>I. Introduction: as in the proposal.</li>
		<li>II. Materials and Methods: as in the proposal.</li>
		<li>III. Results: a description of the result of the
		study (dataset analysis, technology comparison or
		implementation), with quantitative data obtained by
		the project team (graphs, tables, metrics, etc).</li>
		<li>IV. Discussion: a discussion of the relevance of
		the solution(s), of the limitations and of possible
		future work.</li>
	      </ul>
	      Project proposals and reports will be submitted
	      through <a href="https://moodle.concordia.ca/moodle/course/view.php?id=93782"
			 target="_blank" >Moodle</a>.  They will be evaluated using the following
	      criteria:
	      <ul>
		<li>Clarity (writing, organization, formatting) (20%)</li>
		<li>Relevance to the course topics (20%)</li>
		<li>Technical quality (60%)</li>
	      </ul>
	      All criteria will be assessed on a 4-level scale: unacceptable, average, good, exceptional. 
	    <li>The <strong>project presentation (10%)</strong> will be a 5-minute presentation of the project. It will be evaluated using the following
	      criteria:
	      <ul>
		<li>Clarity (slides and speech) (2%)</li>
		<li>Relevance (2%)</li>
		<li>Technical quality (6%)</li>
	      </ul>
	      Expect 1 or 2 questions after your presentation.  All criteria will be assessed on a 4-level scale: unacceptable, average, good, exceptional. </li>
	  </ol>
	</p>
	<p><strong>Grading Scheme</strong>: A passing mark on each of the 3
	  deliverables (lab assignments, exam and
	  project) is required to get a passing grade for the course.  There
	  is no standard relationship between percentages and letter grades
	  assigned. The grading of the course will be done based on the relative
	  percentages assigned to the assignments, project and the exam. There is no
	  definite rule for translation of number grades to letter grades.
	  <h2>
	    <a id="integrity" class="anchor"
	       href="#integrity" aria-hidden="true">
	      <span aria-hidden="true" class="octicon octicon-link"/>
	    </a>
	    Academic Integrity
	  </h2>
	  
	  Violation of the Academic Code of Conduct in any form will be
	  severely dealt with. This includes copying (even with modifications)
	  of program segments. You must demonstrate independent thought through
	  your submitted work.  Click on the following link for more
	  information: <a href="http://www.concordia.ca/students/academic-integrity.html"
			  target="_blank">http://www.concordia.ca/students/academic-integrity.html</a>.
      </section>
  </body>
</html>