Over the rest of the semester, you will work with the Reddit Archive data from January 2021 through the end of August 2022, representing about 8TB of uncompressed text JSON data which have been converted to two parquet files (about 1TB). For a sneak peek of the data, you can download sample files of the submissions and comments:
This is a very rich dataset and it is up to you to decide how you are going to use it. You will select one or more topics of interest to explore and use the Reddit data and external data to perform a meaningful analysis.
You will use Spark on Azure Databricks and Python to analyze, transform, and process your data and create one or more analytical datasets. These are the only tools permitted.
In your work as data scientists, in addition to doing modeling and machine learning work, you will be responsible (either individually or as part of a team) for providing the following as part of a project:
- Findings: what does the data say?
- Conclusions: what is your interpretation of the data?
- Recommendations: what can be done to address the question/problem at hand
Your analysis will focus on the first two above. It should allow the audience to be able to understand the topic you are analyzing, presenting and discussing. The objective is to find a topic of interest, work with the data, and present it to an audience that may not know very much about the subject using a data-driven approach.
The project will be executed over several milestones, and each one has a specific set of requirements and instructions located in the instructions/
directory.
There are four major milestones (click on each one for the approprate instructions and description):
- Milestone 1: Define the questions and Exploratory Data Analysis (You have already done this step)
- Milestone 2: NLP and external data overlay
- Milestone 3: Machine Learning (not available yet)
- Milestone 4: Final delivery (not available yet)
All of your work (with the exception of the first milestone) will be done within this team GitHub repository, and each milestone will be tagged with a specific release tag by the due date.
You will work within an organized repository and apply coding and development best practices. The repository has the following structure:
├── LICENSE
├── README.md
├── code/
├── data/
├── img/
├── instructions/
├── website/
└── website-source/
- The
code/
directory is where you will write all of your scripts. You will have a combination of Pyspark and Python notebooks, and one sub-directory per major task area. You may add additional sub-directories as needed to modularize your development. - The
data/
directory contains. - The
img/
directory contains screenshots for the instructions. You will not edit this directory. - The
instructions/
directory contains all project instructions. The best way to see the instructions is to start from the milestones. You will not edit this directory. - The
website/
directory where the final website will be built. Any website asset (image, html, etc.) must be added to this directory. You do not need to serve the website, although you can do so if you wish. However, this naming convention may not work. - The
website-source/
is where you will develop the website using your preferred method. It must render inwebsite/
.
NOTE: only one cluster is allowed per-group and must be shared between all members of the group.
Follow these instructions to setup your Databricks environment.
- Your code files must be well organized
- Do not work in a messy repository and then try to clean it up
- In notebooks, use Markdown cells to explain what you are doing and the decisions you are making
- Do not write monolithic Notebooks or scripts
- Modularize your code (a script should do a single task)
- Use code comments
- Use functions to promote code reuse
The output of the project will be delivered through a self-contained website in the website/
subdirectory, having index.html
as the starting point. You will build the website incrementally over the milestones.
Read the website requirements.
The project will be evaluated using the following high-level criteria:
- Level of analytical rigor at the graduate student level
- Level of technical approach
- Quality and clarity of your writing and overall presentation
- If a deliverable exceeds the requirements and expectations, that is considered A level work.
- If a deliverable just meets the requirements and expectations, that is considered A-/B+ level work.
- If a deliverable does not meet the requirements, that is considered B or lesser level work.
Deductions will be made for any of the following reasons:
- There is lack of analytical rigor:
- Analytical decisions are not justified
- Analysis is too simplistic
- Big data files included in the repository
- Instruction are not followed
- There are missing sections of the deliverable
- The overall presentation and/or writing is sloppy
- There are no comments in your code
- There are absolute filename links in your code
- The repository structure is sloppy
- Files are named incorrectly (wrong extensions, wrong case, etc.)