© Explore Data Science Academy
This predict contains multiple notebooks that you will be required to update in order to arrive at your solution. To successfully complete this predict you will need to:
- log in to your Github account;
- fork this repo to create your own remote copy; and
- clone the forked repo on your account so you can work with the files locally on your machine.
You are part of a small dynamic team of Data Scientists trying to create the latest and greatest automated stock market trading robots. You assume the role of a Data Engineer and is responsible for ensuring that the Data Science team has access to high-quality training datasets.
You will need to perform data profiling, quality analysis, and testing on the ingested data to ensure that the dataset is of sufficient quality. Data quality analysis and testing have to be done in such a way as to be generic and abstracted to the level where the notebook(s) can be run for any point in time, on a regular schedule, or as part of an ingestion pipeline.
The data that you will be working with is a historical snapshot of market data taken from the Nasdaq electronic market. This dataset contains historical daily prices for all tickers currently trading on Nasdaq. The up-to-date list can be found on their website.
The dataset for the predict can be found in an AWS S3 bucket called: processing-big-data-predict-stocks-data
. The bucket contains two items:
- Stock data: stocks.zip
- Metadata: symbols_valid_meta.csv
You need to download the data stored in the S3 bucket and store it on your local machine.
This first step in the data engineering process serves to ingest the data into a working environment which makes it available for wider use for anyone else in the Data Science team.
Typically, you will use a specific subset of the dataset to develop your ingestion, data quality and testing process. This ensures that you first create a robust process before starting on the productionisation of the process and code.
In this task, you will be using the first data subset (1962) in the dataset as a testing set.
Some of the reasons for choosing this dataset is:
- it is the first year within the dataset;
- it contains a small portion of the dataset and will be very fast to develop on, while not requiring a lot of computation; and
- since it is the oldest in the dataset, it likely contains the most errors.
At the end of this task, your notebook should be able to produce parquet files for a specific year.
This section is mainly concerned with the setup and ingestion of the dataset; you will be assessed using multiple-choice questions.
The following actions need to be taken to complete the assessments in this section:
- Submit Notebook
- Complete Data Ingestion MCQ
The below resource should be used to complete the work found in this section:
After the data has been ingested and basic sense checks have been performed on the dataset, the next step is to ensure that we have a full view of the dataset. This does not refer to exploratory data analysis (EDA) that data scientists are typically familiar with, but rather an exploration of the dataset considering the six (or more) dimensions of data quality.
This ensures that we are completely aware of the data landscape, any possible flaws in the dataset, and characteristics that will have to be considered when building models or performing analytics.
In this section, you will be required to:
- Derive summary statistics on the dataset across the six dimensions of data quality; and
- identify possible concerns in the data quality, and correct any issues identified.
At the end of this task, you will be required to produce a csv
file of all the data transformations you have made during your data quality checks.
This section mainly focuses on transformations required to address the data quality issues found in the dataset. The csv
file produced will be used to auto-grade your work. You will also be required to complete an MCQ.
The following actions need to be taken to complete the assessments in this section:
- Submit Notebook
- Submit
csv
output file - Complete Data Profiling MCQ
The below resource should be used to complete the work found in this section:
Manually inspecting the data for characteristics and flaws is very powerful but not practical when working with production-grade datasets.
Production systems have several challenges, the largest being:
- Size: Incoming data can be so large that manual inspection becomes unfeasible; and
- Frequency: Data can arrive at any interval, sometimes being near real-time. At which point, manual inspection would be impossible as well.
This can be mediated by having an automated process for monitoring data quality and performing tests every time the data gets ingested.
In this section, you will be required to set up a generic process on which you can perform continuous data quality monitoring and testing.
This section will mainly focus on your ability to make use of an automated data testing tool. You will revisit the six dimensions of data quality using Deequ.
The following actions need to be taken to complete the assessments in this section:
- Submit Notebook
- Complete Data Testing MCQ
The below resource should be used to complete the work found in this section:
This section of the repo will be periodically updated to represent common questions that may arise around its use. If you detect any problems/bugs, please create an issue, and we will do our best to resolve it as quickly as possible.
We wish you all the best in your learning experience 🚀!