Skip to content

rajrohan/Data-Management

Repository files navigation

Data-Management

Data Management I (Data Profiling and Quality Dimensions)
Exam Guidelines and Requirements Prof. Dr. Ajinkya Prabhune Prof. Dr. Barbara Sprick

  • The submission for this module has two parts; detailed are explained as Goal-A and Goal-B below. - Goal-A àEach student will be given a dataset and it is expected that you profile and clean the dataset considering the different quality dimensions. - The exam format for Goal-A will be a report submission – the report should not be more than 15-20 pages, following a single column format. o Each student has to submit his/her completed report until 15th March 14:00 in Moodle. o Source code, if any must be zipped and uploaded with the report in Moodle. o The report should not contain any subjective description (general information) but should be specific with objective statements on the steps taken in cleaning the data. - Goal-B à Each student has to also prepare an unclean dataset. - The exam submission for Goal-B will be a short document, recipe/source-code if any and the unclean dataset in either CSV, XL, or a MySQL database dump.

Goal - A a) Data Profiling - Using Talend Data Quality Management tool, you have to document the data profiling steps performed on the dataset - For clear understanding of the profiling steps performed, you can paste the screenshot from the respective tool.

b) Data Cleaning - Mention all the relevant data quality dimension necessary for cleaning the dataset. - Document in detail the steps performed for handling a given quality dimension. - OpenRefine, Talend Data Preparation, or Trifacta can be used for performing the data cleaning process • Bonus point for those who will be showing a comparison of data cleaning features between the 3 above mentioned tools (this task is not mandatory). - The datasets are in CSV format, exporting them in a DB system for performing specific/advance queries is up to each individual. However, that information should be part of the report. - Following questions have to be answered for the task • What insights did you gain from dataset analysis? • Which all data quality dimension does your analysis fits into? - Explain the dimension with a use case (the use case can be hypothetical one created by you) - For example, in the classroom tutorial on the Patent dataset, the use case was finding the different countries filing patents; *If you wish to use a different dataset than the one provided by us, then please inform us as soon as you possible.

  • To answer this use case, the dimensions conformity + consistency + accuracy was considered for the column “PublicationNumber”. • Why did you choose a given technique and the associated actions steps for a task? - Is there an alternative technique that could have been used?

GOAL – B - Creating a Raw dataset refereeing to any online source like kaggle.com. - Conditions on the Raw dataset o Minimum dimensions to be considered while making the dataset Raw are § Accuracy § Completeness § Conformity § Validity § Consistency § Uniqueness § Currency o You are free to include any other dimension from the Literature. - The dataset must be made at least 25% Raw/Unclean based on the above-stated dimensions. - Every record/set of records that you make unclean should be documented highlighting the quality dimension that is considered. - The recipe/source-code should be also submitted for verification o You can use any or a combination of OpenRefine, Data Preparation, or Trifacta.

About

Data Profiling and Data Cleaning by mentioning all the relevant data quality dimension

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published