A walkthrough of the data clean-up and summarization done to meet the project submission guidelines.
- Load the required 'dplyr' library
- Load the 'features' file to find the column/variable names for the observations in data sets ..1. Since the Step #2 of the instructions reads "Extracts only the measurements on the mean and standard deviation for each measurement", only the feature/variable names containing the keyword "mean" or "std" are to be retained for analysis ..2. Thus created a subset of the 'features' table holding the feature names containing "mean" or "std" using grep(). ..3. Since I will be using the $featureName to replace the column names of the data sets later I have removed the ugly "()" from the column names using the sub() and regular expression "\(\)" to find a match.
- Load the 'activities' file to find the activity names for the observations in data sets
- Load TEST subjects from the "test/subject_test.txt" file
- Load TEST activities from the "test/y_test.txt" file
- Load TEST data set of observations from the "test/X_test.txt" file ..1. Retain only the columns/variables holding the "mean" or "std" by select() on the filtered features list from 3(ii) ..2. Rename the columns to appropriate names derived from the features list from 3(iii)
- Add the TEST subjects and activities are two new columns to the TEST data set
- Performed Steps 3 to 7 on the TRAIN data set
- Merged the TEST and TRAIN data sets using the rbind() function since we need to merge the data frames by rows and not columns.
- Added a factor to replace the activity codes with the activity names on the merged data frame.
- Sorted the merged data frame by Subjects and then Activity.
- Used the group_by() and summarise_each() functions to derive the average i.e. mean of the measurements by the Subjects and the Activities.
- Wrote the tidy and summarized data set to the file as required.