Skip to content

Report v1.0

Peter Du edited this page Jun 27, 2020 · 1 revision

This report will briefly discuss the entire analysis process in version 1.0.

  1. Exploring the data and Feature Engineering

    • Some basic exploring techniques have been done to get the initial information of the data and to check for missing values in the data.
    • Each feature for the data has been analysed to find some unusual data points and transform the feature into an appropriate form.
    • Some existing features have been replaced with some new features which support further analysis.
  2. Finding the definition of popularity

    • Analysed features: price, number of subscribers, number of reviews, and number of lectures.
    • Deeply focused features: price, number of subscribers, number of reviews.
    • The cross-tabulation technique has been applied to all of the analysed features. Each pair was separated into 4 segments such as low-low, low-high, high-low, and high-high where low indicates values that are less than the mean, and high indicates values that are greater than or equal to the mean.
    • The cross-tabulation technique has been applied to all of the deeply focused features to find the definition of popularity in the dataset. The results are shown below.
      • Popular courses tend to have more than a high number of subscribers - on average, that is more than 12700 subscribers.
      • Popular courses tend to have a high number of reviews.
      • Popular courses tend to cost lower than the average price.
  3. Finding popular topics

    • Focused feature: course title
    • Some NLP techniques have been applied to the focused feature to find topics for each course. A topic is defined as a noun phrase or phrases that contain consecutive nouns. The pipeline for the process is shown below. Text processing pipeline
    • After having topics associated to each course, popular topics in the general view have been generated based on the number of subscribers and the number of reviews. Furthermore, popular topics for each subject have been generated based on the number of subscribers and the number of reviews. Note that plotted results focus on top-10 2-words topics and top-10 3-words topics.
Clone this wiki locally