2.1.1: What is Unsupervised Learning?
-
Comparison with Supervised Learning:
- Two primary methods of machine learning are unsupervised and supervised learning.
- Unlike supervised learning that utilizes labeled data, unsupervised learning works with unlabeled data.
-
Process & Use:
- Unsupervised learning identifies trends, relationships, and patterns (clusters) within data.
- Its applications:
- Cluster/group data to discern patterns rather than predicting a class.
- Transform data for intuitive analysis or supervised/deep learning use.
-
Practical Applications:
- Recommending related products based on customer views.
- Targeting specific customer groups in marketing.
- Use cases include customer segmentation, fraud detection, outlier identification, spam detection, healthcare diagnoses, and sentiment analysis.
-
Challenges:
- No definitive means to verify the correctness of the output.
- Understanding the clusters is challenging, as the algorithm self-generates categories.
-
Interplay with Other Learning Methods:
- Unsupervised learning can be used alongside supervised learning, deep learning, and NLP.
- Techniques like PCA (Principal Component Analysis) can optimize predictions or classifications by reducing input variables.
- It can be a pre-training step before applying deep learning to labeled data.
- In NLP, unsupervised learning can group texts/documents based on similarity.
-
Learning Outcomes:
- By the module's end, learners will be able to:
- Understand unsupervised learning's role in AI.
- Define clustering in the context of machine learning.
- Utilize the K-means algorithm for clustering.
- Determine the optimal cluster number using the elbow method.
- Convert categorical variables to numerical representations.
- Understand and apply PCA for dimensionality reduction.
- Utilize K-means post-PCA for improved data analysis.
- By the module's end, learners will be able to:
2.1.2: Recap
-
Purpose:
- Unsupervised learning is employed to cluster unlabeled datasets.
- It can also provide an alternative data representation for subsequent analysis and machine learning processes.
- Algorithms in unsupervised learning use test data to form models identifying relationships among data points.
-
Challenges:
- Due to the absence of labels in the input dataset, verifying the accuracy of the output data is uncertain.
- The algorithm self-generates data categories, requiring an expert to assess the relevance and significance of these categories.
-
Value:
- Despite its challenges, unsupervised learning proves beneficial in a diverse range of applications.
2.1.3: Getting Started
-
Files:
- Before starting the module, download the required files: Module 2 Demo and Activity Files.
-
Installations:
- The primary tools for this module are JupyterLab and the scikit-learn Python library.
- Scikit-learn is typically pre-installed with Anaconda.
-
Confirmation Steps for scikit-learn Installation:
- Activate the Conda development environment using the command:
conda activate dev. - To verify the installation of scikit-learn in the active environment, use:
conda list scikit-learn. - If installed correctly, the terminal will show scikit-learn as listed.
- Activate the Conda development environment using the command:
-
Installation Instructions (if scikit-learn is missing):
- Install scikit-learn using the command:
pip install -U scikit-learn. - After installation, revert to the confirmation steps to ensure the library has been installed successfully.
- Install scikit-learn using the command:
2.1.4: Clustering
-
Definition:
- Clustering is the process of grouping similar data based on certain similarities.
- Unsupervised learning models often use clustering algorithms to group objects.
-
Application Example:
- Cable services can use a clustering algorithm to group customers based on their viewing habits.
-
Practical Implementation:
- An example in the Interactive Python Notebook file is demos/01-Clustering/Unsolved/clusters.ipynb in the module-2-files folder.
- A synthetic dataset is generated using
make_blobsfrom scikit-learn.- It creates two features (X) and labels (y).
random_state=1ensures consistent random data generation.
-
Exploring the Data:
- The synthetic dataset created provides an array of X values, with a (100, 2) shape indicating 100 rows and two columns.
- The y values are transformed to fit into a single column.
- Data can be visualized using Pandas plot methods.
- The X values (features) are named "Feature 1" and "Feature 2".
- y values (labels) are added as "Target".
-
Visualization:
- Data can be visualized in a scatter plot, showcasing the different clusters.
- Clusters indicate similar data points.
- This is called "centering," which determines classes/groups in advanced analytics.
- Centering also improves logistic regression models by ensuring data points have the same starting mean value.
2.1.5: Recap and Knowledge Check
-
Definition:
- Unsupervised learning models often use clustering algorithms.
- The goal is to group similar objects/data points into clusters.
-
Cluster Assignment:
- In the given examples, clusters are pre-defined using the "clusters" parameter.
- However, not all datasets come with predefined clusters.
-
Role of Unsupervised Learning:
- The primary role is identifying a dataset's distinct number of clusters.
- This task is achieved using the K-means algorithm, which will be discussed next.
2.1.6: Segmenting Data: The K-Means Algorithm
-
Challenges with Clustering:
- Identifying the correct number of clusters in data is often challenging.
- Features distinguishing different groups might not always be evident.
-
K-Means Algorithm:
- It's an unsupervised learning algorithm used for clustering.
- Simplifies the process of objectively and automatically grouping data.
- K (uppercase) indicates the actual number of clusters.
- k (lowercase) generally refers to a cluster.
-
How K-Means Works:
- The algorithm first assigns points to the nearest cluster center.
- It adjusts the cluster center to the mean of data points in that cluster.
- This process is iterative, refining cluster assignments with each iteration.
- K-Means undergoes the following steps:
- Randomly selecting K clusters.
- Assigning each object to a similar centroid randomly.
- Updating cluster centers based on new means.
- Reassigning data points based on updated cluster centers.
-
Applications & Benefits:
- One significant application is customer segmentation in markets.
- Customer segmentation is crucial in financial services for better market understanding.
- Notable Example: Netflix:
- Uses segmentation to recommend movies based on customer preferences.
- This has improved product utility and reduced user cancellation rates.
-
Using K-Means with scikit-learn:
- K-Means can be implemented in Python using the scikit-learn library.
- Scikit-learn is an open-source Python library offering various supervised and unsupervised learning algorithms.
- Esteemed organizations like J.P. Morgan, Booking.com, Spotify, and Change.org use scikit-learn for their machine learning efforts.
- It is the primary tool recommended for machine learning in the provided content.
2.1.7: Using the K-Means Algorithm for Customer Segmentation
- The example discusses clustering average ratings for customer service, focusing on ratings from customers who evaluated the mobile application and in-person banking services.
- Users can follow along using a specific Jupyter notebook file (
demos/02-Kmeans/Unsolved/services_clustering.ipynb). - The data is loaded using Python's Pandas library from a CSV file named
service-ratings.csv. - Initial data review shows two columns: “mobile_app_rating” and “personal_banker_rating”.
- A scatter plot is created to visualize the spread of these ratings.
- Data visualization reveals that clear clusters aren't immediately apparent; however, there's a noticeable congregation of points around specific values.
- The K-means clustering algorithm from the scikit-learn library will be used to identify clusters in this data.
- The K-means model is initialized targeting two clusters, with specific parameters for consistent outcomes (
random_state=1) and automatic determination of starting centroids (n_init='auto'). - The model is trained (fit) using the service ratings data.
- Predictions for cluster assignments are made for each data point.
- The predictions are added as a new column (“customer_rating”) to a copy of the original DataFrame.
- The predicted customer ratings create A new scatter plot, color-coded.
- The updated scatter plot reveals two distinct clusters, indicating preferences: one group leaning towards mobile banking and the other towards in-person banking.
2.1.8: Activity: Spending Beyond Your K-Means
2.1.9: Recap and Knowledge Check
2.2.1: Finding Optimal Clusters: The Elbow Method
- Cluster analysis requires trial and error to find optimal clusters.
- Different features are tested to achieve the desired number of clusters.
- K-means algorithm has been used for defining customer segments with limited features like "mobile_app_rating" and "personal_banker_rating".
- Identifying the optimal number of clusters (k) in a new, unlabeled dataset with many features is challenging.
- The elbow method is a well-known technique for determining the optimal value of k in the K-means algorithm.
- The elbow method is a heuristic for efficiently solving the problem of determining the number of clusters.
- In K-means, uppercase K represents the number of clusters, while lowercase k refers to a cluster in general.
- The elbow method involves running the K-means algorithm for various values of k and plotting an elbow curve.
- The elbow curve plots the number of clusters (k) on the x-axis and the measure of inertia on the y-axis.
- Inertia measures the distribution of data points within a cluster.
- Low inertia indicates data points are close together within a cluster, implying a slight standard deviation relative to the cluster mean.
- High inertia suggests data points are more spread out within a cluster, indicating a high standard deviation relative to the cluster mean.
- The optimal value for k is found at the elbow of the curve, where the inertia shows minimal change with each additional cluster added.
2.2.2: Apply the Elbow Method
- The Jupyter notebook file 'elbow_method.ipynb' in the 'demos/03-Elbow_Method/Unsolved' directory applies the elbow method.
- Python and the K-means algorithm are used for customer segmentation analysis with customer service ratings data.
- Dependencies are imported, and the dataset is loaded into a Pandas DataFrame.
- An empty list is created to store inertia values, and a range of k-values (1 to 10) is used to test.
- A loop computes inertia for each k-value, which is then appended to the inertia list.
- A DataFrame is created to hold k-values and corresponding inertia values.
- The DataFrame is plotted as an elbow curve to observe the relationship between the number of clusters and inertia.
- The rate of decrease in inertia is analyzed to determine the elbow point, which is identified at k=4.
- The percentage decrease in inertia between each k-value is calculated to confirm the elbow point.
- K-means algorithm and plotting are rerun using four clusters.
- The elbow curve is emphasized as a guide, not a definitive answer, for determining the right number of clusters.
2.2.3: Activity: Finding the Best k
- Background: This activity involves using the elbow method to determine the optimal number of clusters for segmenting stock pricing information.
Subtasks:
-
Establish an elbow curve.
-
Evaluate the two most likely values for k using K-means.
-
Generate scatter plots for each value of k.
-
Files: Start with the files in
activities/02-Finding_the_Best_k/Unsolved/. -
Instructions:
- Read the
option-trades.csvfile into a DataFrame using the "date" column as the DateTime Index. Parameters forparse_datesandinfer_datetime_formatshould be included. - Create two lists: one for lowercase k-values (1 to 11) and another for inertia scores.
- For each k-value, define and fit a K-means model, then append the model's inertia to the inertia list.
- Store the k-values and inertia in a DataFrame called
df_elbow_data. - Plot the
df_elbow_dataDataFrame using Pandas to visualize the elbow curve, ensuring the plot is styled and formatted.
- Read the
-
Solution:
- Compare your work with the solution in
activities/02-Finding_the_Best_k/Solved/finding_the_best_k_solution.ipynbin themodule-2-filesfolder. - Reflect on differences between your approach and the solution, and identify confusing aspects.
- Compare your work with the solution in
2.2.4: Recap and Knowledge Check
- Focus: Using the elbow method to find the optimal number of clusters (k) for data.
Subtasks:
- Running the K-means algorithm for a range of k-values.
- Plotting the results and the corresponding level of inertia for each k-value.
-
Inertia: Measures the distribution of data points within a cluster.
- Optimal k-value: Identified at the elbow of the curve where the rate of decrease in inertia slows down.
-
Next Steps: Introduction to data optimization through data scaling.
- Importance: Data scaling is crucial for applying PCA (Principal Component Analysis).
- PCA combines learnings from unsupervised learning to enhance the algorithm and simplify data interpretation.
- Further exploration of PCA will follow the scaling and transforming data activity.
2.2.5: Scaling and Transforming for Optimization
- Overview: Applying PCA to enhance machine learning algorithms after optimizing data clustering with K-means.
Subtasks:
- Apply standard scaling to data features before using PCA.
- Combine PCA with K-means for better processing of large datasets.
-
Preparing Data:
- Manual data preparation, especially normalization or transformation, is time-consuming for multiple columns.
- K-means requires numeric values in a DataFrame to be on the same scale to avoid bias towards any single variable.
-
Standard Scaling:
- Common method for data scaling, centering values around the mean.
- Involves transforming data to eliminate measurement units and bring numeric values to a similar scale.
- Use of functions from Pandas and scikit-learn simplifies data preparation.
- Standard scaling centers bell curves around the same mean value, making them comparable.
- Example: Scaling "Annual Income" using StandardScaler from scikit-learn.
- StandardScaler calculates the column mean and scales data to a standard deviation of 1.
- Formula: (value - mean)/standard deviation.
- Scaling to a mean of 0 and standard deviation of 1 is essential for equalizing variable ranges in machine learning models.
-
Importance:
- Prevents machine learning models from giving undue importance to columns with larger or more widely ranging values.
- Known as data standardization, a common practice before training machine learning models.
-
Application:
- Demonstration of applying standard scaling to a shopping dataset.
2.2.6: Apply Standard Scaling
- Overview: Application of standard scaling using a shopping dataset from the "Spending beyond your K-Means" activity.
Steps for Standard Scaling:
- Import the StandardScaler module and Pandas.
- Load data into a DataFrame and drop unnecessary columns like "CustomerID".
- Scale numeric columns "Age", "Annual Income", and "Spending Score" using StandardScaler's fit_transform function.
- Create a new DataFrame with scaled data.
- Important Notes:
- StandardScaler's fit_transform function returns an array, not a DataFrame.
- The array needs conversion to a DataFrame, with columns named in the order they were scaled.
- StandardScaler is suitable for continuously ranging data, not categorical data.
Handling Categorical Data:
- Use Pandas' get_dummies function to transform categorical data into numerical format.
- Example: Transform "Card Type" column into numerical values representing "Credit" and "Debit".
- get_dummies converts categorical columns into separate numerical columns.
Concatenating Transformed Data:
- Concatenated scaled numerical data and transformed categorical data into one DataFrame using Pandas' concat function.
Pro Tip:
-
Scaling numeric data and encoding categorical data are essential practices in data preparation for K-means clustering.
-
Ensures all data is on the same scale, preventing disproportionate weight to any variable.
-
Application:
- Practice standard scaling and encoding variables in the next activity.
2.2.7: Activity: Standardizing Stock Data
- Background: Standardize stock data and use the K-means algorithm for clustering.
Files:
- Begin with files in
activities/03-Standardizing_Stock_Data/Unsolved/.
Instructions:
- Read the
tsx-energy-2018.csvfile into a DataFrame, setting the “Ticker” column as the index. - Note: Stock data includes yearly mean prices, volume, annual return, and variance from TSX-listed energy companies.
- Use the StandardScaler module and fit_transform function to scale numerical columns.
- Review a sample of the scaled data.
- Create a new DataFrame,
df_stocks_scaled, with the scaled data:- Retain original column labels.
- Add and set the ticker column from the original DataFrame as the index.
- Review the new DataFrame.
- Encode the “EnergyType” column using pd.get_dummies, storing the result in
df_oil_dummies. - Concatenate
df_stocks_scaledanddf_oil_dummiesusing pd.concat function along axis=1. - Review the concatenated DataFrame.
- Cluster the data using the K-means algorithm with a k value 3.
- Create a copy of the DataFrame and add a column with company segment values.
Solution:
- Compare your work with the solution in
activities/03-Standardizing_Stock_Data/Solved/standardizing_stock_data_solution.ipynb. - Reflect on the differences between your approach and the solution.
- Please feel free to seek help during instructor-led office hours for any confusion or questions.
2.2.8: Recap and Knowledge Check
-
Overview: Scaling or normalizing datasets prepares data for future analysis.
- Purpose: Adjust the scale of data for easier comparison.
- Methods: This can be done manually or using functions from Pandas and scikit-learn.
-
Recent Lesson: Explored data scaling by centering values around the mean, known as standard scaling.
-
Upcoming Lesson:
- Focus: Understanding the importance of scaling data for PCA (Principal Component Analysis).
- Content: Explore the steps involved in the PCA technique.
2.3.1: Introduction to Principal Component Analysis
-
Background: Addressing high dimensionality in machine learning algorithms using Principal Component Analysis (PCA).
-
Issue with High Dimensionality:
- Datasets with too many features can slow down algorithm execution.
- High dimensionality can be problematic in large data sets.
-
Principal Component Analysis (PCA):
- PCA is used to speed up machine learning algorithms with many features.
- It reduces dimensionality by transforming a large set of features into a smaller one.
- This process reduces the number of DataFrame columns while preserving helpful information.
- PCA helps increase interpretability and minimize information loss.
-
Dimensionality Reduction:
- Involves reducing dataset size but preserving as much helpful information as possible.
- PCA is a popular technique for dimensionality reduction.
-
Application of PCA:
- Example provided using credit card dataset segmentation with K-means.
- The process includes normalizing data, applying PCA, and reducing features.
Steps for Using PCA with K-Means Algorithm:
- Normalize and transform the data.
- Use StandardScaler and get_dummies for data preparation.
- Pick the number of components for PCA.
- Apply dimensionality reduction and get a variance for each component.
- Create a DataFrame with transformed data.
- Determine the optimal k value for segmentation with K-means.
-
Application Steps Explained:
- PCA reduces the number of variables while preserving as much original data information as possible.
- The variance of each PCA component indicates the amount of information it contains.
- After PCA, data is represented in principal components without specific meanings.
- Explained variance helps determine the importance of each principal component.
- Aim for a total explained variance between 70-90% for retaining essential data features.
- Overfitting risk increases with too many principal components.
- PCA transformation results in a more apparent distinction between customer segments.
-
Example: Credit Card Dataset
- Use of PCA for more precise segmentation in credit card data.
- K-means with 3 clusters applied to the PCA-transformed data.
- Scatter plot with PCA dimensions shows distinct customer segments.
- Despite some accuracy loss, dimensionality reduction offers more precise visualization.
2.3.2: Recap and Knowledge Check
-
Relationship Between K-means and PCA:
- PCA is used when applying K-means to datasets with many dimensions.
- It helps in accelerating algorithms by reducing dimensions to principal components.
- This process enhances interpretability and minimizes information loss.
-
Steps for Using PCA with K-means Algorithm:
- Normalize and transform the data.
- Utilize StandardScaler for standard scaling.
- Apply get_dummies for handling categorical variables.
- Reduce the number of features in the dataset.
- Choose the number of principal components for dimensionality reduction.
- Apply the dimensionality reduction process.
- Calculate and analyze the variance of each component.
- Convert the transformed data into a DataFrame.
- Determine the optimal k value for K-means clustering.
- Segment the PCA-transformed data using K-means.
-
Learning Progression:
- Understanding PCA and its application.
- Upcoming focus: Simplifying stock data using standard scaling and PCA.
- Objective: To reduce the number of columns for clustering and then apply K-means.
2.3.3: Activity: Energize Your Stock Clustering
-
Background:
- Purpose: To use the K-means algorithm and clustering optimization techniques for clustering stocks and defining a portfolio investment strategy.
-
Files and Instructions:
- Access the files in:
activities/04-Energize_Your_Stock_Clustering/Unsolved/. - Use Jupyter notebook in the Unsolved folder for the following tasks:
- Read in
tsx-energy-2018.csv, setting "Ticker" as the DataFrame index. - Review and utilize four code cells to scale the
df_stocksDataFrame and create a scaled DataFrame. - Cluster the data in
df_stocks_scaledusing K-means with a lowercase-k value of 3, or determine optimal k using the elbow method. - Add the cluster values as a new column in
df_stocks_scaled. - Create a scatter plot with axes "AnnualVariance" and "Annual Return", color-coded by "StockCluster".
- Read in
- Access the files in:
-
Advanced Analysis with PCA:
- Use PCA to reduce
df_stocks_scaledto two principal components:- Review the PCA data and calculate the explained variance ratio.
- Create a DataFrame
df_stocks_pcawith the PCA data and tickers from the original DataFrame. - Set the new Tickers column as the index for
df_stocks_pca. - Rerun K-means using the PCA data and creating a scatter plot using the two principal components.
- Use PCA to reduce
-
Review and Comparison:
- Review the completed activity by comparing the solution in
activities/04-Energize_Your_Stock_Clustering/Solved/energize_your_stock_clustering_solution.ipynb. - Evaluate differences and understanding of the process.
- Review the completed activity by comparing the solution in
2.3.4: Recap and Knowledge Check
-
Overview:
- This lesson demonstrates the culmination of learnings in the module through the application of Principal Component Analysis (PCA).
-
Key Takeaways:
- Understanding the importance of PCA in data visualization and AI applications.
- The examples and activities provide practical experience in using PCA.
-
Next Steps:
- The module concludes with a recap in the form of a summary and reflection.
- Engaging with the final lesson helps solidify the learnings from the entire module.
2.4.1: Summary: Unsupervised Learning
-
Module Completion:
- Congratulations on finishing another module.
-
Learning Outcomes:
- Learned about the purpose and importance of unsupervised learning in AI.
- Defined clustering and its application in machine learning.
- The K-means algorithm was applied for cluster identification in datasets.
- Determined optimal cluster numbers using the elbow method.
- Practiced scaling and transforming categorical variables into numerical formats.
- Explained PCA and its role in dimensionality reduction in data.
- Used PCA to reduce feature numbers in datasets.
- Applied K-means algorithm post-PCA dimensionality reduction.
-
Practical Skills Gained:
- Clustered consumer spending data into different segments.
- Utilized the elbow method for optimal k-value identification in stock pricing datasets.
- Standardized stock data and clustered it using the K-means algorithm.
- Used PCA in combination with K-means for clustering stocks for investment strategies.
-
Unsupervised Learning Insights:
- Unsupervised learning models operate without human intervention, analyzing unlabeled datasets.
- They are particularly useful for handling large datasets.
- Unsupervised learning contributes significantly to AI advancements, such as customer segmentation, fraud detection, and outlier identification.
-
Next Steps:
- The next module will focus on supervised learning models.
- Encouraged to compare machine learning models and understand their specific values and applications.
- Reflect on learnings from the current module before progressing.
2.4.2: Reflect on Your Learning
-
Reflection on Unsupervised Learning:
-
New Knowledge and Skills:
- Understanding the principles and applications of unsupervised learning.
- Mastery of clustering techniques and the K-means algorithm.
- Skills in determining the optimal number of clusters using the elbow method.
- Learning about PCA for dimensionality reduction.
-
Surprises and Intrigues:
- The effectiveness of unsupervised learning in discovering patterns in unlabeled data.
- The versatility of PCA in simplifying complex datasets.
-
Challenges:
- Grasping the complexities of PCA and its implications on data.
- Finding the optimal number of clusters in K-means clustering.
-
Application:
- Using clustering to analyze consumer behavior in a professional setting.
- Applying PCA and K-means to personal projects involving large datasets.
-
Documentation:
- Writing down insights and applications for future reference.
- Keeping a record to refresh memory or guide future learning.
-
Personal Relevance:
- Reflecting on how unsupervised learning models can be integrated into job-related tasks or personal projects.
- Considering the practical impact of these models on making predictions or analyzing trends.
-
2.4.3: References
Thorndike, R.L. 1953. Who belongs in the family? Psychometrika. Available: https://www.semanticscholar.org/paper/Who-belongs-in-the-family-Thorndike/47706e9fdfe6b7d33d09579e60d6c9732cfa90e7