Skip to content

COM6012 Scalable Machine Learning - University of Sheffield. Enjoy our resources? ⭐ Star this repository to show your support and help others discover it!

Notifications You must be signed in to change notification settings

COM6012/ScalableML

Repository files navigation

COM6012 Scalable Machine Learning - University of Sheffield

Spring 2026

by Shuo Zhou and Robert Loftin, with Tahsin Khan

In this module, we will learn how to do machine learning at large scale using Apache Spark. We will use the High Performance Computing (HPC) cluster systems of our university. To access the HPC clusters, log in using SSH with your university username and the associated password. When connecting while on campus using Eduroam or off campus, you must keep the university's VPN connected all the time. Multifactor authentication (MFA) will be mandatory. The standard University DUO MFA is utilized.

  • Session 1: Introduction to Spark and HPC (Shuo Zhou)
  • Session 2: RDD, DataFrame, ML pipeline, & parallelization (Shuo Zhou)
  • Session 3: Scalable logistic regression and Spark configuration (Shuo Zhou)
  • Session 4: Scalable generalized linear models and Spark data types (Shuo Zhou)
  • Session 5: Scalable decision trees and ensemble models (Robert Loftin)
  • Session 6: Scalable matrix factorization for collaborative filtering in recommender systems (Robert Loftin)
  • Session 7: Scalable PCA for dimensionality reduction (Robert Loftin)
  • Session 8: Scalable k-means clustering (Robert Loftin)
  • Session 9: Scalable neural networks (Tahsin Khan)
  • Session 10: Apache Spark in the Cloud (Databricks) (Tahsin Khan)

You can also download the Spring 2025 version for preview or reference.

If you do not have a GitHub account yet, we recommend signing up for one to learn how to use this popular open-source software development platform.

We use US spelling in the slides and lab notes for consistency with the naming conventions in Spark.

An Introduction to Transparent Machine Learning

Shuo Zhou developed an open course on An Introduction to Transparent Machine Learning with Prof. Haiping Lu, part of the Alan Turing Institute’s online learning courses in responsible AI. If interested, you can refer to this introductory course with emphasis on transparency in machine learning to assist you in your learning of scalable machine learning.

Acknowledgement

The materials are built with references to the following sources:

Many thanks to

  • Haiping Lu and Mauricio A Álvarez for their significant contributions to this module between 2016–2025 and 2016–2022, respectively. Their contributions remain reflected in the course materials.
  • Mike Croucher, Neil Lawrence, William Furnass, Twin Karmakharm, Mike Smith, Xianyuan Liu, Desmond Ryan, Steve Kirk, James Moore, and Vamsi Sai Turlapati for their inputs and inspirations since 2016.
  • Our teaching assistants and students who have contributed in many ways since 2017.

About

COM6012 Scalable Machine Learning - University of Sheffield. Enjoy our resources? ⭐ Star this repository to show your support and help others discover it!

Topics

Resources

Stars

Watchers

Forks

Contributors 12