Examples for the Big Data Mining class
- Create your virtual environment and install the required dependencies:
virtualenv -p `which python3` venv
source venv/bin/activate
pip install -r requirements.txt
# You will probably need to install a Jupyter kernel:
ipython kernel install --user --name=venv
Otherwise, copy and paste the scripts to Google Colab.
- binning
- correlation
- feature selection
- principal component analysis (PCA)
- normalization
- categorical encoding
- discretization with k means
- Exploratory data analysis (EDA)
- decision trees for discretization, classification, rule extraction
- bayesian network
- SVM
- KNN
- clustering examples with K-means, hierarchical clustering and DBSCAN
- Silhouette coefficient examples
- apriori implementation and example
- movie recommendation example
- outlier examples with statistical assumptions, boxplot, DBSCAN and Isolation Forest
- text classification with naive bayes
- topic modeling with LDA and information retrieval
- word2vec example with pre-trained embeddings
- language models example with simple n-grams, MLE and smoothing
- data playground
The data folder contains various datasets and toy data used for the examples in this class.
See requirements.txt to install the needed libraries in your virtual environment.