Skip to content

Quantum-Software-Development/6-DataMining_Pre-Processing

Repository files navigation


[πŸ‡§πŸ‡· PortuguΓͺs] [πŸ‡ΊπŸ‡Έ English]



6- Data Mining / Data Cleaning, Preparation and Detection of Anomalies (Outlier Detection)



Institution: Pontifical Catholic University of SΓ£o Paulo (PUC-SP)
School: Faculty of Interdisciplinary Studies
Program: Humanistic AI and Data Science Semester: 2nd Semester 2025
Professor: Professor Doctor in Mathematics Daniel Rodrigues da Silva



Sponsor Quantum Software Development






Important

⚠️ Heads Up







🎢 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix
Statistical.Measures.and.Banking.Sector.Analysis.at.Bovespa.mp4

πŸ“Ί For better resolution, watch the video on YouTube.



Tip

This repository is a review of the Statistics course from the undergraduate program Humanities, AI and Data Science at PUC-SP.

Access Data Mining Main Repository

If you’d like to explore the full materials from the 1st year (not only the review), you can visit the complete repository here.



Explore datasets from the University of California Irvine (UCI) Machine Learning Repository : such as the Balloon, Bank Marketing, and Mammogram datasets to practice these concepts of data pre-processing and mining.



Table of Contents



Real-world data are almost always incomplete, inconsistent, and noisy. These problems must be addressed via pre-processing to ensure clean, reliable data, a prerequisite for successful data mining.

The pre-processing step manipulates raw data into a form that enables better and more accurate knowledge extraction.


Missing attribute values, records, or features.

Example: "?" in the credit card field or missing rows.


Contradictory or conflicting entries within the data, e.g., units mixing kg with lbs.


Random variations or errors that obscure real data trends.


Poor quality input data produce poor quality outputs and insights. Cleaning data beforehand is critical.



Types of Data)]()


Type) Description) Examples)
Structured Fixed fields, clear schema CSV, SQL tables
Semi-Structured Partial structure with markers XML, JSON, Emails
Unstructured No strict structure or schema Text, images, video files




Attribute Type) Description) Example)
Binary Two possible values Yes/No, 0/1
Nominal Categorical, no order Marital Status
Ordinal Ordered categories Education Level
Ratio Numeric with meaningful zero Age, Salary



  • 20 samples, 5 attributes: color, size, action, age, inflated (True/False).
  • Simple dataset to illustrate basic concepts.

  • 4521 samples, 17 attributes related to direct marketing campaigns.
  • Predict whether client will subscribe a term deposit (application).

Example attributes:


Attribute Type Description
age Numeric Client's age
job Categorical Job type
marital Categorical Marital Status
education Categorical Education Level
credit Binary Has credit line (yes/no)
balance Numeric Account balance
... ... ...




  • 960 samples, 6 attributes related to breast masses.
  • Used for predicting severity (benign/malign).



Steps of Data Pre-Processing

  1. Cleaning: Handling missing, noisy, and inconsistent data.
  2. Integration: Combine data from multiple sources.
  3. Reduction: Reduce dimensionality or data volume.
  4. Transformation: Normalize and format data.
  5. Discretization: Convert continuous attributes into categorical intervals.





  • Binning: Group values into intervals (equal width or equal frequency bins). Replace each value by bin mean or bin boundaries.
  • Clustering: Group similar data points; replace with cluster centroid or medoid.
  • Approximation: Fit data to smoothing functions like polynomials.

  • Detect out-of-domain or conflicting values.
  • Correct with manual review or automated scripts verifying domain constraints.
  • Use visualization, statistical summaries, and input from domain experts.



  • Redundancy: Duplicate or derivable data attributes or records.
  • Duplicity: Exact copies of objects or attributes.
  • Conflicts: Different units or representations of the same attribute.
  • Resolve by normalization and unifying units or standards.





Normalization rescales data for algorithm compatibility:


Maps value (a) to (a') in new range ([new_{min}, new_{max}]):


$$ \Huge a' = \frac{a - min_a}{max_a - min_a} \times (new_{max} - new_{min}) + new_{min} $$


a' = \frac{a - min_a}{max_a - min_a} \times (new_{max} - new_{min}) + new_{min}



Centers attribute around zero and scales by standard deviation:


$$ \Huge a' = \frac{a - \bar{a}}{\sigma_a} $$


a' = \frac{a - \bar{a}}{\sigma_a}




  • Converts numeric attributes into categorical bins.
  • Methods include equal-width, equal-frequency, histogram-based, and entropy-based discretization.






import pandas as pd
from sklearn.preprocessing import minmax_scale, scale

# Load Bank Marketing dataset from UCI (use your local copy or URL)

# Example URL requires downloading and preprocessing: demonstration uses local CSV

data_path = "bank-additional-full.csv"
df = pd.read_csv(data_path, sep=';')

# Drop unnamed columns (typical from CSV exports)

df = df.loc[:, ~df.columns.str.contains('^Unnamed')]

# Handling missing values example:

# Replace '?' with NaN for easier manipulation

df.replace('?', pd.NA, inplace=True)

# Remove rows with any missing values (not recommended if too many deleted)

df_clean = df.dropna()

# Or imputing missing values with mode for categorical attribute, e.g. 'job'

df['job'].fillna(df['job'].mode(), inplace=True)

# Max-Min normalization for numeric columns

num_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
df[num_cols] = df[num_cols].apply(minmax_scale)

# Z-score normalization example

df[num_cols] = df[num_cols].apply(scale)

# Discretization example - age into 5 bins

df['age_binned'] = pd.cut(df['age'], bins=5, labels=False)

# Drop columns example: if any 'Unnamed' columns exist

df = df.drop(columns=[col for col in df.columns if 'Unnamed' in col], errors='ignore')




+-------------------+
| Define Problem    |
+---------+---------+
|
v
+-------------------+
| Select Data       |
+---------+---------+
|
v
+-------------------+     +--------------------------+
| Choose Algorithm  +---->+ Apply Preprocessing Steps |
+---------+---------+     +--------------------------+
|
v
+-------------------+
| Apply Data Mining |
+-------------------+




Values: 2, 3, 7, 8, 11, 12, 16, 18, 20
Bins: 3 (width = 6)

Bin 1: 2 - 7  -> 2, 3, 7
Bin 2: 8 - 13 -> 8, 11, 12
Bin 3: 14 - 20->16, 18, 20




(a) Centroid                     (b) Medoid

*                             *
    
*o*                           ooo
oooo                          o o o

Centroid = mean point (artificial)
Medoid  = actual central object



Data pre-processing ensures the quality of mining outcomes by addressing missing, noisy, and inconsistent data; integrating multiple data sources; reducing data size and dimensionality; normalizing; and discretizing attributes for algorithm compatibility.

UCI datasets like Balloon, Bank Marketing, and Mammographic Mass provide excellent real-world scenarios to practice these techniques using Python.



This code notebook guides you through essential data pre-processing steps applied to datasets similar to those from the UCI Machine Learning Repository, such as Bank Marketing or Breast Cancer datasets. The steps cover cleaning, handling missing values, normalizing, discretizing, and preparing data for mining.



# Import libraries

import pandas as pd
import numpy as np
from sklearn.preprocessing import minmax_scale, scale


# Load dataset from local or online source

# Example: Breast Cancer Wisconsin dataset (adjust path or URL as needed)

url = 'https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/cancer.csv'  \# Example URL
dados = pd.read_csv(url)

# Show first rows to inspect data

dados.head()


# List columns and data types

print(dados.dtypes)

# Check for 'Unnamed' columns (common from CSV exports)

unnamed_cols = [col for col in dados.columns if "Unnamed" in col]
print("Unnamed columns:", unnamed_cols)


# Drop unnamed columns if any

dados.drop(columns=unnamed_cols, inplace=True)
print("Columns after dropping unnamed ones:", dados.columns)


# Check for missing values per column

print(dados.isnull().sum())

# Example: Impute missing values using mean for numerical columns

num_cols = dados.select_dtypes(include=['float64', 'int64']).columns.tolist()

for col in num_cols:
mean_value = dados[col].mean()
dados[col].fillna(mean_value, inplace=True)

# For categorical columns, impute mode if needed

cat_cols = dados.select_dtypes(include=['object']).columns.tolist()
for col in cat_cols:
mode_value = dados[col].mode()
dados[col].fillna(mode_value, inplace=True)

# Verify no missing values left

print(dados.isnull().sum())


# Alternatively, remove rows with any missing value

dados_clean = dados.dropna()
print(f"Data shape after dropping missing values: {dados_clean.shape}")


# Select numeric columns for normalization, excluding identifiers or labels

cols = list(dados.columns)

# Example: remove columns if they are IDs or target variables

if 'id' in cols: cols.remove('id')
if 'diagnosis' in cols: cols.remove('diagnosis')

# Apply Min-Max normalization to selected columns

dados_minmax = dados.copy()
dados_minmax[cols] = dados_minmax[cols].apply(minmax_scale)

dados_minmax.head()


# Apply Z-score standardization: mean=0, std=1

dados_zscore = dados.copy()
dados_zscore[cols] = dados_zscore[cols].apply(scale)

dados_zscore.head()


# Example: Discretizing 'radius_mean' into 10 equal-width bins with labels 0-9

if 'radius_mean' in dados.columns:
dados['radius_mean_binned'] = pd.cut(dados['radius_mean'], bins=10, labels=range(10))
print(dados[['radius_mean', 'radius_mean_binned']].head())
else:
print('Column radius_mean not found in dataset')


# Example: Encoding categorical columns (if needed)

categorical_cols = dados.select_dtypes(include=['object']).columns.tolist()
print("Categorical columns:", categorical_cols)

# Convert categorical columns to category dtype

for col in categorical_cols:
dados[col] = dados[col].astype('category')

# Display categorical columns with their unique values

for col in categorical_cols:
print(f"{col} unique values: {dados[col].cat.categories}")


# Final look at data shape and first few rows after pre-processing

print("Data shape:", dados.shape)
dados.head()























1. Castro, L. N. & Ferrari, D. G. (2016). Introduction to Data Mining: Basic Concepts, Algorithms, and Applications. Saraiva.

2. Ferreira, A. C. P. L. et al. (2024). Artificial Intelligence – A Machine Learning Approach. 2nd Ed. LTC.

3. Larson & Farber (2015). Applied Statistics. Pearson.







πŸ›ΈΰΉ‹ My Contacts Hub





────────────── πŸ”­β‹† ──────────────

➣➒➀ Back to Top

Copyright 2025 Quantum Software Development. Code released under the MIT License license.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

No packages published