Lab 5: Sampling Distributions and Handling Imbalanced Data with SMOTE

From probability distributions to SMOTE: exploring sampling techniques and applying them to balance datasets for better machine learning performance.

Introduction

Sampling lies at the heart of data analysis and machine learning. Whether we’re simulating data, testing algorithms, or building predictive models, the way we sample influences the quality and fairness of results.

In this lab, I practiced sampling from a variety of distributions, explored multiple random variables, and applied sampling in the context of imbalanced datasets. Finally, I experimented with SMOTE, a powerful technique to balance classes in datasets where one label dominates — a common challenge in real-world classification tasks.

Key Steps Covered

Sampling from Common Distributions
- Uniform: np.random.sample()
- Normal: np.random.normal(mean, std, n)
- Beta & Binomial distributions.
Discrete Sampling
- Random choice from a collection, with and without probability weights.
Multiple Random Variables
- Summed independent variables (e.g., dice rolls).
Data Sampling
- Selecting subsets of arrays via index sampling.
Handling Imbalanced Data
- Introduced the SMOTE algorithm.
- Generated synthetic examples to rebalance skewed datasets.
- Ran experiments showing how SVM classification improves with SMOTE.

Takeaway

This lab reinforced the importance of sampling as both a statistical concept and a practical tool. By the end, I saw how SMOTE can transform an imbalanced dataset into one that enables classifiers to learn more fairly and effectively — a critical step in applying machine learning responsibly.

🔗 View the full Lab Notebook on GitHub
▶️ Run in Google Colab

Written on August 22, 2025