Lab 6: Logistic Regression for Text Classification — From Odds Ratios to TF-IDF

Applying logistic regression to classify newsgroup posts: exploring odds ratios, feature importance, and TF-IDF text representations.

Introduction

This lab introduced me to one of the most widely used models in machine learning: Logistic Regression. Unlike linear regression, which predicts continuous values, logistic regression predicts probabilities for categorical outcomes. Despite the “regression” name, it’s a powerful classifier, especially for text and high-dimensional data.

I started with the mathematical foundation, including the sigmoid function and the concept of odds ratios for feature importance. Then, I moved on to a practical case study: classifying documents from the 20 Newsgroups dataset using word counts and TF-IDF features.

Key Steps Covered

Theory of Logistic Regression
- Sigmoid function as probability output.
- Odds ratios for interpreting feature importance.
20 Newsgroups Dataset
- Focused on binary classification (e.g., sci.space vs soc.religion.christian).
- Explored raw documents and labels.
Text Preprocessing
- Bag-of-words and count vectorization.
- TF-IDF encoding for better feature weighting.
Model Training
- Implemented logistic regression using SGD.
- Compared scikit-learn’s implementation with manual training.
Feature Insights
- Examined which words were most predictive of each category.

Takeaway

This lab demonstrated why logistic regression is a baseline workhorse for classification problems. With just a linear model and TF-IDF features, we can already achieve strong results on text classification tasks — and odds ratios provide interpretable insights into what features matter most.

🔗 View the full Lab Notebook on GitHub
▶️ Run in Google Colab

Written on August 22, 2025