Lab 6: Logistic Regression for Text Classification — From Odds Ratios to TF-IDF

Applying logistic regression to classify newsgroup posts: exploring odds ratios, feature importance, and TF-IDF text representations.


Introduction

This lab introduced me to one of the most widely used models in machine learning: Logistic Regression. Unlike linear regression, which predicts continuous values, logistic regression predicts probabilities for categorical outcomes. Despite the “regression” name, it’s a powerful classifier, especially for text and high-dimensional data.

I started with the mathematical foundation, including the sigmoid function and the concept of odds ratios for feature importance. Then, I moved on to a practical case study: classifying documents from the 20 Newsgroups dataset using word counts and TF-IDF features.


Key Steps Covered

  • Theory of Logistic Regression
    • Sigmoid function as probability output.
    • Odds ratios for interpreting feature importance.
  • 20 Newsgroups Dataset
    • Focused on binary classification (e.g., sci.space vs soc.religion.christian).
    • Explored raw documents and labels.
  • Text Preprocessing
    • Bag-of-words and count vectorization.
    • TF-IDF encoding for better feature weighting.
  • Model Training
    • Implemented logistic regression using SGD.
    • Compared scikit-learn’s implementation with manual training.
  • Feature Insights
    • Examined which words were most predictive of each category.

Takeaway

This lab demonstrated why logistic regression is a baseline workhorse for classification problems. With just a linear model and TF-IDF features, we can already achieve strong results on text classification tasks — and odds ratios provide interpretable insights into what features matter most.


🔗 View the full Lab Notebook on GitHub
▶️ Run in Google Colab

Written on August 22, 2025