Lab 6: Logistic Regression for Text Classification — From Odds Ratios to TF-IDF
Applying logistic regression to classify newsgroup posts: exploring odds ratios, feature importance, and TF-IDF text representations.
Introduction
This lab introduced me to one of the most widely used models in machine learning: Logistic Regression. Unlike linear regression, which predicts continuous values, logistic regression predicts probabilities for categorical outcomes. Despite the “regression” name, it’s a powerful classifier, especially for text and high-dimensional data.
I started with the mathematical foundation, including the sigmoid function and the concept of odds ratios for feature importance. Then, I moved on to a practical case study: classifying documents from the 20 Newsgroups dataset using word counts and TF-IDF features.
Key Steps Covered
- Theory of Logistic Regression
- Sigmoid function as probability output.
- Odds ratios for interpreting feature importance.
- 20 Newsgroups Dataset
- Focused on binary classification (e.g., sci.space vs soc.religion.christian).
- Explored raw documents and labels.
- Text Preprocessing
- Bag-of-words and count vectorization.
- TF-IDF encoding for better feature weighting.
- Model Training
- Implemented logistic regression using SGD.
- Compared scikit-learn’s implementation with manual training.
- Feature Insights
- Examined which words were most predictive of each category.
Takeaway
This lab demonstrated why logistic regression is a baseline workhorse for classification problems. With just a linear model and TF-IDF features, we can already achieve strong results on text classification tasks — and odds ratios provide interpretable insights into what features matter most.
🔗 View the full Lab Notebook on GitHub
▶️ Run in Google Colab
