Knowledge Distillation Without Cross-Entropy

Research on eliminating logit-based loss in knowledge distillation via intelligent layer selection, enhancing training efficiency and model accuracy for computer vision.


Research conducted at The University of Colorado, Boulder
Research Supervisor: Professor Danna Gurari and PhD Nick Cooper
Conference Submission: NeurIPS 2025


📜 Research Paper & Resources

  • NeurIPS 2025 Submission: Towards Knowledge Distillation Without Cross-Entropy (Under Review)
  • Research Institution: University of Colorado, Boulder
  • Focus Area: Knowledge Distillation, Intermediate Layer Learning, Logit-Free Training
  • Code Repository: (Coming Soon!)
  • Project Page & Resources: (Coming Soon!)

🛠 Tech Stack & Tools

  • Machine Learning & CV: PyTorch, TorchVision, Vision Transformers (ViTs), VGG, ResNet
  • Optimization: Adam, One-cycle LR, PCA, SVD
  • Datasets: CIFAR-10, CIFAR-100, Tiny ImageNet
  • Evaluation Metrics: Accuracy, ARI, Training Efficiency (% Epochs Reduced)

📖 Research Overview

This research introduces a novel knowledge distillation method that eliminates the need for logit-based losses (cross-entropy) when training student models. Traditional approaches use logits as the primary supervisory signal, but they often conflict with intermediate layer knowledge.

To solve this, our method:

  • Proposes a novel Knowledge Quality (KQ) metric to select optimal teacher layers
  • Trains student backbones using only intermediate feature loss and removes CE losses.
  • Achieves improved performance across CNNs and ViTs on image classification tasks

📊 Major Contributions

1. Logit-Free Knowledge Distillation

  • First method to train student backbones without any logit-based loss (CE)
  • Demonstrates significant gains in training stability and generalization

2. Knowledge Quality Metric for Layer Selection

  • Achieves superior performance when selecting teacher layers using this metric

3. Significant Accuracy & Efficiency Gains

  • Boosted top-1 accuracy up to 15% over baselines
  • Reduced training time by up to 80% across datasets and model pairs

4. Robust Evaluation Across Architectures

  • Validated approach on VGG, ResNet, MobileNet, ViTs
  • Proved effectiveness across small and large-scale image datasets

🚀 Future Work & Applications

  • Extend KQ metric to multi-teacher/multi-task settings
  • Explore applicability to language models and multimodal learning
  • Develop lightweight mobile-compatible student models for real-time inference

For collaboration, feel free to reach out via LinkedIn or Email.