Channel Distillation:Channel-Wise Attention for Knowledge Distillation 论文精读总结

Title

Channel Distillation: Channel-Wise Attention for Knowledge Distillation

Summary

文章核心内容:将通道测的注意力分布作为知识迁移给学生模型

文章缺点:消融实验太少,无法提供有力证据支撑前面的猜想

Research Objective

通过知识蒸馏的方法得到准确度高、参数量小 的学生模型

Problem Statement

  1. the teacher is not good enough, and the student cannot accurately learn the essential information from the teacher.(学生模型无法准确学到教师模型的知识)
  2. if the teacher is not completely correct, during training, ifthe student makes a decision with reference to the decisive result of the teacher, the poor output of the teacher will have a bad influence on the student instead.(教师模型准确率不是百分之百,错误样本会误导学生模型)
  3. there is a margin betweenthe teacher and the student since they have different structure, which will make the student unable to find its own optimization space if we always let the teacher supervise it.(学生模型与教师模型网络结构的不同,导致它们具有不同的参数空间,因此不能一直在教师模型的监督下学习)

Method(s)

  1. 利用注意力的机制迁移通道信息:Channel Distilltaion

    • SENet注意力机制:

      wc=1H×Wi=1Hj=1Wuc(i,j)w_c = \frac{1}{H \times W}\sum_{i=1}^{H}\sum_{j=1}^{W}u_c(i,j)

    • CD Loss:

      CD(s,t)=i=1nj=1c(wsijwtij)2n×ci表示样本,j表示通道CD(s,t)=\frac{\sum_{i=1}^{n}\sum_{j=1}^{c}(w^{ij}_s -w^{ij}_t)^2}{n\times c} \\ i表示样本,j表示通道

  2. Guided Knowledge Distillation

    • 原始KD:

      image-20230917163137751

    • GKD:

      image-20230917163153702

  3. Early Decay teacher

    image-20230917163226277

    image-20230917163233378

    image-20230917163530343

Evaluation

作者如何评估自己的方法,有没有问题或者 可以借鉴的地方。

指标:Top1 error 、Top5 error

网络:Resnet34(Teacher)、Resnet18(Student)

问题:没有图、分析不够

Conclusion

transfer the channel attention information from the teacher to the student, not just mimicking the teacher’s representation in label.

Notes

  1. 除了CD,另外两个idea都可以与别的方法结合起来,是否会比本文的CD更好
  2. 通道知识迁移,与标签知识迁移同时进行,与已有的工作:通过注意力机制的特征图迁移隐层参数,再作标签知识的迁移。哪种更好
  3. 既然有Teacher模型,为什么不利用Teacher对未监督数据做数据增强
  4. 原理图简陋