Fine-Grained Visual Categorization State-of-the-Art Pro-and-Con Survey

这是一篇 emergency post。

之前没有完全整理过 FGVC state-of-the-art pros and cons，这个子领域各家有各家的说法，近三年的工作、各种方法之间没有明确的继承关系或者方法分类，对我造成了一定困扰。就好像研究这个方向的一群人坐在了一张桌子上，一个人说：“我有一个方法，嗯（摆出）”，紧接着另一个人又说：“诶，巧了，我这也有一个（伸出）”，然后又有一个人说话了：“啊，我这还有一个呢（拍下）”，以此方式模型和算法不断更迭，某些模型之间 accuracy 仅相差 0.1%。

近期结合某 AAAI2020 的工作，自己设计了构图方法、引入 GCN 做分类，直接导致模型不收敛，至今没有救回来，而之前的代码问题也导致错过 AAAI21 (w/draw)。今天小组和 Ming-Hsuan 开了组会之后，三个做了报告的都有些自闭。我自己感觉还是和顶尖水平的同行实力相差太远，甚至还不需要这种级别的老师指导。Meeting 上我 note 了几点比较有用东西，其中一个就是这个 pros-and-cons survey，虽然时间紧急，还是要做一下，这也是即刻能做起来的一件事情，借此也想针对这些工作的 cons 进行 counter/commence。

Paper List

本节将近期比较关注工作的论文和代码进行汇总。这些 paper 一部分选自 Paperw/codes FGVC leaderboard 的论文，另一部分是从上部分论文中的对比试验、参考文献中挑选拓展出来的。因此这份 paper list 并不能 cover sota 全貌，仅由我主观地挑选出框架、模块、方法比较清晰的，专为细粒度任务设计的，最好公开了代码的工作进行汇总，以便检索。

Model	Paper	Code	Conference
MMAL-Net	Multi-branch and Multi-scale Attention Learning for Fine-Grained Visual Categorization	w/ code	ECCV 2018
PMG	Fine-Grained Visual Classification via Progressive Multi-Granularity Training of Jigsaw Patches	w/ code	AAAI 2020
CIN	Channel Interaction Networks for Fine-Grained Image Categorization	w/o code	AAAI 2020
DCL	Destruction and Construction Learning for Fine-Grained Image Recognition	w/ code	CVPR2019
Cross-X	Cross-X Learning for Fine-Grained Visual Categorization	w/ code	ICCV 2019
S3N	Selective Sparse Sampling for Fine-Grained Image Recognition	w/ code	ICCV 2019
DFL-CNN	Learning a Discriminative Filter Bank within a CNN for Fine-grained Recognition	w/o code	CVPR 2018
MaxEnt	Maximum-Entropy Fine Grained Classification	w/o code	NerPIS 2018
MC-Loss	The Devil is in the Channels: Mutual-Channel Loss for Fine-Grained Image Classification	w/ code	IEEE TIP20
FCAN	Fully Convolutional Attention Networks for Fine-Grained Recognition	w/o code	n/a
RA-CNN	Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-grained Image Recognition	w/o code	CVPR 2017
ViT	An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale	w/ code	ICRL 2021

Pro-and-Con Comparison

Model	Pros / Contributions	Cons	Accuracy
MMAL-Net	√ End-to-end multi-branch network √ Learn object’s discriminative regions for recognition effectively √ The accuracy of Attention Object Location Module is achieved by only using category labels √ Present an Attention Part Proposal Method (APPM) without the need of part annotations √ Outperform the state-of-the-art methods and baselines on three standard benchmark datasets	×	CUB-200: 89.6% Stanford Cars 94.7% FGVC-Aircraft: 95.0%
PMG	√ Novel progressive training strategy operates in different training steps √ Cultivate the inherent complementary properties across different granularities for fine-grained feature learning. √ A simple yet effective jigsaw puzzle generator to form different levels of granularity √ PMG obtains state-of-the-art or competitive performances on all three standard FGVC benchmark datasets	×	CUB-200: 89.6% Stanford Cars 95.1% FGVC-Aircraft: 93.4%
DFL-CNN	√ Enhance the mid-level learning capability of the classical CNN by introducing a bank of discriminative filters √ Simple and effective √ High human interpretability √ Consistent performance across different fine-grained visual domains and various network architectures	×	CUB-200: 87.4% Stanford Cars 93.8% FGVC-Aircraft: 92.0%
CIN	√ Propose a self-channel interaction (SCI) module able to model the interplay between different channels within an image √ Propose a novel contrastive channel interaction (CCI) module to learn channel-wise relationships between images √ Method achieves better performance over current state-of-the-art	×	CUB-200: 88.1% Stanford Cars 94.5% FGVC-Aircraft: 92.8%
Cross-X	√ Cross-X learning approach for finegrained feature learning √ Cross-X learning explores relationships between features from different images and different network layers √ Address the issue of robust multi-scale feature learning through cross-layer regularization	×	CUB-200: 87.7% Stanford Cars 94.6% FGVC-Aircraft: 92.7% Stanford Dogs: 88.9% NABirds: 86.4%
DCL	√ A novel Destruction and Construction Learning (DCL) framework √ State-of-the-art performances √ No need extra part/object annotation √ No computational overhead at inference time	×	Stanford Cars: 93.0% FGVC-Aircraft: 94.5%
S3N	√ Novel Selective Sparse Sampling framework √ Substantial improvement over the baselines concerning model accuracy and the ability of mining visual evidence	×	CUB-200: 88.5% Stanford Cars 94.7% FGVC-Aircraft: 92.8%
ViT	√ Latest trend √ Experiment with applying a standard Transformer directly to images, with the fewest possible modifications.	×	Oxford Flowers: 99.68% (ViT-H/14) 99.74% (ViT-L/16) Oxford-IIIT Pets: 97.56% (ViT-H/14) 97.32% (ViT-L/16)

Brief Conclusion

汇总和对比之后对目前 sota 有如下几点总结。

首先，明确了细粒度分类任务 end-to-end 网络的两种类型。

DFL-CNN 的论文中指出：more recent CNN-based approaches are usually trained end-to-end and can be roughly divided into two categories: localization-classification subnetworks and end-to-end feature encoding，类似于 detection/segmentation 中的 two-stage 和 one-stage 的分类。根据之前实验的思路，实现细节上，我的方法归类于 part-based end-to-end feature encoding FGVC。实作中往往只有图片的类别标签，所以很大程度上 loc-cls subnet 方法的 localization 处于半监督或弱监督的状态。

其次是各个方法的设计，侧重点和出发点有不同。

MC-Loss，MaxEnt 等工作是从 loss function 或 metrics 的角度对 plain network 做提升，对概率分布一类的推理证明要求较高；多层级的网络一般采用多 loss 加和，而一般能监督的标签只有图像标签，越来越多的工作都摒弃了 bbox/part annotation，所以 cross entropy loss 依然是最常用的 loss，中间层自然会有处于半监督或弱监督的层。

MMAL-Net，PMG，DCL，S3N 等工作是以设计了 novel framework 自居，设计了一些新的模块和 feature 使用方式。不同的 part-based 的工作在图像预处理时分片方式略有所不同，但已经成为细粒度分类的必备。例如，MMAL-Net 开发了 AOLM 和 APPM 用于预选区域和目标分片的两级模块；PMG 中的 jigsaws generator 直接将图片分割、打乱成拼图块后重组成图片，但配合 progressive training 就显得独立性不够强，也许存在耦合。

改进 feature maps 利用的工作会从 channel selection，attention，pooling 等方向提出创新，探究特征权重，高低频信号，不同图像、通道之间的联系，提升模型的整体性能。要进一步改进容易收到 limited novelty 警告，论文会显得比较单薄，想要设计出令人 impressed 的模块有很大难度，仍可以尝试。