这是一篇 emergency post。

之前没有完全整理过 FGVC state-of-the-art pros and cons,这个子领域各家有各家的说法,近三年的工作、各种方法之间没有明确的继承关系或者方法分类,对我造成了一定困扰。就好像研究这个方向的一群人坐在了一张桌子上,一个人说:“我有一个方法,嗯(摆出)”,紧接着另一个人又说:“诶,巧了,我这也有一个(伸出)”,然后又有一个人说话了:“啊,我这还有一个呢(拍下)”,以此方式模型和算法不断更迭,某些模型之间 accuracy 仅相差 0.1%。

近期结合某 AAAI2020 的工作,自己设计了构图方法、引入 GCN 做分类,直接导致模型不收敛,至今没有救回来,而之前的代码问题也导致错过 AAAI21 (w/draw)。今天小组和 Ming-Hsuan 开了组会之后,三个做了报告的都有些自闭。我自己感觉还是和顶尖水平的同行实力相差太远,甚至还不需要这种级别的老师指导。Meeting 上我 note 了几点比较有用东西,其中一个就是这个 pros-and-cons survey,虽然时间紧急,还是要做一下,这也是即刻能做起来的一件事情,借此也想针对这些工作的 cons 进行 counter/commence。

Paper List

本节将近期比较关注工作的论文和代码进行汇总。这些 paper 一部分选自 Paperw/codes FGVC leaderboard 的论文,另一部分是从上部分论文中的对比试验、参考文献中挑选拓展出来的。因此这份 paper list 并不能 cover sota 全貌,仅由我主观地挑选出框架、模块、方法比较清晰的,专为细粒度任务设计的,最好公开了代码的工作进行汇总,以便检索。

Model Paper Code Conference
MMAL-Net Multi-branch and Multi-scale Attention Learning for Fine-Grained Visual Categorization w/ code ECCV 2018
PMG Fine-Grained Visual Classification via Progressive Multi-Granularity Training of Jigsaw Patches w/ code AAAI 2020
CIN Channel Interaction Networks for Fine-Grained Image Categorization w/o code AAAI 2020
DCL Destruction and Construction Learning for Fine-Grained Image Recognition w/ code CVPR2019
Cross-X Cross-X Learning for Fine-Grained Visual Categorization w/ code ICCV 2019
S3N Selective Sparse Sampling for Fine-Grained Image Recognition w/ code ICCV 2019
DFL-CNN Learning a Discriminative Filter Bank within a CNN for Fine-grained Recognition w/o code CVPR 2018
MaxEnt Maximum-Entropy Fine Grained Classification w/o code NerPIS 2018
MC-Loss The Devil is in the Channels: Mutual-Channel Loss for Fine-Grained Image Classification w/ code IEEE TIP20
FCAN Fully Convolutional Attention Networks for Fine-Grained Recognition w/o code n/a
RA-CNN Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-grained Image Recognition w/o code CVPR 2017
ViT An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale w/ code ICRL 2021

Pro-and-Con Comparison

Model Pros / Contributions Cons Accuracy
MMAL-Net √ End-to-end multi-branch network
√ Learn object’s discriminative regions for recognition effectively
√ The accuracy of Attention Object Location Module is achieved by only using category labels
√ Present an Attention Part Proposal Method (APPM) without the need of part annotations
√ Outperform the state-of-the-art methods and baselines on three standard benchmark datasets
× CUB-200: 89.6%
Stanford Cars 94.7%
FGVC-Aircraft: 95.0%
PMG √ Novel progressive training strategy operates in different training steps
√ Cultivate the inherent complementary properties across different granularities for fine-grained feature learning.
√ A simple yet effective jigsaw puzzle generator to form different levels of granularity
√ PMG obtains state-of-the-art or competitive performances on all three standard FGVC benchmark datasets
× CUB-200: 89.6%
Stanford Cars 95.1%
FGVC-Aircraft: 93.4%
DFL-CNN √ Enhance the mid-level learning capability of the classical CNN by introducing a bank of discriminative filters
√ Simple and effective
√ High human interpretability
√ Consistent performance across different fine-grained visual domains and various network architectures
× CUB-200: 87.4%
Stanford Cars 93.8%
FGVC-Aircraft: 92.0%
CIN √ Propose a self-channel interaction (SCI) module able to model the interplay between different channels within an image
√ Propose a novel contrastive channel interaction (CCI) module to learn channel-wise relationships between images
√ Method achieves better performance over current state-of-the-art
× CUB-200: 88.1%
Stanford Cars 94.5%
FGVC-Aircraft: 92.8%
Cross-X √ Cross-X learning approach for finegrained feature learning
√ Cross-X learning explores relationships between features from different images and different network layers
√ Address the issue of robust multi-scale feature learning through cross-layer regularization
× CUB-200: 87.7%
Stanford Cars 94.6%
FGVC-Aircraft: 92.7%
Stanford Dogs: 88.9%
NABirds: 86.4%
DCL √ A novel Destruction and Construction Learning (DCL) framework
√ State-of-the-art performances
√ No need extra part/object annotation
√ No computational overhead at inference time
× Stanford Cars: 93.0%
FGVC-Aircraft: 94.5%
S3N √ Novel Selective Sparse Sampling framework
√ Substantial improvement over the baselines concerning model accuracy and the ability of mining visual evidence
× CUB-200: 88.5%
Stanford Cars 94.7%
FGVC-Aircraft: 92.8%
ViT √ Latest trend
√ Experiment with applying a standard Transformer directly to images, with the fewest possible modifications.
× Oxford Flowers: 99.68% (ViT-H/14) 99.74% (ViT-L/16)
Oxford-IIIT Pets: 97.56% (ViT-H/14) 97.32% (ViT-L/16)

Brief Conclusion

汇总和对比之后对目前 sota 有如下几点总结。

首先,明确了细粒度分类任务 end-to-end 网络的两种类型。

DFL-CNN 的论文中指出:more recent CNN-based approaches are usually trained end-to-end and can be roughly divided into two categories: localization-classification subnetworks and end-to-end feature encoding,类似于 detection/segmentation 中的 two-stage 和 one-stage 的分类。根据之前实验的思路,实现细节上,我的方法归类于 part-based end-to-end feature encoding FGVC。实作中往往只有图片的类别标签,所以很大程度上 loc-cls subnet 方法的 localization 处于半监督或弱监督的状态。

其次是各个方法的设计,侧重点和出发点有不同。

MC-Loss,MaxEnt 等工作是从 loss function 或 metrics 的角度对 plain network 做提升,对概率分布一类的推理证明要求较高;多层级的网络一般采用多 loss 加和,而一般能监督的标签只有图像标签,越来越多的工作都摒弃了 bbox/part annotation,所以 cross entropy loss 依然是最常用的 loss,中间层自然会有处于半监督或弱监督的层。

MMAL-Net,PMG,DCL,S3N 等工作是以设计了 novel framework 自居,设计了一些新的模块和 feature 使用方式。不同的 part-based 的工作在图像预处理时分片方式略有所不同,但已经成为细粒度分类的必备。例如,MMAL-Net 开发了 AOLM 和 APPM 用于预选区域和目标分片的两级模块;PMG 中的 jigsaws generator 直接将图片分割、打乱成拼图块后重组成图片,但配合 progressive training 就显得独立性不够强,也许存在耦合。

改进 feature maps 利用的工作会从 channel selection,attention,pooling 等方向提出创新,探究特征权重,高低频信号,不同图像、通道之间的联系,提升模型的整体性能。要进一步改进容易收到 limited novelty 警告,论文会显得比较单薄,想要设计出令人 impressed 的模块有很大难度,仍可以尝试。