这是一篇 emergency post。
之前没有完全整理过 FGVC state-of-the-art pros and cons,这个子领域各家有各家的说法,近三年的工作、各种方法之间没有明确的继承关系或者方法分类,对我造成了一定困扰。就好像研究这个方向的一群人坐在了一张桌子上,一个人说:“我有一个方法,嗯(摆出)”,紧接着另一个人又说:“诶,巧了,我这也有一个(伸出)”,然后又有一个人说话了:“啊,我这还有一个呢(拍下)”,以此方式模型和算法不断更迭,某些模型之间 accuracy 仅相差 0.1%。
近期结合某 AAAI2020 的工作,自己设计了构图方法、引入 GCN 做分类,直接导致模型不收敛,至今没有救回来,而之前的代码问题也导致错过 AAAI21 (w/draw)。今天小组和 Ming-Hsuan 开了组会之后,三个做了报告的都有些自闭。我自己感觉还是和顶尖水平的同行实力相差太远,甚至还不需要这种级别的老师指导。Meeting 上我 note 了几点比较有用东西,其中一个就是这个 pros-and-cons survey,虽然时间紧急,还是要做一下,这也是即刻能做起来的一件事情,借此也想针对这些工作的 cons 进行 counter/commence。
Paper List
本节将近期比较关注工作的论文和代码进行汇总。这些 paper 一部分选自 Paperw/codes FGVC leaderboard 的论文,另一部分是从上部分论文中的对比试验、参考文献中挑选拓展出来的。因此这份 paper list 并不能 cover sota 全貌,仅由我主观地挑选出框架、模块、方法比较清晰的,专为细粒度任务设计的,最好公开了代码的工作进行汇总,以便检索。
Pro-and-Con Comparison
Model | Pros / Contributions | Cons | Accuracy |
---|---|---|---|
MMAL-Net | √ End-to-end multi-branch network √ Learn object’s discriminative regions for recognition effectively √ The accuracy of Attention Object Location Module is achieved by only using category labels √ Present an Attention Part Proposal Method (APPM) without the need of part annotations √ Outperform the state-of-the-art methods and baselines on three standard benchmark datasets |
× | CUB-200: 89.6% Stanford Cars 94.7% FGVC-Aircraft: 95.0% |
PMG | √ Novel progressive training strategy operates in different training steps √ Cultivate the inherent complementary properties across different granularities for fine-grained feature learning. √ A simple yet effective jigsaw puzzle generator to form different levels of granularity √ PMG obtains state-of-the-art or competitive performances on all three standard FGVC benchmark datasets |
× | CUB-200: 89.6% Stanford Cars 95.1% FGVC-Aircraft: 93.4% |
DFL-CNN | √ Enhance the mid-level learning capability of the classical CNN by introducing a bank of discriminative filters √ Simple and effective √ High human interpretability √ Consistent performance across different fine-grained visual domains and various network architectures |
× | CUB-200: 87.4% Stanford Cars 93.8% FGVC-Aircraft: 92.0% |
CIN | √ Propose a self-channel interaction (SCI) module able to model the interplay between different channels within an image √ Propose a novel contrastive channel interaction (CCI) module to learn channel-wise relationships between images √ Method achieves better performance over current state-of-the-art |
× | CUB-200: 88.1% Stanford Cars 94.5% FGVC-Aircraft: 92.8% |
Cross-X | √ Cross-X learning approach for finegrained feature learning √ Cross-X learning explores relationships between features from different images and different network layers √ Address the issue of robust multi-scale feature learning through cross-layer regularization |
× | CUB-200: 87.7% Stanford Cars 94.6% FGVC-Aircraft: 92.7% Stanford Dogs: 88.9% NABirds: 86.4% |
DCL | √ A novel Destruction and Construction Learning (DCL) framework √ State-of-the-art performances √ No need extra part/object annotation √ No computational overhead at inference time |
× | Stanford Cars: 93.0% FGVC-Aircraft: 94.5% |
S3N | √ Novel Selective Sparse Sampling framework √ Substantial improvement over the baselines concerning model accuracy and the ability of mining visual evidence |
× | CUB-200: 88.5% Stanford Cars 94.7% FGVC-Aircraft: 92.8% |
ViT | √ Latest trend √ Experiment with applying a standard Transformer directly to images, with the fewest possible modifications. |
× | Oxford Flowers: 99.68% (ViT-H/14) 99.74% (ViT-L/16) Oxford-IIIT Pets: 97.56% (ViT-H/14) 97.32% (ViT-L/16) |
Brief Conclusion
汇总和对比之后对目前 sota 有如下几点总结。
首先,明确了细粒度分类任务 end-to-end 网络的两种类型。
DFL-CNN 的论文中指出:more recent CNN-based approaches are usually trained end-to-end and can be roughly divided into two categories: localization-classification subnetworks and end-to-end feature encoding,类似于 detection/segmentation 中的 two-stage 和 one-stage 的分类。根据之前实验的思路,实现细节上,我的方法归类于 part-based end-to-end feature encoding FGVC。实作中往往只有图片的类别标签,所以很大程度上 loc-cls subnet 方法的 localization 处于半监督或弱监督的状态。
其次是各个方法的设计,侧重点和出发点有不同。
MC-Loss,MaxEnt 等工作是从 loss function 或 metrics 的角度对 plain network 做提升,对概率分布一类的推理证明要求较高;多层级的网络一般采用多 loss 加和,而一般能监督的标签只有图像标签,越来越多的工作都摒弃了 bbox/part annotation,所以 cross entropy loss 依然是最常用的 loss,中间层自然会有处于半监督或弱监督的层。
MMAL-Net,PMG,DCL,S3N 等工作是以设计了 novel framework 自居,设计了一些新的模块和 feature 使用方式。不同的 part-based 的工作在图像预处理时分片方式略有所不同,但已经成为细粒度分类的必备。例如,MMAL-Net 开发了 AOLM 和 APPM 用于预选区域和目标分片的两级模块;PMG 中的 jigsaws generator 直接将图片分割、打乱成拼图块后重组成图片,但配合 progressive training 就显得独立性不够强,也许存在耦合。
改进 feature maps 利用的工作会从 channel selection,attention,pooling 等方向提出创新,探究特征权重,高低频信号,不同图像、通道之间的联系,提升模型的整体性能。要进一步改进容易收到 limited novelty 警告,论文会显得比较单薄,想要设计出令人 impressed 的模块有很大难度,仍可以尝试。