深度学习刷SOTA有哪些trick？

【写在前面】

对深度学习而言，不论是学术研究还是落地应用都需要尽可能提升模型效果，这往往需要trick进行支撑。这些trick有的是广泛适用的（如循环学习率、BN等等），有的是任务特定的（比如cv里的数据增强，nlp里的mask，推荐里的降采样）。

这些trick有的能够提升网络精度，有的能够加速收敛，有的甚至比模型提升更加显著。

在同学们各自的领域中，有哪些常遇到的、易实践的、易推广的trick呢？

来源：https://www.zhihu.com/question/540433389/answer/2549775065^[1]

作者：Gordon Lee

R-Drop：两次前向+KL loss约束
MLM: 在领域语料上用mlm进一步预训练 (Post-training)
EFL: 少样本下，把分类问题转为匹配问题，把输入构造为NSP任务形式.
混合精度fp16: 加快训练速度，提高训练精度
多卡ddp训练的时候，用到梯度累积时，可以使用no_sync减少不必要的梯度同步，加快速度
对于验证集或者测试集特别大的情况，可以尝试多卡inference，需要用的就是dist.all_gather，对于非张量^[2]的话也可以用all_gather_object
PET: 少样本下，把分类转为mask位置预测，并构造verbalizer，参考EACL2021. PET
ArcFaceLoss：双塔句子匹配的loss把NT-Xent loss改成arccos^[3]的形式，参考ACL2022. ArcCSE
数据增强在zero shot x-lingual transfer：code switch，machine translation..记得最后加一致性loss，参考consistency regularization for cross lingual finetuning
SimCSE：继续在领域语料上做simcse^[4]的预训练
Focal loss: 不平衡的处理
双塔迟交互^[5]：maxsim操作：query和doc的每个token表征算相似度，取最大相似度再求和。速度和精度都有一个很好的平衡，参考colbert
持续学习减轻遗忘：EWC方法+一个很强的预训练模型效果很不错。就是加一个正则让重要参数遗忘不太多，重要性用fisher信息度量。
对抗训练：FGM，PGD，能提点，就是训练慢，
memory bank增大bsz^[6]，虽然我感觉有时候有点鸡肋
PolyLoss: -logpt + eps * (1-pt) 效果存疑，反正我试了没啥效果，有人试过效果不错

作者：昆特Alex

一句话原则： AI performance = data(70%) + model(CNN、RNN、Transformer、Bert、GPT 20%) + trick(loss、warmup^[7]、optimizer、attack-training etc 10%) 记住：数据决定了AI的上线，模型和trick只是去逼近这个上线，还是那句老话：garbage in， garbage out。下面具体分享在NLP领域的一些具体trick：

一、Data Augmentation

1、噪音数据删除：（最大熵删除法^[8]、cleanlab等）

2、错误标注数据修改：交叉验证^[9]训练多个模型，取模型预测结果一致且prob比threshold大的数据（或者topN）。多个模型可以采用不同的seed^[10]，不同的训练集测试机，或者不同的模型结果（bert与textcnn等），找出覆盖部分模型预测与标柱数据^[11]不一致的标注错误数据进行修改。

3、数据增强

同义词替换(Synonym Replacement)：从句子中随机选取n个不属于停用词集的单词，并随机选择其同义词替换它们；
随机插入(Random Insertion)：随机的找出句中某个不属于停用词集的词，并求出其随机的同义词，将该同义词插入句子的一个随机位置。重复n次；
随机交换(Random Swap)：随机的选择句中两个单词并交换它们的位置。重复n次；
随机删除(Random Deletion)：以 ppp 的概率，随机的移除句中的每个单词；
反向翻译(back translation)：将源语言翻译成中间语言，再翻译回原语言

二、Model backbone

Transformer已经随着bert而大杀四方了，不同的预训练模型backbone有着不同的应用场景。领域数据充足且条件允许的话可以考虑用行业预料进行预训练，次之进行领域再训练，最后才考虑用公开的模型进行finetune^[12]。各个公开的backbone选择小trick如下：

robert_wwm_ext: 文本分类、NER等任务单句自然语言理解（NLU）任务上性能较好
simbert：句子相似度计算、句子对关系判断等任务上效果较好
GPT系列：文本翻译、文本摘要等自然语言生成（NLG）任务上性能效果较好。三、训练loss等其他trick

三、tricks

样本不均衡问题：除了前面介绍的数据增强，过采样等方法外，还可以试试facalloss、loss加权^[13]等方式处理。
optimizer、lr、warmup、batch\_size^[14]等配合的好也能能够神奇提点（比如通常batch_size较大时lr也可以同步提升）。
训练trick：进行对抗训练^[15]（FGM、PGD）等
多任务学习：增加auxiliary loss
label smoothing^[16]: 经过了噪音数据删除、数据增强等数据精度还是差强人意的话可以考虑
etc···

last but not least：AI performance = data(70%) + model(20%) +other trick(10%)，请把时间花在最能提升模型性能的事情上面，而不是追求各种花式trick叠buff，trick只是用来景上添花，而数据以及选择的模型backbone才是最核心的景色。

作者：爱睡觉的KKY

这里我罗列一些比较通用的trick^[17]，都是经过自己论文或比赛中使用验证过的：

尝试模型初始化方法，不同的分布，分布参数。下图是用不同初始化方法网络性能对比，有兴趣的同学可以看看kaiming^[18]的论文Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification^[19] 。

不同的预训练任务初始化，在最近的Google Universal Embedding比赛中，采用ImageNet 1K（1K类别分类）和ImageNet21k的两种Pretrained Model 表现差异非常大，主要原因是21K任务分类粒度更细，模型对图片细粒度信息关注度更高，得到的模型输出作为Image Embedding Vector性能表现越好，有兴趣的同学可以看看这篇讨论^[20]。
warmup cosine lr scheduler ，先热身（学习率逐渐攀升），再进行余弦衰减^[21]，对大模型这个学习率策略非常好用，在huggingface Transformers^[22] 库已经有现成的。

分层学习率或学习率策略
对抗训练提升模型鲁棒性，方法有很多，我常用的是对抗权重扰动^[23]（AWP, Adversarial Weight Perturbation），实现可以参考这篇文章^[24]。
随机权重平均^[25]（Stochastic Weight Averaging，SWA），通过对训练过程中的模型权重进行Avg融合，提升模型鲁棒性，PyTorch有官方实现。

pseudo label^[26] / meta pseudo label ，比赛中常用的半监督技巧。

Meta Pseudo Labels (https://arxiv.org/abs/2003.10580^[27])
Fine-Tuning Pre-trained Language Model with Weak Supervision: A Contrastive-Regularized Self-Training Approach (https://arxiv.org/pdf/2010.07835.pdf^[28])

TTA，test time augmentation^[29]，可以搭配Data Augmentation来做。
数据增强

nlp ：回译，词性替换等
CV ：resize^[30]、crop、flip、ratate、blur、HSV变化、affine（仿射）、perspective（透视）、Mixup、cutout、cutmix、Random Erasing（随机擦除）、Mosaic（马赛克）、CopyPaste、GANs domain transfer等）

蒸馏，参考论文Can Students Outperform Teachers in Knowledge Distillation Based Model Comparison? (https://openreview.net/pdf?id=XZDeL25T12l^[31])
结构重参数化，细节可查看RepVGG论文（https://arxiv.org/abs/2101.03697^[32]）
GradientCheckpoint, 节省显存，让你有更高的建模^[33]自由度
终极答案，换个random seed^[34]（笑）

以上均是针对单模型的通用trick。一些领域限定的技术这里不再具体罗列，但是有不少文献大家可以借鉴，例如：

细粒度分类领域：A Strong Baseline and Batch Normalization Neck for Deep Person Re-identification^[35]
GANs训练技巧：Training Language GANs from Scratch^[36]

【项目推荐】

面向小白的顶会论文核心代码库：https://github.com/xmu-xiaoma666/External-Attention-pytorch^[37]

AI必读论文和视频教程：https://github.com/xmu-xiaoma666/FightingCV-Course^[38]

面向小白的YOLO目标检测库：https://github.com/iscyy/yoloair^[39]

面向小白的顶刊顶会的论文解析：https://github.com/xmu-xiaoma666/FightingCV-Paper-Reading^[40]

【技术交流】

已建立深度学习公众号——FightingCV，关注于最新论文解读、基础知识巩固、学术科研交流，欢迎大家关注！！！

请关注FightingCV公众号，并后台回复ECCV2022即可获得ECCV中稿论文汇总列表。

推荐加入FightingCV交流群，每日会发送论文解析、算法和代码的干货分享，进行学术交流，加群请添加小助手wx：FightngCV666，备注：地区-学校（公司）-名称

参考资料

[1]

https://www.zhihu.com/question/540433389/answer/2549775065: https://www.zhihu.com/question/540433389/answer/2549775065

[2]

非张量: https://www.zhihu.com/search?q=非张量&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={"sourceType":"answer","sourceId":2549775065}

[3]

arccos: https://www.zhihu.com/search?q=arccos&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={"sourceType":"answer","sourceId":2549775065}

[4]

simcse: https://www.zhihu.com/search?q=simcse&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={"sourceType":"answer","sourceId":2549775065}

[5]

双塔迟交互: https://www.zhihu.com/search?q=双塔迟交互&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={"sourceType":"answer","sourceId":2549775065}

[6]

bsz: https://www.zhihu.com/search?q=bsz&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={"sourceType":"answer","sourceId":2549775065}

[7]

warmup: https://www.zhihu.com/search?q=warmup&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={"sourceType":"answer","sourceId":2576569581}

[8]

最大熵删除法: https://www.zhihu.com/search?q=最大熵删除法&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={"sourceType":"answer","sourceId":2576569581}

[9]

交叉验证: https://www.zhihu.com/search?q=交叉验证&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={"sourceType":"answer","sourceId":2576569581}

[10]

seed: https://www.zhihu.com/search?q=seed&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={"sourceType":"answer","sourceId":2576569581}

[11]

标柱数据: https://www.zhihu.com/search?q=标柱数据&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={"sourceType":"answer","sourceId":2576569581}

[12]

finetune: https://www.zhihu.com/search?q=finetune&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={"sourceType":"answer","sourceId":2576569581}

[13]

loss加权: https://www.zhihu.com/search?q=loss加权&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={"sourceType":"answer","sourceId":2576569581}

[14]

batch_size: https://www.zhihu.com/search?q=batch_size&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={"sourceType":"answer","sourceId":2576569581}

[15]

对抗训练: https://www.zhihu.com/search?q=对抗训练&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={"sourceType":"answer","sourceId":2576569581}

[16]

label smoothing: https://www.zhihu.com/search?q=label+smoothing&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={"sourceType":"answer","sourceId":2576569581}

[17]

trick: https://www.zhihu.com/search?q=trick&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={"sourceType":"answer","sourceId":2626853486}

[18]

kaiming: https://www.zhihu.com/search?q=kaiming&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={"sourceType":"answer","sourceId":2626853486}

[19]

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification: https://link.zhihu.com/?target=https://arxiv.org/abs/1502.01852

[20]

这篇讨论: https://link.zhihu.com/?target=https://www.kaggle.com/competitions/google-universal-image-embedding/discussion/339554

[21]

余弦衰减: https://www.zhihu.com/search?q=余弦衰减&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={"sourceType":"answer","sourceId":2626853486}

[22]

mers: https://www.zhihu.com/search?q=mers&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={"sourceType":"answer","sourceId":2626853486}

[23]

对抗权重扰动: https://www.zhihu.com/search?q=对抗权重扰动&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={"sourceType":"answer","sourceId":2626853486}

[24]

这篇文章: https://link.zhihu.com/?target=https://www.kaggle.com/code/junkoda/fast-awp

[25]

随机权重平均: https://www.zhihu.com/search?q=随机权重平均&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={"sourceType":"answer","sourceId":2626853486}

[26]

pseudo label: https://www.zhihu.com/search?q=pseudo+label&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={"sourceType":"answer","sourceId":2626853486}

[27]

https://arxiv.org/abs/2003.10580: https://link.zhihu.com/?target=https://arxiv.org/abs/2003.10580

[28]

https://arxiv.org/pdf/2010.07835.pdf: https://link.zhihu.com/?target=https://arxiv.org/pdf/2010.07835.pdf

[29]

test time augmentation: https://www.zhihu.com/search?q=test+time+augmentation&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={"sourceType":"answer","sourceId":2626853486}

[30]

resize: https://www.zhihu.com/search?q=resize&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={"sourceType":"answer","sourceId":2626853486}

[31]

https://openreview.net/pdf?id=XZDeL25T12l: https://link.zhihu.com/?target=https://openreview.net/pdf?id=XZDeL25T12l

[32]

https://arxiv.org/abs/2101.03697: https://link.zhihu.com/?target=https://arxiv.org/abs/2101.03697

[33]

建模: https://www.zhihu.com/search?q=建模&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={"sourceType":"answer","sourceId":2626853486}

[34]

random seed: https://www.zhihu.com/search?q=random+seed&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={"sourceType":"answer","sourceId":2626853486}

[35]

A Strong Baseline and Batch Normalization Neck for Deep Person Re-identification: https://link.zhihu.com/?target=https://arxiv.org/abs/1906.08332

[36]

Training Language GANs from Scratch: https://link.zhihu.com/?target=https://proceedings.neurips.cc/paper/2019/file/a6ea8471c120fe8cc35a2954c9b9c595-Paper.pdf

[37]

https://github.com/xmu-xiaoma666/External-Attention-pytorch: https://github.com/xmu-xiaoma666/External-Attention-pytorch

[38]

https://github.com/xmu-xiaoma666/FightingCV-Course: https://github.com/xmu-xiaoma666/FightingCV-Course

[39]

https://github.com/iscyy/yoloair: https://github.com/iscyy/yoloair

[40]

https://github.com/xmu-xiaoma666/FightingCV-Paper-Reading: https://github.com/xmu-xiaoma666/FightingCV-Paper-Reading