当前位置：首页 > news >正文

多模态论文阅读之VLMo

news 2025/8/23 4:44:45

VLMo泛读

Title
Motivation
Contribution
Model
Expertiments
Summary

Title

VLMo:Unified Vision_Langugae Pre-Training with Mixture-of-Modality-Experts

Motivation

CLIP和ALIGN都采用dual-encoder的方式分别编码图像和文本，模态之间的交互采用cosine similarity ，这种方法对retrieval tasks(检索任务)及其有效；但是如此shallow intersection between images and text is not enough to handle complex VL classfication tasks. In ViLT, find that CLIP gives a relatively low accuracy on visual resaoning(VR) task; 后来一系列的tasks，采用的fusion encoder 的方式，即一开始分来images and text 然后采用transformer的encoder 做cross-modal 的intersection，这样的architecture 弥补了dual encoder architecture的drawback，But it requires to jointly encode all possible image-text pairs to compute similarity scores for retrieval tasks. The quadratic time complexity leads to a much slower inference speed than the dual-encoder models models whos time complexity is linear. So, 有没**有一种融合上述两种架构的方法呢？**做检索任务的时候用 dual-encoder架构，做classfication的时候用fusion encoder，所以本文提出了Mixture-of-Modality-Experts
VLMo的训练loss是image-text contrastive(ITC), image-text matching(ITM), masked Language modeling(MLM)和ALBEF是一样的。提出了一个stagewise的预训练方法分别vision 和NLP中的large-scale corpus：首先在vision上训练好，再预训练language experts on text-only data，最后将模型用于vision-language pre-training。

Contribution

模型上的改进：Mixture-of-Modality-Experts
训练方式上的改进：分阶段模型预训练

Model

overview of the model

模型中所有的multi-head self-Attention都是share weights的
模型inference的时候很灵活，要做那个任务，切换到那个架构上就行。
分阶段训练策略

Expertiments

比ALBEF性能好很多
在更大的数据集上训练，数据变得更好。

Summary

就是把transformer里的encoder中的FFN分为了几个FFN

http://www.lryc.cn/news/216590.html

相关文章：

休闲类手游还有机会吗？两大策略收割全球玩家

Git复制代码

数据结构笔记——查找、排序（王道408）

MySQL---搜索引擎

2022最新版-李宏毅机器学习深度学习课程-P32 Transformer

如何使用商品详情API接口获取商品数据：一篇详尽的论述

华为：手机王者归来，汽车起死回生

Vue3.0 provide与inject依赖注入：VCA

前端react入门day02-React中的事件绑定与组件

工业5G路由器；小体积千兆高速通信组网

【深度学习基础】从R-CNN到Fast R-CNN，再到MaskR-CNN，发展历程讲清楚！

面试算法51：节点值之和最大的路径

阿里云 k8s 容器服务设置节点为不可调度的两种方法有什么区别？

新一代数据质量平台datavines

建议收藏《2023华为海思实习笔试-数字芯片真题+解析》（附下载）

【详细教程】关于如何使用GitGitHub的基本操作汇总GitHub的密钥配置 -＞（个人学习记录笔记）

HTML样式CSS、图像

智能电表瞬时电量是什么意思？

Redis之 redis.config配置文件

BIOS开发笔记 - CMOS

leetcode_117 填充每个节点的下一个右侧节点指针 II

亲测 IDEA Pycharm 全家桶自动重置免费30天

Marp: 将 Markdown 变为 PPT 式样的 VScode 插件

根据正则表达式截取字串符，这个办法打败99%程序员

冬天女儿的羽绒服就选它了，哈哈很喜欢

Vim插件配置

函数参数的最佳传递方式与现代C++的规则

Asterisk Ubuntu 安装

rwkv模型lora微调之accelerate和deepspeed训练加速

分享一下在微信小程序里怎么做一个投票链接