<dd id="om44c"><optgroup id="om44c"></optgroup></dd>
  • <xmp id="om44c"><nav id="om44c"></nav>
    <xmp id="om44c"><nav id="om44c"></nav>
    <menu id="om44c"></menu>

    科學研究

    Research

    首頁 >  論文  > 詳情

    CLIP-Adapter: Better Vision-Language Models with Feature Adapters

    發表會議及期刊:arXiv

    Peng Gao?1, Shijie Geng?2, Renrui Zhang?1, Teli Ma1, Rongyao Fang3

    Yongfeng Zhang2, Hongsheng Li3, Yu Qiao

    1Shanghai AI Laboratory     2Rutgers University

     3The Chinese University of Hong Kong

     {gaopeng,zhangrenrui,qiaoyu}@pjlab.org.cn 

    sg1309@rutgers.edu,  hsli@ee.cuhk.edu.hk


    Abstract:

    Large-scale contrastive vision-language pretraining has shown significant progress in visual representation learning. Unlike traditional visual systems trained by a fixed set of discrete labels, a new paradigm was introduced in (Radford et al., 2021) to directly learn to align images with raw texts in an open-vocabulary setting. On downstream tasks, a carefully chosen text prompt is employed to make zero-shot predictions. To avoid non-trivial prompt engineering, context optimization (Zhou et al., 2021) has been proposed to learn continuous vectors as task-specific prompts with few-shot training examples. In this paper, we show that there is an alternative path to achieve better vision-language models other than prompt tuning. While prompt tuning is for the textual inputs, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch. Specifically, CLIP-Adapter adopts an additional bottleneck layer to learn new features and performs residual style feature blending with the original pre-trained features. As a consequence, CLIP-Adapter is able to outperform context optimization while maintains a simple design. Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.

    comm@pjlab.org.cn

    上海市徐匯區云錦路701號西岸國際人工智能中心37-38層

    滬ICP備2021009351號-1

    最近中文字幕国语免费完整