CoCa(Contrastive Captioners)
Pretraining method :
- encoder-decoder models
- encoder
- dual encoder
- decoder
- transfer learning
multimodal : In CoCa using text data + image data
modality: In the context of human–computer interaction, a modality is the classification of a single independent channel of sensory input/output between a computer and a human. A system is designated unimodal if it has only one modality implemented, and multimodal if it has more than one. When multiple modalities are available for some tasks or aspects of a task, the system is said to have overlapping modalities. If multiple modalities are available for a task, the system is said to have redundant modalities. Multiple modalities can be used in combination to provide complementary methods that may be redundant but convey information more effectively. Modalities can be generally defined in two forms: human-computer and computer-human modalities.
Losses
Contrastive loss:
pull positive samples and push negative samples
Captioning loss:
zero shot manner: