ArtiClarity — COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

An interesting attention-based approach in joint representation learning for text and video

3 min readMar 17, 2021

In this post, I want to discuss the following article:

Ging, Simon, et al. “Coot: Cooperative hierarchical transformer for video-text representation learning.” arXiv preprint arXiv:2011.00597 (2020).

which I found one of the particularly interesting articles in NeurIPS 2020. So, let’s begin going over the paper and highlighting the parts I find important in order to provide a complete summary of it.

Abstract

Motivation: when working on video-text tasks, many times it is the case that many different levels of granularity are required.
They propose a hierarchical model that has three main components:
1) an attention-based way of aggregating features
2) transformer-based representation learning
3) a cross-modal cycle-consistency loss (to bridge the learning of representations for video and text)
Code is available

Introduction

Consider you have videos and text of an action such as “making tea”. Such action is composed of different components (e.g., “pouring tea in the cup”) which each of them can be done differently, however, to understand this overall action one would break down the recording into these components and grasp the gist of these actions. It is due to this natural way of thinking that the authors have thought of a hierarchical system.
Given that the domain they are considering is video AND text, they are following the recent trends on utilizing alignment losses (please refer to the article for the full list of references).
Let’s look at the schematics of their model:

In the above, they have Temporal transformers (in which they utilize sine/cosine based positional encoding according to their codebase to encode time)
To aggregate features, they have an “attention-based feature aggregation” referred to in the pipeline as Attention-FA. This is better than for example average pooling in the sense that such generic aggregation is potentially leading to a lot of meaningless information being incorporated in the final representation, dampening the effect and impact of the main points of focus.

Preliminaries

The first part is going over the semantic alignment losses. Well, they mainly focus on Zhang et al. [21] which introduces the following loss:

To better understand it, you could consider a positive pair and a negative pair composed of 3 instances. Given that x and y are positive samples, this loss is the addition of two triplet margin losses once with x as the anchor point and once with y.
This part of the paper has a clear introduction to the method in Zhang et al. [21]. It is basically repeated utilization of the above contrastive loss for different objectives and in a shared embedding space.
The rest of part 2 is pretty clear and basically explaining the main pipeline in utilizing global and local information in obtaining representations in different levels of hierarchy.

Cross-Modal Cycle Consistency

This is the loss that they are introducing, which is very interesting. The idea is that having the representation of local segments (sentence or clip), from the sentence representation I go and find (“softly”) the clip representation closest to it. Now use that clip representation to find the closest sentence representation. Did you get back to where you started? If yes, good, if no, not so good.

The empirical results also validate their methodology, and I really suggest reading on the datasets, tasks, and evaluation mechanisms that they have employed in the paper (the link to their paper: here).

Thank you.