Figure 2. Illustration of the general paradigm of HCMoCo.
We group modalities of human data into dense I∗d and sparse rep-
resentations I∗s.
Three levels of embeddings (i.e. global, dense and sparse embeddings) are extracted.
Combining the nature of human data and tasks, we present contrastive learning targets for each level of embedding.
Specifically, we present a) Sample-level Modality-invariant Representation Learning; b) Dense Intra-sample Contrastive Learning;
c) Sparse Structure-aware Contrastive Learning, respectively.
Figure 3. Pipelines of Two Applications of HCMoCo.
On top of the pre-train framework HCMoCo, we propose to further extend it on two direct applications:
Cross-Modality Supervision and Missing-Modality Inference. The
extensions are based on the key design of HCMoCo: dense
intra-sample contrastive learning target. With the feature
maps of different modalities aligned, it is straightforward to
implement the two extensions.
Figure 4. Illustration of the RGB-D human parsing dataset NTURGBD-Parsing-4K.
We contribute the first RGB-D human parsing
dataset:
NTURGBD-Parsing-4K. The RGB and depth are
uniformly sampled from NTU RGB+D (60/120).
We annotate 24 human parts for paired
RGB-D data.
The train and test set both have 1963 samples. The whole
dataset contains 3926 samples. Hopefully, by contributing
this dataset, we could promote the development of both hu-
man perception and multi-modality learning.
Downloads are available
here.