EgoLM: Multi-Modal Language Model of Egocentric Motions

  • 1Meta Reality Labs Research
  • 2S-Lab, Nanyang Technological University
  • 3University of Tuebingen
TL;DR
EgoLM is a language model-based framework that tracks and understands egocentric motions
from multi-modal inputs, i.e., egocentric videos and sparse motion sensors.
Abstract
As the prevalence of wearable devices, learning egocentric motions becomes essential to develop contextual AI. In this work, we present EgoLM, a versatile framework that tracks and understands egocentric motions from multi-modal inputs, e.g., egocentric videos and motion sensors. EgoLM exploits rich contexts for the disambiguation of egomotion tracking and understanding, which are ill-posed under single modality conditions. To facilitate the versatile and multi-modal framework, our key insight is to model the joint distribution of egocentric motions and natural languages using large language models (LLM). Multi-modal sensor inputs are encoded and projected to the joint latent space of language models, and used to prompt motion generation or text generation for egomotion tracking or understanding, respectively. Extensive experiments on large-scale multi-modal human motion dataset validate the effectiveness of EgoLM as a generalist model for universal egocentric learning.
Results of Egomotion Tracking and Understanding
Upper left: Input egocentric videos; Lower left: Input sparse motion sensors; Upper middle: Ground truth motion;
Upper right: Our motion tracking results; Lower middle: Our motion understanding results.
Method Overview
Figure 1. Overview of EgoLM.

Three steps are designed for the training of EgoLM. In the first step, we train a motion VQ-VAE as the motion tokenizer. The second step is motion pre-training for motion distribution learning. The last step is multi-modal instruction tuning to guide the model to perform motion tracking and understanding.

Figure 2. Details of Multi-Modal Instruction Tuning.

Different modalities, e.g., egocentric videos and sparse motion sensors, are encoded and projected to the language model space. Their features are concatenated in the order of the instruction template, and input into the transformer layers of the language model.

Bibtex
@article{EgoLM,
    title={EgoLM: Multi-Modal Language Model of Egocentric Motions},
    author={Fangzhou Hong and Vladimir Guzov and Hyo Jin Kim and Yuting Ye and Richard Newcombe and Ziwei Liu and Lingni Ma},
    journal={arXiv preprint arXiv:2409.18127},
    year={2024}
}
            
Related Projects

Project Aria: we use the project Aria glasses in our research to capture egocentric videos.

Nymeria Dataset: a large-scale, diverse, richly annotated human motion dataset collected in the wild with multi-modal egocentric devices.