Fangzhou Hong is currently a final-year Ph.D. student (2021-) at College of Computing and Data Science in Nanyang Technological University, with MMLab@NTU, supervised by Prof. Ziwei Liu. Previously, he received B.Eng. degree in Software Engineering from Tsinghua University (2016-2020). He was fortunate to have an internship with Meta Reality Labs Research in 2023. His research interests lie in 3D computer vision and its intersection with computer graphics.
One paper DiffTF++ accepted to TPAMI.
One paper HMD2 accepted to 3DV 2025.
Four papers accepted to ECCV 2024.
Invited talk at the 1st Workshop on EgoMotion.
Two papers accepted to CVPR 2024.
Two papers accepted to TPAMI (4D-DS-Net and MotionDiffuse).
Two papers accecpted to NeurIPS 2023 (one spotlight, one poster).
We are hosting OmniObject3D challenge.
Three papers accepted to ICCV 2023.
I am recognized as CVPR 2023 Outstanding Reviewer.
One paper (AvatarCLIP) accepted to SIGGRAPH 2022 (journal track).
One paper (Garment4D) accepted to NeurIPS 2021.
I am awarded Google PhD Fellowship 2021 (Machine Perception).
One paper (extended Cylinder3D) accepted by TPAMI.
Two papers (DS-Net and Cylinder3D) accepted to CVPR 2021.
Start my journey in MMLab@NTU!
3DTopia: Large Text-to-3D Generation Model with Hybrid Diffusion Priors
arXiv Preprint, 2024
Text-to-3D Generation within 5 Minutes! A two-stage design, utilizing both 3D difffusion prior and 2D priors.
Unified 3D and 4D Panoptic Segmentation via Dynamic Shifting Networks
Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Extension of the CVPR21 Version; Extend DS-Net to 4D panoptic LiDAR segmentation by the temporally unified instance clustering on aligned LiDAR frames.
SHERF: Generalizable Human NeRF from a Single Image
International Conference on Computer Vision (ICCV), 2023
Reconstruct human NeRF from a single image in one forward pass!
EVA3D: Compositional 3D Human Generation from 2D Image Collections
International Conference on Learning Representations (ICLR), 2023 (Spotlight)
EVA3D is a high-quality unconditional 3D human generative model that only requires 2D image collections for training.
AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars
ACM Transactions on Graphics (SIGGRAPH), 2022
AvatarCLIP empowers layman users to customize a 3D avatar with the desired shape and texture, and drive the avatar with the described motions using solely natural languages.
Versatile Multi-Modal Pre-Training for Human-Centric Perception
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022 (Oral)
The first to leverage the multi-modal nature of human data (e.g. RGB, depth, 2D key-points) for effective human-centric representation learning.
Garment4D: Garment Reconstruction from Point Cloud Sequences
35th Conference on Neural Information Processing Systems (NeurIPS), 2021
The first attempt at separable and interpretable garment reconstruction from point cloud sequences, especially challenging loose garments.
LiDAR-based Panoptic Segmentation via Dynamic Shifting Network
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021
Rank 1st in the public leaderboard of SemanticKITTI panoptic segmentation (2020-11-16); A learnable clustering module is designed to adapt kernel functions to complex point distributions.
DiffTF++: 3D-aware Diffusion Transformer for Large-Vocabulary 3D Generation
Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2025
Extension of our ICLR 2024 paper DiffTF. Joint training of diffusion model and Triplane representation increases the generation quality.
HMD2: Environment-aware Motion Generation from Single Egocentric Head-Mounted Device
International Conference on 3D Vision (3DV), 2025
We propose HMD2, the first system for the online generation of full-body self-motion using a single head-mounted device (e.g. Project Aria Glasses) equipped with an outward-facing camera in complex and diverse environments.
Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild
European Conference on Computer Vision (ECCV), 2024
A large-scale, diverse, richly annotated human motion dataset collected in the wild with multi-modal egocentric devices.
LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation
European Conference on Computer Vision (ECCV), 2024
LN3Diff creates high-quality 3D object mesh from text within 8 V100-SECONDS.
StructLDM: Structured Latent Diffusion for 3D Human Generation
European Conference on Computer Vision (ECCV), 2024
StructLDM is a diffusion-based unconditional 3D human generative model learned from 2D images.
Large Motion Model for Unified Multi-Modal Motion Generation
European Conference on Computer Vision (ECCV), 2024
Large Motion Model (LMM) is a motion-centric, multi-modal framework that unifies mainstream motion generation tasks into a generalist model.
SurMo: Surface-based 4D Motion Modeling for Dynamic Human Rendering
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024
Dynamic human rendering with the joint modeling of motion dynamics and appearance.
CityDreamer: Compositional Generative Model of Unbounded 3D Cities
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024
Unbouned 3D cities generated from 2D image collections!
Large-Vocabulary 3D Diffusion Model with Transformer
International Conference on Learning Representations (ICLR), 2024
DiffTF achieves state-of-the-art large-vocabulary 3D object generation performance with 3D-aware transformers.
MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model
Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
The first diffusion-model-based text-driven motion generation framework with probabilistic mapping, realistic synthesis and multi-level manipulation ability.
Unified 3D and 4D Panoptic Segmentation via Dynamic Shifting Networks
Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Extension of the CVPR21 Version; Extend DS-Net to 4D panoptic LiDAR segmentation by the temporally unified instance clustering on aligned LiDAR frames.
PrimDiffusion: Volumetric Primitives Diffusion for 3D Human Generation
Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023
PrimDiffusion performs the diffusion and denoising process on a set of primitives which compactly represent 3D humans.
4D Panoptic Scene Graph Generation
Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023 (Spotlight)
To allow artificial intelligence to develop a comprehensive understanding of a 4D world, we introduce 4D Panoptic Scene Graph (PSG-4D), a new representation that bridges the raw visual data perceived in a dynamic 4D world and high-level visual understanding.
SHERF: Generalizable Human NeRF from a Single Image
International Conference on Computer Vision (ICCV), 2023
Reconstruct human NeRF from a single image in one forward pass!
DeformToon3D: Deformable 3D Toonification from Neural Radiance Fields
International Conference on Computer Vision (ICCV), 2023
We learn a style field that deforms real 3D faces to styleized 3D faces.
ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model
International Conference on Computer Vision (ICCV), 2023
ReMoDiffuse is a diffusion-model-based motion generation framework that integrates a retrieval mechanism to refine the denoising process, which enhances the generalizability and diversity.
EVA3D: Compositional 3D Human Generation from 2D Image Collections
International Conference on Learning Representations (ICLR), 2023 (Spotlight)
EVA3D is a high-quality unconditional 3D human generative model that only requires 2D image collections for training.
HuMMan: Multi-Modal 4D Human Dataset for Versatile Sensing and Modeling
European Conference on Computer Vision (ECCV), 2022 (Oral)
A large-scale multi-modal (color images, point clouds, keypoints, SMPL parameters, and textured meshes) 4D human dataset with 1000 human subjects, 400k sequences and 60M frames.
AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars
ACM Transactions on Graphics (SIGGRAPH), 2022
AvatarCLIP empowers layman users to customize a 3D avatar with the desired shape and texture, and drive the avatar with the described motions using solely natural languages.
Versatile Multi-Modal Pre-Training for Human-Centric Perception
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022 (Oral)
The first to leverage the multi-modal nature of human data (e.g. RGB, depth, 2D key-points) for effective human-centric representation learning.
Garment4D: Garment Reconstruction from Point Cloud Sequences
35th Conference on Neural Information Processing Systems (NeurIPS), 2021
The first attempt at separable and interpretable garment reconstruction from point cloud sequences, especially challenging loose garments.
LiDAR-based Panoptic Segmentation via Dynamic Shifting Network
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021
Rank 1st in the public leaderboard of SemanticKITTI panoptic segmentation (2020-11-16); A learnable clustering module is designed to adapt kernel functions to complex point distributions.
Cylindrical and Asymmetrical 3D Convolution Networks for LiDAR-based Perception
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021
Journal Extension of the CVPR21 version; Extend the cylindrical convolution to more general LiDAR-based perception tasks.
Cylindrical and Asymmetrical 3D Convolution Networks for LiDAR Segmentation
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021 (Oral)
Rank 1st in the public leaderboard of SemanticKITTI semantic segmentation (2020-11-16); Cylindrical 3D convolution is designed to explore the 3D geometric pattern of LiDAR point clouds.
LRC-Net: Learning Discriminative Features on Point Clouds by Encoding Local Region Contexts
Computer Aided Geometric Design, 2020, 79: 101859. (SCI, 2017 Impact factor: 1.421, CCF B)
To learn discriminative features on point clouds by encoding the fine-grained contexts inside and among local regions simultaneously.
EgoLM: Multi-Modal Language Model of Egocentric Motions
arXiv Preprint, 2024
EgoLM is a language model-based framework that tracks and understands egocentric motions from multi-modal inputs, i.e., egocentric videos and sparse motion sensors.
3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion
arXiv Preprint, 2024
3DTopia-XL scales high-quality 3D asset generation using Diffusion Transformer (DiT) built upon an expressive and efficient 3D representation, PrimX. The denoising process takes 5 seconds to generate a 3D PBR asset from text / image input which is ready for graphics pipeline to use.
GaussianCity: Generative Gaussian Splatting for Unbounded 3D City Generation
arXiv Preprint, 2024
GaussianCity is a framework for efficient unbounded 3D city generation using 3D Gaussian Splatting.
FashionEngine: Interactive Generation and Editing of 3D Clothed Humans
arXiv Preprint, 2024
FashionEngine is an interactive 3D human generation and editing system with multimodal control (e.g., texts, images, hand-drawing sketches).
3DTopia: Large Text-to-3D Generation Model with Hybrid Diffusion Priors
arXiv Preprint, 2024
Text-to-3D Generation within 5 Minutes! A two-stage design, utilizing both 3D difffusion prior and 2D priors.
HumanLiff: Layer-wise 3D Human Generation with Diffusion Model
arXiv Preprint, 2023
We generate 3D digital humans using 3D diffusion model in a controllable, layer-wise way.
PointHPS: Cascaded 3D Human Pose and Shape Estimation from Point Clouds
arXiv Preprint, 2023
SMPL reconstruction from real depth sensor, which are partial point cloud inputs.
DiffTF++: 3D-aware Diffusion Transformer for Large-Vocabulary 3D Generation
Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2025
Extension of our ICLR 2024 paper DiffTF. Joint training of diffusion model and Triplane representation increases the generation quality.
3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion
arXiv Preprint, 2024
3DTopia-XL scales high-quality 3D asset generation using Diffusion Transformer (DiT) built upon an expressive and efficient 3D representation, PrimX. The denoising process takes 5 seconds to generate a 3D PBR asset from text / image input which is ready for graphics pipeline to use.
3DTopia: Large Text-to-3D Generation Model with Hybrid Diffusion Priors
arXiv Preprint, 2024
Text-to-3D Generation within 5 Minutes! A two-stage design, utilizing both 3D difffusion prior and 2D priors.
LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation
European Conference on Computer Vision (ECCV), 2024
LN3Diff creates high-quality 3D object mesh from text within 8 V100-SECONDS.
Large-Vocabulary 3D Diffusion Model with Transformer
International Conference on Learning Representations (ICLR), 2024
DiffTF achieves state-of-the-art large-vocabulary 3D object generation performance with 3D-aware transformers.
FashionEngine: Interactive Generation and Editing of 3D Clothed Humans
arXiv Preprint, 2024
FashionEngine is an interactive 3D human generation and editing system with multimodal control (e.g., texts, images, hand-drawing sketches).
StructLDM: Structured Latent Diffusion for 3D Human Generation
European Conference on Computer Vision (ECCV), 2024
StructLDM is a diffusion-based unconditional 3D human generative model learned from 2D images.
PrimDiffusion: Volumetric Primitives Diffusion for 3D Human Generation
Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023
PrimDiffusion performs the diffusion and denoising process on a set of primitives which compactly represent 3D humans.
DeformToon3D: Deformable 3D Toonification from Neural Radiance Fields
International Conference on Computer Vision (ICCV), 2023
We learn a style field that deforms real 3D faces to styleized 3D faces.
EVA3D: Compositional 3D Human Generation from 2D Image Collections
International Conference on Learning Representations (ICLR), 2023 (Spotlight)
EVA3D is a high-quality unconditional 3D human generative model that only requires 2D image collections for training.
AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars
ACM Transactions on Graphics (SIGGRAPH), 2022
AvatarCLIP empowers layman users to customize a 3D avatar with the desired shape and texture, and drive the avatar with the described motions using solely natural languages.
HumanLiff: Layer-wise 3D Human Generation with Diffusion Model
arXiv Preprint, 2023
We generate 3D digital humans using 3D diffusion model in a controllable, layer-wise way.
GaussianCity: Generative Gaussian Splatting for Unbounded 3D City Generation
arXiv Preprint, 2024
GaussianCity is a framework for efficient unbounded 3D city generation using 3D Gaussian Splatting.
CityDreamer: Compositional Generative Model of Unbounded 3D Cities
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024
Unbouned 3D cities generated from 2D image collections!
HMD2: Environment-aware Motion Generation from Single Egocentric Head-Mounted Device
International Conference on 3D Vision (3DV), 2025
We propose HMD2, the first system for the online generation of full-body self-motion using a single head-mounted device (e.g. Project Aria Glasses) equipped with an outward-facing camera in complex and diverse environments.
EgoLM: Multi-Modal Language Model of Egocentric Motions
arXiv Preprint, 2024
EgoLM is a language model-based framework that tracks and understands egocentric motions from multi-modal inputs, i.e., egocentric videos and sparse motion sensors.
Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild
European Conference on Computer Vision (ECCV), 2024
A large-scale, diverse, richly annotated human motion dataset collected in the wild with multi-modal egocentric devices.
Large Motion Model for Unified Multi-Modal Motion Generation
European Conference on Computer Vision (ECCV), 2024
Large Motion Model (LMM) is a motion-centric, multi-modal framework that unifies mainstream motion generation tasks into a generalist model.
MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model
Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
The first diffusion-model-based text-driven motion generation framework with probabilistic mapping, realistic synthesis and multi-level manipulation ability.
ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model
International Conference on Computer Vision (ICCV), 2023
ReMoDiffuse is a diffusion-model-based motion generation framework that integrates a retrieval mechanism to refine the denoising process, which enhances the generalizability and diversity.
HuMMan: Multi-Modal 4D Human Dataset for Versatile Sensing and Modeling
European Conference on Computer Vision (ECCV), 2022 (Oral)
A large-scale multi-modal (color images, point clouds, keypoints, SMPL parameters, and textured meshes) 4D human dataset with 1000 human subjects, 400k sequences and 60M frames.
AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars
ACM Transactions on Graphics (SIGGRAPH), 2022
AvatarCLIP empowers layman users to customize a 3D avatar with the desired shape and texture, and drive the avatar with the described motions using solely natural languages.
PointHPS: Cascaded 3D Human Pose and Shape Estimation from Point Clouds
arXiv Preprint, 2023
SMPL reconstruction from real depth sensor, which are partial point cloud inputs.
SurMo: Surface-based 4D Motion Modeling for Dynamic Human Rendering
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024
Dynamic human rendering with the joint modeling of motion dynamics and appearance.
SHERF: Generalizable Human NeRF from a Single Image
International Conference on Computer Vision (ICCV), 2023
Reconstruct human NeRF from a single image in one forward pass!
Garment4D: Garment Reconstruction from Point Cloud Sequences
35th Conference on Neural Information Processing Systems (NeurIPS), 2021
The first attempt at separable and interpretable garment reconstruction from point cloud sequences, especially challenging loose garments.
EgoLM: Multi-Modal Language Model of Egocentric Motions
arXiv Preprint, 2024
EgoLM is a language model-based framework that tracks and understands egocentric motions from multi-modal inputs, i.e., egocentric videos and sparse motion sensors.
Unified 3D and 4D Panoptic Segmentation via Dynamic Shifting Networks
Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Extension of the CVPR21 Version; Extend DS-Net to 4D panoptic LiDAR segmentation by the temporally unified instance clustering on aligned LiDAR frames.
4D Panoptic Scene Graph Generation
Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023 (Spotlight)
To allow artificial intelligence to develop a comprehensive understanding of a 4D world, we introduce 4D Panoptic Scene Graph (PSG-4D), a new representation that bridges the raw visual data perceived in a dynamic 4D world and high-level visual understanding.
Versatile Multi-Modal Pre-Training for Human-Centric Perception
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022 (Oral)
The first to leverage the multi-modal nature of human data (e.g. RGB, depth, 2D key-points) for effective human-centric representation learning.
LiDAR-based Panoptic Segmentation via Dynamic Shifting Network
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021
Rank 1st in the public leaderboard of SemanticKITTI panoptic segmentation (2020-11-16); A learnable clustering module is designed to adapt kernel functions to complex point distributions.
Cylindrical and Asymmetrical 3D Convolution Networks for LiDAR-based Perception
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021
Journal Extension of the CVPR21 version; Extend the cylindrical convolution to more general LiDAR-based perception tasks.
Cylindrical and Asymmetrical 3D Convolution Networks for LiDAR Segmentation
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021 (Oral)
Rank 1st in the public leaderboard of SemanticKITTI semantic segmentation (2020-11-16); Cylindrical 3D convolution is designed to explore the 3D geometric pattern of LiDAR point clouds.
LRC-Net: Learning Discriminative Features on Point Clouds by Encoding Local Region Contexts
Computer Aided Geometric Design, 2020, 79: 101859. (SCI, 2017 Impact factor: 1.421, CCF B)
To learn discriminative features on point clouds by encoding the fine-grained contexts inside and among local regions simultaneously.
ECCV 2024 Outstanding Reviewer
Google PhD Fellowship 2021
Outstanding Undergraduate Thesis of Tsinghua University
Outstanding Graduate of Tsinghua University
Outstanding Graduate of Beijing
Outstanding Graduate of School of Software, Tsinghua University
ICBC Scholarship (Top 3%)
Hua Wei Scholarship (Top 1%)
Tung OOCL Scholarship (Top 5%)
From High-Fidelity 3D Generative Models to Dynamic Embodied Learning
Conference Reviewer: CVPR’21/23/24/25, ICCV’23, ECCV’24, NeurIPS’22/23/24, ICML’23/24, ICLR’24, SIGGRAPH’23/24, SIGGRAPH Asia’23/24, AAAI’21/23, 3DV’25
Journal Reviewer: TPAMI, IJCV, TVCG, TCSVT, JABES, PR
Guest Lecture on 3D Generative Models @ UMich EECS 542.
NTU CE/CZ1115 Introduction to Data Science and Artificial Intelligence (Teaching Assistant)
NTU CE2003 Digital System Design (Teaching Assistant)
NTU CE/CZ1115 Introduction to Data Science and Artificial Intelligence (Teaching Assistant)
NTU SC1013 Physics for Computing (Teaching Assistant)
EgoLM is a language model-based framework that tracks and understands egocentric motions from multi-modal inputs, i.e., egocentric videos and sparse motion sensors.