HGGT: Robust 3D Hand Mesh from Uncalibrated Images

TL;DR

We present the first feed-forward framework that jointly estimates 3D hand meshes and camera poses from uncalibrated multi-view images.

Abstract

Recovering high-fidelity 3D hand geometry from images is a critical task in computer vision, holding significant value for domains such as robotics, animation and VR/AR. Crucially, scalable applications demand both accuracy and deployment flexibility, requiring the ability to leverage massive amounts of unstructured image data from the internet or enable deployment on consumer-grade RGB cameras without complex calibration. However, current methods face a dilemma. While single-view approaches are easy to deploy, they suffer from depth ambiguity and occlusion. Conversely, multi-view systems resolve these uncertainties but typically demand fixed, calibrated setups, limiting their real-world utility. To bridge this gap, we draw inspiration from 3D foundation models that learn explicit geometry directly from visual data. By reformulating hand reconstruction from arbitrary views as a visual-geometry grounded task, we propose a feed-forward architecture that, for the first time in literature, jointly infers 3D hand meshes and camera poses from uncalibrated views. Extensive evaluations show that our approach outperforms state-of-the-art benchmarks and demonstrates strong generalization to uncalibrated, in-the-wild scenarios.

Method

The pipeline of HGGT. Given uncalibrated multi-view images, we first employ a VGGT Aggregator to extract image tokens and initial camera tokens. These are processed alongside learnable random hand tokens via a series of Cross-attention Blocks. Finally, two parallel heads predict the camera parameters and the canonical MANO parameters (θ, β, t), which can be re-projected onto the input views for verification.

Synthetic Dataset

We introduce a new synthetic dataset and a mixed-data training strategy that effectively leverages real monocular data, real multi-view data and synthetic multi-view data. The diversity of data resource significantly enhances the model's generalization capabilities across different domains. Here are some samples from our synthetic dataset. It contains diverse photorealistic hand-object interactions rendered with randomized camera viewpoints, providing critical viewpoint diversity absent in real-world captures.

Qualitative Comparison

Qualitative comparison on InterHand2.6M, OakInk, HO3D, Arctic and DexYCB. We compare our method against baselines on open benchmarks. Rows correspond to: Input RGB, Ground Truth, (a) Cameras Predicted by VGGT + POEM-large, (b) POEM-large, (c) Cameras Predicted by Ours + POEM-large, and (d) Ours (Full).

Interactive 3D Visualization

Select a dataset and sample to explore the 3D hand mesh reconstruction interactively. Each visualization contains three panels: predicted mesh, ground truth, and an overlay view.

Arctic
DexYCB
HO3D
InterHand
OakInk

In-the-Wild Video Inference

Camera setup

Uncalibrated phone camera placement for Demo 1, Demo 2, and Demo 3

Qualitative results on in-the-wild video sequences. The video was captured using only two phones with random position. Our model processes the video frame-by-frame and produces accurate 3D hand reconstruction.

HGGT: Robust and Flexible 3D Hand Mesh Reconstruction from Uncalibrated Images

We introduce Hand Geometry Grounding Transformer (HGGT), a scalable and generalized solution for 3D hand mesh recovery. Our method unifies diverse data sources to achieve robust performance across varying camera viewpoints and environments.