Kirin — Animal Motion Generation from In-the-Wild Video

01

Abstract

A unified framework that reconstructs 3D motion from in-the-wild video, learns priors at scale on the resulting dataset, and generates realistic motion conditioned on text and image.

Subject: Quadruped animals
Input: Single image + text prompt
Output: Rigged 3D animated mesh
Dataset: AiM3D — text · video · motion
Backbone: SMAL · MDM-adapted
Status: Under review

Understanding animal motion is fundamental to modeling behavior and biomechanics, yet progress has lagged far behind human motion research because high-quality data is scarce. Capturing motion in controlled environments is impractical for most species, leaving existing datasets small and domain-limited.

Kirin sidesteps capture entirely. We reconstruct 3D motion sequences from large collections of in-the-wild animal video, pair them with VLM-generated captions, and release AiM3D — the first large-scale dataset offering aligned video–text–motion tuples for quadrupeds. On top of it we train a visual-guided generation model that conditions on both text and image, then leverage an off-the-shelf image-to-3D model to automatically rig and animate the result, producing ready-to-render animated animals from a single picture.

Contribution 01

AiM3D — a large-scale animal motion dataset with aligned text, video, and 3D motion.

Contribution 02

The first animal motion generation model conditioned on both text and image input.

Contribution 03

A fully automatic 2D-image → 3D animated mesh system, beating prior 4D animation methods.

02

Method

A two-stage pipeline. A diffusion-based motion model conditioned on text and image generates the 3D motion; an automatic rigging module applies it to a T-posed mesh produced from the same image.

Fig. 02 · Pipeline

Left — generation pipeline. A text description and an image are provided as inputs. Both are used for motion generation, while the image is also used to generate a T-posed mesh. The animation module rigs the generated motion onto the mesh to produce the final animated 3D model.

Right — motion generation architecture. Text features are extracted with a frozen DistilBERT encoder; image features with a frozen DINOv3 encoder. Text, image, and denoising-step embeddings are combined and fed into a transformer decoder with cross-attention to produce the motion sequence.

Reconstruction backbone

Fig. 03 · Reconstruction

Overview of motion reconstruction. We leverage off-the-shelf 3D quadruped reconstruction and 3D tracking methods to infer articulation and global translation separately, then combine them to produce the final per-frame 3D motion that populates AiM3D.

03

Dataset

A large-scale animal motion dataset with aligned video, text, and 3D motion. Reconstructions cover diverse species, behaviours, and environments far beyond what controlled capture can offer.

AiM3D data examples — captions, video frames and reconstructed motion

Fig. 04 · AiM3D samples

Examples from AiM3D. Each row shows a real-world video sequence aligned with two VLM-generated captions describing the animal's motion, and the corresponding reconstructed 3D motion (visualized as keypoints). Reconstructions span species, gait, and behaviour — rhinos grazing, foxes sniffing and resting, rabbits hopping, horses trotting under saddle.

Global trajectories

04

Comparisons

Side-by-side qualitative comparisons of generated mesh and skeletal animation against the strongest published baseline. Drag to inspect; click pause to hold a frame.

Generated mesh animation

Cmp.01 "A moose walks straight ahead." 01 / 02

Reference input

Puppeteer baseline

Kirin ours

Cmp.02 "A horse trots forward." 02 / 02

Reference input

Puppeteer baseline

Kirin ours

Left-click drag to rotate

Right-click drag, or WASD to move

Scroll to zoom

Click

to pause

Generated skeletal motion

Fig. 05 · Skeleton

Generated skeletal motion comparison. Sequences from our model versus the strongest published baseline at matched frames. Kirin produces gait cycles with cleaner foot contact and more plausible global trajectory.

Animated mesh — gallery comparison

Mesh animation comparison against Puppeteer baseline

Fig. 06 · Mesh gallery

Comparison against Puppeteer*. A reproduction of Puppeteer's setup (no public code at the time of writing) generates static T-posed meshes that do not animate the prompted action. Kirin produces full motion sequences faithful to the input image and prompt. * Reproduced from the paper.

05

Results

Across motion-generation metrics and 4D animation benchmarks, Kirin outperforms or matches prior methods. Best results in color; second-best underlined.

Tab. 01 · Text-conditioned animal motion generation Higher R-Precision and Diversity is better · lower FID, MM-Dist is better

Method	R-Prec. Top-1 ↑	R-Prec. Top-3 ↑	FID ↓	MM-Dist ↓	Diversity ↑
Real motion	0.512±.004	0.823±.003	0.002±.001	2.974±.008	9.503±.065
MDM	0.298±.005	0.612±.004	1.842±.040	4.918±.011	8.612±.078
MotionDiffuse	0.321±.004	0.658±.005	1.514±.034	4.671±.012	8.804±.080
T2M-GPT	0.354	0.701	1.288±.030	4.402	9.014
MoMask	0.346±.005	0.689±.004	1.196	4.477±.010	8.961±.071
Kirin (text-only)	0.392±.004	0.748±.003	0.974±.026	4.142±.009	9.182±.069

Even without image conditioning, Kirin's architecture trained on AiM3D outperforms the strongest text-to-motion baselines across all five standard metrics. Numbers reported on the AiM3D test split with 95% confidence intervals.

Tab. 02 · Image + text-conditioned animal motion generation Adding image conditioning further improves identity preservation and motion plausibility.

Method	R-Prec. Top-1 ↑	R-Prec. Top-3 ↑	FID ↓	MM-Dist ↓	Image-CLIP ↑
Real motion	0.512±.004	0.823±.003	0.002±.001	2.974±.008	0.298±.003
MDM + img	0.331±.005	0.671±.004	1.487±.038	4.604±.011	0.221±.003
MoMask + img	0.378	0.722	1.041	4.291	0.247
Kirin (full)	0.421±.004	0.776±.003	0.812±.022	3.978±.010	0.281±.003

Conditioning on the input image alongside the text prompt — through a frozen DINOv3 encoder fused with text and time embeddings — yields the largest gains on FID and Image-CLIP, indicating that motion better matches the appearance and identity of the depicted animal.

Tab. 03 · 4D animation from a single image Mesh-level evaluation against published 4D animation methods.

Method	CLIP-T ↑	CLIP-I ↑	Temporal Cons. ↑	User Pref. ↑
Animate124	0.218±.004	0.704±.005	0.812±.006	11.4 %
4D-fy	0.231±.004	0.728±.005	0.847±.005	14.8 %
Puppeteer*	0.262	0.781	0.881	21.7 %
Kirin (ours)	0.291±.003	0.812±.004	0.924±.004	52.1 %

A user study with 38 participants comparing 24 prompt × image pairs from a held-out evaluation set; participants picked the most realistic animation per prompt. Kirin wins a majority of comparisons. * Reproduced; no public code at submission.

06

Citation

Bibliographic information will be finalised on release. A placeholder entry is shown below.

BibTeX · placeholder

@misc{kirin2026,
  title  = {Kirin: Animal Motion Generation from In-the-Wild Video},
  author = {Anonymous},
  year   = {2026},
  note   = {Preprint, under review}
}