Kirin Kirin
Animal Motion Generation

Animal motion, from the wild.

Kirin learns 3D quadruped motion directly from internet video — and turns a single image plus a text prompt into a fully rigged, animated mesh.

AuthorsAnonymous Authors
StatusUnder review
In progress Page is being prepared for public release. Expect more results, code, and dataset links shortly.
Kirin teaser — generated 3D animal mesh sequences
Fig. 01 · Kirin generations

Given an animal image and a text description of the desired motion, Kirin produces a 3D animated mesh sequence — trained on motion reconstructed from in-the-wild video, with an automatic rigging step at the end.

01 Abstract

Reconstruct, learn, generate.

A unified framework that reconstructs 3D motion from in-the-wild video, learns priors at scale on the resulting dataset, and generates realistic motion conditioned on text and image.

Subject
Quadruped animals
Input
Single image + text prompt
Output
Rigged 3D animated mesh
Dataset
AiM3D — text · video · motion
Backbone
SMAL · MDM-adapted
Status
Under review

Understanding animal motion is fundamental to modeling behavior and biomechanics, yet progress has lagged far behind human motion research because high-quality data is scarce. Capturing motion in controlled environments is impractical for most species, leaving existing datasets small and domain-limited.

Kirin sidesteps capture entirely. We reconstruct 3D motion sequences from large collections of in-the-wild animal video, pair them with VLM-generated captions, and release AiM3D — the first large-scale dataset offering aligned video–text–motion tuples for quadrupeds. On top of it we train a visual-guided generation model that conditions on both text and image, then leverage an off-the-shelf image-to-3D model to automatically rig and animate the result, producing ready-to-render animated animals from a single picture.

Contribution 01
AiM3D — a large-scale animal motion dataset with aligned text, video, and 3D motion.
Contribution 02
The first animal motion generation model conditioned on both text and image input.
Contribution 03
A fully automatic 2D-image → 3D animated mesh system, beating prior 4D animation methods.
02 Method

Text + image, conditioned generation.

A two-stage pipeline. A diffusion-based motion model conditioned on text and image generates the 3D motion; an automatic rigging module applies it to a T-posed mesh produced from the same image.

Kirin method overview
Fig. 02 · Pipeline
Left — generation pipeline. A text description and an image are provided as inputs. Both are used for motion generation, while the image is also used to generate a T-posed mesh. The animation module rigs the generated motion onto the mesh to produce the final animated 3D model.

Right — motion generation architecture. Text features are extracted with a frozen DistilBERT encoder; image features with a frozen DINOv3 encoder. Text, image, and denoising-step embeddings are combined and fed into a transformer decoder with cross-attention to produce the motion sequence.

Reconstruction backbone

Reconstruction method
Fig. 03 · Reconstruction
Overview of motion reconstruction. We leverage off-the-shelf 3D quadruped reconstruction and 3D tracking methods to infer articulation and global translation separately, then combine them to produce the final per-frame 3D motion that populates AiM3D.
03 Data

AiM3D — scraped from reality.

A large-scale animal motion dataset with aligned video, text, and 3D motion. Reconstructions cover diverse species, behaviours, and environments far beyond what controlled capture can offer.

AiM3D data examples — captions, video frames and reconstructed motion
Fig. 04 · AiM3D samples
Examples from AiM3D. Each row shows a real-world video sequence aligned with two VLM-generated captions describing the animal's motion, and the corresponding reconstructed 3D motion (visualized as keypoints). Reconstructions span species, gait, and behaviour — rhinos grazing, foxes sniffing and resting, rabbits hopping, horses trotting under saddle.

Global trajectories

Global translation preserved across the sequence
01 / 02
04 Comparisons

Beating the baselines, in motion.

Side-by-side qualitative comparisons of generated mesh and skeletal animation against the strongest published baseline. Drag to inspect; click pause to hold a frame.

Generated mesh animation

Cmp.01 "A moose walks straight ahead." 01 / 02
Reference input
Reference moose
Puppeteer baseline
Kirin ours
Cmp.02 "A horse trots forward." 02 / 02
Reference input
Reference horse
Puppeteer baseline
Kirin ours
Left-click drag to rotate
Right-click drag, or WASD to move
Scroll to zoom
Click to pause

Generated skeletal motion

Skeletal motion comparison vs baseline
Fig. 05 · Skeleton
Generated skeletal motion comparison. Sequences from our model versus the strongest published baseline at matched frames. Kirin produces gait cycles with cleaner foot contact and more plausible global trajectory.

Animated mesh — gallery comparison

Mesh animation comparison against Puppeteer baseline
Fig. 06 · Mesh gallery
Comparison against Puppeteer*. A reproduction of Puppeteer's setup (no public code at the time of writing) generates static T-posed meshes that do not animate the prompted action. Kirin produces full motion sequences faithful to the input image and prompt. * Reproduced from the paper.
05 Results

Quantitative evaluation.

Across motion-generation metrics and 4D animation benchmarks, Kirin outperforms or matches prior methods. Best results in color; second-best underlined.

Tab. 01 · Text-conditioned animal motion generation Higher R-Precision and Diversity is better · lower FID, MM-Dist is better
Method R-Prec. Top-1 R-Prec. Top-3 FID MM-Dist Diversity
Real motion0.512±.0040.823±.0030.002±.0012.974±.0089.503±.065
MDM0.298±.0050.612±.0041.842±.0404.918±.0118.612±.078
MotionDiffuse0.321±.0040.658±.0051.514±.0344.671±.0128.804±.080
T2M-GPT0.3540.7011.288±.0304.4029.014
MoMask0.346±.0050.689±.0041.1964.477±.0108.961±.071
Kirin (text-only)0.392±.0040.748±.0030.974±.0264.142±.0099.182±.069

Even without image conditioning, Kirin's architecture trained on AiM3D outperforms the strongest text-to-motion baselines across all five standard metrics. Numbers reported on the AiM3D test split with 95% confidence intervals.

Tab. 02 · Image + text-conditioned animal motion generation Adding image conditioning further improves identity preservation and motion plausibility.
Method R-Prec. Top-1 R-Prec. Top-3 FID MM-Dist Image-CLIP
Real motion0.512±.0040.823±.0030.002±.0012.974±.0080.298±.003
MDM + img0.331±.0050.671±.0041.487±.0384.604±.0110.221±.003
MoMask + img0.3780.7221.0414.2910.247
Kirin (full)0.421±.0040.776±.0030.812±.0223.978±.0100.281±.003

Conditioning on the input image alongside the text prompt — through a frozen DINOv3 encoder fused with text and time embeddings — yields the largest gains on FID and Image-CLIP, indicating that motion better matches the appearance and identity of the depicted animal.

Tab. 03 · 4D animation from a single image Mesh-level evaluation against published 4D animation methods.
Method CLIP-T CLIP-I Temporal Cons. User Pref.
Animate1240.218±.0040.704±.0050.812±.00611.4 %
4D-fy0.231±.0040.728±.0050.847±.00514.8 %
Puppeteer*0.2620.7810.88121.7 %
Kirin (ours)0.291±.0030.812±.0040.924±.00452.1 %

A user study with 38 participants comparing 24 prompt × image pairs from a held-out evaluation set; participants picked the most realistic animation per prompt. Kirin wins a majority of comparisons. * Reproduced; no public code at submission.

06 Cite

If Kirin helps your work, please cite.

Bibliographic information will be finalised on release. A placeholder entry is shown below.

BibTeX · placeholder
@misc{kirin2026,
  title  = {Kirin: Animal Motion Generation from In-the-Wild Video},
  author = {Anonymous},
  year   = {2026},
  note   = {Preprint, under review}
}