Text-to-motion generative models span a wide range of 3D human actions but struggle with nuanced stylistic attributes such as a ”Chicken” style. Due to the scarcity of style-specific data, existing approaches pull the generative prior towards a reference style, which often results in out-of-distribution low quality generations. In this work, we introduce LoRA-MDM, a lightweight framework for motion stylization that generalizes to complex actions while maintaining editability.
Our key insight is that adapting the generative prior to include the style, while preserving its overall distribution, is more effective than modifying each individual motion during generation. Building on this idea, LoRA-MDM learns to adapt the prior to include the reference style using only a few samples. The style can then be used in the context of different textual prompts for generation. The low-rank adaptation shifts the motion manifold in a semantically meaningful way, enabling realistic style infusion even for actions not present in the reference samples. Moreover, preserving the distribution structure enables advanced operations such as style blending and motion editing.
We compare LoRA-MDM to state-of-the-art stylized motion generation methods and demonstrate a favorable balance between text fidelity and style consistency.
Our goal is to generate high-quality human motions in a specified style while preserving the diversity and generalization ability of a pre-trained motion diffusion model (MDM). Condition motion generation on style-specific data often leads to a degradation in motion quality and generalization due to the limited diversity of style datasets. On the other hand, fine-tuning the whole model using the style data, which includes a handful of motions per style, may lead to an overfit and the forgetting of underrepresented actions.
Inspired by personalization literature, we train Low-Rank Adaptations for the attention layers of the Motion Diffusion Model (MDM) to shift it toward a given style (e.g., "Chicken") while preserving its structure. The style is represented by a handful of reference motions and bound with a special text token, marked <·>. The style and prior loss terms based on the original simple loss, learn the style adaptation and preserve the manifold structure, respectively.
One of the most important properties of in-domain gener- ation is editability: While generative models struggle with modifying specific semantic aspects while preserving others, they are still capable of achieving this in a disentangled manner. For regions that a model is not well familiar with, this is a nearly impossible task.
To demonstrate in-distribution generation, we integrate LoRA-MDM with an off-the-shelf motion editing technique [Shafir et al.], which demonstrates that new controls can be added to the MDM backbone through simple self-supervised fine-tuning. Note how the body realistically shifts sideways and the hands raise correctly in an asymmetric manner to incorporate turning while jumping on one leg.
Each style is assigned a unique token, allowing multiple styles to be incorporated within a single set of LoRA weights by training on them simultaneously. By embedding multiple style tokens within the same LoRA module, we enable style mixing —where conditioning on multiple tokens blends their stylistic attributes. Although each style is learned independently, combining tokens within a prompt results in a coherent fusion of styles.
Our first baseline, called style in prompt or prompting, is the base MDM model, without LoRA training, which gets the style description in the text prompt (e.g., "A person [...] in chicken style"). We introduce this baseline to assess whether the model can naturally interpret stylistic descriptors without explicit adaptation.
Additionally, we compare our method to two state-of-the-art approaches for stylized motion generation: SMooDi and MoMo. SMooDi leverages ControlNet along with classifier-based and classifier-free guidance to condition the generation process on a reference style motion. MoMo, on the other hand, is a training-free method that utilizes the attention mechanism of MDM to semantically fuse reference "follower" and "leader" motions into one motion. Those motions can be specified either by a prompt or as a reference motion. The authors demonstrate stylized motion generation by injecting the style reference as the "follower" and the content as the "leader". We follow this framework and perform stylized motion generation using the style motion from 100STYLE as the "follower", and the HumanML3D text prompt as the "leader".
@misc{sawdayee2025chicken,
title={Dance Like a Chicken: Low-Rank Stylization for Human Motion Diffusion},
author={Haim Sawdayee and Chuan Guo and Guy Tevet and Bing Zhou and Jian Wang and Amit H. Bermano},
year={2025},
eprint={2503.19557},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.19557},
}