DiM-Gestor: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 (2024)

\stripsep

-3pt plus 3pt minus 2pt

Fan Zhang\orcidlink0000000295341777\orcidlink0000000295341777{}^{~{}\orcidlink{0000-0002-9534-1777}}start_FLOATSUPERSCRIPT 0000 - 0002 - 9534 - 1777 end_FLOATSUPERSCRIPT,Siyuan Zhao\orcidlink0009000958318699\orcidlink0009000958318699{}^{~{}\orcidlink{0009-0009-5831-8699}}start_FLOATSUPERSCRIPT 0009 - 0009 - 5831 - 8699 end_FLOATSUPERSCRIPT,Naye Ji \orcidlink0000000269863766\orcidlink0000000269863766{}^{~{}\orcidlink{0000-0002-6986-3766}}start_FLOATSUPERSCRIPT 0000 - 0002 - 6986 - 3766 end_FLOATSUPERSCRIPT,Zhaohan Wang\orcidlink0009000547836213\orcidlink0009000547836213{}^{~{}\orcidlink{0009-0005-4783-6213}}start_FLOATSUPERSCRIPT 0009 - 0005 - 4783 - 6213 end_FLOATSUPERSCRIPT,Jingmei Wu \orcidlink0009000092218784\orcidlink0009000092218784{}^{~{}\orcidlink{0009-0000-9221-8784}}start_FLOATSUPERSCRIPT 0009 - 0000 - 9221 - 8784 end_FLOATSUPERSCRIPT,Fuxing Gao\orcidlink0009000855864734\orcidlink0009000855864734{}^{~{}\orcidlink{0009-0008-5586-4734}}start_FLOATSUPERSCRIPT 0009 - 0008 - 5586 - 4734 end_FLOATSUPERSCRIPT,Zhenqing Ye\orcidlink0009000343412734\orcidlink0009000343412734{}^{~{}\orcidlink{0009-0003-4341-2734}}start_FLOATSUPERSCRIPT 0009 - 0003 - 4341 - 2734 end_FLOATSUPERSCRIPT,Leyao Yan \orcidlink000900088797176\orcidlink000900088797176{}^{~{}\orcidlink{0009-0008-8797-176}}start_FLOATSUPERSCRIPT 0009 - 0008 - 8797 - 176 end_FLOATSUPERSCRIPT,Lanxin Dai \orcidlink0009000115166180\orcidlink0009000115166180{}^{~{}\orcidlink{0009-0001-1516-6180}}start_FLOATSUPERSCRIPT 0009 - 0001 - 1516 - 6180 end_FLOATSUPERSCRIPT,Weidong Geng \orcidlink000000022709396X\orcidlink000000022709396𝑋{}^{~{}\orcidlink{0000-0002-2709-396X}}start_FLOATSUPERSCRIPT 0000 - 0002 - 2709 - 396 italic_X end_FLOATSUPERSCRIPT,Xin Lyu \orcidlink0009000010556334\orcidlink0009000010556334{}^{~{}\orcidlink{0009-0000-1055-6334}}start_FLOATSUPERSCRIPT 0009 - 0000 - 1055 - 6334 end_FLOATSUPERSCRIPT,Bozuo Zhao \orcidlink0009000821520087\orcidlink0009000821520087{}^{~{}\orcidlink{0009-0008-2152-0087}}start_FLOATSUPERSCRIPT 0009 - 0008 - 2152 - 0087 end_FLOATSUPERSCRIPT,Dingguo Yu\orcidlink000000027674444X\orcidlink000000027674444𝑋{}^{~{}\orcidlink{0000-0002-7674-444X}}start_FLOATSUPERSCRIPT 0000 - 0002 - 7674 - 444 italic_X end_FLOATSUPERSCRIPT,Hui Du\orcidlink0000000165512064\orcidlink0000000165512064{}^{~{}\orcidlink{0000-0001-6551-2064}}start_FLOATSUPERSCRIPT 0000 - 0001 - 6551 - 2064 end_FLOATSUPERSCRIPT,Bin Hu (){}^{(\textrm{{\char 0\relax}})}start_FLOATSUPERSCRIPT ( ✉ ) end_FLOATSUPERSCRIPT\orcidlink0009000801125354\orcidlink0009000801125354{}^{~{}\orcidlink{0009-0008-0112-5354}}start_FLOATSUPERSCRIPT 0009 - 0008 - 0112 - 5354 end_FLOATSUPERSCRIPTFan Zhang, Naye Ji, Fuxing Gao, Zhenqing Ye, Leyao Yan, Lanxin Dai, Dingguo Yu, Hui Du, are with the School of Media Engineering, Communication University of Zhejiang, China; (e-mail: fanzhang@cuz.edu.cn; jinaye@cuz.edu.cn; fuxing@cuz.edu.cn; zhenqingye@stu.cuz.edu.cn; leyaoyan@stu.cuz.edu.cn; lanxindai@stu.cuz.edu.cn; yudg@cuz.edu.cn; duhui@cuz.edu.cn; )Jingmei Wu is with the School of Broadcast Announcing Arts, Communication University of Zhejiang, China; (e-mail: 20190095@cuz.edu.cn;)Siyuan Zhao, Bin Hu, are with the Faculty of Humanities and Arts, Macau University of Science and Technology, Macau, China (e-mail: 2109853jai30001@student.must.edu.mo, binhu@must.edu.mo)Zhaohan Wang, Xin Lyu are with the School of Animation and Digital ArtsCommunication University of China, Beijing, China (e-mail: 2022201305j6018@cuc.edu.cn; lvxinlx@cuc.edu.cn) Weidong Geng is with the College of Computer Science and Technology, Zhejiang University, the Research Center for Artificial Intelligence and Fine Arts, Zhejiang Lab, Zhejiang, China (e-mail: gengwd@zju.edu.cn)Zhao Bozuo is with Changjiang Academy of Art and Design, Shantou University, China (e-mail: bzzhao@stu.edu.cn)We would like to thank Jiayang Zhu, Weifan Zhong, Huaizhen Chen, and Qiuyi Shen from the College of Media Engineering at the Communication University of Zhejiang for their invaluable technical support in recording the dataset. Additionally, we extend our thanks to Xiaomeng Ma, Yuye Wang, Yanjie Cai, Xiaoran Chen, Jinyan Xiao, Jialing Ma, Zicheng He, Shuyang Fang, Shuyu Fang, Shixue Sun, Shufan Ma, Sen Xu, Jiabao Zeng, Yue Xu, and Senhua He from the School of Broadcast Arts and School of International Communication & Education at Communication University of Zhejiang for their contribution as professional TV broadcasters. This work was partially supported by the Pioneer and Leading Goose R&D Program of Zhejiang (No.2023C01212), the Public Welfare Technology Application Research Project of Zhejiang (No.LGF21F020002, No.LGF22F020008), the National Key Research and Development Program of China (No.2022YFF 0902305).Code and dataset can be accessed at https://github.com/zf223669/ DiMGestures.

Abstract

Speech-driven gesture generation using transformer-based generative models represents a rapidly advancing area within virtual human creation. However, existing models face significant challenges due to their quadratic time and space complexities, limiting scalability and efficiency. To address these limitations, we introduce DiM-Gestor, an innovative end-to-end generative model leveraging the Mamba-2 architecture. DiM-Gestor features a dual-component framework: (1) a fuzzy feature extractor and (2) a speech-to-gesture mapping module, both built on the Mamba-2. The fuzzy feature extractor, integrated with a Chinese Pre-trained Model and Mamba-2, autonomously extracts implicit, continuous speech features. These features are synthesized into a unified latent representation and then processed by the speech-to-gesture mapping module. This module employs an Adaptive Layer Normalization (AdaLN)-enhanced Mamba-2 mechanism to uniformly apply transformations across all sequence tokens. This enables precise modeling of the nuanced interplay between speech features and gesture dynamics. We utilize a diffusion model to train and infer diverse gesture outputs. Extensive subjective and objective evaluations conducted on the newly released Chinese Co-Speech Gestures dataset corroborate the efficacy of our proposed model. Compared with Transformer-based architecture, the assessments reveal that our approach delivers competitive results and significantly reduces memory usage—approximately 2.4 times—and enhances inference speeds by 2 to 4 times. Additionally, we released the CCG dataset, a Chinese Co-Speech Gestures dataset, comprising 15.97 hours (six styles across five scenarios) of 3D full-body skeleton gesture motion performed by professional Chinese TV broadcasters.

Index Terms:

Speech-driven, Gesture synthesis, Gesture generation, AdaLN, Diffusion, Mamba.

I Introduction

DiM-Gestor: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 (1)

Recent advancements in 3D virtual human technology have broadened its applications across sectors like animation, human-computer interaction, and digital hosting. A key focus is generating realistic, personalized co-speech gestures, now made feasible through deep learning. Speech-driven gesture generation offers a cost-efficient, automated alternative to traditional motion capture, reducing manual effort while enhancing the realism and adaptability of virtual avatars for diverse professional and recreational uses.

Achieving gesture-speech synchronization with naturalness remains challenging in speech-driven gesture generation. Transformer and Diffusion-based models have improved efficiency and flexibility, leading to innovations like Diffuse Style Gesture[1], Diffuse Style Gesture+[2], GestureDiffuClip[3], and LDA[4]. Notably, Persona-Gestor[5], using a Diffusion Transformer (DiT)[6] architecture, achieves state-of-the-art results by effectively modeling the speech-gesture relationship. However, its transformer-based design imposes high memory usage and slower inference speeds, underscoring the need for more efficient solutions for real-time applications.

The Mamba architecture[7], addresses the quadratic complexity of traditional transformers. Validated across domains like vision[8, 9, 10], segmentation[11] and image tasks[12, 13]. The improved Mamba-2[14] confirms Mamba’s theoretical equivalence to transformers via State Space Duality (SSD) while reducing complexity to linear. This advancement enables faster, resource-efficient processing, making Mamba a compelling alternative for tasks like speech-driven gesture generation, delivering comparable performance to transformers at reduced computational cost.

Training datasets in this field, including Trinity[15], ZEGGS[16], BEAT[17] and Hands 16.2M[18], largely focus on English content. Although BEAT offers 12 hours of Chinese speech data, it primarily features spontaneous speech. It limits its suitability for formal contexts like TV broadcasting or structured dialogues.

In this study, we present DiM-Gestor, an innovative model leveraging Mamba-2 and diffusion-based architectures to synthesis personalized gestures. The framework utilizes a Mamba-2 fuzzy feature extractor to autonomously capture nuanced fuzzy features from raw speech audio. DiM-Gestor integrates an AdaLN Mamba-2 module within its diffusion-based architecture, effectively modeling the intricate dynamics between speech and gestures. Inspired by DiT [6] and PG [5], the inclusion of AdaLN significantly enhances the model’s ability to accurately capture and reproduce the complex interplay between speech and gestures. Compared with the adaLN transformer, the adaLN Mamba-2 achieves competitive performance while substantially optimizing resource efficiency, reducing memory usage by approximately 2.4 times, and improving inference speeds by a factor of 2 to 4.

Further, we released the CCG dataset, comprising 15.97 hours of 3D full-body skeleton gesture motion, encompassing six styles across five scenarios performed by professional Chinese TV broadcasters. This dataset provides high-quality, structured data, facilitating advanced research in Chinese speech-driven gesture generation, particularly for applications requiring formal and contextually appropriate non-verbal communication, as shown in Figure 1.

For clarity, our contributions are summarized as follows:

  • We introduce DiM-Gestor, an innovative end-to-end generative model leveraging the Mamba-2 architecture: Our approach achieves competitive performance while significantly optimizing resource efficiency, reducing memory usage by approximately 2.4 times and enhancing inference speeds by a factor of 2 to 4.

  • We released the CCG dataset, a Chinese Co-Speech Gestures dataset: This comprehensive dataset, captured using inertial motion capture technology, comprises 15.97 hours of 3D full-body skeleton gesture motion. It includes six distinct styles across five scenarios, performed by professional Chinese TV broadcasters, offering high-quality, structured data for advancing research in Chinese speech-driven gesture synthesis.

  • Extensive subjective and objective evaluations: These evaluations demonstrate that our model surpasses current state-of-the-art methods, highlighting its exceptional capability to generate credible, speech-appropriate, and personalized gestures while achieving reduced memory consumption and faster inference times.

II RELATED WORK

This section briefly overviews transformer- and diffusion-based generative models for speech-driven gesture generation.

II-A Transformer- and diffusion-based generative models

DiffMotion [19] represents a pioneering application of diffusion models in gesture synthesis, incorporating an LSTM to enhance gesture diversity. Cross-modal Quantization (CMQ)[20], jointly learns and encodes the quantized codes for representations of the speech and gesture together. However, these model supports only the upper body. Alexanderson et al. [4] refined DiffWave by replacing dilated convolutions, thereby unlocking the potential of transformer architectures for gesture generation. GestureDiffuCLIP (GDC)[3] employs transformers and AdaIN layers to integrate style guidance directly into the diffusion process. Similarly, DiffuseStyleGesture (DSG) [1] and its extension DSG+ [2] utilize cross-local attention and layer normalization within transformer models. While these methods have demonstrated significant progress, they often struggle to balance gesture and speech synchronization. This leads to gestures that can appear either overly subtle or excessively synchronized with speech.

Persona-Gestor (PG) [5] addresses some of these challenges by introducing a fuzzy feature extractor. This approach uses 1D convolution to capture global features from raw speech audio, paired with an Adaptive Layer Normalization (AdaLN) transformer [6] to model the nuanced correlation between speech features and gesture sequences. Although Persona-Gestor achieves high-quality motion outputs, it is hindered by substantial memory requirements and slower inference speeds associated with convolutional and transformer-based architectures.

II-B Co-speech gesture training datasets

The datasets commonly employed for training co-speech gesture models include Trinity [15, 21], ZEGGS [16], BEAT [17], and Hands 16.2M [18]. These datasets predominantly feature native English speakers engaged in spontaneous conversational speech. Although the BEAT dataset includes 12 hours of Chinese content, this subset is characterized by unstructured speech patterns, making it less suitable for applications requiring formal contexts, such as event broadcasting or structured dialogues. While Yoon et al. [22] present a dataset collected and extracted the skeleton motions from TED talks videos, this dataset’s quality is unsuitable for high-quality gesture synthesis tasks.

To address these limitations, we adopt the Mamba-2 architecture [14], further adapting it with an AdaLN-based implementation. The Mamba-2 framework significantly reduces memory usage and improves inference speed, offering a more efficient solution for gesture synthesis in virtual human interactions. In addition, we introduce the Chinese Co-speech Gestures (CCG) dataset, comprising 15.97 hours of 3D full-body skeleton motion across six gesture styles and five scenarios. The dataset features high-quality performances by professional Chinese TV broadcasters. This dataset enables advanced research in speech-driven gesture generation, particularly in formal settings such as event hosting and professional presentations.

III Problem Formulation

We conceptualize co-speech gesture generation as a sequence-to-sequence translation task, where the goal is to map a sequence of speech audio features, X=[xt]t=1t=TT𝑋superscriptsubscriptdelimited-[]subscript𝑥𝑡𝑡1𝑡𝑇superscript𝑇X=[x_{t}]_{t=1}^{t=T}\in\mathbb{R}^{T}italic_X = [ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t = italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, to a corresponding sequence of full-body gesture features, Y0=[yt0]t=1t=TT×(D+6)superscript𝑌0superscriptsubscriptdelimited-[]subscriptsuperscript𝑦0𝑡𝑡1𝑡𝑇superscript𝑇𝐷6Y^{0}=[y^{0}_{t}]_{t=1}^{t=T}\in\mathbb{R}^{T\times(D+6)}italic_Y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = [ italic_y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t = italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × ( italic_D + 6 ) end_POSTSUPERSCRIPT. Each gesture frame yt0(D+6)subscriptsuperscript𝑦0𝑡superscript𝐷6y^{0}_{t}\in\mathbb{R}^{(D+6)}italic_y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_D + 6 ) end_POSTSUPERSCRIPT comprises 3D joint angles, as well as root positional and rotational velocities, with D𝐷Ditalic_D representing the number of joint channels and T𝑇Titalic_T denoting the sequence length. This formulation encapsulates the temporal and spatial complexity of mapping audio-driven dynamics to expressive human-like gestures.

We define the probability density function (PDF), pθ()subscript𝑝𝜃p_{\theta}(\cdot)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ), to approximate the true gesture data distribution p()𝑝p(\cdot)italic_p ( ⋅ ), enabling efficient sampling of gestures. The objective is to generate a non-autoregressive full-body gesture sequence (Y0superscript𝑌0Y^{0}italic_Y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT) from its conditional probability distribution, given the audio sequence (X𝑋Xitalic_X) as a covariate:

Y0pθ(Y0X)p(Y0X)similar-tosuperscript𝑌0subscript𝑝𝜃conditionalsuperscript𝑌0𝑋𝑝conditionalsuperscript𝑌0𝑋Y^{0}\sim p_{\theta}(Y^{0}\mid X)\approx p(Y^{0}\mid X)italic_Y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_Y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∣ italic_X ) ≈ italic_p ( italic_Y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∣ italic_X )(1)

This formulation leverages a denoising diffusion probabilistic model trained to approximate the conditional distribution of gestures aligned with speech. This approach provides a robust framework for learning and synthesizing co-speech gestures with high fidelity and temporal coherence by modeling the intricate relationship between audio inputs and gesture outputs.

IV System Overview

DiM-Gestor, an end-to-end Mamba- and diffusion-based architecture, processes raw speech audio as sole input to synthesize personalized gestures. This model balances naturalness with lower resource consumption, as shown in Figure 2.

DiM-Gestor: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 (2)

IV-A Model Architecture

The architecture of DiM-Gestor, depicted in Figure 2, integrates four key components to efficiently generate personalized gestures directly from speech audio: (a) Mamba-2 Fuzzy Feature Extractor: This module employs a fuzzy inference strategy utlizing Mamba-2 to autonomously capturing nuanced stylistic and contextual elements from raw speech audio. (b) Stack of AdaLN Mamba-2 Blocks: These blocks introduce AdaLN Mamba-2 architecture that applies uniform transformations across all tokens, enabling the model to effectively capture the intricate interplay between speech and gestures while enhancing computational efficiency. (c) Gestures Encoder and Decoder: These modules encode gesture sequences into latent representations and decode them back into full-body motion outputs, ensuring accurate reconstruction of gesture dynamics. (d) Denoising Diffusion Probabilistic Model (DDPM): As the backbone for probabilistic generation, this module leverages diffusion processes to synthesize diverse and realistic gesture sequences aligned with the given speech context.

By combining these components into a unified framework, the DiM-Gestor architecture captures the complexity of human gestures in relation to speech while significantly reducing memory consumption and improving inference speed. This design ensures high-quality, personalized, and contextually appropriate gesture generation.

IV-A1 Mamba-2 Fuzzy Feature Extractor

This module employs a fuzzy inference strategy, which does not rely on explicit and manual classification inputs. Instead, it provides implicit, continuous, and fuzzy feature information, automatically learning and inferring the global style and specific details directly from raw speech audio. Illustrated in Figure 2, this module is a dual-component extractor comprising both global and local extractors. The local extractor utilizes the Chinese Hubert Speech Pretrained Model [23] to process the audio sequence into discrete tokens. This pre-trained model, for its proficiency in capturing the complex attributes of speech audio, allows it to effectively represent universal Chinese speech audio latent features, denoted as Zxsubscript𝑍𝑥Z_{x}italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT.

We implement a Mamba-2 [14] global style extractor framework. In the Mamba architecture, the module scans the entire sequence of Zxsubscript𝑍𝑥Z_{x}italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT to capture the style feature. The last output token, zssubscript𝑧𝑠z_{s}italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, is considered crucial as it encompasses the global style feature contained within the speech audio. This feature is then broadcast to align with the local fuzzy features, ensuring that the global context influences the local gesture synthesis. This process allows the model to maintain a holistic understanding of the style and emotional context throughout the gesture generation process. This unified latent representation is then channeled to the downsampling module for further refinement.

The downsampling module, crucial for aligning each latent representation with its corresponding sequence of encoded gestures, is seamlessly integrated into the condition extractor. We implement a Conv1D layer with a kernel size of 201 within our architecture[5]. The size of the kernel in the gesture synthesis model suggests that the preceding gesture is influenced not only by the current semantics but also by the prior and subsequent semantic contexts [24]. This kernel size thus plays a crucial role in capturing the temporal dynamics across sequences, allowing for a more coherent and contextually integrated gesture generation that aligns with the natural flow of speech and its semantic shifts. The output of this module, C=[ct]t=1t=T𝐶superscriptsubscriptdelimited-[]subscript𝑐𝑡𝑡1𝑡𝑇C=[c_{t}]_{t=1}^{t=T}italic_C = [ italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t = italic_T end_POSTSUPERSCRIPT, serves as a unified latent representation that encapsulates both encoded acoustic features and the diffusion time step n𝑛nitalic_n, ensuring a coherent and accurate gesture generation process.

IV-A2 AdaLN Mamba-2

The AdaLN’s fundamental purpose is to incorporate a conditional mechanism [6, 5] that uniformly applies a specific function across all tokens. It offers a more sophisticated and nuanced approach to modeling, enabling the system to capture and articulate the complex dynamics between various input conditions and their corresponding outputs. Consequently, this improves the model’s predictive accuracy and ability to generate outputs more aligned with the given conditions.

Diffusion Transformers (DiTs) exemplify a sophisticated advancement in diffusion model architectures, incorporating an AdaLN-infused transformer framework primarily for text-to-image synthesis. The utility of DiTs has recently expanded to include text-conditional video generation, illustrating their versatility. Furthermore, DiTs have shown potential in co-speech gesture generation [5], marking a significant step in applying these models to sequence-based tasks. However, the inherent quadratic space complexity associated with Transformers results in substantial memory consumption and slower inference speeds.

The AdaLN architecture involves regressing the dimension-wise scale and shift parameters (γ(T,D)𝛾𝑇𝐷\gamma\in(T,D)italic_γ ∈ ( italic_T , italic_D ) and β(T,D)𝛽𝑇𝐷\beta\in(T,D)italic_β ∈ ( italic_T , italic_D )), derived from the fuzzy feature extractor output C𝐶Citalic_C, rather than directly learning these parameters, as depicted in Figure 2 and algorithm 1. In each AdaLN Mamba-2 stack, a latent feature z1:T,mnsubscriptsuperscript𝑧𝑛:1𝑇𝑚z^{n}_{1:T,m}italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_T , italic_m end_POSTSUBSCRIPT is generated, combining condition information and gesture using AdaLN and the Mamba-2 architecture. The index m𝑚mitalic_m ranges from 1 to M𝑀Mitalic_M, where M𝑀Mitalic_M represents the total number of AdaLN Mamba-2 stacks. Furthermore, as illustrated in Figure 2, the final layer utilizes the same fuzzy features, supplemented by a scale and shift operation to fine-tune the gesture synthesis.

Input: Encoded Gestures G1:Tsubscript𝐺:1𝑇G_{1:T}italic_G start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT and Conditional features C1:Tsubscript𝐶:1𝑇C_{1:T}italic_C start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT

form=0𝑚0m=0italic_m = 0 to M1𝑀1M-1italic_M - 1do

γ1,β1,α1,γ2,β2,α2=MLP(C1:T).Chunk(6)formulae-sequencesubscript𝛾1subscript𝛽1subscript𝛼1subscript𝛾2subscript𝛽2subscript𝛼2𝑀𝐿𝑃subscript𝐶:1𝑇𝐶𝑢𝑛𝑘6\gamma_{1},\beta_{1},\alpha_{1},\gamma_{2},\beta_{2},\alpha_{2}=MLP(C_{1:T}).%Chunk(6)italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_M italic_L italic_P ( italic_C start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) . italic_C italic_h italic_u italic_n italic_k ( 6 )

X1:T=X1:T+α1×Mamba2(LN(G1:T)×(1+γ1)+β1)subscript𝑋:1𝑇subscript𝑋:1𝑇subscript𝛼1×𝑀𝑎𝑚𝑏𝑎2𝐿𝑁subscript𝐺:1𝑇×1subscript𝛾1subscript𝛽1X_{1:T}=X_{1:T}+\alpha_{1}\texttimes Mamba2(LN(G_{1:T})\texttimes(1+\gamma_{1}%)+\beta_{1})italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT = italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_M italic_a italic_m italic_b italic_a 2 ( italic_L italic_N ( italic_G start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) × ( 1 + italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )

X1:Tm+1=X1:T+α2×MLP(LN(X1:Tm)×(1+γ2)+β2))X^{m+1}_{1:T}=X_{1:T}+\alpha_{2}\texttimes MLP(LN(X^{m}_{1:T})\texttimes(1+%\gamma_{2})+\beta_{2}))italic_X start_POSTSUPERSCRIPT italic_m + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT = italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_M italic_L italic_P ( italic_L italic_N ( italic_X start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) × ( 1 + italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) )

end for

γ3,β3=MLP(C1:T).Chunk(2)formulae-sequencesubscript𝛾3subscript𝛽3𝑀𝐿𝑃subscript𝐶:1𝑇𝐶𝑢𝑛𝑘2\gamma_{3},\beta_{3}=MLP(C_{1:T}).Chunk(2)italic_γ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_M italic_L italic_P ( italic_C start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) . italic_C italic_h italic_u italic_n italic_k ( 2 )

Z1:T=LN(X1:T)×(1+γ3)+β3)Z_{1:T}=LN(X_{1:T})\texttimes(1+\gamma_{3})+\beta_{3})italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT = italic_L italic_N ( italic_X start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) × ( 1 + italic_γ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) + italic_β start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT )

Return: Z1:Tsubscript𝑍:1𝑇Z_{1:T}italic_Z start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT

Mamba-2[14], an evolution within the Structured State Space Duality (SSD) framework, refines Mamba’s selective state-space model (SSM), markedly enhancing computational efficiency. This architecture is engineered to supplant the traditional attention mechanism in Transformers, specifically utilizing structured semiseparable matrices to optimize facets such as training speed and memory consumption. Its robust performance on modern hardware, versatility across varying sequence lengths, and the implementation of tensor parallelism firmly establish Mamba-2 as a potent alternative to traditional attention-based models, offering substantial efficiency gains and reduced operational costs.

The dual form of Structured State Space Duality (SSD) is characterized by a quadratic computation closely related to the attention mechanism. It can be defined as follows:

(L(QKT))VLij={ai××aj+1ifij0ifi<j𝐿𝑄superscript𝐾𝑇𝑉subscript𝐿𝑖𝑗casessubscript𝑎𝑖subscript𝑎𝑗1if𝑖𝑗0if𝑖𝑗(L\circ(QK^{T}))\cdot V\quad L_{ij}=\begin{cases}a_{i}\times\cdots\times a_{j+%1}&\text{if }i\geq j\\0&\text{if }i<j\end{cases}( italic_L ∘ ( italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) ) ⋅ italic_V italic_L start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = { start_ROW start_CELL italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × ⋯ × italic_a start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT end_CELL start_CELL if italic_i ≥ italic_j end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL if italic_i < italic_j end_CELL end_ROW(2)

where aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are input-dependent scalars bounded within the interval [0,1]01[0,1][ 0 , 1 ]. These scalars represent the normalized attention weights in the structured state-space model, modulating the influence of each input token on the computed output. For indexing of a𝑎aitalic_a, i:j:𝑖𝑗i:jitalic_i : italic_j refers to the range (i,i+1,,j1)𝑖𝑖1𝑗1(i,i+1,\dots,j-1)( italic_i , italic_i + 1 , … , italic_j - 1 ) when i<j𝑖𝑗i<jitalic_i < italic_j and (i,i1,,j+1)𝑖𝑖1𝑗1(i,i-1,\dots,j+1)( italic_i , italic_i - 1 , … , italic_j + 1 ) when i>j𝑖𝑗i>jitalic_i > italic_j.

Let Structured Masked Attention (SMA) M=(L(QKT))𝑀𝐿𝑄superscript𝐾𝑇M=(L\circ(QK^{T}))italic_M = ( italic_L ∘ ( italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) ), we chose 1-semi separable for structured Mask L𝐿Litalic_L for constructing the 1-SS Structured Attention, as shown in Figure 3. So the Mamba-2 output is Man=MV𝑀subscript𝑎𝑛𝑀𝑉Ma_{n}=M\cdot Vitalic_M italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_M ⋅ italic_V, Where the subscript n𝑛nitalic_n represents the n-th blocks of the AdaLN Mamba-2.

DiM-Gestor: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 (3)

Compared to standard (self-)attention mechanisms, the Structured State Space (SSD) model with 1-semi separable Structured Masked Attention (SMA) introduces significant optimizations. SSD eliminates the softmax normalization step, effectively reducing the requisite state size of the model from a linear to a constant scale, thereby enhancing computational efficiency from quadratic to linear. Additionally, SSD incorporates a distinct elementwise mask matrix that is applied multiplicatively, further refining the model’s efficiency and operational dynamics. This alteration not only simplifies the computational process but also improves the scalability and speed of the model, making it more adept at handling larger datasets and complex tasks without compromising performance. In contrast to the Persona-Gestor [5], which utilizes a causal mask, the primary SSD model employs a 1-Semiseparable Structured Masked Attention (1-SS mask), offering a more structured approach to attention that enhances both computational efficiency and performance in sequence modeling tasks.

IV-A3 Gesture Encoder and Decoder

The architecture of the gesture encoder and decoder is meticulously designed to process gesture sequences, as illustrated in Fig. 2.

Gesture Encoder employs a Convolution1D layer with a kernel size of 3 to encode the initial sequence of gestures, denoted as X𝑋Xitalic_X, into a hidden state Hx=[ht]t=1t=Tsubscript𝐻𝑥superscriptsubscriptdelimited-[]subscript𝑡𝑡1𝑡𝑇H_{x}=[h_{t}]_{t=1}^{t=T}italic_H start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = [ italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t = italic_T end_POSTSUPERSCRIPT. Experimental results reveal that using a kernel size of 1 often leads to animation jitter, likely due to insufficient spatial-temporal feature capture. Conversely, the kernel size 3 effectively mitigates this issue by capturing spatial-temporal relationships across adjacent frames, ensuring smoother gesture dynamics.

The decoder transforms the high-dimensional output from the AdaLN Mamba-2 layer to the original dimension, corresponding to the skeletal joint channels. This step involves generating the predicted noise (ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT), a critical component for gesture reconstruction. Utilizing a 1D kernel of size 1 in the decoder enables the model to extract relevant features and correlations between adjacent joint channels, thereby improving the overall quality and coherence of the generated gestures.

IV-B Training and Inferencing with DDPM

The diffusion process, leveraging the Denoising Diffusion Probabilistic Model (DDPM) [19, 5, 25, 26], enabling the reconstruction of the conditional probability distribution between gestures and fuzzy features.

The model functions through two primary processes: the diffusion process and the generation process. During training, the diffusion process progressively transforms the original gesture data (X0superscript𝑋0X^{0}italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT) into white noise (XNsuperscript𝑋𝑁X^{N}italic_X start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT) by optimizing a variational bound on the data likelihood. This gradual addition of noise is meticulously controlled to preserve the underlying data structure, facilitating effective learning of the conditional probability distribution.

During inference, the generation process seeks to reverse the transformation performed during training. It reconstructs the original gesture data from noise by reversing the noising process through a Markov chain, employing Langevin sampling [27]. This approach facilitates the accurate and effective recovery of gesture data from its perturbed state.

The Markov chains utilized in the diffusion and generation processes ensure a coherent and systematic transition between stages, thereby preserving the integrity and quality of the synthesized gestures. The specific Markov chains employed in the diffusion and generation processes are as follows:

p(Yn|Y0)=𝒩(Yn;α¯nY0,(1α¯n)I)and𝑝conditionalsuperscript𝑌𝑛superscript𝑌0𝒩superscript𝑌𝑛superscript¯𝛼𝑛superscript𝑌01superscript¯𝛼𝑛𝐼𝑎𝑛𝑑\displaystyle p\left(Y^{n}|Y^{0}\right)=\mathcal{N}\left(Y^{n};\sqrt{\overline%{\alpha}^{n}}Y^{0},\left(1-\overline{\alpha}^{n}\right)I\right)\quad anditalic_p ( italic_Y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_Y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) = caligraphic_N ( italic_Y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ; square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG italic_Y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) italic_I ) italic_a italic_n italic_d(3)
pθ(Yn1|Yn,Y0)=𝒩(Yn1;μ~n(Yn,Y0),β~nI),subscript𝑝𝜃conditionalsuperscript𝑌𝑛1superscript𝑌𝑛superscript𝑌0𝒩superscript𝑌𝑛1superscript~𝜇𝑛superscript𝑌𝑛superscript𝑌0superscript~𝛽𝑛𝐼\displaystyle p_{\theta}\left(Y^{n-1}|Y^{n},Y^{0}\right)=\mathcal{N}\left(Y^{n%-1};\tilde{\mu}^{n}\left(Y^{n},Y^{0}\right),\tilde{\beta}^{n}I\right),italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_Y start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT | italic_Y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_Y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) = caligraphic_N ( italic_Y start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT ; over~ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_Y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_Y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , over~ start_ARG italic_β end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_I ) ,

where αn:=1βnassignsuperscript𝛼𝑛1superscript𝛽𝑛\alpha^{n}:=1-\beta^{n}italic_α start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT := 1 - italic_β start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and α¯n:=i=1nαiassignsuperscript¯𝛼𝑛superscriptsubscriptproduct𝑖1𝑛superscript𝛼𝑖\overline{\alpha}^{n}:=\prod_{i=1}^{n}\alpha^{i}over¯ start_ARG italic_α end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT := ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. As shown by[25], βnsuperscript𝛽𝑛\beta^{n}italic_β start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is a increasing variance schedule β1,,βNsuperscript𝛽1superscript𝛽𝑁\beta^{1},...,\beta^{N}italic_β start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_β start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT with βn(0,1)superscript𝛽𝑛01\beta^{n}\in(0,1)italic_β start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∈ ( 0 , 1 ), and β~n:=1α¯n11α¯nβnassignsuperscript~𝛽𝑛1superscript¯𝛼𝑛11superscript¯𝛼𝑛superscript𝛽𝑛\tilde{\beta}^{n}:=\frac{1-\overline{\alpha}^{n-1}}{1-\overline{\alpha}^{n}}%\beta^{n}over~ start_ARG italic_β end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT := divide start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG italic_β start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT.

The training objective is to optimize the model parameters θ𝜃\thetaitalic_θ by minimizing the Negative Log-Likelihood (NLL). This optimization is implemented through a Mean Squared Error (MSE) loss, which measures the deviation between the true noise, denoted as ϵ𝒩(0,I)similar-toitalic-ϵ𝒩0𝐼\epsilon\sim\mathcal{N}(0,I)italic_ϵ ∼ caligraphic_N ( 0 , italic_I ), and the predicted noise ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. The objective function can be expressed as:

MSE=𝔼x0,ϵ,n[ϵϵθ(xn,n)22],subscriptMSEsubscript𝔼superscript𝑥0italic-ϵ𝑛delimited-[]subscriptsuperscriptnormitalic-ϵsubscriptitalic-ϵ𝜃superscript𝑥𝑛𝑛22\mathcal{L}_{\text{MSE}}=\mathbb{E}_{x^{0},\epsilon,n}\left[\|\epsilon-%\epsilon_{\theta}(x^{n},n)\|^{2}_{2}\right],caligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_ϵ , italic_n end_POSTSUBSCRIPT [ ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_n ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] ,(4)

where x0superscript𝑥0x^{0}italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT represents the original data, xnsuperscript𝑥𝑛x^{n}italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is the noised version of the data at step n𝑛nitalic_n, and n𝑛nitalic_n indicates the diffusion time step. This formulation ensures the model learns to accurately predict noise across all diffusion steps, thereby enabling the effective reconstruction of the original data during inference.

𝔼Y1:T0,ϵ,n[ϵϵθ(α¯nX0+1α¯nϵ,X,n)2],subscript𝔼subscriptsuperscript𝑌0:1𝑇italic-ϵ𝑛delimited-[]superscriptnormitalic-ϵsubscriptitalic-ϵ𝜃superscript¯𝛼𝑛superscript𝑋01superscript¯𝛼𝑛italic-ϵ𝑋𝑛2\mathbb{E}_{Y^{0}_{1:T},\epsilon,n}[||\epsilon-\epsilon_{\theta}\left(\sqrt{%\overline{\alpha}^{n}X^{0}}+\sqrt{1-\overline{\alpha}^{n}}\epsilon,X,n\right)|%|^{2}],blackboard_E start_POSTSUBSCRIPT italic_Y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , italic_ϵ , italic_n end_POSTSUBSCRIPT [ | | italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_ARG + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG italic_ϵ , italic_X , italic_n ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ,(5)

Here ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is a neural network (see figure 2), which uses input Y0superscript𝑌0Y^{0}italic_Y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , X𝑋Xitalic_X and n𝑛nitalic_n that to predict the ϵitalic-ϵ\epsilonitalic_ϵ, and contains the similar architecture employed in [28].

After completing the training phase, we utilize variational inference to generate a full sequence of new gestures that align with the original data distribution, formulated as Y0pθ(Y0X)similar-tosuperscript𝑌0subscript𝑝𝜃conditionalsuperscript𝑌0𝑋Y^{0}\sim p_{\theta}(Y^{0}\mid X)italic_Y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_Y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∣ italic_X ). In this generation phase, the entire sequence Y0superscript𝑌0Y^{0}italic_Y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT is sampled from the learned conditional probability distribution, ensuring that the synthesized gestures accurately reflect the dynamics and nuances of the input speech features X𝑋Xitalic_X.

The term σθsubscript𝜎𝜃\sigma_{\theta}italic_σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT denotes the standard deviation of the conditional distribution pθ(Yn1Yn)subscript𝑝𝜃conditionalsuperscript𝑌𝑛1superscript𝑌𝑛p_{\theta}(Y^{n-1}\mid Y^{n})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_Y start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT ∣ italic_Y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ), playing a pivotal role in capturing the variability and intricacies of transitions across diffusion stages. In our model, we define σθ:=β~nassignsubscript𝜎𝜃superscript~𝛽𝑛\sigma_{\theta}:=\tilde{\beta}^{n}italic_σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT := over~ start_ARG italic_β end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, where β~nsuperscript~𝛽𝑛\tilde{\beta}^{n}over~ start_ARG italic_β end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT represents a predetermined scaling factor. This factor adjusts the noise level at each diffusion step, enabling a controlled balance between smoothing and preserving fine details during the generation process.

During inference, we send the entire sequence of raw audio to the condition extractor component. The output of this component is then fed to the diffusion model to generate the whole sequence of accompanying gestures (Y0superscript𝑌0Y^{0}italic_Y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT).

V EXPERIMENTS

Our experiments focused on producing full 3D body gestures, including finger motions and locomotion, trained on our released Chinese Co-speech Gestures (CCG) dataset.

V-A Dataset Recording and Data Processing

V-A1 Datasets

We have constructed Chinese Co-Speech Gestures, CCG dataset, a high-quality, synchronized motion capture and speech audio dataset comprising 391 monologue sequences. The dataset features performances by 12 female and 5 male actors, all conducted in Chinese and covering 6 distinct emotion styles with 5 different senses. These styles were carefully selected to comprehensively represent various postures, hand gestures, and head movements. The total length of the dataset amounts to 958.3 minutes. Table I and Figure 4 illustrate the time distribution of the different motion styles across the various scenes within the dataset.

The style labels for each sequence were assigned according to predefined actor directives. However, it is essential to note that these labels may not always align with the subjective interpretations of independent external annotators regarding the observed movement styles. The motion capture data was recorded using the Noitom PNS system111https://www.noitom.com.cn/, which employs inertial motion capture technology.

The full-body motion capture was conducted at a frame rate of 100 frames per second (fps), with the motion data encoded using a skeletal model comprising j=59 joints. Concurrently, the audio data was recorded at a sampling rate of 44,100 Hz.

Style
Length
(min)
Style
Length
(min)
Calm134.72Sad133.79
Delightful145.06Serious139.92
Excited261.65
Happy143.16Total958.3

DiM-Gestor: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 (4)

V-A2 Speech Audio Data Process

Due to the Chinese Hubert Speech Pretrained Model being pre-trained on speech audio sampled at 16 kHz, we uniformly resampled all audio down from 44.1 kHz to match this frequency, ensuring compatibility and optimal performance.

V-A3 Gesture Data Process

We concentrate exclusively on full-body gestures, employing the data processing techniques detailed by Alexanderson et al. [29]. This includes capturing translational and rotational velocities to accurately delineate the root’s trajectory and orientation. The datasets are uniformly downsampled to a frame rate of 20 fps. To ensure precise and continuous representation of joint angles, we utilize the exponential map technique [30]. All data are segmented into 20-second clips for training and validation purposes.

V-B Model Settings

Our experiments utilized Mamba-2 architecture for a global fuzzy feature extractor and six AdaLN Mamba-2 blocks, each Mamba-2 configured with a 256 SSM state expansion factor, a local convolution width of 4, and a block expansion factor of 2. This encoding process transforms each frame of the gesture sequence into hidden states h1280superscript1280h\in\mathbb{R}^{1280}italic_h ∈ blackboard_R start_POSTSUPERSCRIPT 1280 end_POSTSUPERSCRIPT. We employ the Chinese Hubert Speech Pretrained Model (chinese-wav2vec2-base)222https://github.com/TencentGameMate/chinese_speech_pretrain for audio processing.

The diffusion model uses a quaternary variance schedule, starting from β1=1×104subscript𝛽11superscript104\beta_{1}=1\times 10^{-4}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT to βN=8×102subscript𝛽𝑁8superscript102\beta_{N}=8\times 10^{-2}italic_β start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = 8 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT with a linear beat schedule, and a total of N=1000𝑁1000N=1000italic_N = 1000 diffusion steps. The training batch size is set to 32.

Our model was tested on Intel i9 CPU with a Nvidia Geforce 4090 GPU, in contrast to the A100 GPU used by Persona-Gestor [5]. The training time was approximately 6 hours.

V-C Visualization Results

Our system excels in generating personalized gestures that are contextually aligned with speech by leveraging the Mamba-2 fuzzy inference extractor and adaLN Mamba-2.

DiM-Gestor: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 (5)
DiM-Gestor: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 (6)
DiM-Gestor: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 (7)
DiM-Gestor: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 (8)

Figures 5 - 8 illustrate the co-speech gesture effects generated by our proposed DiM-Gestor model, alongside a comparative analysis with the Ground Truth (GT) and the PG model. The results demonstrate that our model accurately captures the gesture styles corresponding to relevant emotions and contexts.

For instance, in a neutral-themed broadcast, the DiM-Gestor model generates gestures with the female announcer crossing her hands in front, reflecting a composed and professional demeanor. In contrast, the PG model depicts her with hands resting at her sides, missing the nuance of professional engagement (Figure 5). Similarly, when reciting content with a sad tone, the DiM-Gestor model positions the announcer’s hands down bilaterally, conveying subdued emotions. In comparison, the PG model displays more excited gestures, which deviate from the intended emotional tone, as shown in Figure 6.

In another example, for male broadcasters, as illustrated in Figure 7 and Figure 8, the GT data shows both hands bent forward, aligning with an expressive yet controlled delivery style. However, the PG model depicts hands pushed to the sides, failing to capture the gestural alignment with the speech context. These comparisons highlight the superior ability of the DiM-Gestor model to generate gestures that align with the emotional and contextual nuances of the speech, enhancing the authenticity and relatability of the digital human.

V-D Subjective and Objective Evaluation

In line with established practices in gesture generation research, we conducted a series of subjective and objective evaluations to assess the co-speech gestures generated by our proposed DiM-Gestor (DiM) model.

DiM-Gestor: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 (9)
DiM-Gestor: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 (10)
DiM-Gestor: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 (11)

We primarily carried out experiments and made comparisons with models utilizing the transformers architecture, including LDA[4], DiffuseStyleGesture (DSG+)[2], Taming[31], GDC[3] and Persona-Gesture (PG) [5] as baseline models. The DSG++ are the best-performing in the 2023[32] GENEA Challenge. We employ mamba-2 as a fuzzy feature extractor and adaLN mamba-2 architecture, denoted as DiM_m2_s_m2, to abbreviate the model. The first ’m2’ signifies the adoption of adaLN mamba-2, while the second ’m2’(after s) indicates the utilization of mamba-2 architecture for the fuzzy feature extractor. The PG model configured with 12 blocks incorporates a stack of 12 AdaLN transformer blocks, whereas the variant with 6 blocks consists of a stack with 6 AdaLN transformer blocks.

The baseline models were initially trained using English speech datasets, including Trinity[15], ZEGGS[16], and BEAT[17]. Contrary to these settings, we employed our internally recorded Chinese dataset for training and inference.

V-D1 Subjective Evaluation

We utilize three distinct metrics for comprehensive subjective evaluations: human-likeness, appropriateness, and style-appropriateness. Human-likeness evaluates the naturalness and resemblance of the generated gestures to authentic human movements, independent of speech. Appropriateness assesses the temporal alignment of gestures with the speech’s rhythm, intonation, and semantics to ensure a seamless and natural interaction. Style-appropriateness quantifies the similarity between the generated gestures and their original human counterparts, ensuring fidelity to the intended gesture style.

We executed a user study utilizing pairwise comparisons by the methodology outlined by [33]. During each session, participants were shown two 20-second video clips. These clips, generated by different models, including the Ground Truth (GT), were presented for direct comparative analysis. Participants were instructed to select the preferred clip based on predefined evaluation criteria. Preferences were quantified on a scale ranging from 0 to 2, with the non-selected clip in each pair receiving a corresponding inverse score. A score of zero was used to indicate a neutral preference. More details are described in [33, 5].

Given the extensive array of styles within the datasets, an individual evaluation of each style was considered unfeasible. A random selection methodology was employed to address this. Each participant was assigned a subset of five styles for assessment. Critically, none of the audio clips selected for evaluation were used in the training or validation sets. This strategy ensured a broad yet manageable coverage of the dataset’s diversity in a controlled and unbiased manner.

DiM-Gestor: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 (12)
DiM-Gestor: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 (13)
DiM-Gestor: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 (14)

We invited 30 native Chinese volunteers—14 males and 16 females aged between 18 and 35 for user study.

One-way ANOVA and post hoc Tukey, multiple comparison tests, were conducted to determine if there were statistically significant differences among the models’ scores across the three evaluation aspects. The results are presented in Figure 9 and Table II, offering detailed insights into the performance variances observed among models regarding human-likeness, appropriateness, and style-appropriateness.

Subject Evaluation Metrics
Model
With
Fingers
Human↑
likeness
Appropriateness↑
Style↑
appropriateness
GTY0.694±1.4300.710±1.268/
DSG+Y-1.494±0.794-1.472±0.712-1.326±0.744
GDCN-0.467±1.399-0.307±1.394-0.242±1.306
LDAN-1.069±1.099-0.955±1.076-0.854±1.027
TamingY-1.464±0.692-1.276±0.796-1.323±0.769
PG-6blocksY0.023±1.3900.096±1.3760.188±1.188
PG-12blocksY0.668±1.1920.687±1.2470.664±1.216

DiM_m2_s_m2(Ours)

Y0.653±1.3230.682±1.3261.30±0.770
DiM_m1_s_m2Y0.221±1.3550.148±1.3270.124±1.139
DiM_m2_s_convY0.593±1.2220.425±1.3190.549±1.111
DiM_m1_s_convY0.651±1.2790.571±1.2970.384±1.186
DiM_m2_s_m1Y0.243±1.3480.106±1.4200.230±1.221
DiM_m1_s_m1Y0.641±1.2360.580±1.2800.323±1.136

Regarding the Human Likeness metric, our proposed model, DiM_m2_s_m2, achieves a score of 0.653±1.323plus-or-minus0.6531.3230.653\pm 1.3230.653 ± 1.323. This result closely approximates the top-performing PG-12blocks model (0.668±1.192plus-or-minus0.6681.1920.668\pm 1.1920.668 ± 1.192) and the Ground Truth (GT) benchmark (0.694±1.430plus-or-minus0.6941.4300.694\pm 1.4300.694 ± 1.430). Statistical analysis revealed no significant differences among these three models, as Figure 10 illustrates. This indicates that the DiM_m2_s_m2 model performs comparably to the human baseline regarding perceived naturalness. In contrast, alternative methods such as DSG+ (1.494±0.794plus-or-minus1.4940.794-1.494\pm 0.794- 1.494 ± 0.794), GDC (0.467±1.399plus-or-minus0.4671.399-0.467\pm 1.399- 0.467 ± 1.399), and LDA (1.069±1.099plus-or-minus1.0691.099-1.069\pm 1.099- 1.069 ± 1.099) exhibit significantly lower scores on the Human Likeness metric.

In terms of the Appropriateness metric, DiM_m2_s_m2 attains a score of 0.682±1.326plus-or-minus0.6821.3260.682\pm 1.3260.682 ± 1.326, demonstrating high competitiveness and being nearly on par with PG-12blocks, the top-performing synthetic model with a score of 0.687±1.247plus-or-minus0.6871.2470.687\pm 1.2470.687 ± 1.247. The Ground Truth (GT) establishes the benchmark at a score of 0.710±1.268plus-or-minus0.7101.2680.710\pm 1.2680.710 ± 1.268. Additionally, no significant difference exists among these three models. The models’ significant differences are visually depicted in Figure 10. Models such as DSG+, Taming, and GDC, which record scores within the negative range, evidently have difficulty synchronizing gestures with the speech context.

The model DiM_m2_s_m2 exhibits significant superiority in the Style Appropriateness metric, achieving the highest score of 1.30±0.770plus-or-minus1.300.7701.30\pm 0.7701.30 ± 0.770 among all evaluated models. This metric underscores our method’s capability to generate gestures stylistically congruent with the various Chinese speech styles. In contrast, the PG-12blocks model attains a lower score of 0.664±1.216plus-or-minus0.6641.2160.664\pm 1.2160.664 ± 1.216, while the other methods, including DSG+, GDC, and LDA, exhibit negative scores in this category. These findings emphasize the distinct advantage of DiM_m2_s_m2 in producing gestures that align closely with the intended stylistic attributes of the spoken language.

In conclusion, DiM_m2_s_m2 outperforms alternative models in Style Appropriateness and achieves highly competitive results in Human-Likeness and Appropriateness. These findings suggest that DiM_m2_s_m2 effectively generates perceptually realistic gestures and is well-aligned with speech-driven gesture synthesis’s contextual and stylistic requirements. This highlights the strength of our Mamba-2 approach in addressing the multi-dimensional challenges in this domain, setting a new standard for synthetic gesture quality compared to traditional transformer methods.

V-D2 Objective Evaluation

We employ three objective evaluation metrics to assess the quality and synchronization of generated gestures: Fréchet Gesture Distance (FGD) in both feature and raw data spaces[34], and BeatAlign[35]. Inspired by the Fréchet Inception Distance (FID)[36], FGD evaluates the quality of generated gestures and has shown moderate correlation with human-likeness ratings, surpassing other objective metrics[37]. BeatAlign, on the other hand, assesses gesture-audio synchrony by calculating the Chamfer Distance between audio beats and gesture beats, thus providing insights into the temporal alignment of gestures with speech rhythms.

Objective Evaluation Metrics
Model
FGD↓
on feature space
FGD↓
on raw data space
BeatAlign↑
GT///
DSG+42329.55812206710.2240.423
GDC148.9372536.2480.657
LDA34678.66812305524.7510.428
Taming5699.738394606.8500.667
PG-6blocks108.9131952.6180.667
PG-12blocks100.8992128.2740.669

DiM_m2_s_m2 (Ours)

17.716424.2870.670
DiM_m1_s_conv87.0251293.6440.672
DiM_m2_s_conv44.282862.6730.678
DiM_m1_s_m1130.1191789.3930.669
DiM_m2_s_m1120.0531911.9560.669
DiM_m1_s_m2184.3002174.2910.674

Table III provides a comparison of the objective evaluation metrics for several models in speech-driven gesture synthesis, including our proposed method, DiM_m2_s_m2, alongside GT, DSG+, GDC, LDA, Taming, PG-6blocks, and PG-12blocks. The metrics evaluated are the Fréchet Gesture Distance (FGD) in both feature space and raw data space, as well as the BeatAlign score. Lower FGD values and higher BeatAlign scores indicate better performance.

Our proposed model, DiM_m2_s_m2, achieves the lowest FGD on the feature space among all synthetic models with a score of 17.716, significantly outperforming the other methods. By comparison, PG-12blocks and PG-6blocks, which also performed well on other perceptual metrics, scored 100.899 and 108.913, respectively. High FGD values in DSG+ (42329.558) and LDA (34678.668) indicate poor alignment with the target gesture distribution, highlighting our approach’s substantial improvements in naturalness and similarity to natural gestures.

Regarding the BeatAlign metric, DiM_m2_s_m2 scores competitively with a BeatAlign value of 0.670. While this is close to the highest scores achieved by PG-12blocks (0.669) and DiM_m2_s_conv (0.678), it demonstrates the overall balance of our model across both spatial and temporal metrics. Models such as DSG+ (0.423) and LDA (0.428) scored significantly lower, indicating suboptimal temporal synchronization.

In summary, DiM_m2_s_m2 achieves superior performance across multiple objective metrics, with the lowest FGD values in both feature and raw data spaces, indicating a close resemblance to natural gestures. Although its BeatAlign score is slightly lower than the top-performing PG-12blocks, the overall results validate DiM_m2_s_m2 as a highly effective approach for generating temporally and spatially consistent gestures in the domain of speech-driven gesture synthesis.

V-E Ablation Studies

This section details an ablation study to assess the individual contributions of two key components in our proposed model: the fuzzy feature extractor and the AdaLN architectures equipped with different versions of the Mamba framework, as shown in Tables II, III, and Figure 9.

V-E1 Effect of Fuzzy Feature Extractor

Table II and Figure 9 present the performance differences when utilizing mamba-1, mamba-2, and convolution 1D [5] as the fuzzy feature extractor across various model configurations. In all instances, the AdaLN module is integrated with the Mamba-2 architecture.

In the Human-likeness metric, both DiM_m2_s_m2 (0.653 ± 1.323) and DiM_m2_s_conv (0.593 ± 1.222) achieve high scores, with no statistically significant difference (p=1.0>0.05𝑝1.00.05p=1.0>0.05italic_p = 1.0 > 0.05) observed between the two. These findings suggest that Mamba-2 and 1D Convolution represent competitive alternatives for capturing human-like gestures. However, the model DiM_m2_s_m1, which employs Mamba-1, achieves a significantly lower score of 0.243±1.348plus-or-minus0.2431.3480.243\pm 1.3480.243 ± 1.348.

Similarly, in the Appropriateness metric, there is no significant difference (p=2.1e1>0.05𝑝2.1𝑒10.05p=2.1e-1>0.05italic_p = 2.1 italic_e - 1 > 0.05) between DiM_m2_s_m2 (0.682 ± 1.326) and DiM_m2_s_conv (0.425 ± 1.319). Both models align well with the contextual requirements of speech-driven gestures. However, DiM_m2_s_m1 achieves a much lower score (0.106 ± 1.420).

In Style Appropriateness, DiM_m2_s_m2 outperforms the alternatives, achieving the highest score of 1.30 ± 0.770. This result underscores the superiority of Mamba-2 in capturing stylistic nuances and generating gestures that are contextually relevant and visually coherent with the speech content. In comparison, DiM_m2_s_conv scores moderately at 0.549 ± 1.111, while DiM_m2_s_m1 achieves a significantly lower score of 0.230 ± 1.221.

The objective evaluation results in Table III highlight the benefits of using the mamba-2 fuzzy feature extractor in achieving superior alignment and detail in gesture synthesis. Specifically, DiM_m2_s_m2 records the lowest Fréchet Gesture Distance (FGD) in feature space at 17.716, outperforming alternative configurations like DiM_m2_s_conv (44.282) and DiM_m2_s_m1 (120.053).

For the BeatAlign metric, which measures temporal synchronization with speech, DiM_m2_s_conv achieves the highest score of 0.678, slightly outperforming DiM_m2_s_m2 (0.670) and DiM_m2_s_m1 (0.669). While mamba-2 in DiM_m2_s_m2 remains competitive.

The ablation study of fuzzy feature extractor demonstrates that our proposed model, DiM_m2_s_m2, utilizing Mamba-2 for the fuzzy feature extractor, consistently outperforms the Mamba-1 configurations, particularly in Style Appropriateness. Although no significant difference was observed between DiM_m2_s_m2 and DiM_m2_s_conv in Human Likeness and Appropriateness, DiM_m2_s_m2 demonstrates a clear advantage in Style Appropriateness. These findings validate the effectiveness of Mamba-2 in achieving high-quality gesture synthesis that balances naturalness, contextual alignment, and stylistic coherence.

V-E2 Effect of adaLN mamba Architecture

This experiment investigates the impact of employing different adaLN Mamba architectures (adaLN Mamba-1 and adaLN Mamba-2) on the generation effect, as shown in Table II and Table III. In the Human Likeness metric, DiM_m2_s_m2 achieves a high score of 0.653 ± 1.323, comparable to DiM_m1_s_conv (0.651 ± 1.279) and DiM_m1_s_m1 (0.641 ± 1.236). Statistical analysis reveals no significant difference between these models in this metric (Figure 9). However, DiM_m2_s_m1, which uses the Mamba-1 architecture, scores considerably lower at 0.243 ± 1.348. This suggests that while both architectures can produce human-like gestures, the Mamba-2 architecture in DiM_m2_s_m2 exhibits slight improvements in capturing nuanced human motion.

Style Appropriateness highlights a more pronounced distinction between the models. Our proposed model, DiM_m2_s_m2, achieves the highest score of 1.30 ± 0.770, indicating superior stylistic coherence and visual appeal in gesture synthesis. Both DiM_m1_s_conv (0.384 ± 1.186) and DiM_m1_s_m1 (0.323 ± 1.136) perform moderately, while DiM_m2_s_m1 again scores poorly (0.230 ± 1.221).

Quantitatively, Table III highlights that models with adaLN mamba-2 architectures, like DiM_m2_s_m2, consistently achieve lower FGD scores, underscoring their alignment with natural gestures. For instance, DiM_m2_s_conv records an FGD of 44.282 in feature space, while DiM_m1_s_m1, using adaLN mamba-1, records a significantly higher FGD of 130.119. This suggests that adaLN mamba-2 enhances alignment with the target gesture distribution.

The ablation study confirms that using mamba-2 for both the fuzzy feature extractor and adaLN mamba architecture provides optimal results. Our proposed model, DiM_m2_s_m2, outperforms all other configurations in perceptual and quantitative metrics, highlighting the combined benefits of the mamba-2 configurations in generating realistic, contextually appropriate, and stylistically aligned gestures in speech-driven gesture synthesis.

Interestingly, experiments that mixed Mamba-1 and Mamba-2 architectures resulted in a noticeable decrease in performance. Conversely, utilizing a consistent Mamba architecture throughout substantially enhanced the scores. The best results were achieved using Mamba-2, underscoring its superior effectiveness in this application context.

VI Paramter Counts and Inference Speed

This section discusses the parameter counts and inference speed among DiM-Gestor with different Mamba versions and the PG, which utilizing AdaLN transformer architecture.

VI-A Parameter Counts

DiM_m2_s_m2 demonstrates a relatively low parameter count of 535M, making it considerably more compact than configurations such as DiM_m1_s_conv (836M) and DiM_m2_s_conv (826M). This compactness is advantageous, as it indicates that DiM_m2_s_m2 can achieve high-quality gesture synthesis with reduced computational overhead compared to larger models. While the AdaLN-based transformer model, PG-12blocks, can generate high-quality gestures, its significantly larger parameter count of 1.2B may impact memory requirements, posing challenges for deployment in resource-constrained environments.

These observations highlight the efficiency and practicality of DiM_m2_s_m2 as a robust yet computationally lightweight solution for speech-driven gesture synthesis.

VI-B Inference Speed

We aggregated the 20s audio segments into durations of 40s, 60s, 80s, and 100s to evaluate the computational efficiency of different models in processing gesture sequences with varying lengths. Table IV and Figure 11 also detail each model configuration’s inference times across different gesture sequence lengths. DiM_m2_s_m2 shows a balanced performance, with inference times of 10.16s for 20s sequences and scaling to 23.27s for 100s sequences. This model maintains competitive inference speed even for longer sequences, outperforming other configurations with similar parameter sizes. For example, PG-12blocks requires 20.153 seconds of inference time for generating a 20-second gesture sequence, which scales to 40 seconds for a 40-second sequence and 100 seconds for a 100-second sequence. This highlights the need for the computational resources of the PG-12blocks model.

Parameter Counts and Inference Time
ModelsParam.↓CountsGesture Lengths (Second)
20s↓40s↓60s↓80s↓100s↓
PG-6blocks812M7.8712.4616.4424.3128.22
PG-12blocks1.2B20.15339.63759.78581.34499.214
DiM_m2_s_m2(Ours)535M10.169.8612.6017.9623.27
DiM_m1_s_m2545M9.5115.3421.0828.9533.32
DiM_m2_s_conv826M9.1310.2612.5817.8428.67
DiM_m1_s_conv836M8.4915.3020.9828.9334.50
DiM_m2_s_m1536M9.439.8312.5917.8124.89
DiM_m1_s_m1546M8.5115.3121.0028.9935.52

DiM-Gestor: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 (15)

Compared to the AdaLN Transformer architecture that requires 12 blocks, the DiM model employs only 6 AdaLN Mamba-2 blocks to achieve comparable performance in generating co-speech gestures. This configuration results in a reduction of memory usage by approximately 2.4 times and an increase in inference speeds by 2 to 4 times.

The results of this ablation study underscore the efficiency of our proposed model, DiM_m2_s_m2, which achieves a favorable balance between parameter count and inference speed. With a lower parameter count of 535M and competitive inference times, DiM_m2_s_m2 is well-suited for applications requiring both performance and efficiency in speech-driven gesture synthesis. These findings highlight DiM_m2_s_m2 as a compact and efficient model, outperforming other tested architectures, particularly in scenarios involving long-duration gesture synthesis tasks.

In summary, our experimental findings confirm that DiM-Gestor offers competitive advantages over traditional Transformer-based models, achieving state-of-the-art results in style appropriateness and excelling in metrics of human-likeness and appropriateness. These outcomes validate the model’s capability to produce visually convincing gestures and contextually synchronized with speech. Further, our ablation study highlights the Mamba-2 architecture’s superior performance and efficiency, which is particularly beneficial for processing long sequences. Compared with PG with 12 12-block adaLN transformer, implementing this architecture significantly reduces memory usage—approximately 2.4 times—and increases inference speeds by 2 to 4 times, underscoring its potential for scalable and real-time applications.

VII DISSCUSTION and CONCLUSION

This study introduces DiM-Gestor, a novel architecture for co-speech gesture synthesis, employing a Mamba-2 fuzzy feature extractor and AdaLN Mamba-2 within a diffusion framework. This model is designed to generate highly personalized 3D full-body gestures solely from raw speech audio, representing a significant advancement in gesture synthesis technologies for virtual human applications.

DiM-Gestor’s fuzzy feature extractor leverages Mamba-2 to capture implicit, speaker-specific features from audio. These synthesizing gestures resonate with the style and rhythm of the speaker’s voice without relying on predefined style labels. This approach improves generalization and usability, making it applicable across diverse scenarios without requiring extensive label-specific training data. The integration of AdaLN Mamba-2 further enhances the model’s efficiency and flexibility, surpassing traditional transformer-based methods regarding memory efficiency and inference speed. AdaLN Mamba-2 reduces computational overhead by maintaining linear complexity, making real-time applications feasible while maintaining gesture quality comparable to the existing Transformer-based Persona-Gestor.

In addition to model advancements, we contribute a large-scale, high-quality Chinese Co-Speech Gesture Dataset (CCG dataset), recorded by professional broadcasters, encompassing various styles and scenarios. This dataset enriches the resources available for research and development in this field, particularly for applications involving formal and structured speech in the Chinese language.

For future developments, specifically by integrating the accelerated diffusion model sCM [38]. This adaptation could amplify inference speeds by at least 50 times, catering to interactive applications’ stringent real-time performance requirements. This enhancement would refine the user experience and broaden the model’s applicability across various dynamic and user-centric platforms.

References

  • [1]S.Yang, Z.Wu, M.Li, Z.Zhang, L.Hao, W.Bao, M.Cheng, and L.Xiao, “DiffuseStyleGesture: Stylized audio-driven co-speech gesture generation with diffusion models,” arXiv preprint arXiv:2305.04919, 2023.
  • [2]S.Yang, H.Xue, Z.Zhang, M.Li, Z.Wu, X.Wu, S.Xu, and Z.Dai, “The diffusestylegesture+ entry to the genea challenge 2023,” in Proceedings of the 25th International Conference on Multimodal Interaction, 2023, pp. 779–785.
  • [3]T.Ao, Z.Zhang, and L.Liu, “Gesturediffuclip: Gesture diffusion model with clip latents,” ACM Transactions on Graphics (TOG), vol.42, no.4, pp. 1–18, 2023.
  • [4]S.Alexanderson, R.Nagy, J.Beskow, and G.E. Henter, “Listen, denoise, action! audio-driven motion synthesis with diffusion models,” ACM Transactions on Graphics (TOG), vol.42, no.4, pp. 1–20, 2023.
  • [5]F.Zhang, Z.Wang, X.Lyu, S.Zhao, M.Li, W.Geng, N.Ji, H.Du, F.Gao, and H.Wu, “Speech-driven personalized gesture synthetics: Harnessing automatic fuzzy feature inference,” IEEE Transactions on Visualization and Computer Graphics, 2024, publisher: IEEE. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/10508094/
  • [6]W.Peebles and S.Xie, “Scalable diffusion models with transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4195–4205.
  • [7]A.Gu and T.Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752, 2023.
  • [8]L.Zhu, B.Liao, Q.Zhang, X.Wang, W.Liu, and X.Wang, “Vision mamba: Efficient visual representation learning with bidirectional state space model,” arXiv preprint arXiv:2401.09417, 2024.
  • [9]J.Ma, F.Li, and B.Wang, “U-mamba: Enhancing long-range dependency for biomedical image segmentation,” arXiv preprint arXiv:2401.04722, 2024.
  • [10]K.Li, X.Li, Y.Wang, Y.He, Y.Wang, L.Wang, and Y.Qiao, “Videomamba: State space model for efficient video understanding,” Springer, pp. 237–255, 2025.
  • [11]Z.Xing, T.Ye, Y.Yang, G.Liu, and L.Zhu, “Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.Springer, 2024, pp. 578–588.
  • [12]A.Behrouz and F.Hashemi, “Graph mamba: Towards learning on graphs with state space models,” Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 119–130, 2024.
  • [13]C.Wang, O.Tsepa, J.Ma, and B.Wang, “Graph-mamba: Towards long-range graph sequence modeling with selective state spaces,” arXiv preprint arXiv:2402.00789, 2024.
  • [14]T.Dao and A.Gu, “Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality,” arXiv preprint arXiv:2405.21060, 2024. [Online]. Available: https://arxiv.org/abs/2405.21060
  • [15]Y.Ferstl and R.McDonnell, “Investigating the use of recurrent motion modelling for speech gesture generation,” in Proc. of the 18th International Conf. on Intelligent Virtual Agents, 2018, pp. 93–98.
  • [16]S.Ghorbani, Y.Ferstl, D.Holden, N.F. Troje, and M.-A. Carbonneau, “Zeroeggs: Zero-shot example-based gesture generation from speech,” in Computer Graphics Forum, vol.42, no.1.Wiley Online Library, 2023, pp. 206–216.
  • [17]H.Liu, Z.Zhu, N.Iwamoto, Y.Peng, Z.Li, Y.Zhou, E.Bozkurt, and B.Zheng, “Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis,” in COMPUTER VISION, ECCV 2022, PT VII, vol. 13667, 2022, pp. 612–630.
  • [18]Lee Gilwoo, Deng Zhiwei, Ma Shugao, Shiratori Takaaki, Srinivasa Siddhartha S., and Sheikh Yaser, “Talking with hands 16.2 m: A large-scale dataset of synchronized body-finger motion and audio for conversational motion analysis and synthesis,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 763–772.
  • [19]F.Zhang, N.Ji, F.Gao, and Y.Li, “Diffmotion: Speech-driven gesture synthesis using denoising diffusion model,” in MultiMedia Modeling: 29th International Conf., MMM 2023, Bergen, Norway, January 9–12, 2023, Proc., Part I.Springer, 2023, pp. 231–242.
DiM-Gestor: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 (2024)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Kelle Weber

Last Updated:

Views: 5665

Rating: 4.2 / 5 (53 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Kelle Weber

Birthday: 2000-08-05

Address: 6796 Juan Square, Markfort, MN 58988

Phone: +8215934114615

Job: Hospitality Director

Hobby: tabletop games, Foreign language learning, Leather crafting, Horseback riding, Swimming, Knapping, Handball

Introduction: My name is Kelle Weber, I am a magnificent, enchanting, fair, joyous, light, determined, joyous person who loves writing and wants to share my knowledge and understanding with you.