Temporal and Spatial Attention for 3D Pose Estimation

Imagine a dancer performing a routine. At every moment, her body is in a specific pose: arms raised, legs extended, and so on (space). As she starts to spin, each part of her body moves, adjusting in real time to maintain balance and fluidity (time). Her pose is constantly changing, but at each instant, it’s a precise combination of spatial positions.

For AI to replicate this in 3D pose estimation, it must not only track the spatial configuration of each body part at any given moment, but also understand how these parts move and evolve over time. The AI needs to keep track of where the body parts are at each frame and how they transition smoothly between frames.

This is where temporal and spatial attention come into play. Spatial attention helps the model focus on the relative positions of the joints at each frame, while temporal attention enables the model to track the motion of the joints over time, understanding the continuity of movement. By combining these two, AI systems can more accurately estimate human pose, capturing both the structure of the body and the dynamics of motion. Let’s take a closer look at the theory behind these concepts.


If you want to see the direct implementation of this model, you may look into this code.

The Ill-Posed Problem: Lifting 2D to 3D

At its core, the task of 2D-to-3D pose estimation is to lift a 2D representation of a person’s keypoints…essentially a list of coordinates for different joints…into the 3D space. The challenge comes from the fact that we only have 2D projections of the 3D world, making it difficult to infer the third dimension depth directly. This problem gets even harder when you deal with complex motions, where joints move in ways that are difficult to represent with just 2D information.

To bridge the gap, temporal and spatial attention come into play. These two mechanisms allow the model to not only focus on how the joints relate to each other in a given frame but also how they evolve over time, capturing both the motion dynamics and the anatomical structure of the human body.

Temporal Attention: Understanding Motion Over Time

Human motions; whether it’s walking, running, or waving; are continuous. That means that the pose at any given moment in time is influenced by the poses at previous moments. This is where temporal attention shines. It allows the model to capture the flow of motion across frames, ensuring that each 3D pose prediction is consistent with the preceding and following frames.

Let’s imagine you have a sequence of 2D keypoints { \mathbf{P}_1, \mathbf{P}_2, …, \mathbf{P}_T } representing the human poses over T frames, where each \mathbf{P}_t \in \mathbb{R}^{J \times 2} is a set of 2D keypoints for J joints at frame t. The goal is to predict the corresponding 3D poses \mathbf{P}_t^{3D} \in \mathbb{R}^{J \times 3}.

How Temporal Attention Works

The self-attention mechanism helps us here. It enables the model to figure out which past frames should inform the current frame’s prediction. For each frame t, we calculate queries \mathbf{Q}_t, keys \mathbf{K}_t, and values \mathbf{V}_t from the 2D keypoints:

    \[\mathbf{Q}_t = \mathbf{P}_t W_Q, \quad \mathbf{K}_t = \mathbf{P}_t W_K, \quad \mathbf{V}_t = \mathbf{P}_t W_V\]

Where W_Q , W_K, and W_V are learned weight matrices. The attention scores A_{t,i} between frames t and i are computed as follows:

    \[ A_{t,i} = \frac{\mathbf{Q}_t \cdot \mathbf{K}_i^T}{\sqrt{d_k}}, \quad \alpha_{t,i} = \text{softmax}(A_{t,i}) \]

This equation tells us how much each previous frame i should influence the current frame t. The final 3D pose prediction for frame t is then a weighted sum of the values from all frames:

    \[ \mathbf{P}_t^{3D} = \sum_{i=1}^{T} \alpha_{t,i} \cdot \mathbf{V}_i \]

This mechanism ensures that the model smoothly transitions from one pose to the next, which is especially useful for dynamic motions.

Temporal Attention: Changing Dimensions

In the attention mechanism, the input 2D pose matrix \mathbf{P}_t \in \mathbb{R}^{J \times 2} is transformed into new feature spaces for queries, keys, and values:

    \[\mathbf{Q}_t \in \mathbb{R}^{J \times d_k}, \quad \mathbf{K}_t \in \mathbb{R}^{J \times d_k}, \quad \mathbf{V}_t \in \mathbb{R}^{J \times d_v}\]

The dimension of the queries and keys, d_k, is a hyperparameter, and d_v is the size of the value vectors. This dimensional transformation allows the model to learn richer representations and capture temporal dependencies across frames.

Spatial Attention: Maintaining Anatomical Consistency

While temporal attention deals with motion over time, spatial attention focuses on how the joints in a single frame relate to each other. For example, the position of the knee is influenced by the position of the hip and ankle, and these anatomical constraints must be preserved in the 3D reconstruction.

How Spatial Attention Works

Spatial attention is built on the idea that joints are connected in a graph-like structure. Each joint is influenced by its neighbors, and the relationship between them must be modeled carefully. In the spatial attention mechanism, for each joint j, we calculate the queries \mathbf{Q}_j, keys \mathbf{K}_k, and values \mathbf{V}_k for all other joints k:

    \[\mathbf{Q}_j = \mathbf{P}_j W_Q, \quad \mathbf{K}_k = \mathbf{P}_k W_K, \quad \mathbf{V}_k = \mathbf{P}_k W_V\]

The attention score A_{jk} between joints j and k is computed as:

    \[A_{jk} = \frac{\mathbf{Q}_j \cdot \mathbf{K}_k^T}{\sqrt{d_k}}\]

The spatial attention weight \alpha_{jk} is given by:

    \[\alpha_{jk} = \text{softmax}(A_{jk})\]

The final 3D position for joint j is a weighted sum of the 3D positions of its neighboring joints k:

    \[ \mathbf{P}_j^{3D} = \sum_{k \in \mathcal{N}(j)} \alpha_{jk} \cdot \mathbf{P}_k \]

This ensures that the 3D pose respects the relative positioning of joints, preserving anatomical consistency.

Fusion of Temporal and Spatial Streams

DSTformer
DSTformer from MotionBERT

The fusion mechanism in DSTformer (Dual-stream Spatio-temporal Transformer) is a key component that integrates the outputs of the spatial \mathbf{S} and temporal \mathbf{T} streams. Both streams independently capture spatial and temporal information, leveraging dedicated blocks within each stream for this purpose. The fusion combines these outputs to create a unified representation, mathematically expressed as:

    \[ \mathbf{X} = \mathbf{S} \cdot \boldsymbol{\alpha}_S + \mathbf{T} \cdot \boldsymbol{\alpha}_T \]

Here:

  • \mathbf{S} is the output of the spatial stream, encompassing information about spatial relationships and features across the input sequence.
  • \mathbf{T} is the output of the temporal stream, which captures motion dynamics and temporal dependencies.
  • \boldsymbol{\alpha} = \text{softmax}(\boldsymbol{\beta}) , where \boldsymbol{\beta} are learnable parameters controlling the relative contributions of \mathbf{S} and \mathbf{T} to the fused output. The softmax ensures that \boldsymbol{\alpha}_S + \boldsymbol{\alpha}_T = 1.

In practice, this mechanism dynamically adjusts the weighting of \mathbf{S} and \mathbf{T} for each input frame, making the fusion context-aware. For example, scenes with more significant spatial variations might rely more on \mathbf{S}, while motion-heavy sequences may prioritize \mathbf{T}.

The architecture of the spatial and temporal streams ensures complementary feature extraction:

  • Spatial stream: Employs spatial attention blocks to focus on intra-frame relationships, identifying important features across spatial dimensions.
  • Temporal stream: Uses temporal attention blocks to model inter-frame dependencies, capturing the evolution of features over time.

By combining these streams through the fusion equation, DSTformer achieves a holistic understanding of both spatial and temporal dimensions, vital for tasks like pose estimation. The adaptability of the learned weights ensures that the model effectively balances spatial and temporal insights without requiring manual tuning, enabling robust performance across varied scenarios.

Conclusion

By combining temporal and spatial attention, DSTformer offers a powerful solution to the 2D-to-3D pose estimation problem. Temporal attention helps the model track motion over time, ensuring smooth transitions between poses, while spatial attention enforces anatomical constraints, ensuring that the 3D poses are realistic. The fusion of these two streams allows DSTformer to leverage the strengths of both temporal and spatial reasoning, ultimately leading to more accurate and robust 3D pose estimations.

This model’s ability to capture both the flow of motion and the anatomy of the human body makes it a promising tool for a wide range of applications, from motion capture in entertainment to human-computer interaction and beyond.

References:

  1. MotionBERT paper (link)
  2. VideoPose3D paper (link)


Discover more from Arshad Kazi

Subscribe to get the latest posts sent to your email.

Leave a Reply/Feedback :)