Temporal and Spatial Attention for 3D Pose Estimation
At its core, the task of 2D-to-3D pose estimation is to lift a 2D representation of a person’s keypoints; essentially a list of coordinates for different joints; into the 3D space. The challenge comes from the fact that we only have 2D projections of the 3D world, making it difficult to infer the third dimension depth directly. This problem gets even harder when you deal with complex motions, where joints move in ways that are difficult to represent with just 2D information. To bridge the gap, temporal and spatial attention come into play.