Beyond Frame-Level Matching: A Unified Contrastive-Regularized Framework for Dense Video Correspondence Across Viewpoints

Establishing dense temporal correspondences between videos depicting the same activity from different viewpoints is a fundamental challenge in video understanding, with applications spanning robotic imitation learning, skill assessment, and procedural analysis. Existing methods either rely on frame-level matching that ignores fine-grained temporal structure or employ pairwise alignment losses that fail to enforce global consistency. In this paper, we propose UniCRAF (Unified ContrastiveRegularized Alignment Framework), a novel approach that unifies contrastive temporal embedding learning with inverse dynamics regularization and differentiable soft alignment to achieve dense, viewpoint-invariant video correspondence. UniCRAF introduces three key components: (1) a dual-stream temporal encoder with cross-view attention bridges, (2) a contrastive inverse dynamics module (CIDM) that grounds frame embeddings in action-predictive semantics, and (3) a regularized soft-DTW alignment head that enforces monotonic temporal consistency while permitting local non-linear warping. Extensive experiments on three benchmarks—Penn Action, Pouring, and ProcedureVid—demonstrate that UniCRAF outperforms prior state-of-the-art methods by 3.2–5.7% in Kendall’s tau and 2.8– 4.1 percentage points in phase classification accuracy, while maintaining real-time inference speed. Ablation studies confirm the complementary contributions of each proposed module.

Morzsák

Oldal címe

Beyond Frame-Level Matching: A Unified Contrastive-Regularized Framework for Dense Video Correspondence Across Viewpoints

Címlapos tartalom