My research spans video understanding, explainable AI, temporal modeling, and geometry-aware generation. A common theme across these projects is building models that are not only effective, but also more interpretable, structured, or physically consistent.

Overview

  • Research areas: video understanding, explainable AI, temporal modeling, and geometry-aware generation
  • Common thread: interpretable or structured representations for vision models
  • Outputs: publications, open-source code, datasets, and visual demonstrations

Satellite-to-Street-View Video Generation

Contributed to a research series on generating street-view video from satellite imagery and target trajectories. These works explored geometry-aware generation through voxel-based and point-cloud-based intermediate scene representations, improving photorealism and frame-to-frame continuity.

Sat2Vid: Street-view video synthesis from a single satellite image

Sat2Scene: 3D urban scene generation from satellite images with diffusion

Focus

  • Geometry-aware street-view video synthesis from satellite imagery
  • 3D scene representations for temporal consistency
  • Collaborative contribution across method design, preprocessing, and baseline implementation

Links

Model-Agnostic Visual Explanation for Video Understanding

Proposed a model-agnostic method for visually explaining video-understanding networks by identifying the spatial-temporal regions most responsible for a model’s prediction. Extended the work with a more detailed and objective evaluation framework for explanation methods.

Video explanation teaser
Basketball explanation Fencing explanation Open cupboard explanation
Open fridge explanation Close drawer explanation Walking with dog explanation

Focus

  • Model-agnostic explanation for video models
  • Spatiotemporal saliency and continuity
  • Objective evaluation metrics for explanation quality

Links

Surgical Skill Assessment via Video Semantic Aggregation

Developed an interpretable video model for assessing da Vinci robotic surgery skill. The method clusters visual features into semantic abstractions before temporal modeling, improving both transparency and benchmark accuracy.

Surgical skill assessment teaser
Suturing frame visualization Knot tying frame visualization Hei-Chole frame visualization
Suturing assignment maps Knot tying assignment maps Hei-Chole assignment maps

Focus

  • Video-based robotic surgical skill assessment
  • Semantic aggregation of spatiotemporal features
  • Transparent intermediate representations for analysis and supervision

Links

Video Skill Assessment with Attention-Based RNN

Proposed an attention-based recurrent model for video-based skill assessment and built a new dataset for the task. The method improved over prior approaches and produced per-frame attention maps for interpretability.

Suturing better performance > Suturing worse performance
Sonic drawing better performance > Sonic drawing worse performance

Focus

  • Skill score regression from video
  • Spatial attention for interpretability
  • Dataset construction for a new task

Links