Research

My research spans video understanding, explainable AI, temporal modeling, and geometry-aware generation. A common theme across these projects is building models that are not only effective, but also more interpretable, structured, or physically consistent.

Overview

Research areas: video understanding, explainable AI, temporal modeling, and geometry-aware generation
Common thread: interpretable or structured representations for vision models
Outputs: publications, open-source code, datasets, and visual demonstrations

Satellite-to-Street-View Video Generation

Contributed to a research series on generating street-view video from satellite imagery and target trajectories. These works explored geometry-aware generation through voxel-based and point-cloud-based intermediate scene representations, improving photorealism and frame-to-frame continuity.

Sat2Vid: Street-view video synthesis from a single satellite image

Sat2Scene: 3D urban scene generation from satellite images with diffusion

Focus

Geometry-aware street-view video synthesis from satellite imagery
3D scene representations for temporal consistency
Collaborative contribution across method design, preprocessing, and baseline implementation

Links

Sat2Vid paper: Sat2Vid: Street-view Panoramic Video Synthesis from a Single Satellite Image
Sat2Scene paper: Sat2Scene: 3D Urban Scene Generation from Satellite Images with Diffusion
Sat2Scene project: Sat2Scene

Model-Agnostic Visual Explanation for Video Understanding

Proposed a model-agnostic method for visually explaining video-understanding networks by identifying the spatial-temporal regions most responsible for a model’s prediction. Extended the work with a more detailed and objective evaluation framework for explanation methods.

Focus

Model-agnostic explanation for video models
Spatiotemporal saliency and continuity
Objective evaluation metrics for explanation quality

Links

WACV 2021 paper: Visually Explaining Video Understanding Networks with Perturbation
TCSVT 2022 paper: Evaluation Metrics of Visual Explanation Methods
Code: GitHub repository

Surgical Skill Assessment via Video Semantic Aggregation

Developed an interpretable video model for assessing da Vinci robotic surgery skill. The method clusters visual features into semantic abstractions before temporal modeling, improving both transparency and benchmark accuracy.

Focus

Video-based robotic surgical skill assessment
Semantic aggregation of spatiotemporal features
Transparent intermediate representations for analysis and supervision

Links

MICCAI 2022 paper: Surgical Skill Assessment via Video Semantic Aggregation
Code: GitHub repository

Video Skill Assessment with Attention-Based RNN

Proposed an attention-based recurrent model for video-based skill assessment and built a new dataset for the task. The method improved over prior approaches and produced per-frame attention maps for interpretability.

	>
	>

Focus

Skill score regression from video
Spatial attention for interpretability
Dataset construction for a new task

Links

ICCV 2019 EPIC Workshop paper: Video-based skill assessment with spatial attention network
Code: GitHub repository