Multimodal Video Understanding
Designing chaptering and summarization systems that align spoken, visual, and on-screen textual signals from presentation-style videos.
Applied Scientist / ML Engineer Track
I am a PhD Candidate in Computer Science at the University of Houston. My work focuses on computer vision, multimodal AI, and vision-language systems for educational video understanding, multimodal summarization, object detection, and robust evaluation pipelines.
About
My research sits at the intersection of computer vision, multimodal AI, and machine learning systems. I design pipelines that combine ASR transcripts, OCR text, slide visuals, and detected visual objects to improve navigation and review for long educational videos.
Beyond model development, I care about modular pipeline design, experiment reproducibility, evaluation rigor, and translating research ideas into systems that are easier to use, benchmark, and extend.
Focus Areas
Designing chaptering and summarization systems that align spoken, visual, and on-screen textual signals from presentation-style videos.
Building automated workflows to measure faithfulness, relevance, coherence, and ranking quality for generative multimodal systems.
Creating modular, configuration-driven experimentation frameworks for training, inference, debugging, and iterative model development.
Featured Work
End-to-end multimodal pipeline for educational video chaptering and summary generation using ASR, OCR, slide visuals, and supporting visual evidence.
Slide-anchored segmentation workflow for dividing long educational videos into topically coherent chapters with validation and retry logic.
Reusable training and evaluation framework for object detection experiments with configuration-driven workflows and rapid iteration.
Pipeline for collecting, structuring, and analyzing listing data with geospatial filtering and multimodal recommendation workflows.
Technical Stack
Python, SQL, C++, MATLAB, R, PHP, JavaScript, HTML
PyTorch, TensorFlow, Hugging Face Transformers, OpenCV, Whisper, DeepEval
Docker, Linux, Git, Hydra, Slurm, MLflow, Weights & Biases, AWS EC2, AWS SageMaker
Selected Research
Research on presentation-style educational videos, chapter-based summarization, and multimodal evidence selection from transcripts, OCR, slides, and detected visual objects.
Work on detecting pedagogically meaningful objects such as charts, diagrams, tables, and text regions to complement generated summaries.
Add direct paper links, Google Scholar links, or project pages here once you decide which 2–4 items you want to feature publicly.
Contact
I am interested in Applied Scientist and Machine Learning Engineer opportunities in computer vision, multimodal AI, and video understanding.