Applied Scientist / ML Engineer Track

Building end-to-end ML systems for computer vision, multimodal AI, and video understanding.

I am a PhD Candidate in Computer Science at the University of Houston. My work focuses on computer vision, multimodal AI, and vision-language systems for educational video understanding, multimodal summarization, object detection, and robust evaluation pipelines.

View Projects Email Me

Research Areas CV, Multimodal AI, VLMs, LLM Evaluation
Core Stack Python, PyTorch, OpenCV, Transformers, Docker
Current Goal Applied Scientist and Machine Learning Engineer roles

About

Research-trained engineer focused on practical ML systems

My research sits at the intersection of computer vision, multimodal AI, and machine learning systems. I design pipelines that combine ASR transcripts, OCR text, slide visuals, and detected visual objects to improve navigation and review for long educational videos.

Beyond model development, I care about modular pipeline design, experiment reproducibility, evaluation rigor, and translating research ideas into systems that are easier to use, benchmark, and extend.

Focus Areas

What I work on

Multimodal Video Understanding

Designing chaptering and summarization systems that align spoken, visual, and on-screen textual signals from presentation-style videos.

Vision-Language & LLM Evaluation

Building automated workflows to measure faithfulness, relevance, coherence, and ranking quality for generative multimodal systems.

Reproducible ML Pipelines

Creating modular, configuration-driven experimentation frameworks for training, inference, debugging, and iterative model development.

Featured Work

Projects and repositories

Flagship Project Multimodal AI

video-chaptering

End-to-end multimodal pipeline for educational video chaptering and summary generation using ASR, OCR, slide visuals, and supporting visual evidence.

Modular stage-wise workflow for prepare, keyframes, modalities, and chaptering
Designed for evidence alignment, summary generation, and iterative experimentation
Focused on practical review and navigation utility for long lecture videos

View repository

Video Understanding LLMs

LLM-Based Video Temporal Segmentation

Slide-anchored segmentation workflow for dividing long educational videos into topically coherent chapters with validation and retry logic.

Combines transcript and visual structure to improve segmentation quality
Designed for robust output correction and quality control
Supports downstream chapter-level summarization workflows

See GitHub profile

Computer Vision PyTorch

SOTA Object Detection Lab

Reusable training and evaluation framework for object detection experiments with configuration-driven workflows and rapid iteration.

Supports shared train, evaluate, and predict workflows
Structured for reproducibility and model comparison
Useful for custom dataset fine-tuning and experimentation

See GitHub profile

Applied ML Multimodal Systems

Real Estate Listing Optimization

Pipeline for collecting, structuring, and analyzing listing data with geospatial filtering and multimodal recommendation workflows.

Built for large-scale listing extraction and comparative analysis
Combines structured data pipelines with multimodal model outputs
Targets practical pricing and content optimization use cases

See GitHub profile

Technical Stack

Tools and technologies

Languages

Python, SQL, C++, MATLAB, R, PHP, JavaScript, HTML

ML / CV / Multimodal

PyTorch, TensorFlow, Hugging Face Transformers, OpenCV, Whisper, DeepEval

Tools / Platforms

Docker, Linux, Git, Hydra, Slurm, MLflow, Weights & Biases, AWS EC2, AWS SageMaker

Selected Research

Publications and research themes

Educational video understanding and summarization

Research on presentation-style educational videos, chapter-based summarization, and multimodal evidence selection from transcripts, OCR, slides, and detected visual objects.

Visual content detection in lecture videos

Work on detecting pedagogically meaningful objects such as charts, diagrams, tables, and text regions to complement generated summaries.

Add direct paper links, Google Scholar links, or project pages here once you decide which 2–4 items you want to feature publicly.

Contact

Let’s connect

I am interested in Applied Scientist and Machine Learning Engineer opportunities in computer vision, multimodal AI, and video understanding.

dipayan1109033@gmail.com LinkedIn GitHub