I am a PhD candidate in Computer Science at the University of Houston, advised by Prof. Shishir Shah (now at University of Oklahoma), Prof. Jaspal Subhlok, and Dr. Pranav Mantini. My dissertation research focuses on multimodal AI for educational video understanding — building systems that integrate speech, slide text, and visual frames with vision-language models to generate structured, chapter-level summaries, deployed on the NSF-supported VideoPoints platform.
My work spans the full pipeline: visual content detection (LogiForm, LVVO benchmark), LLM-based video segmentation, multimodal summarization, and automated evaluation — improving summary faithfulness by 9.6 pts and relevance by 5.2 pts over unimodal baselines. I am broadly interested in vision-language models, multimodal reasoning, and building reproducible, deployment-ready ML systems.
Previously, I was an Assistant Professor at Khulna University of Engineering and Technology (KUET), Bangladesh, and a Software Engineer at Samsung R&D. I hold a B.Sc. from KUET, where I graduated as University Gold Medalist.
Engineered a lightweight, config-driven framework for unified VLM inference across local and cloud backends, including Ollama, MLX-VLM, vLLM, HuggingFace Transformers, OpenAI, & Gemini. It provides a shared InferenceRequest interface for consistent text-image prompting, and multi-step inference workflows.
Developed an end-to-end multimodal summarization system for the NSF-supported VideoPoints platform to improve educational video navigation and review. Integrated ASR transcripts, OCR text, and visual frames with vision-language models and structured in-context learning to generate chapter-level summaries grounded in slide-level evidence.
Developed a slide-guided chaptering pipeline that aligns speech transcripts with slide transitions to segment long lecture videos into coherent topical chapters. Incorporated automatic validation and correction to ensure accurate, contiguous boundaries and full video coverage.
A modular PyTorch framework for training and evaluating state-of-the-art object detectors — YOLO, Faster R-CNN, and DETR — on custom datasets. Hydra-based configuration enables effortless model switching and hyperparameter tuning, while a unified dataset format and shared train/evaluate/predict interface keep experiments reproducible and easy to extend.
Enhanced RetinaFace for long-distance face detection by introducing a custom IoU-aware multi-task loss and an auxiliary IoU prediction head to improve localization quality on WIDER FACE. The approach improved robustness for detecting small, distant, and partially occluded faces across both ResNet50 and MobileNet0.25 backbones.
This paper investigates how modern object detectors can be adapted to lecture videos, where visual content is often semantically complex and weakly structured. It shows that YOLO-based transfer learning, combined with semi-supervised dataset enrichment, can substantially improve detection of educational visuals.
This work presents the LVVO dataset, a new benchmark for visual object detection in educational videos, with 4,000 lecture frames drawn from multiple subjects and instructors. The dataset provides high-quality annotations for key visual categories and supports both supervised and semi-supervised research.
This paper presents LogiForm, a graph-based approach for identifying meaningful visual objects in lecture video frames, where slide visuals are often diverse, custom-designed, and composed of parts without clear boundaries. By leveraging spatial, color, and local geometric cues, it better groups semantically related content and achieves a 15.8% mAP improvement over prior methods.
Develops a transfer learning model for plant disease recognition using ResNet50 on a large leaf-image dataset spanning 38 classes, reaching an overall accuracy of 99.80%.