Dipayan Biswas

I am a PhD candidate in Computer Science at the University of Houston, advised by Prof. Shishir Shah (now at University of Oklahoma), Prof. Jaspal Subhlok, and Dr. Pranav Mantini. My dissertation research focuses on multimodal AI for educational video understanding — building systems that integrate speech, slide text, and visual frames with vision-language models to generate structured, chapter-level summaries, deployed on the NSF-supported VideoPoints platform.

My work spans the full pipeline: visual content detection (LogiForm, LVVO benchmark), LLM-based video segmentation, multimodal summarization, and automated evaluation — improving summary faithfulness by 9.6 pts and relevance by 5.2 pts over unimodal baselines. I am broadly interested in vision-language models, multimodal reasoning, and building reproducible, deployment-ready ML systems.

Previously, I was an Assistant Professor at Khulna University of Engineering and Technology (KUET), Bangladesh, and a Software Engineer at Samsung R&D. I hold a B.Sc. from KUET, where I graduated as University Gold Medalist.

Dipayan Biswas

News

Research Interests

Multimodal AI Computer Vision Vision-Language Models Multimodal Video Summarization Visual Grounding

Experience

Graduate Research Assistant
University of Houston · Houston, TX
Aug 2021 – Present
  • Advanced multimodal AI methods for educational video understanding, focusing on multimodal summarization that integrates text descriptions with representative supporting visuals.
  • Integrate multi-channel information including ASR transcripts, OCR text, slide visuals, and visual object detection in multimodal pipelines for chapter-level educational video summarization.
  • Built automated LLM-based evaluation workflows with DeepEval and ChatGPT/Gemini APIs to assess summaries, improving faithfulness by 9.6 pts and relevance by 5.2 pts over text-only baselines.
  • Improved visual content detection in educational videos by optimizing YOLOv11 with transfer learning and semi-supervised auto-labeling, increasing AP50 from 90.75% to 95.32%.
  • Developed LogiForm, a graph-based computer vision algorithm using color and keypoint similarity to detect meaningful visual objects in lecture videos, improving mAP by 15.8%.
Machine Learning Fellow · Team Lead
Fellowship.AI · Remote
Sept 2025 – Nov 2025
  • Engineered a real estate data pipeline to perform nearest-neighbor comparative market analysis on a 4.5 GB Redfin dataset, estimating property values across localized markets.
  • Built a Gemini-based multimodal workflow to deliver structured recommendations for pricing and content optimization by analyzing listing images and descriptions.
Assistant Professor / Lecturer
Khulna University of Engineering and Technology (KUET) · Khulna, Bangladesh
Feb 2017 – Aug 2021
  • Developed a ResNet50 transfer-learning model for plant disease detection, achieving 99.80% accuracy.
  • Designed custom CNN architectures for pneumonia detection from chest X-rays and ensemble models for breast cancer prediction, achieving up to 99.28% accuracy.
  • Taught and mentored undergraduate students, guiding projects and fostering student learning and engagement.
Software Engineer
Samsung R&D Institute Bangladesh (SRBD) · Dhaka, Bangladesh
Jul 2016 – Feb 2017
  • Developed panoramic, 360-view, and little-planet image rendering features for the Samsung Gear 360 iPhone app.
  • Optimized OpenCV and OpenGL image processings to improve visual quality and reduce latency on mobile devices.

Selected Projects

VLM Inference Framework
Unified VLM Inference Framework  [GitHub] Mar 2026 – Apr 2026

Engineered a lightweight, config-driven framework for unified VLM inference across local and cloud backends, including Ollama, MLX-VLM, vLLM, HuggingFace Transformers, OpenAI, & Gemini. It provides a shared InferenceRequest interface for consistent text-image prompting, and multi-step inference workflows.

VLM Multimodal Inference LLM vLLM Transformers OpenAI API Gemini API Python
Multimodal Video Summarization
Multimodal Video Summarization Engine  [Platform] [Demo] Dec 2025 – Mar 2026

Developed an end-to-end multimodal summarization system for the NSF-supported VideoPoints platform to improve educational video navigation and review. Integrated ASR transcripts, OCR text, and visual frames with vision-language models and structured in-context learning to generate chapter-level summaries grounded in slide-level evidence.

Multimodal LLM In-Context Learning Whisper Tesseract OCR Gemma 3 Qwen3-VL
LLM Video Temporal Segmentation
LLM-Based Video Temporal Segmentation  [Platform] Jun 2025 – Aug 2025

Developed a slide-guided chaptering pipeline that aligns speech transcripts with slide transitions to segment long lecture videos into coherent topical chapters. Incorporated automatic validation and correction to ensure accurate, contiguous boundaries and full video coverage.

LLM Whisper Video Segmentation Temporal Segmentation Gemini Flash Python
SOTA Object Detection Lab
SOTA Object Detection Lab  [GitHub] Dec 2024 – Feb 2025

A modular PyTorch framework for training and evaluating state-of-the-art object detectors — YOLO, Faster R-CNN, and DETR — on custom datasets. Hydra-based configuration enables effortless model switching and hyperparameter tuning, while a unified dataset format and shared train/evaluate/predict interface keep experiments reproducible and easy to extend.

PyTorch Object Detection YOLOv11 Transfer Learning Hydra
Long-Distance Face Detection
Long-Distance Face Detection  [GitHub] Feb 2023 – May 2023

Enhanced RetinaFace for long-distance face detection by introducing a custom IoU-aware multi-task loss and an auxiliary IoU prediction head to improve localization quality on WIDER FACE. The approach improved robustness for detecting small, distant, and partially occluded faces across both ResNet50 and MobileNet0.25 backbones.

PyTorch Face Detection IoU-Aware Loss Multi-Task Learning RetinaFace WIDER FACE

Publications  [Google Scholar]

MIPR 2025 thumbnail
D. Biswas, S. Shah, J. Subhlok
IEEE MIPR 2025

This paper investigates how modern object detectors can be adapted to lecture videos, where visual content is often semantically complex and weakly structured. It shows that YOLO-based transfer learning, combined with semi-supervised dataset enrichment, can substantially improve detection of educational visuals.

LVVO dataset thumbnail
D. Biswas, S. Shah, J. Subhlok
arXiv 2025

This work presents the LVVO dataset, a new benchmark for visual object detection in educational videos, with 4,000 lecture frames drawn from multiple subjects and instructors. The dataset provides high-quality annotations for key visual categories and supports both supervised and semi-supervised research.

ISM 2023 thumbnail
D. Biswas, S. Shah, J. Subhlok
IEEE ISM 2023

This paper presents LogiForm, a graph-based approach for identifying meaningful visual objects in lecture video frames, where slide visuals are often diverse, custom-designed, and composed of parts without clear boundaries. By leveraging spatial, color, and local geometric cues, it better groups semantically related content and achieves a 15.8% mAP improvement over prior methods.

Plant Disease Detection thumbnail
I. Z. Mukti, D. Biswas
IEEE EICT 2019

Develops a transfer learning model for plant disease recognition using ResNet50 on a large leaf-image dataset spanning 38 classes, reaching an overall accuracy of 99.80%.

Certifications

AWS & DeepLearning.AI
Generative AI with Large Language Models
Coursera
View Certificate →
DeepLearning.AI
Machine Learning in Production
Coursera
View Certificate →
Amazon Web Services
AWS Cloud Practitioner Essentials
AWS Skill Builder

Self-Study Courses

Honors & Awards

  • Ph.D. Research Showcase Award (Audience's Choice) — University of Houston, 2025  [Certificate]
  • Cullen Graduate Student Fellowship — Graduate School, University of Houston, 2021–2026
  • University Gold Medalist — Khulna University of Engineering & Technology (KUET), 2018  [Certificate]

Service & Leadership

Team Lead, Real Estate Listing Optimization Project — Fellowship.AI, Remote
Sep 2025 – Nov 2025
Organizing Secretary, Graduate Student Association of Bangladesh, University of Houston, USA
Aug 2022 – Jul 2025
Academic Advisor & Project Mentor, Khulna University of Engineering and Technology, Bangladesh
Feb 2017 – Aug 2021
Workshop Instructor, Programming & Logic Circuit Designing — Electronics Club (MEC), KUET, Bangladesh
2020

Education

University of Houston  Houston, TX, USA
Ph.D. in Computer Science  |  GPA: 3.88
Aug 2021 – Jul 2026 (Expected)
Dissertation: Synthesizing Text and Visuals: A Multimodal Approach to Summarizing Presentation-Style Educational Videos
Khulna University of Engineering and Technology (KUET)  Khulna, Bangladesh
B.Sc. in Electronics and Communication Engineering  |  GPA: 3.88
Apr 2012 – May 2016
🏆 University Gold Medalist · Dean's Award · 1st in Merit Position