Dipayan Biswas

I am a PhD candidate in Computer Science at the University of Houston, advised by Prof. Shishir Shah (now at University of Oklahoma), Prof. Jaspal Subhlok, and Dr. Pranav Mantini. My dissertation research focuses on multimodal AI for educational video understanding — building systems that integrate speech, slide text, and visual frames with vision-language models to generate structured, chapter-level summaries, deployed on the NSF-supported VideoPoints platform.

My work spans the full pipeline: visual content detection (LogiForm, LVVO benchmark), LLM-based video segmentation, multimodal summarization, and automated evaluation — improving summary faithfulness by 9.6 pts and relevance by 5.2 pts over unimodal baselines. I am broadly interested in vision-language models, multimodal reasoning, and building reproducible, deployment-ready ML systems.

Previously, I was an Assistant Professor at Khulna University of Engineering and Technology (KUET), Bangladesh, and a Software Engineer at Samsung R&D. I hold a B.Sc. from KUET, where I graduated as University Gold Medalist.

Email LinkedIn CV GitHub

News

Apr 2026 Project Released Unified VLM Inference Framework — a Python toolkit with a consistent text-image prompting interface across Ollama, vLLM, Transformers, OpenAI, and Gemini. [GitHub]
Mar 2026 Project Deployed Multimodal Video Summarization Engine on VideoPoints — integrating speech, slide text, and visual frames to generate visually grounded chapter-level summaries.
Feb 2026 Certificate Completed Generative AI with Large Language Models (DeepLearning.AI & AWS), Coursera. [Verify]
Jan 2026 Certificate Completed Machine Learning in Production (DeepLearning.AI), Coursera. [Verify]
Nov 2025 Award Received Audience's Choice at the Ph.D. Research Showcase, University of Houston. [Certificate]
Nov 2025 Industry Completed ML Fellowship at Fellowship.AI as Team Lead for a real estate multimodal pipeline project.
Nov 2025 Certificate Completed AWS Cloud Practitioner Essentials (Amazon Web Services). [Certificate]
Aug 2025 Talk Presented at IEEE MIPR 2025 (San Jose, CA) — “Visual Content Detection in Educational Videos with Transfer Learning and Dataset Enrichment”.
Aug 2025 Project Deployed LLM-Based Video Temporal Segmentation on VideoPoints — slide-guided lecture video chaptering.
Jul 2025 Course Completed Building and Evaluating Advanced RAG — DeepLearning.AI.
Jul 2025 Service Concluded 3 years as Organizing Secretary, Graduate Student Association of Bangladesh (GSAB), University of Houston.
Jun 2025 Dataset Released LVVO, a new benchmark for visual object detection in lecture videos. [arXiv]
May 2025 Paper “Visual Content Detection in Educational Videos with Transfer Learning and Dataset Enrichment” accepted at IEEE MIPR 2025.
Feb 2025 Project Released SOTA Object Detection Lab — modular PyTorch framework for training YOLO, Faster R-CNN, and DETR with Hydra config. [GitHub]
Dec 2024 Course Completed Community Computer Vision Course — Hugging Face.
2023 Course Completed Docker Tutorial for Beginners — YouTube.
2022 Service Joined as Organizing Secretary, Graduate Student Association of Bangladesh (GSAB), University of Houston.

Research Interests

Multimodal AI Computer Vision Vision-Language Models Multimodal Video Summarization Visual Grounding

Experience

Graduate Research Assistant

University of Houston · Houston, TX

Aug 2021 – Present

Advanced multimodal AI methods for educational video understanding, focusing on multimodal summarization that integrates text descriptions with representative supporting visuals.
Integrate multi-channel information including ASR transcripts, OCR text, slide visuals, and visual object detection in multimodal pipelines for chapter-level educational video summarization.
Built automated LLM-based evaluation workflows with DeepEval and ChatGPT/Gemini APIs to assess summaries, improving faithfulness by 9.6 pts and relevance by 5.2 pts over text-only baselines.
Improved visual content detection in educational videos by optimizing YOLOv11 with transfer learning and semi-supervised auto-labeling, increasing AP50 from 90.75% to 95.32%.
Developed LogiForm, a graph-based computer vision algorithm using color and keypoint similarity to detect meaningful visual objects in lecture videos, improving mAP by 15.8%.

Machine Learning Fellow · Team Lead

Fellowship.AI · Remote

Sept 2025 – Nov 2025

Engineered a real estate data pipeline to perform nearest-neighbor comparative market analysis on a 4.5 GB Redfin dataset, estimating property values across localized markets.
Built a Gemini-based multimodal workflow to deliver structured recommendations for pricing and content optimization by analyzing listing images and descriptions.

Assistant Professor / Lecturer

Khulna University of Engineering and Technology (KUET) · Khulna, Bangladesh

Feb 2017 – Aug 2021

Developed a ResNet50 transfer-learning model for plant disease detection, achieving 99.80% accuracy.
Designed custom CNN architectures for pneumonia detection from chest X-rays and ensemble models for breast cancer prediction, achieving up to 99.28% accuracy.
Taught and mentored undergraduate students, guiding projects and fostering student learning and engagement.

Software Engineer

Samsung R&D Institute Bangladesh (SRBD) · Dhaka, Bangladesh

Jul 2016 – Feb 2017

Developed panoramic, 360-view, and little-planet image rendering features for the Samsung Gear 360 iPhone app.
Optimized OpenCV and OpenGL image processings to improve visual quality and reduce latency on mobile devices.

Selected Projects

Unified VLM Inference Framework [GitHub] Mar 2026 – Apr 2026

Engineered a lightweight, config-driven framework for unified VLM inference across local and cloud backends, including Ollama, MLX-VLM, vLLM, HuggingFace Transformers, OpenAI, & Gemini. It provides a shared InferenceRequest interface for consistent text-image prompting, and multi-step inference workflows.

VLM Multimodal Inference LLM vLLM Transformers OpenAI API Gemini API Python

Multimodal Video Summarization Engine [Platform] [Demo] Dec 2025 – Mar 2026

Developed an end-to-end multimodal summarization system for the NSF-supported VideoPoints platform to improve educational video navigation and review. Integrated ASR transcripts, OCR text, and visual frames with vision-language models and structured in-context learning to generate chapter-level summaries grounded in slide-level evidence.

Multimodal LLM In-Context Learning Whisper Tesseract OCR Gemma 3 Qwen3-VL

LLM-Based Video Temporal Segmentation [Platform] Jun 2025 – Aug 2025

Developed a slide-guided chaptering pipeline that aligns speech transcripts with slide transitions to segment long lecture videos into coherent topical chapters. Incorporated automatic validation and correction to ensure accurate, contiguous boundaries and full video coverage.

LLM Whisper Video Segmentation Temporal Segmentation Gemini Flash Python

SOTA Object Detection Lab [GitHub] Dec 2024 – Feb 2025

A modular PyTorch framework for training and evaluating state-of-the-art object detectors — YOLO, Faster R-CNN, and DETR — on custom datasets. Hydra-based configuration enables effortless model switching and hyperparameter tuning, while a unified dataset format and shared train/evaluate/predict interface keep experiments reproducible and easy to extend.

PyTorch Object Detection YOLOv11 Transfer Learning Hydra

Long-Distance Face Detection [GitHub] Feb 2023 – May 2023

Enhanced RetinaFace for long-distance face detection by introducing a custom IoU-aware multi-task loss and an auxiliary IoU prediction head to improve localization quality on WIDER FACE. The approach improved robustness for detecting small, distant, and partially occluded faces across both ResNet50 and MobileNet0.25 backbones.

PyTorch Face Detection IoU-Aware Loss Multi-Task Learning RetinaFace WIDER FACE

Publications [Google Scholar]

Visual Content Detection in Educational Videos with Transfer Learning and Dataset Enrichment

D. Biswas, S. Shah, J. Subhlok

IEEE MIPR 2025

This paper investigates how modern object detectors can be adapted to lecture videos, where visual content is often semantically complex and weakly structured. It shows that YOLO-based transfer learning, combined with semi-supervised dataset enrichment, can substantially improve detection of educational visuals.

[Paper] [arXiv] [Code]

Lecture Video Visual Objects (LVVO) Dataset: A Benchmark for Visual Object Detection

D. Biswas, S. Shah, J. Subhlok

arXiv 2025

This work presents the LVVO dataset, a new benchmark for visual object detection in educational videos, with 4,000 lecture frames drawn from multiple subjects and instructors. The dataset provides high-quality annotations for key visual categories and supports both supervised and semi-supervised research.

[arXiv] [Dataset]

Identification of Visual Objects in Lecture Videos with Color and Keypoints Analysis

D. Biswas, S. Shah, J. Subhlok

IEEE ISM 2023

This paper presents LogiForm, a graph-based approach for identifying meaningful visual objects in lecture video frames, where slide visuals are often diverse, custom-designed, and composed of parts without clear boundaries. By leveraging spatial, color, and local geometric cues, it better groups semantically related content and achieves a 15.8% mAP improvement over prior methods.

[Paper]

Transfer Learning Based Plant Diseases Detection Using ResNet50

I. Z. Mukti, D. Biswas

IEEE EICT 2019

Develops a transfer learning model for plant disease recognition using ResNet50 on a large leaf-image dataset spanning 38 classes, reaching an overall accuracy of 99.80%.

[Paper]

Certifications

AWS & DeepLearning.AI

Generative AI with Large Language Models

Coursera

View Certificate →

DeepLearning.AI

Machine Learning in Production

Coursera

View Certificate →

Amazon Web Services

AWS Cloud Practitioner Essentials

AWS Skill Builder

Certificate Course

DeepLearning.AI

Deep Learning Specialization

Coursera — 4 Courses

Neural Networks & Deep Learning Improving Deep Neural Networks Structuring ML Projects Convolutional Neural Networks

Self-Study Courses

Community Computer Vision Course Hugging Face

Building and Evaluating Advanced RAG DeepLearning.AI

Docker Tutorial for Beginners YouTube

Honors & Awards

Ph.D. Research Showcase Award (Audience's Choice) — University of Houston, 2025 [Certificate]
Cullen Graduate Student Fellowship — Graduate School, University of Houston, 2021–2026
University Gold Medalist — Khulna University of Engineering & Technology (KUET), 2018 [Certificate]

Service & Leadership

Team Lead, Real Estate Listing Optimization Project — Fellowship.AI, Remote

Sep 2025 – Nov 2025

Organizing Secretary, Graduate Student Association of Bangladesh, University of Houston, USA

Aug 2022 – Jul 2025

Academic Advisor & Project Mentor, Khulna University of Engineering and Technology, Bangladesh

Feb 2017 – Aug 2021

Workshop Instructor, Programming & Logic Circuit Designing — Electronics Club (MEC), KUET, Bangladesh

2020

Education

University of Houston Houston, TX, USA

Ph.D. in Computer Science | GPA: 3.88

Aug 2021 – Jul 2026 (Expected)

Dissertation: Synthesizing Text and Visuals: A Multimodal Approach to Summarizing Presentation-Style Educational Videos

Khulna University of Engineering and Technology (KUET) Khulna, Bangladesh

B.Sc. in Electronics and Communication Engineering | GPA: 3.88

Apr 2012 – May 2016

🏆 University Gold Medalist · Dean's Award · 1st in Merit Position