I teach machines to read documents — to look at a page and understand it the way a literate person does, as pixels and layout and language together, rather than any of those alone.
My work sits at the intersection of multimodal machine learning, computer vision, and applications of computer vision in healthcare. I think about three things mostly: robustness, generalization, and explainability.
“How do we build models that combine heterogeneous signals — text, images, audio, layout — without collapsing into any one of them?”
Selected Work
-
2025 · Interspeech
Can Multimodal Foundation Models Help Analyze Child-Inclusive Autism Diagnostic Videos?A. Kommineni, T. Feng, D. Bose, S. Narayanan
-
2024 · ICMI
Can Text-to-image Models Assist Multi-modal Learning with Visual Modality Missing?T. Feng, D. Yang, D. Bose, S. Narayanan
-
2023 · ACM MM
MM-AU: Towards Multimodal Understanding of Advertisement VideosD. Bose, R. Hebbar, T. Feng, K. Somandepalli, A. Xu, S. Narayanan
-
2023 · WACV
MovieCLIP: Visual Scene Recognition in MoviesD. Bose, R. Hebbar, K. Somandepalli, et al.
Recent Dispatches
2025Jun 13
AwardRecognized as an outstanding reviewer at CVPR 2025.
2025Jan 27
New roleJoined Adobe Research, Bengaluru as a Research Scientist.
2024Dec 18
ThesisDefended my Ph.D. — Multimodal Perception Guided Computational Media Understanding.
The page you are reading is set in EB Garamond and Cormorant Garamond, with Inter for metadata. The writing is mine; errors are mine too. Corrections and correspondence are welcome at dbose [at] usc [dot] edu.
— D. B.