Digbalay Bose

I teach machines to read documents — to look at a page and understand it the way a literate person does, as pixels and layout and language together, rather than any of those alone.

My work sits at the intersection of multimodal machine learning, computer vision, and applications of computer vision in healthcare. I think about three things mostly: robustness, generalization, and explainability.

“How do we build models that combine heterogeneous signals — text, images, audio, layout — without collapsing into any one of them?”

Selected Work

2025 · Interspeech
Can Multimodal Foundation Models Help Analyze Child-Inclusive Autism Diagnostic Videos?

A. Kommineni, T. Feng, D. Bose, S. Narayanan
2024 · ICMI
Can Text-to-image Models Assist Multi-modal Learning with Visual Modality Missing?

T. Feng, D. Yang, D. Bose, S. Narayanan
2023 · ACM MM
MM-AU: Towards Multimodal Understanding of Advertisement Videos

D. Bose, R. Hebbar, T. Feng, K. Somandepalli, A. Xu, S. Narayanan
2023 · WACV
MovieCLIP: Visual Scene Recognition in Movies

D. Bose, R. Hebbar, K. Somandepalli, et al.

See all 19 papers →

Recent Dispatches

2025Jun 13

AwardRecognized as an outstanding reviewer at CVPR 2025.

2025Jan 27

New roleJoined Adobe Research, Bengaluru as a Research Scientist.

2024Dec 18

ThesisDefended my Ph.D. — Multimodal Perception Guided Computational Media Understanding.

All dispatches →

Email Scholar GitHub LinkedIn ORCID

The page you are reading is set in EB Garamond and Cormorant Garamond, with Inter for metadata. The writing is mine; errors are mine too. Corrections and correspondence are welcome at dbose [at] usc [dot] edu.

— D. B.