Research
Notes on a small set of questions I keep returning to about multimodal learning.
My work sits at the intersection of multimodal machine learning, computer vision, and applications of computer vision in healthcare. These threads don’t always look related on a publication list, but they share a common preoccupation: how to build models that read the world through more than one sense at once, and hold up when the world cooperates less than the benchmark did.
In practice, the work orbits three concerns — each of which I’ve been chasing, in one form or another, since the middle of graduate school. What follows is a short note on each, with a few pointers to the papers that live under them.
Three Questions
Robustness
Multimodal models tend to break in the wild for an unglamorous reason: one of the modalities is missing or corrupted, and the system has no way to route around it. A video file arrives with silent audio. A page scan is blurred beyond OCR. A sensor drops out mid-inference.
Most of my recent work has been about teaching models to degrade gracefully — or, better, to re-route — when the input they expected isn’t there. Generative models have become useful collaborators here: synthetic imagery can stand in for missing visual modalities at both training and test time, improving data efficiency and robustness at once.
Generalization
Most multimodal benchmarks overfit to a small, Anglophone-Western slice of the world. Faces, voices, advertising, storytelling, TV and film — the training distribution of a modern multimodal model is almost never the world. I care about systems that extend gracefully to novel tasks, diverse cultures, and long-tailed domains.
A recent joint study with Google Research and the Geena Davis Institute on Gender in Media on demographic representation in Indian television was one way of asking what generalization means when the training distribution is not the world. Advertisement understanding, movie scene recognition, and cross-domain sentiment analysis are other ways I’ve approached the same question.
Explainability
A multimodal model’s reasoning hides inside its fusion layers, where attention and feature mixing make opaque choices we later rationalize as predictions. I’m interested in architectures whose reasoning a human can inspect — attention maps that mean something, feature contributions that can be audited, and evaluations that look past accuracy toward whether we can trust what a system did and why.
This matters most in the healthcare applications I work on, where a model’s output can shape a clinical decision. Whether it’s analyzing facial asymmetry in paralysis patients, estimating physiological arousal during driving, or studying sleep patterns as a biomarker, the model’s reasoning has to survive scrutiny by someone who isn’t a machine-learning researcher.
Collaborations & Labs
The work above was shaped by the people I’ve had the luck to work with. In graduate school, by Prof. Shrikanth Narayanan and the community at the Signal Analysis and Interpretation Laboratory at USC. Before that, by Prof. Subhasis Chaudhuri at IIT Bombay. Industry collaborators at Adobe Research, IBM Research, NVIDIA Maxine, Google Research, and clinicians at USC’s Keck School have all left their mark.
What I’d Like to Hear About
If you’re working on any of the following, I’d be glad to hear from you:
— Multimodal foundation models for document and media understanding.
— Robustness under missing modalities, whether via generation, fusion, or architectural design.
— Applications of computer vision and audio in clinical workflows — especially where interpretability matters.
— Fair and culturally diverse multimodal evaluation.
Reach me at dbose [at] usc [dot] edu.
A longer academic record, including talks, teaching, and service, lives on the CV page. The complete list of papers is on Publications.