Research
Open problems I keep returning to — and what they suggest about where AI is headed.
01overview
My work sits at the intersection of multi-agent systems, multimodal understanding, and controllable generation.
02three questions
Multi-agent AI systems — where multiple LLM-based agents perceive, reason, and act together — concentrate a peculiar set of difficulties that single-agent setups largely sidestep. Problems of interest include:
- Structuring collaboration. Agents may cooperate, compete, or do both at once (coopetition), under centralized, decentralized, or hierarchical topologies, governed by rule-, role-, or model-based strategies — and no single combination dominates across tasks.
- Governance and collective decision-making. There is still no agreed framework for distributing tasks, assigning roles, and aggregating heterogeneous agent outputs into accurate joint decisions — without collapsing the useful disagreement that motivated using multiple agents in the first place.
- Scalability. Communication overhead, context length, and orchestration complexity grow quickly with agent count, and the field's cleaner demos rarely survive that growth intact.
- Reliability, safety, and accountability. LLM hallucinations, bias, and vulnerabilities do not stay isolated — they propagate through inter-agent dialogue, raising the stakes of misalignment, prompt-injection, and unclear responsibility for joint outcomes.
Multimodal understanding — getting a single model to perceive, align, and reason across text, images, video, 3D, and audio — has become the dominant theme of recent vision-language research. Problems of interest include:
- Cross-modal alignment and fusion. Designing interfaces — cross-attention, Q-Formers, projectors, prompt and adapter tuning — that let heterogeneous encoders share a usable representation without erasing the structure each modality contributes.
- Multimodal reasoning. Moving past alignment to composition: treating each modality as a partial projection of a shared world-knowledge manifold and unrolling evidence across modalities before producing an answer.
- Robustness and adversarial safety. Modern VLMs are "semantically strong but spatially fragile" — low-severity geometric and resampling perturbations often degrade performance — while the continuous, high-dimensional nature of images also leaves them open to imperceptible adversarial perturbations that transfer across models and bypass safety mechanisms.
Controllable image and video generation — where a model must obey the user's intent down to the level of pose, layout, identity, or motion — concentrates difficulties that unconditional generation sidesteps. Problems of interest include:
- Multi-signal conditioning. Combining text, sketches, depth, pose, masks, references, and audio into a single coherent generation, when the signals often conflict or under-specify.
- Fidelity and compositionality. Honouring complex specs without sacrificing visual quality — long-form coherent composition and generation from short-form videos is particularly challenging.
- Temporal and identity consistency. Keeping subjects, objects, and physics stable across frames or repeated generations, as small per-step errors compound.
- Evaluation and efficiency. Evaluation metrics for evaluating the controllability of the generated output is still an open problem.
03collaborations
The work above was shaped by the people I've had the luck to work with. In graduate school: Prof. Shrikanth Narayanan and the SAIL community at USC. Before that: Prof. Subhasis Chaudhuri at IIT Bombay. Industry collaborators at Adobe Research, IBM Research, NVIDIA Maxine, and Google Research — and at USC Keck — have all contributed immensely to my research journey.