Research Papers

My recent research is across AI agents and multi-agent systems, robotics, biomedical AI, LLM evaluation and multimodal learning.

The work spans several research modes: algorithmic and mathematical modeling for biomedical prediction; agent workflow and benchmark questions around reliability and claim strength; robotics or human-machine interaction problems where signals may be missing, noisy or incomplete.

The broader connection is building AI methods that remain useful under imperfect evidence: choosing representations, designing algorithms, handling uncertainty and checking whether the resulting system is robust enough for the setting.

Current Research Themes

Agents and multi-agent workflows

Robotics

Biomedical AI

Agent reliability and observability

Evaluation and benchmark reliability

Multimodal learning under uncertainty

LLM evaluation and benchmark reliability

FrontierAudit: Stability Audits for Cost-Aware LLM Benchmark Claims

Abstract

Cost-aware LLM benchmark papers often present a single frontier winner as if the best policy were exact and stable. We argue that this style of reporting can overstate what the evidence supports when frontier comparisons are close, cost schedules are partly conventional and subgroup or family structure is hidden by pooled summaries. We introduce FrontierAudit, a reporting-oriented audit protocol for calibrating claim strength in cost-aware routing and evidence-acquisition evaluations. FrontierAudit separates exact winner promotion from weaker but still useful conclusions by combining a threshold-based taxonomy, an empirical exact-promotion certificate and conservative follow-on checks for near ties, subgroup stability, transfer and budget semantics. Across audited routing-style and biomedical case studies we find that many apparent exact winners do not support exact headline language once these checks are applied and instead justify weaker claim forms such as cost-profile dependence, partial support or transfer-limited conclusions. At the same time, the audit identifies the narrower conditions under which exact promotion is genuinely supported. The main contribution is not a new router or benchmark release but a reusable protocol for matching benchmark claims to the strength of the available evidence.

Research Contribution

We designed an audit protocol for cost-aware LLM benchmark claims and applied it to routing-style and biomedical case studies. The goal was to test whether the evidence supports the strength of the claim being made.

Hypothesis and Significance

Many benchmark winners are less stable than they look. Changes in costs, subgroups, families or budget assumptions can make exact headline language too strong.

For LLM evaluation the main contribution is a stricter connection between empirical evidence and the claims made from it. The protocol fits a broader shift toward agent and model evaluation that cares about reliability, robustness and claim strength beyond leaderboard position alone.

KeywordsLLM benchmarksCost-aware evaluationClaim stabilityAgent reliabilityRouting

AI agents and multi-agent systems

When Usefulness Is Not Identifiability: A Benchmark and Protocol for MAS Interventions under Partial Observability

Abstract

Multi-agent systems are increasingly evaluated through workflow interventions such as verifier placement but existing papers often summarize these settings with winner-only scores, implicitly treating intervention usefulness as if it guaranteed identifiability from available observables. We show that this assumption fails under partial observability. We introduce an intervention-observability benchmark and reporting protocol, instantiated here through verifier-placement policies but designed for broader workflow interventions and grounded in both controlled workflow generators and trace-backed real-artifact settings. The protocol requires supported slice-pair reporting, ambiguity-aware identifiability metrics and strong-null baseline comparisons together. Across the supported benchmark ambiguity persists, positive headroom beyond the strong null remains and the nominal global winner does not behave like a stable dominant policy. Winner-only reporting can therefore materially misstate intervention evaluation in partially observed MAS settings including under supported holdout tests. The resulting benchmark and protocol provide a reusable framework for evaluating workflow interventions under partial observability and for making negative identifiability results scientifically interpretable rather than merely inconclusive.

Research Contribution

We built a protocol for evaluating workflow interventions when the available trace reveals only part of the causal story. The benchmark focuses on verifier-placement policies but is designed to generalize to broader intervention settings.

Hypothesis and Significance

An intervention can be useful even when identifiability from the stored observables remains limited. In incomplete traces, global winner summaries can materially overstate what the evidence proves.

The contribution is a more disciplined evaluation framework for agent workflows. It surfaces ambiguity alongside the score, so the reader can see where the evidence is strong and where it remains limited. This connects directly to the current need for trace-based evaluation and reliability checks in long-running agent systems.

KeywordsMulti-agent systemsWorkflow interventionsPartial observabilityTrace-based evaluationVerification

Biomedical AI

Recursive Multimodal Survival Transformer with Chebyshev Logit Modulation for Histology and Genomics

Abstract

Computational pathology increasingly blends whole-slide imaging with genomics to model cancer prognosis under challenging conditions: whole-slide images yield thousands of visual tokens; genomic features are high-dimensional and heterogeneous; cohorts are small and often signal-sparse. We present a transformer-based multimodal survival model with two novelties: a stable and parameter-efficient way to fuse very long histology sequences with genomic vectors and enhanced attention with improved focus on clinically relevant cues via logit modulation. Specifically, our design leverages cross-modal recursive layers that tie attention blocks across depths and modalities to form a continuous fusion path and ChebMax, a low-order Chebyshev transform applied to attention logits that improves attention concentration. Evaluations on five major TCGA datasets show improved performance against leading multimodal approaches. Ablations support the hypothesis that these components provide additional gains on top of a strong overall architecture. Qualitative analyses suggest WSI attention emphasizes tumor-relevant regions and genomic attributions align with known pathways supporting interpretability. Together, these choices provide an effective approach for multimodal survival prediction that balances accuracy, efficiency and interpretability.

Research Contribution

We designed a multimodal survival modeling architecture for histology and genomics. We evaluated it across TCGA cohorts and studied whether recursive fusion and logit modulation improve performance over existing multimodal approaches.

Hypothesis and Significance

Biomedical multimodal models need to combine rich visual evidence and high-dimensional genomic evidence while staying compact and stable enough for small patient cohorts.

This matters because multimodal biomedical prediction has to balance rich evidence with scarce reliable samples. The architecture is designed around that constraint and treats parameter efficiency as part of the modeling problem. It also sits within the broader movement toward pathology foundation models, histology-genomics alignment and clinically meaningful validation.

KeywordsComputational pathologySurvival modelingHistologyGenomicsMulti-omics

Robotics and human-machine interaction

Behaviour4All: in-the-wild Facial Behaviour Analysis Toolkit

Abstract

In this paper we introduce Behavior4All, a comprehensive open-source toolkit for in-the-wild facial behavior analysis. It integrates Face Localization, Valence-Arousal Estimation, Basic Expression Recognition and Action Unit Detection within a single framework. Available in both CPU-only and GPU-accelerated versions, Behavior4All leverages 12 large-scale in-the-wild datasets consisting of over 5 million images from diverse demographic groups. It introduces a novel framework that leverages distribution matching and label co-annotation to address tasks with non-overlapping annotations and encode prior knowledge of their relatedness. In the largest study of its kind, Behavior4All outperforms both state-of-the-art methods and toolkits in overall performance as well as fairness across all databases and tasks. It also demonstrates superior generalizability on unseen databases and on compound expression recognition. Finally, Behavior4All is substantially faster than other toolkits.

Research Contribution

We contributed to a unified toolkit that brings multiple facial behavior tasks into one framework with CPU and GPU versions and evaluation across diverse in-the-wild datasets.

Hypothesis and Significance

Related facial behavior tasks can benefit from shared structure when the framework handles missing and non-overlapping annotations as part of a unified learning problem.

The broader value is practical infrastructure for robotics and human-machine interaction. Stronger facial behavior tools can support social AI, embodied interaction and multimodal systems that need to interpret human reactions robustly.

KeywordsFacial behaviorRoboticsHuman-machine interactionOpen-source toolkit

Robotics and human-machine interaction

Robust Facial Reactions Generation: An Emotion-Aware Framework with Modality Compensation

Abstract

The objective of the Multiple Appropriate Facial Reaction Generation (MAFRG) task is to produce contextually appropriate and diverse listener facial behavioural responses based on the multimodal behavioural data of the conversational partner. Current methodologies typically assume continuous availability of speech and facial modality data and neglect real-world scenarios where these data may be intermittently unavailable. This often results in model failures. Furthermore, despite utilising advanced deep learning models to extract information from the speaker's multimodal inputs, these models fail to adequately leverage the speaker's emotional context. That context is vital for eliciting appropriate facial reactions from human listeners. To address these limitations we propose an Emotion-aware Modality Compensatory (EMC) framework. This versatile solution can be seamlessly integrated into existing models, thereby preserving their advantages while significantly enhancing performance and robustness in scenarios with missing modalities. Our framework ensures resilience when faced with missing modality data through the Compensatory Modality Alignment (CMA) module. It also generates more appropriate emotion-aware reactions via the Emotion-aware Attention (EA) module. This module incorporates the speaker's emotional information throughout the entire encoding and decoding process. Experimental results demonstrate that our framework improves the appropriateness metric FRCorr by an average of 57.2% compared to the original model structure. In scenarios where speech modality data is missing, the performance of appropriate generation shows an improvement and when facial data is missing, it only exhibits minimal degradation.

Research Contribution

This work addresses robust listener-reaction generation under missing-modality conditions with emphasis on maintaining appropriate behavior when audio or facial information is unavailable.

Hypothesis and Significance

Reaction-generation systems need to handle imperfect multimodal input. Emotion-aware compensation can help preserve contextual appropriateness when one modality is missing or unreliable.

The broader value is more realistic multimodal generation for interactive systems. In robotics, virtual agents and embodied human-machine interaction, graceful behavior under missing information matters as much as realism under ideal inputs.

KeywordsFacial reaction generationEmotion-aware AIModality compensationRobotics