Ax Yejin Kim, Wilbert Pumacay, Omar Rayyan, Max Argus, Winson Han, Eli VanderBilt, Jordi Salvador, Abhay Deshpande, Rose Hendrix, Snehal Jauhri, Shuo Liu, Nur Muhammad Mahi Shafiullah, Maya Guru, Ainaz Eftekhar, Karen Farley, Donovan Clay, Jiafei Duan, Arjun Guru, Piper Wolters, Alvaro Herrasti, Ying-Chun Lee, Georgia Chalvatzaki, Yuchen Cui, Ali Farhadi, Dieter Fox, Ranjay Krishna 2/20/2026

MolmoSpaces: A Large-Scale Open Ecosystem for Robot Navigation and Manipulation

MolmoSpaces is an open ecosystem for robot navigation and manipulation with diverse benchmarks for evaluating generalization in real-world robotic tasks.

Ax Qingqing Zhu, Qiao Jin, Tejas S. Mathai, Yin Fang, Zhizheng Wang, Yifan Yang, Maame Sarfo-Gyamfi, Benjamin Hou, Ran Gu, Praveen T. S. Balamuralikrishna, Kenneth C. Wang, Ronald M. Summers, Zhiyong Lu 2/20/2026

CT-Bench: A Benchmark for Multimodal Lesion Understanding in Computed Tomography

CT-Bench benchmark dataset with 20K+ lesion annotations from CT studies for multimodal lesion understanding and report generation.

Ax Beatrix M. G. Nielsen, Emanuele Marconato, Luigi Gresele, Andrea Dittadi, Simon Buchholz 2/20/2026

Logit Distance Bounds Representational Similarity

Proves logit distance bounds representational similarity for discriminative models including autoregressive language models.

Ax Md. Najib Hasan, Touseef Hasan, Souvika Sarkar 2/20/2026

Are LLMs Ready to Replace Bangla Annotators?

Evaluates LLMs as zero-shot annotators for Bangla hate speech detection, examining reliability and bias in low-resource language settings.

Ax Nils Palumbo, Sarthak Choudhary, Jihye Choi, Prasad Chalasani, Somesh Jha 2/20/2026

Policy Compiler for Secure Agentic Systems

PCAS system enforces deterministic authorization policies in LLM agents for customer service, workflows, and compliance without relying on prompts.

Ax Xidong Wang, Shuqi Guo, Yue Shen, Junying Chen, Jian Wang, Jinjie Gu, Ping Zhang, Lei Liu, Benyou Wang 2/20/2026

LiveClin: A Live Clinical Benchmark without Leakage

LiveClin live benchmark for clinical LLM evaluation using contemporary peer-reviewed cases updated biannually to prevent contamination.

Ax Sourav Chakraborty, Amit Kiran Rege, Claire Monteleoni, Lijun Chen 2/20/2026

Multi-Agent Lipschitz Bandits

Communication-free decentralized multi-agent bandit protocol with Lipschitz-structured action spaces and hard collision constraints.

Ax Sourav Chakraborty, Amit Kiran Rege, Claire Monteleoni, Lijun Chen 2/20/2026

A Unified Framework for Locality in Scalable MARL

Unified framework for exploiting locality in scalable multi-agent RL with relaxed conditions on exponential decay property.

Ax Zachary Coalson, Beth Sohler, Aiden Gabriel, Sanghyun Hong 2/20/2026

Fail-Closed Alignment for Large Language Models

Fail-closed alignment design principle for robust LLM safety through redundant refusal mechanisms across latent features.