Ax Mohamed Elfeki, Tu Trinh, Kelvin Luu, Guangze Luo, Nathan Hunt, Ernesto Montoya, Nandan Marwaha, Yannis He, Charles Wang, Fernando Crabedo, Alessa Castilo, Bing Liu 8d ago

HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

HiL-Bench evaluates whether coding agents know when to request help with incomplete specifications, exposing judgment gaps in frontier models.

Ax Charlie F. Ruan, Yucheng Qin, Akaash R. Parthasarathy, Xun Zhou, Ruihang Lai, Hongyi Jin, Yixin Dong, Bohan Hou, Meng-Shiun Yu, Yiyan Zhai, Sudeep Agarwal, Hangrui Cao, Siyuan Feng, Tianqi Chen 8d ago

WebLLM: A High-Performance In-Browser LLM Inference Engine

WebLLM inference engine enabling high-performance LLM execution directly in web browsers for on-device deployment without server GPUs.

Ax Yves-Simon Zeulner, Simon Cr\"amer, Sandeep Selvaraj, Roberto Calandra 8d ago

Learning to Play Piano in the Real World

First robotic system using learning-based approaches for real-world piano playing, advancing manipulation capabilities in robotics.

Ax Kisu Yang, Yoonna Jang, Hwanseok Jang, Kenneth Choi, Isabelle Augenstein, Heuiseok Lim 8d ago

Reliable Evaluation Protocol for Low-Precision Retrieval

Protocol for reliable evaluation of low-precision retrieval systems, addressing spurious ties and variability in relevance scoring with reduced numerical precision.