Ax Md Atik Ahamed, Mihir Parmar, Palash Goyal, Yiwen Song, Long T. Le, Qiang Cheng, Chun-Liang Li, Hamid Palangi, Jinsung Yoon, Tomas Pfister 25d ago

TFRBench: A Reasoning Benchmark for Evaluating Forecasting Systems

TFRBench: Benchmark for evaluating reasoning capabilities of time-series forecasting systems beyond numerical accuracy metrics.

Ax Myeongsoo Kim, Joe Hsu, Dingmin Wang, Shweta Garg, Varun Kumar, Murali Krishna Ramanathan 25d ago

CODESTRUCT: Code Agents over Structured Action Spaces

CODESTRUCT: LLM-based code agents using structured AST action spaces instead of text matching for reliable code editing and repository interaction.

Ax Yi Nian, Aojie Yuan, Haiyue Zhang, Jiate Li, Yue Zhao 25d ago

Auditable Agents

Auditable Agents framework establishing accountability, auditability, and auditing definitions for LLM agents with external effects, addressing post-deployment answer-ability.

Ax Yushuo Zheng (Affiliation 1, Affiliation 2), Huiyu Duan (Affiliation 1), Zicheng Zhang (Affiliation 1, Affiliation 2), Yucheng Zhu (Affiliation 1), Xiongkuo Min (Affiliation 1), Guangtao Zhai (Affiliation 1, Affiliation 2) 25d ago

Market-Bench: Benchmarking Large Language Models on Economic and Trade Competition

Market-Bench comprehensive benchmark evaluating LLM capabilities in economically-relevant tasks via configurable multi-agent supply chain model with LLM retailer agents.

Ax Chenghao Li, Jun Liu, Songbo Zhang, Huadong Jian, Hao Ni, Lik-Hang Lee, Sung-Ho Bae, Guoqing Wang, Yang Yang, Chaoning Zhang 25d ago

Experience Transfer for Multimodal LLM Agents in Minecraft Game

Echo memory framework for multimodal LLM agents enabling transfer of reusable knowledge across Minecraft tasks by decomposing experience into five interpretable dimensions.

Ax Florent Capelli, YooJung Choi, Stefan Mengel, Mart\'in Mu\~noz, Guy Van den Broeck 25d ago

A canonical generalization of OBDD

Introduces Tree Decision Diagrams generalizing OBDD for Boolean function representation with improved succinctness and tractable operations like model counting and conditioning.

Ax Maria Nesterova, Mikhail Kolosov, Anton Andreychuk, Egor Cherepanov, Oleg Bulichev, Alexey Kovalev, Konstantin Yakovlev, Aleksandr Panov, Alexey Skrynnik 25d ago

MARL-GPT: Foundation Model for Multi-Agent Reinforcement Learning

Foundation model enabling single GPT-based agent to perform across diverse multi-agent reinforcement learning tasks and environments.

Ax Eranga Bandara, Ross Gore, Sachin Shetty, Piumi Siyambalapitiya, Sachini Rajapakse, Isurunima Kularathna, Pramoda Karunarathna, Ravi Mukkamala, Peter Foytik, Safdar H. Bouk, Abdul Rahman, Xueping Liang, Amin Hass, Tharaka Hewa, Ng Wee Keong, Kasun De Zoysa, Aruna Withanage, Nilaan Loganathan 25d ago

Flowr -- Scaling Up Retail Supply Chain Operations Through Agentic AI in Large Scale Supermarket Chains

AI agents for retail supply chain operations, automating demand forecasting, procurement, and inventory replenishment in supermarket chains.

Ax Bowen Ye, Rang Li, Qibin Yang, Yuanxin Liu, Linli Yao, Hanglong Lv, Zhihui Xie, Chenxin An, Lei Li, Lingpeng Kong, Qi Liu, Zhifang Sui, Tong Yang 25d ago

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Introduces Claw-Eval, an end-to-end evaluation suite for autonomous agents addressing trajectory-opaque grading, safety, and interaction modality coverage.