Ax Esakkivel Esakkiraja, Sai Rajeswar, Denis Akhiyarov, Rajagopal Venkatesaramani 4/3/2026

Therefore I am. I Think

Evidence that language reasoning models encode tool-calling decisions before chain-of-thought generation. Analysis of model decision-making timing.

Ax Weyl Lu, Chenjie Hao, Yubei Chen 4/3/2026

Deep Networks Favor Simple Data

Analysis of OOD anomaly where deep networks assign higher density to simple out-of-distribution data than in-distribution test data.

Ax Zhengyang Tang, Ke Ji, Xidong Wang, Zihan Ye, Xinyuan Wang, Yiduo Guo, Ziniu Li, Chenxin Li, Jingyuan Hu, Shunian Chen, Tongxu Luo, Jiaxi Bi, Zeyu Qin, Shaobo Wang, Xin Lai, Pengyuan Lyu, Junyi Li, Can Xu, Chengquan Zhang, Han Hu, Ming Yan, Benyou Wang 4/3/2026

Do Phone-Use Agents Respect Your Privacy?

MyPhoneBench evaluation framework measuring privacy compliance in phone-use agents during mobile task completion.

Ax Marawan Gamal Abdel Hameed, Derek Tam, Pascal Jr Tikeng Notsawo, Colin Raffel, Guillaume Rabusseau 4/3/2026

Model Merging via Data-Free Covariance Estimation

Principled layer-wise optimization approach for model merging via data-free covariance estimation without task-specific training.

Ax Urs Hackstein, Jordi Alastruey, Philip Aston, Ciaran Bench, Peter H. Charlton, Loic Coquelin, Nando Hegemann, Vaidotas Marozas, Mohammad Moulaeifard, Manasi Nandi, Andrius Petrenas, Oskar Pfeffer, Mantas Rinkevicius, Andrius Solosenko, Nils Strodthoff, Sara Vardanega 4/3/2026

Benchmark Problems and Benchmark Datasets for the evaluation of Machine and Deep Learning methods on Photoplethysmography signals: the D4 report from the QUMPHY project

Benchmark datasets and evaluation protocols for machine learning methods on photoplethysmography medical signals.

Ax Nicholas Roberts, Sungjun Cho, Zhiqi Gao, Tzu-Heng Huang, Albert Wu, Gabriel Orlanski, Avi Trost, Kelly Buchanan, Aws Albarghouthi, Frederic Sala 4/3/2026

Test-Time Scaling Makes Overtraining Compute-Optimal

Train-to-Test scaling laws optimizing model size, training tokens, and inference samples jointly for compute-optimal LLM deployment.