Ax Hugh Blayney, \'Alvaro Arroyo, Johan Obando-Ceron, Pablo Samuel Castro, Aaron Courville, Michael M. Bronstein, Xiaowen Dong 9d ago

A Mechanistic Analysis of Looped Reasoning Language Models

Mechanistic analysis of looped reasoning language models examining internal dynamics and latent state evolution compared to standard feedforward models.

Ax Mihir Prabhudesai, Aryan Satpathy, Yangmin Li, Zheyang Qin, Nikash Bhardwaj, Amir Zadeh, Chuan Li, Katerina Fragkiadaki, Deepak Pathak 9d ago

Solving Physics Olympiad via Reinforcement Learning on Physics Simulators

Uses reinforcement learning on physics simulators to train models solving Physics Olympiad problems, addressing lack of large-scale physics QA datasets for reasoning models.

Ax Jon M Laurent, Albert Bou, Michael Pieler, Conor Igoe, Alex Andonian, Siddharth Narayanan, James Braza, Alexandros Sanchez Vassopoulos, Jacob L Steenwyk, Blake Lash, Andrew D White, Samuel G Rodriques 9d ago

LABBench2: An Improved Benchmark for AI Systems Performing Biology Research

LABBench2: Improved benchmark for evaluating AI systems and agents on biology research tasks with real-world capabilities.

Ax Magda Dubois, Ekin Zorer, Maia Hamin, Joe Skinner, Alexandra Souly, Jerome Wynne, Harry Coppock, Lucas Satos, Sayash Kapoor, Sunischal Dev, Keno Juchems, Kimberly Mai, Timo Flesch, Lennart Luettgau, Charles Teague, Eric Patey, JJ Allaire, Lorenzo Pacchiardi, Jose Hernandez-Orallo, Cozmin Ududec 9d ago

Seven simple steps for log analysis in AI systems

Pipeline and best practices for log analysis in AI systems to understand model behaviors, with code examples in Inspect framework.

Ax Yaniv Leviathan (Cheenu), Dani Valevski (Cheenu), Matan Kalman (Cheenu), Danny Lumen (Cheenu), Eyal Segalis (Cheenu), Eyal Molad (Cheenu), Shlomi Pasternak (Cheenu), Vishnu Natchu (Cheenu), Valerie Nygaard (Cheenu), Srinivasan (Cheenu), Venkatachary, James Manyika, Yossi Matias 9d ago

Generative UI: LLMs are Effective UI Generators

Demonstrating LLMs can generate UI interfaces and content together with proper prompting and tool integration.

Ax Jash Vira, Ashley Harris 9d ago

Spatial Competence Benchmark

Spatial Competence Benchmark (SCBench) evaluating large models on spatial reasoning, environment representation, and planning tasks.

Ax Justin Li, Daniel Ding, Asmita Yuki Pritha, Aryana Hou, Xin Wang, Shu Hu 9d ago

Robust Fair Disease Diagnosis in CT Images

Deep learning approach for fair disease diagnosis in chest CT addressing compound failures from class imbalance and demographic underrepresentation.

Ax Kyle Waters, Lucas Nuzzi, Tadhg Looram, Alessandro Tomasiello, Ariel Ghislain Kemogne Kamdoum, Bikun Li, Damien Sileo, Egor Kretov, Francesco Fournier-Facio, Georgios Soloupis, Haile Kassahun, Hew Wolff, Jiaqi Cai, Lianghui Li, Marc Roth, Mohinder Naiya, Naixu Guo, Qicheng Tang, Richard Wheeler, Samuele Sala, Serguei Popov, Steven Dillman, Yuqi Li 9d ago

COMPOSITE-Stem

COMPOSITE-STEM benchmark with 70 expert-written tasks for evaluating AI agents on physics, biology, chemistry, and materials science problems.

Ax Aayush Mishra, Daniel Khashabi, Anqi Liu 9d ago

Steered LLM Activations are Non-Surjective

Research on activation steering in LLMs showing steered states are non-surjective, with implications for interpretability and safety.

Ax Vasilis Kontonis, Yuchen Zeng, Shivam Garg, Lingjiao Chen, Hao Tang, Ziyan Wang, Ahmed Awadallah, Eric Horvitz, John Langford, Dimitris Papailiopoulos 9d ago

MEMENTO: Teaching LLMs to Manage Their Own Context

MEMENTO teaches LLMs to compress reasoning into dense summaries, reducing context and compute requirements. Releases OpenMementos dataset of 228K examples.

HN jeyzolo 9d ago

All in One for AI Chatbot

Commercial tool aggregating multiple AI model APIs behind single interface. Generic LLM comparison service.