HN steadeepanda 3/26/2026

Show HN: Agent Ruler new update v0.1.9

Agent Ruler v0.1.9 update: reference monitor with confinement for AI agent workflows, adding security/safety layer outside agent guardrails.

Ax Marc-Antoine Provost, Nejc Ilenic, Christopher Solinas, Philippe Beardsell 3/26/2026

GTO Wizard Benchmark

Public API and evaluation framework for benchmarking poker algorithms against GTO Wizard, a superhuman HUNL poker agent.

Ax Jerin George Mathew, Sumayya Taher, Anindita Kundu, Denilson Barbosa 3/26/2026

LLMs Do Not Grade Essays Like Humans

Evaluation comparing LLM essay scoring with human grading across GPT and Llama models, finding weak agreement in standard settings.

Ax Franck Ndzomga 3/26/2026

Efficient Benchmarking of AI Agents

Study on efficient benchmarking of AI agents showing how task subsets can preserve agent rankings while reducing evaluation costs.

Ax Christopher M. Ackerman, Nina Panickssery 3/26/2026

Mitigating Many-Shot Jailbreaking

Analysis of many-shot jailbreaking technique exploiting long context windows; probes effectiveness and develops mitigation strategies for LLM safety.

Ax Christopher Ackerman 3/26/2026

Evidence for Limited Metacognition in LLMs

Novel methodology quantitatively evaluating metacognitive abilities in LLMs, testing self-awareness without relying on model self-reports.

Ax Yutao Wu, Xiao Liu, Yifeng Gao, Xiang Zheng, Hanxun Huang, Yige Li, Cong Wang, Bo Li, Xingjun Ma, Yu-Gang Jiang 3/26/2026

Internal Safety Collapse in Frontier Large Language Models

Internal Safety Collapse (ISC) failure mode identified in frontier LLMs where models generate harmful content under certain task conditions; TVD framework presented to trigger and study ISC.

Ax Jonathan Prunty, Seraphina Zhang, Patrick Quinn, Jianxun Lian, Xing Xie, Lucy Cheke 3/26/2026

Visuospatial Perspective Taking in Multimodal Language Models

Evaluation of visuospatial perspective-taking abilities in multimodal language models using adapted tasks from human studies (Director Task, Rotating F task).

Ax Kenza Benkirane, Dan Goldwater, Martin Asenov, Aneiss Ghodsi 3/26/2026

DISCO: Document Intelligence Suite for COmparative Evaluation

DISCO benchmark suite for evaluating OCR pipelines and vision-language models on document parsing and QA across diverse document types including handwritten and multilingual text.

Ax Samridhi Vaid, Mike Weldon, Jesse Dunn, Sacha Davis, Kevin Lonergan, Henry Li, Jeffrey Franc, Mohamed Abdalla, Daniel C. Baumgart, Jake Hayward, J Ross Mitchell 3/26/2026

Berta: an open-source, modular tool for AI-enabled clinical documentation

Berta: open-source modular platform for AI-enabled clinical documentation with institutional data governance and workflow integration, deployed at Alberta Health Services.

Ax Peijun Qing, Puneet Mathur, Nedim Lipka, Varun Manjunatha, Ryan Rossi, Franck Dernoncourt, Saeed Hassanpour, Soroush Vosoughi 3/26/2026

Cluster-R1: Large Reasoning Models Are Instruction-following Clustering Agents

Cluster-R1 reframes instruction-following clustering as generative task, enabling reasoning models to autonomously infer corpus structure while respecting user instructions.

Ax Shanghua Gao, Yuchang Su, Pengwei Sui, Curtis Ginder, Marinka Zitnik 3/26/2026

Qworld: Question-Specific Evaluation Criteria for LLMs

Qworld framework generates question-specific evaluation criteria for LLMs on open-ended tasks, capturing context-dependent response quality requirements.