Isolater - Feed

Ax Chaomeng Lu, Bert Lagaisse 29d ago

From Lab to Reality: A Practical Evaluation of Deep Learning Models and LLMs for Vulnerability Detection

Evaluates deep learning and LLM-based vulnerability detection in real-world conditions, revealing gaps between benchmark and production performance for cybersecurity.

Ax Sophie Zhao 29d ago

Probing Spectrum-Like Organization of States of Mind in Transformer Representation Spaces

Investigates spectrum-like organization of mental states in transformer representation spaces using annotated natural language sentences with continuous and ordinal scores.

Ax Hyeonjeong Ha, Jinjin Ge, Bo Feng, Kaixin Ma, Gargi Chakraborty 29d ago

NarrativeTrack: Evaluating Entity-Centric Reasoning for Narrative Understanding

NarrativeTrack benchmark evaluates multimodal LLMs on entity-centric reasoning and temporal understanding in video narratives with dynamic visual contexts.

Ax Masum Hasan, Junjie Zhao, Ehsan Hoque 29d ago

HAL: Inducing Human-likeness in LLMs with Alignment

HAL framework aligns LLMs to conversational human-likeness using interpretable, data-driven methods rather than relying solely on scale or broad supervised training.

Ax Fengyuan Liu, Jay Gala, Nilaksh, Dzmitry Bahdanau, Siva Reddy, Hugo Larochelle 29d ago

BRIDGE: Predicting Human Task Completion Time From Model Performance

BRIDGE maps model benchmark performance to human task completion time via psychometric framework, scaling AI capability evaluation without extensive annotations.

Ax Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Haoxuan Li, Hao Wang, Shijian Wang, Guanting Dong, Jiajie Jin, Yinuo Wang, Yuan Lu, Ji-Rong Wen, Zhicheng Dou, Zhouchen Lin 29d ago

OmniGAIA: Towards Native Omni-Modal AI Agents

OmniGAIA is a benchmark for evaluating omni-modal AI agents with vision, audio, and language capabilities for complex reasoning and tool usage tasks.

Ax Somjit Roy, Pritam Dey, Bani K. Mallick 29d ago

VaSST: Variational Inference for Symbolic Regression using Soft Symbolic Trees

VaSST introduces a probabilistic framework for symbolic regression using variational inference and soft symbolic trees with uncertainty quantification for scientific discovery.

Ax Drew Prinster, Clara Fannjiang, Ji Won Park, Kyunghyun Cho, Anqi Liu, Suchi Saria, Samuel Stanton 29d ago

Conformal Policy Control

Conformal Policy Control uses safe reference policies to regulate untested agent behaviors, balancing exploration and safety constraints in high-stakes environments.

Ax Yinpeng Wu, Yitong Chen, Lixiang Wang, Jinyu Gu, Zhichao Hua, Yubin Xia 29d ago

FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation

FlexServe enables privacy-preserving LLM inference on mobile devices using ARM TrustZone for secure model weight and user data protection against OS-level attacks.

Ax Eden Saig, Tamar Garbuz, Ariel D. Procaccia, Inbal Talgam-Cohen, Jamie Tucker-Foltz 29d ago

Adaptive Contracts for Cost-Effective AI Delegation

Adaptive contracts framework for cost-effective AI delegation balancing evaluation noise and costs in pay-for-performance tasks.

Ax Jiacheng Liu, Xiaohan Zhao, Xinyi Shang, Zhiqiang Shen 29d ago

Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems

Analysis of Claude Code agentic system architecture with comparison to OpenClaw and Hermes Agent identifying design principles.

Ax Zenghui Zhou, Man Li, Xiaoke Fang, Xinyi Zhou, Weibin Lin, Zheng Zheng 29d ago

LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs

LGMT: logic-grounded metamorphic testing framework for evaluating LLM reasoning robustness using first-order logic.

Ax Minhao Yao, Ruoyu Wang, Xihong Lin, Lin Liu, Zhonghua Liu 29d ago

Gradient-Flow Optimization as Dynamic Random-Effects Inference: Testing and Early Stopping with Applications to Deep Learning

Statistical framework viewing gradient-flow optimization as random-effects inference with applications to early stopping in deep learning.

Ax Hyunjin Cho, Youngji Roh, Jaehyung Kim 29d ago

Shared Semantics, Divergent Mechanisms: Unsupervised Feature Discovery by Aligning Semantics and Mechanisms

Unsupervised feature discovery aligning semantics and mechanisms for auditing internal LLM computations via mechanistic interpretability.

Ax Xinwei Qiang, Yifan Hu, Shixuan Sun, Jing Yang, Han Zhao, Chen Chen, Yu Feng, Jingwen Leng, Minyi Guo 29d ago

GF-DiT: Scheduling Parallelism for Diffusion Transformer Serving

GF-DiT: dynamic parallelism scheduler for efficient diffusion transformer serving with heterogeneous workloads.

Ax Yinpeng Wu, Yitong Chen, Lixiang Wang, Jinyu Gu, Zhichao Hua, Yubin Xia 29d ago

FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation

FlexServe: system for secure LLM inference on mobile devices using ARM TrustZone hardware isolation.

Ax Shuwen Chai, Qiaosen Wang 29d ago

Sample Complexities of Estimating Gumbel--Max Watermark Proportions with and without Reduction to Pivotal Statistics

Statistical framework for estimating proportion of LLM-generated text in mixed documents using Gumbel-max watermarking.

HN jjmerino 29d ago

Show HN: Dabs spawns dumb agents in boxes for free

Dabs is a local sandbox tool for spawning agents without cloud infrastructure, presented as a free alternative to cloud-based solutions.

HN unliftedq 29d ago

Show HN: Imagent – agentic image/video/speech generation

Imagent: Agent framework providing unified interface for image, video, and speech generation across providers and models with asset organization.

HN millereffect 29d ago

Reducing AI costs with smart pricing

Case study on cost management for AI services using reverse trials and pricing tiers to control expenses.

HN sollawen 29d ago

AI coding is a nightmare. Am I the only one experiencing this?

Developer experience critique of AI coding assistants: issues with duplicate code generation, context window limitations, and preference for adding over modifying code.

HN wslh 29d ago

GPT-5.5-Cyber built a zlib fuzzing lab in a day

OpenAI GPT-5.5-Cyber used for security fuzzing in open-source projects. Collaboration to find/patch bugs before malicious use.

HN 1vuio0pswjnm7 29d ago

AI is 'not smart' so what's next in artificial intelligence?

Yann LeCun interview on AI limitations beyond current systems. Discussion of physical world understanding gap.

HN sherlockxu 29d ago

Open Source LLM Statistics and Trends (2026)

Statistics report on open-source LLM ecosystem growth in 2026. Covers usage, performance, inference costs, and market trends.

HN mxfeinberg 29d ago

TurboQuant can reduce vector index size by 10x at 100M Row Scale

TurboQuant vector index compression reduces size by 10x at scale. Open-source implementations for KV-cache compression in LLMs.

HN sudo_cowsay 29d ago

What is agentic AI today, and what do we want it to be?

Analysis of agentic AI deployment trends. MIT research on current state and future potential of automated software agents.

HN bmcdresson 29d ago

GLM-5.2: The Open-Source Chinese Model Challenging Claude at One-Fifth the Cost

GLM-5.2: open-source Chinese frontier LLM matching Claude/GPT performance at 1/5 cost under MIT license.

HN vaishcodescape 29d ago

Using AI Agents with Databases

Data-Spear: autonomous SQL agent for PostgreSQL that plans queries, verifies results, and cites sources.

HN matt_d 29d ago

Stealing 50 Years of Database Ideas for AI Agents

OneWill system for safely sandboxing autonomous agents using database-inspired write-ahead logging and state isolation.

HN Anon84 29d ago

Learning to Replicate Expert Judgment in Financial Tasks

Research on using LLMs to replicate expert judgment in financial decision-making tasks.

HN matt_d 29d ago

Empirical Computation: Prompting versus Programming [pdf]

PDF title about comparing prompting versus programming approaches. No content provided.

HN healsdata 29d ago

The $1.3M theft that exposed AI's blind spot

Cybersecurity article about AI blind spots in threat detection.

HN ctrlnode-ai 29d ago

We Ran a Complex Task – A LangChain Repo Analysis with Claude Fable Models

Comparative analysis of Claude Fable models (Opus, Sonnet, Haiku) on complex engineering task using multi-agent approach.

HN tomcupr 29d ago

Sous-Chef, a Claude Code plugin where Fable reviews, Codex implements

Claude Code plugin that splits work between two LLMs: Claude for planning/review, Codex for implementation. Optimizes token spending by delegating tasks by model strength.

HN MediaSquirrel 29d ago

Show HN: Gist Discover – TikTok for ArXiv Summaries

Gist: AI tool summarizing ArXiv papers into layered slide decks with counter-arguments and steelman critiques.

HN ankurchrungoo 29d ago

Show HN: A 155K-param transformer builds a map of a world it's never shown

155K-param transformer trained only on movement symbols learned to build internal world map without explicit training. Researchers use linear probes to decode the learned representation and manipulate agent behavior.

HN tiahura 29d ago

Consult-LLM – A second opinion from another model right in your existing agent

Consult-LLM: tool for multi-model agent workflows. Gets second opinions from different LLMs (GPT, Claude, Gemini, etc.) for planning, review, and debugging within existing agents.

HN aisinghal 29d ago

Show HN: Mirrors – test AI agent changes by replaying real production traces

Mirrors: tool for testing AI agent changes by replaying production traces in isolated environments.

HN RiccardoMus 29d ago

Show HN: Meanwhile – turn Claude Code idle time into learning

IDE extension adding learning pane alongside Claude Code execution, delivering curated AI news and engineering concepts via spaced repetition.

HN ShivamNayak11 29d ago

Show HN: Declaw Arena – a CTF-style challenge to break an AI agent in a microVM

CTF-style security challenge to exploit vulnerabilities in AI agents running in microVMs. Practical testing tool.

HN skidrow 29d ago

Toward Better Hip Kernel Generation for AMD GPUs

Research on using language models to generate optimized HIP kernels for AMD GPUs, including synthetic dataset generation and multi-agent optimization pipeline.

HN Bnjoroge 29d ago

Show HN: Run multiple Docker Compose instances for your agents

Docker Compose tool for running multiple agent instances simultaneously without port and naming collisions, managing isolated environments automatically.

HN 0xkaz 29d ago

Show HN: Self-hostable dashboard to govern a team's AI coding spend

Self-hostable dashboard for governing AI coding agent spend across teams. LiteLLM proxy with per-user API keys and usage tracking.

HN abratabia 29d ago

Show HN: Auto Learning Agents, a self-hosted AI agent platform on Elixir/OTP

Open source self-hosted AI agent platform built with Elixir/OTP. Single Docker deployment with bundled runtime, Python services, database, and tooling.

HN themobiusstrip 29d ago

Your Coding Agent Will Always Tell You It's Safe

Analysis of security risks in coding agents that blindly execute commands from repository files, examining trust assumptions in autonomous AI code execution.

HN Gooblebrai 29d ago

Show HN: Neural Fit game -Adjust the network's weights and biases

Interactive game for learning neural networks: adjust weights and biases to match target outputs. Educational tool.

HN nielka 29d ago

Borrowing the Night: Reclaiming Idle Inference GPUs for Research

GPU capacity controller that reallocates inference resources between production and research using queueing theory optimization, maximizing utilization during off-peak hours.

HN robinhouston 29d ago

The Ramanujan Challenge for AI

Benchmark dataset for evaluating AI mathematical reasoning using formulas for mathematical constants. Research evaluation tool.

HN Gabrieliam42 29d ago

AI Agent Qubitz

Local-first AI agent using 7B-35B GGUF models with specialized harness for instruction following and tool orchestration. Open source project.

HN cwwc 29d ago

Zuckerberg says AI agent development going slower than expected

Meta's Mark Zuckerberg reports AI agent development progressing slower than anticipated. Industry development pace update.