A static recompiler for original GameBoy ROMs
Static recompiler translating original GameBoy Z80 assembly into portable C code.
Static recompiler translating original GameBoy Z80 assembly into portable C code.
News article on AI's role in Iran conflict and military decision-making systems.
Personal blog about tracking coffee habits with an iOS app and building a custom data system.
Emacs extension using Kitty graphics protocol to display images in terminal mode via Claude API.
AutoKernel uses AI agents to autonomously optimize PyTorch models into Triton GPU kernels via iterative testing and refinement.
LLMSec framework for testing and evaluating agentic AI applications with autonomous security testing and attack simulation.
Microsoft patents AI system to automatically complete game sections for players.
PromptVault desktop app for versioning prompts in multi-agent pipelines, logging outputs, and tracking agent configurations locally.
HN discussion on forecasting and managing API costs for LLM-based agent workflows in production.
TADA: Novel text-acoustic tokenization schema for faster, more reliable LLM-based text-to-speech synthesis.
Self-hosted DCF valuation tool using LLM narratives and Damodaran datasets with transparent assumptions.
OWASP analysis of security vulnerabilities specific to AI agents: non-determinism, mixed instruction/data, and API access risks.
Armalo Context Packs: NPM-like package manager for agent knowledge with trust and commerce layers for multi-agent systems.
Brief headline on advertising effectiveness in chatbots.
Hypothesis discussion on intelligence as phase transition at scale requiring grounding rather than architecture alone.
Mnemos: Scoped memory system for coding agents with project/workspace/global separation, MCP integration, adaptive retrieval.
Discussion on responsibility and human judgment in shipping AI-assisted code; emphasis on quality over speed.
Case study: Generated 100K-line enterprise aircraft MRO app in one week using AI, 50-60% of production code.
HN discussion on using AI agents for infrastructure operations: migrations, deployments, provisioning, and MCP servers.
Discussion on separating AI agent reasoning from execution with cryptographic binding.
IH-Challenge: training dataset and research improving instruction hierarchy, safety steerability, and prompt injection robustness in frontier LLMs.
Pseudo-Code-Flow: Claude-based tool enabling developers to write pseudocode and automatically translate to real code, leveraging LLM translation capabilities.
MASEval: benchmark extending multi-agent evaluation beyond models to system components, comparing topologies, orchestration logic, and error handling across LLM frameworks.
LDP: AI-native communication protocol for multi-agent LLM systems exposing model identity, reasoning profile, quality calibration, and cost as first-class primitives.
BCAS: controlled measurement study quantifying how search depth, retrieval strategy, and token budget affect accuracy and cost in agentic RAG systems.
Guardian system combining reinforcement learning with LLM-based QA to generate interpretable spatiotemporal risk surfaces for missing-child search planning from unstructured case data.
AgentOS: operating system architecture enabling locally-hosted LLM agents to autonomously operate computing environments, orchestrate workflows, and integrate external tools.
Guardian: multi-LLM pipeline system for missing-person investigations using consensus-driven LLM coordination for intelligent information extraction and search planning.
FABRIC strategy for backward reachability analysis and verification of neural feedback systems controlled by neural networks.
Meissa: open-source multi-modal medical agentic system combining medical image understanding with tool use and multi-agent collaboration, deployable on-premise without frontier models.
MEMO: memory-augmented optimization reducing run-to-run variance in multi-turn multi-agent LLM games by stabilizing prompt policies and improving ranking reliability.
Philosophical analysis of temporal coherence and consciousness evaluation in LLM agents, examining whether agents' self-descriptions match actual decision constraints.
EPOCH: engineering protocol for autonomous agents to perform iterative multi-round optimization of prompts, code, and ML systems in heterogeneous environments.
Sentinel: autonomous AI agent for remote patient monitoring clinical triage using Model Context Protocol and 21 clinical tools, reducing manual review from days to minutes.
Research on stability and chaotic dynamics in multi-LLM committee systems using Lyapunov exponents to measure inter-run sensitivity across policy scenarios.
Deep Tabular Research agentic framework for multi-step reasoning over complex hierarchical tables using closed-loop decision-making.
DataFactory multi-agent framework for table question answering, addressing context constraints, hallucination, and complex reasoning over structured data.
TrustBench framework for real-time action verification in autonomous agents, preventing harmful actions during execution rather than post-hoc evaluation.
Explainable Innovation Engine upgrades RAG with methods-as-nodes, weighted provenance trees, and hierarchical clustering for traceable multi-step synthesis.
Research on logical reasoning as mechanistic pathway to situational awareness in LLMs, examining risks of advanced reasoning capabilities.
EvalAct framework converts implicit retrieval quality assessment into explicit action for improving multi-step reasoning in retrieval-augmented agents.
Macro-financial analysis of rapid AI adoption examining economic distribution mismatch and institutional anchoring to human cognitive scarcity.
PrivPRISM framework detecting discrepancies between Google Play data safety declarations and developer privacy policies using language models.
Data synthesis approach for domain-adapting LLMs to space situational awareness through cognitively layered supervision and engineering specifications.
Social-R1 framework enhancing social reasoning in LLMs for perceiving social cues and inferring mental states in human-AI collaboration.
Logos reasoning engine for molecular design combining machine learning predictions with transparent chemical reasoning and validity guarantees.
Study of verbalized confidence scales in LLMs, showing heavy discretization around round numbers and implications for uncertainty estimation.
Analysis of nonlinear activation steering in LLMs, questioning linear representation hypothesis and demonstrating inconsistent behavior of linear interventions.
Offline reinforcement learning approach using robust policy optimization to handle distribution shift and transition uncertainty.
Open evaluation benchmark for assessing NLP and RAG systems' compliance with EU AI Act regulatory standards.