Tag
inference
Inference is the production stage where models serve predictions or generate outputs, so latency, throughput, GPU scheduling, memory footprint, and cost all matter. Recent work spans Kubernetes as an AI control plane, quantization, and TensorRT-LLM optimizations.
6 articles

Jalapeño turns OpenAI into a chip designer
OpenAI and Broadcom’s Jalapeño shows how to turn a model company into a custom silicon builder.

OpenAI’s custom chip is the right move against Nvidia
OpenAI’s Broadcom chip is a necessary move that should reduce Nvidia dependence and improve AI economics.

AE-LLM aims to make LLMs more efficient
AE-LLM proposes adaptive efficiency optimization for large language models, but the provided source does not include benchmark details.

Nvidia’s MLPerf Gains Show Software Still Matters
Nvidia posted up to 2.77x MLPerf gains on GB300 NVL72, with software tricks like Dynamo and TensorRT-LLM doing heavy lifting.

Kubernetes Is Becoming AI’s Control Plane
KubeCon Europe 2026 showed Kubernetes moving from app orchestration to AI ops, with inference, GPUs, and open standards leading the shift.

Five AI Infra Frontiers Bessemer Expects for 2026
Bessemer’s 2026 AI infra roadmap points to memory, continual learning, RL, inference, and world models as the next big build areas.