Tag

inference

Inference is the production stage where models serve predictions or generate outputs, so latency, throughput, GPU scheduling, memory footprint, and cost all matter. Recent work spans Kubernetes as an AI control plane, quantization, and TensorRT-LLM optimizations.

6 articles

Industry News/Jun 26

Jalapeño turns OpenAI into a chip designer

OpenAI and Broadcom’s Jalapeño shows how to turn a model company into a custom silicon builder.

Industry News/Jun 25

OpenAI’s custom chip is the right move against Nvidia

OpenAI’s Broadcom chip is a necessary move that should reduce Nvidia dependence and improve AI economics.

Research/May 6

AE-LLM aims to make LLMs more efficient

AE-LLM proposes adaptive efficiency optimization for large language models, but the provided source does not include benchmark details.

Research/Apr 3

Nvidia’s MLPerf Gains Show Software Still Matters

Nvidia posted up to 2.77x MLPerf gains on GB300 NVL72, with software tricks like Dynamo and TensorRT-LLM doing heavy lifting.

Industry News/Apr 3

Kubernetes Is Becoming AI’s Control Plane

KubeCon Europe 2026 showed Kubernetes moving from app orchestration to AI ops, with inference, GPUs, and open standards leading the shift.

Industry News/Apr 3

Five AI Infra Frontiers Bessemer Expects for 2026

Bessemer’s 2026 AI infra roadmap points to memory, continual learning, RL, inference, and world models as the next big build areas.