[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-gke-system-metrics-tpu-hpa-cloud-monitoring-en":3,"article-related-gke-system-metrics-tpu-hpa-cloud-monitoring-en":30,"series-tools-416d35fb-69b8-4d1c-a423-3fe0d54d502d":73},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":22,"views":26,"created_at":27,"published_at":28,"topic_cluster_id":29},"416d35fb-69b8-4d1c-a423-3fe0d54d502d","gke-system-metrics-tpu-hpa-cloud-monitoring-en","GKE system metrics expose TPU and HPA data","\u003Cp data-speakable=\"summary\">\u003Ca href=\"\u002Ftag\u002Fgoogle-cloud\">Google Cloud\u003C\u002Fa>’s GKE system metrics add TPU, accelerator, and autoscaling data to Cloud Monitoring.\u003C\u002Fp>\u003Cp>Google Cloud’s \u003Ca href=\"https:\u002F\u002Fcloud.google.com\u002Fmonitoring\" target=\"_blank\" rel=\"noopener\">Cloud Monitoring\u003C\u002Fa> now documents a dense set of \u003Ca href=\"https:\u002F\u002Fcloud.google.com\u002Fkubernetes-engine\" target=\"_blank\" rel=\"noopener\">Google Kubernetes Engine\u003C\u002Fa> system metrics, including TPU partition state, accelerator memory, and HPA recommendation latency. The reference page was last generated on 2026-06-18 17:12:37 UTC, and many of the metrics are sampled every 60 seconds.\u003C\u002Fp>\u003Cp>That matters because the new entries are not just generic cluster health counters. They expose TPU-specific state, autoscaler behavior, and container accelerator usage in a format you can query directly from Monitoring or MQL.\u003C\u002Fp>\u003Ctable>\u003Cthead>\u003Ctr>\u003Cth>Metric\u003C\u002Fth>\u003Cth>Stage\u003C\u002Fth>\u003Cth>Kind \u002F Type\u003C\u002Fth>\u003Cth>Sample interval\u003C\u002Fth>\u003Cth>Visibility delay\u003C\u002Fth>\u003C\u002Ftr>\u003C\u002Fthead>\u003Ctbody>\u003Ctr>\u003Ctd>accelerator\u002Fpartition\u002Fstate\u003C\u002Ftd>\u003Ctd>BETA\u003C\u002Ftd>\u003Ctd>GAUGE \u002F INT64\u003C\u002Ftd>\u003Ctd>60 seconds\u003C\u002Ftd>\u003Ctd>up to 120 seconds\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>accelerator\u002Fslice\u002Fformation_durations\u003C\u002Ftd>\u003Ctd>BETA\u003C\u002Ftd>\u003Ctd>CUMULATIVE \u002F DISTRIBUTION\u003C\u002Ftd>\u003Ctd>60 seconds\u003C\u002Ftd>\u003Ctd>up to 120 seconds\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>autoscaler\u002Flatencies\u002Fper_hpa_recommendation_scale_latency_seconds\u003C\u002Ftd>\u003Ctd>GA\u003C\u002Ftd>\u003Ctd>GAUGE \u002F DOUBLE\u003C\u002Ftd>\u003Ctd>60 seconds\u003C\u002Ftd>\u003Ctd>up to 20 seconds\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>container\u002Faccelerator\u002Fduty_cycle\u003C\u002Ftd>\u003Ctd>GA\u003C\u002Ftd>\u003Ctd>GAUGE \u002F INT64\u003C\u002Ftd>\u003Ctd>60 seconds\u003C\u002Ftd>\u003Ctd>up to 120 seconds\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>container\u002Fcpu\u002Fcore_usage_time\u003C\u002Ftd>\u003Ctd>GA\u003C\u002Ftd>\u003Ctd>CUMULATIVE \u002F DOUBLE\u003C\u002Ftd>\u003Ctd>60 seconds\u003C\u002Ftd>\u003Ctd>varies by metric\u003C\u002Ftd>\u003C\u002Ftr>\u003C\u002Ftbody>\u003C\u002Ftable>\u003Ch2>What Google is exposing in GKE\u003C\u002Fh2>\u003Cp>The \u003Ca href=\"https:\u002F\u002Fdocs.cloud.google.com\u002Fmonitoring\u002Fapi\u002Fmetrics_kubernetes\" target=\"_blank\" rel=\"noopener\">GKE system metrics reference\u003C\u002Fa> covers metrics that appear only when GKE system metrics are enabled. The page groups them under the Kubernetes metrics family and marks their launch stage as either GA or BETA.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782137890166-5zeg.png\" alt=\"GKE system metrics expose TPU and HPA data\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>In practice, this means Cloud Monitoring is now surfacing more of the machinery behind a Kubernetes cluster, especially on \u003Ca href=\"\u002Ftag\u002Fgpu\">GPU\u003C\u002Fa> and TPU-heavy workloads. You can see partition metadata, slice state, accelerator duty cycle, memory bandwidth utilization, and memory totals without stitching together separate telemetry sources.\u003C\u002Fp>\u003Cp>The documentation also reminds you of the basics that matter \u003Ca href=\"\u002Fnews\u002Fprompt-engineering-pay-gets-real-when-you-ship-systems-en\">when you\u003C\u002Fa> are building dashboards: metric kinds such as GAUGE, CUMULATIVE, and DISTRIBUTION behave differently, string values need MQL conversion before charting, and metric units are defined in the MetricDescriptor reference.\u003C\u002Fp>\u003Cul>\u003Cli>Metrics are written at the project level by default unless the descriptor says otherwise.\u003C\u002Fli>\u003Cli>String-type metrics require Monitoring Query Language before you can chart them.\u003C\u002Fli>\u003Cli>Some metrics are visible only after a delay of up to 240 seconds.\u003C\u002Fli>\u003Cli>The metric type strings use the \u003Ccode>kubernetes.io\u002F\u003C\u002Fcode> prefix, which the table omits for readability.\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>TPU metrics show how much of the stack is now observable\u003C\u002Fh2>\u003Cp>The most interesting part of the page is the TPU coverage. \u003Ca href=\"\u002Ftag\u002Fgoogle\">Google\u003C\u002Fa> documents metrics for accelerator partitions, slices, and their metadata, which means operators can inspect not just whether a TPU exists, but whether it is healthy, active, degraded, or failed.\u003C\u002Fp>\u003Cp>That level of detail is useful in clusters where accelerator scheduling and topology matter as much as pod placement. A slice can be formed, torn down, or flagged with an end state, while partition state can expose HEALTHY or UNHEALTHY conditions. For teams training models on TPU-backed nodes, that is the difference between seeing a vague resource problem and understanding exactly where the failure sits.\u003C\u002Fp>\u003Cblockquote>“The AI industry is at an inflection point, and the next wave of progress will be driven by systems that can reason, plan and act.” — Thomas Kurian, Google Cloud Next 2024 keynote\u003C\u002Fblockquote>\u003Cp>Kurian’s comment was about AI systems, but the same logic applies here. More capable infrastructure needs more precise observability, and GKE’s TPU metrics give operators a better view of what the cluster is actually doing.\u003C\u002Fp>\u003Cp>A few of the TPU-related entries are worth calling out because they tell you what Google thinks matters operationally:\u003C\u002Fp>\u003Cul>\u003Cli>\u003Ccode>accelerator\u002Fpartition\u002Fstate\u003C\u002Fcode> reports partition health with a 1 or 0 signal.\u003C\u002Fli>\u003Cli>\u003Ccode>accelerator\u002Fslice\u002Fformation_durations\u003C\u002Fcode> measures how long slice assembly takes.\u003C\u002Fli>\u003Cli>\u003Ccode>accelerator\u002Fslice\u002Fdeformation_durations\u003C\u002Fcode> measures teardown and resource release time.\u003C\u002Fli>\u003Cli>\u003Ccode>accelerator\u002Fslice\u002Fmetadata\u003C\u002Fcode> emits streams for discovered slice and partition combinations.\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>Autoscaling metrics are the practical win\u003C\u002Fh2>\u003Cp>If you run ordinary application workloads, the autoscaler metrics may matter more than the TPU entries. Google exposes recommended CPU request cores, recommended memory bytes, and HPA recommendation latency. Those numbers tell you whether your scaling logic is reacting quickly enough to workload changes.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782137892746-bcnx.png\" alt=\"GKE system metrics expose TPU and HPA data\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>The latency metric is especially useful because the documentation defines it as the time between metrics being created and the corresponding recommendation being applied to the apiserver. That makes it a direct signal for autoscaling lag, not a vague proxy.\u003C\u002Fp>\u003Cp>Here is the comparison that jumps out from the doc:\u003C\u002Fp>\u003Cul>\u003Cli>\u003Ccode>autoscaler\u002Flatencies\u002Fper_hpa_recommendation_scale_latency_seconds\u003C\u002Fcode> is GA and has up to 20 seconds of visibility delay.\u003C\u002Fli>\u003Cli>\u003Ccode>autoscaler\u002Fcontainer\u002Fcpu\u002Fper_replica_recommended_request_cores\u003C\u002Fcode> is GA and can take up to 240 seconds before data appears.\u003C\u002Fli>\u003Cli>\u003Ccode>autoscaler\u002Fcontainer\u002Fmemory\u002Fper_replica_recommended_request_bytes\u003C\u002Fcode> is GA and also has up to 240 seconds of delay.\u003C\u002Fli>\u003Cli>\u003Ccode>container\u002Faccelerator\u002Fduty_cycle\u003C\u002Fcode> is GA and sampled every 60 seconds, which makes it better for steady-state utilization checks than instant debugging.\u003C\u002Fli>\u003C\u002Ful>\u003Cp>Those differences matter when you build alerting. A metric that shows up in 20 seconds can support faster feedback loops, while a metric with a four-minute delay is better for trend analysis and right-sizing decisions.\u003C\u002Fp>\u003Cp>Google also documents the label fields for each metric, and those labels are where the real filtering power lives. For TPU metrics, labels like \u003Ccode>partition_id\u003C\u002Fcode>, \u003Ccode>slice_topology\u003C\u002Fcode>, \u003Ccode>accelerator_type\u003C\u002Fcode>, and \u003Ccode>block_id\u003C\u002Fcode> let you narrow queries to a specific hardware slice or topology.\u003C\u002Fp>\u003Ch2>What this means for teams running accelerator-heavy clusters\u003C\u002Fh2>\u003Cp>The page is a reference document, but it reveals a product direction: GKE observability is moving deeper into hardware-aware operations. That is good news for teams running model training, \u003Ca href=\"\u002Ftag\u002Finference\">inference\u003C\u002Fa>, or mixed CPU-accelerator workloads, because the monitoring layer now speaks the same language as the infrastructure.\u003C\u002Fp>\u003Cp>It also means the boring parts of operations get easier to automate. If a TPU slice is unhealthy, if formation time spikes, or if HPA recommendations lag behind demand, those conditions can be turned into alerts and dashboards instead of manual checks.\u003C\u002Fp>\u003Cp>For teams already using \u003Ca href=\"https:\u002F\u002Fcloud.google.com\u002Fmonitoring\u002Fdocs\u002Fmonitoring-query-language\" target=\"_blank\" rel=\"noopener\">Monitoring Query Language\u003C\u002Fa>, the doc’s note about string metrics is a reminder that not every field charts cleanly by default. For everyone else, the page is a sign that Cloud Monitoring expects users to work at a finer level of detail than a simple CPU-plus-memory dashboard.\u003C\u002Fp>\u003Cp>That is probably the main takeaway: GKE system metrics are no longer limited to node health and container basics. They now include the signals you need to understand accelerator state, autoscaling decisions, and the timing gaps that can make a cluster feel slow even when the workload itself is fine.\u003C\u002Fp>\u003Cp>If you are running GKE with TPUs or aggressive autoscaling, the next thing to check is whether these metrics are enabled in your project and wired into your dashboards. If they are not, you are leaving a lot of operational context on the table.\u003C\u002Fp>\u003Cp>For related observability coverage, see \u003Ca href=\"\u002Fnews\u002Fgoogle-cloud-monitoring-mql-guide\">our MQL guide\u003C\u002Fa> and \u003Ca href=\"\u002Fnews\u002Fkubernetes-observability-best-practices\">our Kubernetes observability primer\u003C\u002Fa>.\u003C\u002Fp>","Google Cloud’s GKE system metrics add TPU, accelerator, and autoscaling data to Cloud Monitoring with 60-second sampling.","docs.cloud.google.com","https:\u002F\u002Fdocs.cloud.google.com\u002Fmonitoring\u002Fapi\u002Fmetrics_kubernetes",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782137890166-5zeg.png","tools","en","3c99ee7a-64ef-459f-9cd2-6fc420bd9e4b",[17,18,19,20,21],"GKE","Cloud Monitoring","TPU metrics","autoscaling","MQL",[23,24,25],"Google Cloud’s GKE system metrics add TPU and accelerator visibility to Cloud Monitoring.","Several metrics sample every 60 seconds, but visibility delays vary from 20 to 240 seconds.","Autoscaler latency and TPU slice state are the most operationally useful additions.",0,"2026-06-22T14:17:43.247326+00:00","2026-06-22T14:17:43.241+00:00","bd027ff9-99dc-4092-8d81-59f5c8c8cc5d",{"tags":31,"relatedLang":32,"relatedPosts":36},[],{"id":15,"slug":33,"title":34,"language":35},"gke-system-metrics-tpu-hpa-cloud-monitoring-zh","GKE 系統指標開始看見 TPU 與 HPA","zh",[37,43,49,55,61,67],{"id":38,"slug":39,"title":40,"cover_image":41,"image_url":41,"created_at":42,"category":13},"507b9c2f-a9f2-44ca-817c-db878ca21269","rust-forum-checkins-turn-vague-work-into-plans-en","Rust forum check-ins turn vague work into plans","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782151405705-qpuu.png","2026-06-22T18:02:50.801572+00:00",{"id":44,"slug":45,"title":46,"cover_image":47,"image_url":47,"created_at":48,"category":13},"6dc55ef5-7d20-4012-8eef-c4795a7ea38b","googles-99-speaker-turns-home-into-gemini-chat-en","Google’s $99 speaker turns home into Gemini chat","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782122605293-a634.png","2026-06-22T10:03:02.737417+00:00",{"id":50,"slug":51,"title":52,"cover_image":53,"image_url":53,"created_at":54,"category":13},"573d2a49-84d8-4017-9118-55bc5586dab9","install-openclaw-windows-powershell-wsl2-en","Install OpenClaw on Windows with PowerShell","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782093770144-gllk.png","2026-06-22T02:02:28.698991+00:00",{"id":56,"slug":57,"title":58,"cover_image":59,"image_url":59,"created_at":60,"category":13},"cd0c44d6-1db4-4050-beba-6c3dfc74112a","anthropic-github-repositories-claude-code-push-en","91 Anthropic GitHub repos showcase Claude Code push","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782082971121-zlxw.png","2026-06-21T23:02:28.93858+00:00",{"id":62,"slug":63,"title":64,"cover_image":65,"image_url":65,"created_at":66,"category":13},"ac744d9a-e6d0-4bef-893e-a0963d46f939","mistral-models-guide-turns-picking-easier-en","Mistral Models Guide Turns Picking Easier","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782079396132-wsei.png","2026-06-21T22:02:51.767694+00:00",{"id":68,"slug":69,"title":70,"cover_image":71,"image_url":71,"created_at":72,"category":13},"decd40da-6ddd-45f0-835f-7981d0f45111","cudf-turns-pandas-code-into-gpu-runs-en","cuDF turns pandas code into GPU runs","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1782058729869-s0tn.png","2026-06-21T16:18:27.628499+00:00",[74,79,84,89,94,99,104,109,114,119],{"id":75,"slug":76,"title":77,"created_at":78},"8008f1a9-7a00-4bad-88c9-3eedc9c6b4b1","surepath-ai-mcp-policy-controls-en","SurePath AI's New MCP Policy Controls Enhance AI Security","2026-03-26T01:26:52.222015+00:00",{"id":80,"slug":81,"title":82,"created_at":83},"27e39a8f-b65d-4f7b-a875-859e2b210156","mcp-standard-ai-tools-2026-en","MCP Standard in 2026: Integrating AI Tools","2026-03-26T01:27:43.127519+00:00",{"id":85,"slug":86,"title":87,"created_at":88},"165f9a19-c92d-46ba-b3f0-7125f662921d","rag-2026-transforming-enterprise-ai-en","How RAG in 2026 is Transforming Enterprise AI","2026-03-26T01:28:11.485236+00:00",{"id":90,"slug":91,"title":92,"created_at":93},"6a2a8e6e-b956-49d8-be12-cc47bdc132b2","mastering-ai-prompts-2026-guide-en","Mastering AI Prompts: A 2026 Guide for Developers","2026-03-26T01:29:07.835148+00:00",{"id":95,"slug":96,"title":97,"created_at":98},"3ab2c67e-4664-4c67-a013-687a2f605814","garry-tan-open-sources-claude-code-toolkit-en","Garry Tan Open-Sources a Claude Code Toolkit","2026-03-26T08:26:20.245934+00:00",{"id":100,"slug":101,"title":102,"created_at":103},"66a7cbf8-7e76-41d4-9bbf-eaca9761bf69","github-ai-projects-to-watch-in-2026-en","20 GitHub AI Projects to Watch in 2026","2026-03-26T08:28:09.752027+00:00",{"id":105,"slug":106,"title":107,"created_at":108},"9f332fda-eace-448a-a292-2283951eee71","practical-github-guide-learning-ml-2026-en","A Practical GitHub Guide to Learning ML in 2026","2026-03-27T01:16:50.125678+00:00",{"id":110,"slug":111,"title":112,"created_at":113},"1b1f637d-0f4d-42bd-974b-07b53829144d","aiml-2026-student-ai-ml-lab-repo-review-en","AIML-2026 Is a Bare-Bones Student Lab Repo","2026-03-27T01:21:51.661231+00:00",{"id":115,"slug":116,"title":117,"created_at":118},"6d1bf3f6-e191-4d30-b55b-8a0722fa6afe","ai-trending-github-repos-and-research-feeds-en","AI Trending Tracks Repos and Research Feeds","2026-03-27T01:31:35.709532+00:00",{"id":120,"slug":121,"title":122,"created_at":123},"010539a1-4c3a-4bd3-937a-26616422ee0d","awesome-ai-for-science-research-tools-map-en","Awesome AI for Science Is Becoming a Real Research Map","2026-03-27T01:46:50.89513+00:00"]