Learn how to effortlessly monitor system, pod, node, and user-defined metrics in GKE with the Cloud Monitoring API and Google Cloud Managed Service for Prometheus.
🎉 Kubecost 2.0 is here! Learn more about the massive new feature additions and predictive learning

GKE Monitoring and Metrics

Like this Article?

Subscribe to our Linkedin Newsletter to receive more educational content

Subscribe now

Google Kubernetes Engine (GKE) is a managed service by Google Cloud that simplifies the deployment, management, and scaling of containerized applications using Kubernetes, a popular open-source container orchestration platform. It helps developers focus on building applications while abstracting the complexities of infrastructure management.

Given the popularity of Kubernetes and Google Cloud, effective GKE monitoring is integral to ensuring system reliability and enabling site reliability engineering (SRE) initiatives for teams running containerized workloads in Google’s cloud infrastructure. This article will explore the topic of GKE monitoring in-depth, including key metrics, limitations, and third-party cloud cost monitoring tools.

Summary of key GKE monitoring concepts

The table below summarizes the eight essential GKE monitoring concepts this article will explore in more detail.

Key Concept Description
Observability Metrics in GKE With GKE, you can effortlessly monitor various metrics, including system metrics, control plane, workload, third-party, and user-defined metrics.
Benefits of collecting GKE metrics Monitoring GKE metrics in real-time prevents issues like memory leaks, unscheduled Pods, and high ingestion rates. Alerts can help minimize recovery times, while thorough monitoring helps optimize costs by identifying which resources to reduce, keep, or increase.
GKE Metrics Server The Metrics Server provides container resource metrics to Horizontal and Vertical Pod Autoscalers for GKE autoscaling.
GKE Usage Metering GKE usage metering is a tool that helps you track how many resources your cluster's workloads use, enabling you to plan your resources efficiently.
GCP Prometheus Google Cloud Managed Service for Prometheus monitors metrics and workloads globally in hybrid and multi-cloud environments without operational overhead, ensuring portability with upstream Prometheus.
GKE Monitoring dashboard You can visualize your GKE metrics data using GKE-specific or custom dashboards and third-party integrations.
Limitations of GKE Monitoring GKE monitoring has limitations around data retention and quotas, relies on Google Cloud Monitoring, and may experience delays in metric ingestion. Users can retrieve data using MQL and PromQL, but setting up alerting policies and custom metrics can be challenging.
Monitoring with third-party cost optimization tools To optimize the costs of your Kubernetes clusters, you can use third-party tools like KubeCost instead of relying on GKE's default features. It uses open-source Prometheus as a time-series database to provide cost allocation calculations and optimization insights for your Kubernetes clusters.

The eight essential aspects of GKE monitoring

GKE monitoring can monitor GKE-managed workloads running on GKE clusters and track core system metrics such as CPU, memory, and Disk utilization across all the workloads running on those clusters. When you create a GKE cluster on Google Cloud, the following services are enabled by default: Cloud Logging, Monitoring, and Google Cloud Managed Service for Prometheus.

GKE integrates well with Cloud Monitoring to monitor the health of your Kubernetes components and the workloads running on them. You can also

  • Monitor the metrics populated on custom dashboards.
  • Generate alerts based on specific metrics and log messages.
  • Create service-level objectives(SLOs) and access by third-party services using the Cloud Monitoring API.

In the sections below, we’ll explore eight essential concepts that enable teams to effectively implement and scale GKE monitoring.

Observability metrics in GKE

There are several essential observability metrics relevant to GKE monitoring.

Keeping on top of these metrics ensures the system runs smoothly without issues and allows you to monitor the service health for defining service level objectives (SLOs).

Using Google Cloud Managed Service for Prometheus, you can monitor third-party applications on GKE clusters, such as ArgoCD, Kafka, Jenkins, MongoDB, etc., using Prometheus exporters. Metrics generated by Prometheus are custom metrics.

The table below shows the metrics available in GKE and their default behavior:

Metric Metric Source Enabled by default?
Kubernetes Metrics System Yes
Third-Party Google Managed Prometheus Yes
User Defined Google Managed Prometheus Yes
Kubernetes API server Control Plane Metrics No
Scheduler Control Plane Metrics No
Controller Manager Control Plane Metrics No
Kubernetes objects Kube-State Metrics No

In the sections below, we’ll take a closer look at the system, control plane, and Kube state metrics, key aspects of effective GKE monitoring.

Comprehensive Kubernetes cost monitoring & optimization

System metrics

System metrics related to memory, storage, CPU, Network throughout, etc., are forwarded to Cloud Monitoring by default in GKE Standard and Autopilot clusters. These metrics are captured at the container, pod, and node levels and have the prefix kubernetes.io/. You can find the complete list of available system metrics here.

During cluster creation or later, you can enable additional observability options in your GKE cluster, such as Control plane Metrics and Kube State Metrics.

Control plane metrics

Control plane metrics enable health monitoring for Kubernetes components by collecting metrics for the Control Plane components, such as the Kubernetes API server, Scheduler, and Controller Manager.

How to use the Cloud Shell with GKE

You can use the steps in this Google Cloud Docs page to activate the Cloud Shell, a built-in shell that interacts with Google Cloud resources. Once the Cloud Shell is activated, you can follow the below steps to enable Control Plane metrics:

Note: Please replace the variables($CLUSTER_NAME, $REGION, $ZONE) in the commands below with your values.

Step 1: You can get all the information about the GKE cluster using the following command:

gcloud container clusters describe $CLUSTER_NAME --region=$REGION

Step 2: Download the cluster kubeconfig to your local workstation. The following command generates the kubeconfig and adds it to the ~/.kube/config file.

gcloud container clusters get-credentials $CLUSTER_NAME \ --region=$REGION

Step 3: Run the below command to enable the Control Plane Metrics:

gcloud container clusters update $CLUSTER_ID \
  --zone=$ZONE \
  --project=$PROJECT_ID \
  --monitoring=SYSTEM,API_SERVER,SCHEDULER,CONTROLLER_MANAGER

Kube state metrics

Kube state metrics is an open-source project that you can install on a Kubernetes cluster, and it exposes various cluster metrics in the Prometheus format. GKE comes with a pre-packaged Kube State Metrics installation, which users don't need to manage. But you can install your open-source version, bringing your copy of the software to tweak custom parameters. It helps you monitor the health of Kubernetes objects such as deployments, nodes, and pods.

Benefits of collecting GKE metrics

Keeping track of metrics can help identify and solve issues in GKE clusters. These issues include high CPU or memory usage, which may show the possibility of memory leaks in the application or insufficient resources assigned at the container or node level.

Collecting GKE metrics could mitigate many critical issues during the initial phase. For instance:

  • High container restarts could indicate that containers are crashing. A high number of unscheduled pods may indicate the cluster needs more resources or they have configuration errors.
  • High ingestion rates for Cloud Logging or Google Cloud Managed Service for Prometheus can also increase Google Cloud operation suite costs. This can be reduced by decreasing ingestion rates and proactive monitoring of the ingested data.

By tracking resource usage in GKE, you can ensure you are using Kubernetes resources effectively. Monitoring GKE resources can help identify their real-time status and ensure optimal availability, preventing service outages.

With the right alerts, you can quickly take action if something is about to or does go wrong, preventing minor issues from becoming more significant problems. Monitoring can also help you identify the root cause of a problem, making it easier for Kubernetes administrators to find the issue quickly and reduce recovery times.

Finally, monitoring GKE supports thorough cost optimization, allowing you to maximize your returns on GKE investment by identifying which resources to cut, keep, and increase.

K8s clusters handling 10B daily API calls use Kubecost

GKE Metrics Server

The Metrics Server is essential for GKE's built-in autoscaling pipelines. It provides container resource metrics to HPA(Horizontal Pod Autoscaling) and VPA(Vertical Pod Autoscaling), which use them to trigger autoscaling. The Metrics Server retrieves the metrics from Kubelets and then exposes them through the Kubernetes Metrics API.

Keeping the Metrics Server healthy is crucial to ensure proper GKE autoscaling. The GKE metrics-server deployment includes an addon-resizer, which adjusts the metrics-server container's resources based on the cluster's node count. As of Kubernetes 1.26, updating Pods in place is not supported, so the addon-resizer restarts the metrics-server Pod to apply the new required resources however, the feature is available in Kubernetes 1.27 as alpha release.

The GKE metrics server architecture diagram shows the integration with different components.

GKE Usage Metering

GKE usage metering is a helpful tool that provides a clear understanding of the resource usage in your GKE clusters. These resources include CPU, GPU, TPU, memory, storage, and network egress usage.

GKE stores the data in BigQuery, which allows you to view it directly or analyze it with external tools like Looker Studio. GKE usage metering helps track resource usage based on Kubernetes namespaces and labels. This tracking associates resource usage with various projects or tenants and can highlight opportunities for optimization by helping you identify workloads where the resource requests and resource consumption differ.

There are limitations to GKE Usage Metering that may hinder adoption if your use case runs up against them. GKE Usage Metering monitors usage but does not associate this with cost by default. It takes a few steps, but a combination of BigQuery queries, Looker Studio, and exported cloud billing data can give you an approximate cost breakdown. Additionally, there are limitations to tracking resources not created within GKE, such as those created when working in a multi-cloud environment.

Third-party solutions like KubeCost provide more flexibility and precise cost monitoring of your projects, which can overcome the limitations in GKE Usage metering. Keep reading for more information on Monitoring with third-party cost optimization tools.

Google Managed Prometheus

Google Cloud Managed Service for Prometheus is a fully managed, multi-cloud, cross-project solution for collecting and monitoring Prometheus metrics. With this service, you can globally monitor, visualize, and generate alerts on your workloads without worrying about Prometheus's operational tasks and scalability.

Managed Service for Prometheus collects metrics from Prometheus exporters. It lets you query the data globally using PromQL, similar to how we query Cloud Monitoring data, which means you can continue to use any existing Grafana dashboards, PromQL-based alerts, and workflows.

This service is hybrid and multi-cloud compatible. It can monitor both Kubernetes and VM workloads. Additionally, it retains data for 24 months and remains compatible with upstream Prometheus, ensuring portability.

The open-source Prometheus consists of a single deployment handling various functions such as data collection, query evaluation, rule and alert evaluation, and data storage. However, Managed Service for Prometheus divides these features into multiple components, such as:

  • Data collectors. A Kubernetes Administrator can configure the Data collection with the flexibility to use Managed Collectors, self-deployed collectors, the OpenTelemetry Collector, or the Ops Agent, which scrape local exporters and forward the collected data to Monarch, a globally distributed in-memory time series database developed by Google. It is mainly used as a reliable monitoring system by most of Google’s internal systems like Spanner, BigTable, etc.
  • Queries. Monarch executes queries, aggregates the results across all Google Cloud regions, and supports up to 1,000 projects.
  • Alers. Once Monarch aggregates the data results, you can write PromQL alerts in Cloud monitoring. A rule evaluator executes the rules and forwards the fired alerts to the Prometheus Alert Manager.
  • Data storage. Monarch also provides data storage and stores the Prometheus data for 24 months free of cost.
  • Visualizations. You can then use Grafana to visualize the data ingested from the global Monarch data store.

Google Managed Prometheus architecture. (Source)

Learn how to manage K8s costs via the Kubecost APIs

GKE monitoring dashboard

You can use GKE-specific dashboards or create custom dashboards and even third-party integrations per your requirements. Additionally, you can import your existing Grafana dashboards into the Cloud Monitoring Dashboard.

  • The GKE dashboard comprehensively views your clusters, workloads, services, nodes, and other resources you can filter. You can click a resource to view metrics and log details. You can also view and create Service Level Objectives (SLOs) from the detail view for namespaces, workloads, and Kubernetes services. The dashboard also displays the level of activity or inactivity of GKE clusters based on their usage, providing users with a clear understanding of their cluster's status.
  • Other GKE dashboards and playbooks focus on specific resources or conditions, such as at-risk workloads.

GKE monitoring dashboard sample - Node View with default widgets.

Limitations of GKE monitoring

While native GKE monitoring has its benefits, there are limitations. For example, data retention quotas for specific GKE monitoring metrics can limit some use cases. Cloud monitoring may only retain historical data for a limited time, and detailed metrics may be unavailable for long-term analysis. For example, metric data is stored at its original sampling frequency for six weeks before being down-sampled to 10-minute intervals for extended storage.

On the other hand, metric data from Google Cloud Managed Service for Prometheus is stored for one week at its original sampling frequency before being down-sampled to one-minute intervals for the next five weeks. After that, it is down-sampled to 10-minute intervals for extended storage of up to 24 months.

Data / Duration One week Six weeks 24 months
Regular Metrics Data Original sampling frequency Original sampling frequency down-sampled to 10-minute intervals
Metrics Data by GMP Original sampling frequency down-sampled to 1-minute intervals down-sampled to 10-minute intervals

Other limitations related to native GKE monitoring include:

  • Data ingestion delays. The monitoring system may experience delays in ingesting specific metrics from their source, causing a delay in their availability. Real-time monitoring might take time for all metrics.
  • Tight coupling to Google Cloud Monitoring. GKE monitoring heavily relies on Google Cloud Monitoring. Changes or limitations in this service impact GKE monitoring capabilities.
  • Query and data retrieval complexity. Users can retrieve data from the cloud monitoring system using MQL (Monitoring Query Language) and PromQL (Prometheus Query Language). However, setting up and managing alerting policies and custom metrics can be challenging for those unfamiliar with these query languages. It may take some time to understand and utilize their capabilities thoroughly.
  • Quotas that limit usage. Google Cloud enforces quotas to ensure fair usage and reduce spikes in resource consumption. These quotas apply to APIs, monitored projects, alerting, custom metrics, uptime checks, dashboards, SLOs, and other features.

Monitoring with third-party cost optimization tools

Achieving cost optimization objectives for your Kubernetes clusters requires significant manual effort and implementation when relying on GKE's out-of-the-box functionality for cost allocation calculations and optimization insights. Consider using third-party tools such as KubeCost to simplify this process.

KubeCost is a tool that leverages open-source Prometheus as a time-series database and post-processes Prometheus-generated data to provide cost allocation calculations and optimization insights for your Kubernetes clusters. Kubecost is built on OpenCost, a Cloud Native Computing Foundation (CNCF) Sandbox project. Being noticed in the CNCF community is like spotlighting the project. It brings more attention, facilitates collaboration to enhance quality, and increases usage. This recognition also assures that your project is high quality and receives support from the CNCF community.

KubeCost provides a seamless integration step with GKE. You have two options to choose from.

  • Deploy Kubecost using Google Cloud Marketplace. The KubeCost image is available on Google Cloud Marketplace and is production-ready within five minutes for organizations seeking a quick setup.
  • Install KubeCost as a GKE deployment. For those who prefer more configuration control, you can install KubeCost on your GKE cluster and use Google Managed Prometheus (GMP) Prometheus binary for seamless metric ingestion into the GMP database. Whichever option you choose, you'll find KubeCost to be a valuable asset to your operations.

Conclusion

GKE monitoring is a crucial aspect of successful Kubernetes management for many teams that run containerized apps on the Google Cloud platform. Components required for comprehensive GKE monitoring include understanding observability metrics and exploring tools like the GKE Metrics Server, Usage Metering, and Google Managed Prometheus.

Additionally, teams should collect key GKE metrics and understand how they benefit Kubernetes management, with enabling performance improvements and efficient resource utilization being typical examples. It is also important to acknowledge the role of the GKE monitoring dashboard and be mindful of its limitations to ensure a balanced approach.

Finally, understanding when to leverage third-party tooling can help teams supercharge their GKE monitoring. For example, integrating third-party cost optimization tools like KubeCost can be crucial in the journey toward streamlined, optimized, and resilient container orchestration.

Comprehensive Kubernetes cost monitoring & optimization

Continue reading this series