Let's look at the following real query from Node Exporter Full dashboard:
(
(
node_memory_MemTotal_bytes{instance=~"$node:$port", job=~"$job"}
-
node_memory_MemFree_bytes{instance=~"$node:$port", job=~"$job"}
)
/
node_memory_MemTotal_bytes{instance=~"$node:$port", job=~"$job"}
)
*
100
It is clear the query calculates the percentage of used memory for the given $node, $port and $job. Isn't it? :)
What's wrong with this query? Copy-pasted label filters for distinct timeseries which makes it easy to mistype these filters during modification. Let's simplify the query with WITH expressions:
WITH (
commonFilters = {instance=~"$node:$port",job=~"$job"}
)
(
node_memory_MemTotal_bytes{commonFilters}
-
node_memory_MemFree_bytes{commonFilters}
)
/
node_memory_MemTotal_bytes{commonFilters} * 100
Now label filters are located in a single place instead of three distinct places. The query mentions node_memory_MemTotal_bytes metric twice and {commonFilters} three times. WITH expressions may improve this:
WITH (
my_resource_utilization(free, limit, filters) = (limit{filters} - free{filters}) / limit{filters} * 100
)
my_resource_utilization(
node_memory_MemFree_bytes,
node_memory_MemTotal_bytes,
{instance=~"$node:$port",job=~"$job"},
)
Now the template function my_resource_utilization() may be used for monitoring arbitrary resources - memory, CPU, network, storage, you name it.
Let's take another nice query from Node Exporter Full dashboard:
(
(
(
count(
count(node_cpu_seconds_total{instance=~"$node:$port",job=~"$job"}) by (cpu)
)
)
-
avg(
sum by (mode) (rate(node_cpu_seconds_total{mode='idle',instance=~"$node:$port",job=~"$job"}[5m]))
)
)
*
100
)
/
count(
count(node_cpu_seconds_total{instance=~"$node:$port",job=~"$job"}) by (cpu)
)
Do you understand what does this mess do? Is it manageable? :) WITH expressions are happy to help in a few iterations.
1. Extract common filters used in multiple places into a commonFilters variable:
WITH (
commonFilters = {instance=~"$node:$port",job=~"$job"}
)
(
(
(
count(
count(node_cpu_seconds_total{commonFilters}) by (cpu)
)
)
-
avg(
sum by (mode) (rate(node_cpu_seconds_total{mode='idle',commonFilters}[5m]))
)
)
*
100
)
/
count(
count(node_cpu_seconds_total{commonFilters}) by (cpu)
)
2. Extract "count(count(...) by (cpu))" into cpuCount variable:
WITH (
commonFilters = {instance=~"$node:$port",job=~"$job"},
cpuCount = count(count(node_cpu_seconds_total{commonFilters}) by (cpu))
)
(
(
cpuCount
-
avg(
sum by (mode) (rate(node_cpu_seconds_total{mode='idle',commonFilters}[5m]))
)
)
*
100
) / cpuCount
3. Extract rate(...) part into cpuIdle variable, since it is clear now that this part calculates the number of idle CPUs:
WITH (
commonFilters = {instance=~"$node:$port",job=~"$job"},
cpuCount = count(count(node_cpu_seconds_total{commonFilters}) by (cpu)),
cpuIdle = sum(rate(node_cpu_seconds_total{mode='idle',commonFilters}[5m]))
)
((cpuCount - cpuIdle) * 100) / cpuCount
4. Put node_cpu_seconds_total{commonFilters} into its own variable with the name cpuSeconds:
WITH (
cpuSeconds = node_cpu_seconds_total{instance=~"$node:$port",job=~"$job"},
cpuCount = count(count(cpuSeconds) by (cpu)),
cpuIdle = sum(rate(cpuSeconds{mode='idle'}[5m]))
)
((cpuCount - cpuIdle) * 100) / cpuCount
Now the query became more clear comparing to the initial query.
WITH expressions may be nested and may be put anywhere. Try expanding the following query:
WITH (
f(a, b) = WITH (
f1(x) = b-x,
f2(x) = x+x
) f1(a)*f2(b)
) f(foo, with(x=bar) x)