Health
Health
Colors in the graph represent the health of your service mesh. A node colored red or orange might need attention. The color of an edge between components represents the health of the requests between those components. The node shape indicates the type of component such as services, workloads, or apps.
The health of nodes and edges is refreshed automatically based on the user’s preference. The graph can also be paused to examine a particular state, or replayed to re-examine a particular time period.
Health Configuration
Kiali calculates health by combining the individual health of several indicators, such as pods and request traffic. The global health of a resource reflects the most severe health of its indicators.
Health Indicators
The table below lists the current health indicators and whether the indicator supports custom configuration for its health calculation.
Indicator | Supports Configuration |
---|---|
Pod Status | No |
Traffic Health | Yes |
Icons and colors
Kiali use icons and colors to indicate the health of resources and associated request traffic.
- No Health Information (NA)
- Healthy
- Degraded
- Failure
Default Values
Request Traffic
By default Kiali uses the traffic rate configuration shown below. Application errors have minimal tolerance while client errors have a higher tolerance reflecting that some level of client errors is often normal (e.g. 404 Not Found):
- For http protocol 4xx are client errors and 5xx codes are application errors.
- For grpc protocol all 1-16 are errors (0 is success).
So, for example, if the rate of application errors is >= 0.1% kiali will show Degraded
health and if > 10% will show Failure
health.
# ...
health_config:
rate:
- namespace: ".*"
kind: ".*"
name: ".*"
tolerance:
- code: "^5\\d\\d$"
direction: ".*"
protocol: "http"
degraded: 0
failure: 10
- code: "^4\\d\\d$"
direction: ".*"
protocol: "http"
degraded: 10
failure: 20
- code: "^[1-9]$|^1[0-6]$"
direction: ".*"
protocol: "grpc"
degraded: 0
failure: 10
# ...
Configuration
Custom health configuration is specified in the Kiali CR. To see the supported configuration syntax for health_config
visit Kiali CR.
Kiali applies the first matching rate configuration (namespace, kind, etc) and calculates the status for each tolerance. The reported health will be the status with highest priority (see below).
Rate Option | Definition | Default | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
namespace | Matching Namespaces (regex) | .* (match all) | |||||||||||||||
kind | Matching Resource Types (workload|app|service) (regex) | .* (match all) | |||||||||||||||
name | Matching Resource Names (regex) | .* (match all) | |||||||||||||||
Tolerance | Array of tolerances to apply. | ||||||||||||||||
Tolerance Option | Definition | Default |
---|---|---|
code | Matching Response Status Codes (regex) [1] | required |
direction | Matching Request Directions (inbound|outbound) (regex) | .* (match all) |
protocol | Matching Request Protocols (http|grpc) (regex) | .* (match all) |
degraded | Degraded Threshold(% matching requests >= value) | 0 |
failure | Failure Threshold (% matching requests >= value) | 0 |
[1] The status code typically depends on the request protocol. The special code -, a single dash, is used for requests that don’t receive a response, and therefore no response code.
Kiali reports traffic health with the following top-down status priority :
Priority | Rule (value=% matching requests) | Status |
---|---|---|
1 | value >= FAILURE threshold | FAILURE |
2 | value >= DEGRADED threshold AND value < FAILURE threshold | DEGRADED |
3 | value > 0 AND value < DEGRADED threshold | HEALTHY |
4 | value = 0 | HEALTHY |
5 | No traffic | No Health Information |
Examples
These examples use the repo _https://github.com/kiali/demos/tree/master/error-rates_.
In this repo we can see 2 namespaces: alpha and beta (Demo design).
Alpha |
Where nodes return the responses (You can configure responses here):
App (alpha/beta) | Code | Rate |
---|---|---|
x-server | 200 | 9 |
x-server | 404 | 1 |
y-server | 200 | 9 |
y-server | 500 | 1 |
z-server | 200 | 8 |
z-server | 201 | 1 |
z-server | 201 | 1 |
The applied traffic rate configuration is:
# ...
health_config:
rate:
- namespace: "alpha"
tolerance:
- code: "404"
failure: 10
protocol: "http"
- code: "[45]\\d[^\\D4]"
protocol: "http"
- namespace: "beta"
tolerance:
- code: "[4]\\d\\d"
degraded: 30
failure: 40
protocol: "http"
- code: "[5]\\d\\d"
protocol: "http"
# ...
After Kiali adds default configuration we have the following (Debug Info Kiali):
{
"healthConfig": {
"rate": [
{
"namespace": "/alpha/",
"kind": "/.*/",
"name": "/.*/",
"tolerance": [
{
"code": "/404/",
"degraded": 0,
"failure": 10,
"protocol": "/http/",
"direction": "/.*/"
},
{
"code": "/[45]\\d[^\\D4]/",
"degraded": 0,
"failure": 0,
"protocol": "/http/",
"direction": "/.*/"
}
]
},
{
"namespace": "/beta/",
"kind": "/.*/",
"name": "/.*/",
"tolerance": [
{
"code": "/[4]\\d\\d/",
"degraded": 30,
"failure": 40,
"protocol": "/http/",
"direction": "/.*/"
},
{
"code": "/[5]\\d\\d/",
"degraded": 0,
"failure": 0,
"protocol": "/http/",
"direction": "/.*/"
}
]
},
{
"namespace": "/.*/",
"kind": "/.*/",
"name": "/.*/",
"tolerance": [
{
"code": "/^5\\d\\d$/",
"degraded": 0,
"failure": 10,
"protocol": "/http/",
"direction": "/.*/"
},
{
"code": "/^4\\d\\d$/",
"degraded": 10,
"failure": 20,
"protocol": "/http/",
"direction": "/.*/"
},
{
"code": "/^[1-9]$|^1[0-6]$/",
"degraded": 0,
"failure": 10,
"protocol": "/grpc/",
"direction": "/.*/"
}
]
}
]
}
}
What are we applying?
-
For namespace alpha, all resources
-
Protocol http if % requests with error code 404 are >= 10 then FAILURE, if they are > 0 then DEGRADED
-
Protocol http if % requests with others error codes are> 0 then FAILURE.
-
For namespace beta, all resources
-
Protocol http if % requests with error code 4xx are >= 40 then FAILURE, if they are >= 30 then DEGRADED
-
Protocol http if % requests with error code 5xx are > 0 then FAILURE
-
For other namespaces kiali apply defaults.
-
Protocol http if % requests with error code 5xx are >= 20 then FAILURE, if they are >= 0.1 then DEGRADED
-
Protocol grpc if % requests with error code match /^[1-9]$|^1[0-6]$/ are >= 20 then FAILURE, if they are >= 0.1 then DEGRADED
Alpha | Beta |