the apiserver is a good source of metrics, because unlike most other components, all of the components have to talk to the apiserver. IT exposes /metrics, health endpoints (/healthz, /livez and /readyz as of 1.16)
Nov 19, 2019, 3:36:42 PM [original]audit-logs are ostensibly a security feature, but they're also a great source of queryable structured logs from the apiserver
Nov 19, 2019, 3:36:43 PM [original]etcd is the database that backs the apiserver. so it's another juice metrics target.
Nov 19, 2019, 3:43:43 PM [original]etcd has a /metrics and a /health, but there are also tools like auger and etcdctl
Nov 19, 2019, 3:46:16 PM [original]auger is specifically for kubernetes, so unlike etcdctl it can decode kubernetes protobufs in etcd.
Nov 19, 2019, 3:46:16 PM [original]etcd isn't a kubernetes component, so things like the health endpoint and logs are in different formats.
Nov 19, 2019, 3:46:17 PM [original]time for an example! problem: a node is down. how do we know? you can look for all nodes for whom promothesus scrapes are failing
Nov 19, 2019, 3:46:17 PM [original]but things aren't always so neat in production. a grey failure could just be a very slow, as indicated by a long scrape duration
Nov 19, 2019, 3:46:18 PM [original]this doesn't need any innate knowledge of the node, perfect for black-box debugging!
Nov 19, 2019, 3:52:32 PM [original]another problem: a crash-looping apiserver, detected by either monitoring healthcheck endpoints, or by master kubelet's metrics for liveness or readiness.
Nov 19, 2019, 3:52:32 PM [original]is the process crashing immediately, or is it getting killed by the kubelet (e.g. because the liveness check failed) ?
Nov 19, 2019, 3:52:33 PM [original]healthz?verbose can help us here, if we can get a curl off before a crashloop. in this example, we can see etcd is down. And because it's down, apiserver fails its own healthcheck and crashloops.
Nov 19, 2019, 3:52:33 PM [original]the default storage limit for etcd is 2gb - so etcd_object_count can be a clue to an overwhelmed etcd
Nov 19, 2019, 4:02:17 PM [original]etcd objects aren't all the same size. a watch mechanism means etcd has to store historical versions of an object. So a very frequently updated object can take up a disproportionate amount of space on etcd!
Nov 19, 2019, 4:02:18 PM [original]etcd compacts every 5 minutes, so you can easily go over storage limit sizes. you can use kubectl for this if apiserver is up, but if it's not this is where auger shines!
Nov 19, 2019, 4:02:18 PM [original]probably don't store images in etcd, btw. it's not meant for that
Nov 19, 2019, 4:02:19 PM [original]last example: slow API server! let's take a look at request latency. it turned out these metrics were useless, because the bucket sizes were too coarse.
Nov 19, 2019, 4:02:19 PM [original]bad metrics make it difficult for cluster operators. but updating is hard! Changing metrics could cause false positives or negatives in people's monitoring
Nov 19, 2019, 4:06:29 PM [original]bad metrics can't be disabled, so you need a full upgrade. sig-instrumentation did an overhaul of broken metrics in 1.14.
- labels didn't match
- wrong data types
- units weren't standardized, and many were even outright incorrect!
how do we deprecate bad metrics? sig-instrumentation is going to treat metrics as a proper api. we'll have stability levels for all metrics going forward, so we can mark metrics as deprecated
Nov 19, 2019, 4:06:30 PM [original]what metrics are stable? we need criteria and promotion still. and we need runtime flags for disabling individual metrics
Nov 19, 2019, 4:06:31 PM [original]distributed tracing and structured logs are (hopefully) coming soon too!
Nov 19, 2019, 4:06:31 PM [original](that's all for today. Thanks @ehashdn and @LogicalHan!) #KubeCon
Nov 19, 2019, 4:14:23 PM [original]