Thanos – a scalable prometheus with unlimited storage database jokes

Improbable, a British technology company that focuses on large-scale simulations in the cloud, has a large Prometheus deployment for monitoring their dozens of Kubernetes clusters. An out-of-the-box Prometheus setup had difficulty meeting their requirements around querying historical data, querying across distributed Prometheus servers in a single API call, and merging replicated data from multiple Prometheus HA setups.

Prometheus has existing high availability features – highly available alerts and federated deployments. In a federation, a global Prometheus server aggregates data across other Prometheus servers, potentially in multiple datacenters. Each server sees only a portion of the metrics. To handle more load per datacenter, multiple Prometheus servers can run in a single datacenter, with horizontal sharding.

In a sharding setup, slave servers fetch subsets of the data and a master aggregates them. Querying a specific machine involves querying the specific slave that scraped its data. By default, Prometheus stores time series data for 15 days. To store data for unlimited periods, Prometheus has remote endpoints to write to another datastore along with its regular one. However, de-duplication of data is a problem with this approach. Other solutions like Cortex provide scalable long term storage when used as a remote write endpoint, and a compatible querying API.

Thanos‘ architecture introduces a central query layer across all the servers via a sidecar component which sits alongside each Prometheus server, and a central Querier component that responds to PromQL queries. This makes up a Thanos deployment. Inter-component communication is via the memberlist gossip protocol. The Querier can scale horizontally since it is stateless and acts as an intelligent reverse proxy, passing on requests to the sidecars, aggregating their responses and evaluating the PromQL query against them.

Thanos solves the storage retention problem by using an object storage as the backend. The sidecar StoreAPI component detects whenever Prometheus writes data to disk, and uploads them to object storage. The same Store component also functions as a retrieval proxy over the gossip protocol, letting the Querier component talk to it to fetch data.

In Thanos, the query is always evaluated in a single place – in the root, which listens over HTTP for PromQL queries. The vanilla PromQL engine from Prometheus 2.2.1 is used to evaluate the query, which deduces what time series and for what time ranges we need to fetch the data. We use basic filtering (based on time ranges and external labels) to filter out StoreAPIs (leafs) that will not give us the desired data and then invoke the remaining ones. The results are then merged – append together time-wise – from different sources.

StoreAPIs propagate external labels and the time range they have data for. So we can do basic filtering on this. However if you don’t specify any of these in query (you just want all the "up" series) the querier concurrently asks all the StoreAPI servers. Also, there might be some duplication of results between sidecar and store data, which might not be easy to avoid.

The StoreAPI component understands the Prometheus data format, so it can optimize the query execution plan and also cache specific indices of the blocks to respond fast enough to user queries. This absolves it from the need to cache huge amounts of data. In an HA Prometheus setup with Thanos sidecars, would there be issues with multiple sidecars attempting to upload the same data blocks to object storage? Płotka responded:

Thanos also provides compaction and downsampled storage of time series data. Prometheus has an inbuilt compaction model where existing smaller data blocks are rewritten into larger ones, and restructured to improve query performance. Thanos utilizes the same mechanism in a Compactor component that runs as a batch job and compacts the object storage data. The Compactor downsamples the data too, and "the downsampling interval is not configurable at this point but we have chosen some sane intervals – 5m and 1h", says Płotka. Compaction is a common feature in other time series databases like InfluxDB and OpenTSDB.