Postgres parallel indexing in citus

Indexes are an essential tool for optimizing database performance and are becoming ever more important with big data. Raid 5 data recovery software However, as the volume of data increases, index maintenance often becomes a write bottleneck, especially for advanced index types which use a lot of CPU time for every row that gets written. Data recovery pro Index creation may also become prohibitively expensive as it may take hours or even days to build a new index on terabytes of data in postgres.


Data recovery tools mac As of Citus 6.0, we’ve made creating and maintaining indexes that much faster through parallelization.

Citus can be used to distribute PostgreSQL tables across many machines. Z wave database One of the many advantages of Citus is that you can keep adding more machines with more CPUs such that you can keep increasing your write capacity even if indexes are becoming the bottleneck. Database orm As of Citus 6.0 CREATE INDEX can also be performed in a massively parallel fashion, allowing fast index creation on large tables. Database workbench Moreover, the COPY command can write multiple rows in parallel when used on a distributed table, which greatly improves performance for use-cases which can use bulk ingestion (e.g. Database best practices sensor data, click streams, telemetry).

To show the benefits of parallel indexing, we’ll walk through a small example of indexing ~200k rows containing large JSON objects from the GitHub archive. Database schema To run the examples, we set up a formation using Citus Cloud consisting of 4 worker nodes with 4 cores each, running PostgreSQL 9.6 with Citus 6.

You can download the sample data by running the following commands: wget http://examples.citusdata.com/github_archive/github_events-2015-01-01-{0..24}.csv.gz

Next lets create the table for the GitHub events once as a regular PostgreSQL table and then distribute it across the 4 nodes: CREATE TABLE github_events (

Each event in the GitHub data set has a detailed payload object in JSON format. Data recovery external hard drive Building a GIN index on the payload gives us the ability to quickly perform fine-grained searches on events, such as finding commits from a specific author. Database 1 to many However, building such an index can be very expensive. Data recovery galaxy s6 Fortunately, parallel indexing makes this a lot faster by using all cores at the same time and building many smaller indexes: CREATE INDEX github_events_payload_idx ON github_events USING GIN ( payload ); | | Regular table | Distributed table | Speedup |

To test how well this scales we took the opportunity to run our test multiple times. Os x data recovery software Interestingly, parallel CREATE INDEX exhibits superlinear speedups giving >16x speedup despite having only 16 cores. Database technology This is likely due to the fact that inserting into one big index is less efficient than inserting into a small, per-shard index (following O(log N) for N rows), which gives an additional performance benefit to sharding. Database queries definition | | Regular table | Distributed table | Speedup |

Once the index is created, the COPY command also takes advantage of parallel indexing. Data recovery mac free Internally, COPY sends a large number of rows over multiple connections to different workers asynchronously which then store and index the rows in parallel. Data recovery phone This allows for much faster load times than a single PostgreSQL process could achieve. Database normalization How much speedup depends on the data distribution. Data recovery quote If all data goes to the a single shard, performance will be very similar to PostgreSQL. Database key types \ COPY github_events FROM PROGRAM ‘cat github_events-*.csv’ WITH ( FORMAT CSV ) | | Regular table | Distributed table | Speedup |

Finally, it’s worth measuring the effect that the index has on query time. Database instance We try two different queries, one across all repos and one with a specific repo_id filter. Data recovery raid 0 This distinction is relevant to Citus because the github_events table is sharded by repo_id. Data recovery tools A query with a specific repo_id filter goes to a single shard, whereas the other query is parallelised across all shards. Database programming languages — Get all commits by [email protected] from all repos SELECT repo_id , jsonb_array_elements ( payload -> ‘commits’ ) FROM github_events WHERE event_type = ‘PushEvent’ AND payload @> ‘{“commits”:[{“author”:{“email”:” [email protected]”}}]}’ ; — Get all commits by [email protected] from a single repo SELECT repo_id , jsonb_array_elements ( payload -> ‘commits’ ) FROM github_events WHERE event_type = ‘PushEvent’ AND payload @> ‘{“commits”:[{“author”:{“email”:” [email protected]”}}]}’ AND repo_id = 17330407 ;

On 219k rows, this gives us the query times below. Super 8 database Times marked with * are of queries that are executed in parallel by Citus. 5 databases Parallelisation creates some fixed overhead, but also allows for more heavy lifting, which is why it can either be much faster or a bit slower than queries on a regular table. H data recovery registration code | | Regular table | Distributed table |

Indexes in PostgreSQL can dramatically reduce query times, but at the same time dramatically slow down writes. Database link Citus gives you the possibility of scaling out your cluster to get good performance on both sides of the pipeline. Database google A particular sweet spot for Citus is parallel ingestion and single-shard queries, which gives querying performance that is better than regular PostgreSQL, but with much higher and more scalable write throughput.

If you would like to learn more about parallel indexes or other ways in which Citus helps you scale, consult our documentation or give us a ping on slack. Database update You can also get started with Citus in minutes by setting up a managed cluster in Citus Cloud or spinning up a cluster on your desktop.

banner