Optimizing Read and Write Latency
Reducing latency in reads and writes in YugabyteDB
Today’s global and distributed applications often need to serve user requests from a single data source across different regions. While providing data scaling and protection against network outages, ensuring low-latency access to data is critical for providing a seamless user experience. YugabyteDB, a distributed SQL database, is designed to handle global data workloads efficiently. In this blog post, I’ll share some techniques to optimize read and write latency in a multi-region YugabyteDB cluster.
My Test Setup
For my tests, I set up a three-node YugabyteDB cluster spanning the us-central1, us-east1, and us-west1 regions on Google Cloud Platform (GCP). I designated the central and west nodes as preferred leaders, meaning they would host leader tablets for the cluster unless one of these preferred regions go offline. In some cases, a single preferred region is, well, preferable… but for this test, the assumption is that application servers in both central and west are required to read and write data concurrently.
Understanding Latency
Latency is generally understood to be the amount of time for a request to be served to an application. The largest percentage of this time by far is the network time it takes for data to travel between different parts of a distributed system. In a global database cluster, this network latency can significantly impact the performance of read and write operations, especially when data needs to traverse long distances across regions.
Measuring Network Latency
To measure base network latency, I used the ping utility to send packets to the database nodes and measured the round-trip time. As expected, the ping times to the node in the same region were significantly lower (under 1ms) compared to cross-region pings (around 32ms). The application server (app01) used is in the central region.
[valerie@app01 bin]$ ping 10.200.0.63
PING 10.200.0.63 (10.200.0.63) 56(84) bytes of data.
64 bytes from 10.200.0.63: icmp_seq=1 ttl=64 time=0.903 ms
64 bytes from 10.200.0.63: icmp_seq=2 ttl=64 time=0.183 ms
64 bytes from 10.200.0.63: icmp_seq=3 ttl=64 time=0.193 ms
64 bytes from 10.200.0.63: icmp_seq=4 ttl=64 time=0.185 ms
[valerie@app01 bin]$ ping 10.204.0.78
PING 10.204.0.78 (10.204.0.78) 56(84) bytes of data.
64 bytes from 10.204.0.78: icmp_seq=1 ttl=64 time=32.5 ms
64 bytes from 10.204.0.78: icmp_seq=2 ttl=64 time=31.9 ms
64 bytes from 10.204.0.78: icmp_seq=3 ttl=64 time=31.10 ms
64 bytes from 10.204.0.78: icmp_seq=4 ttl=64 time=32.0 ms
[valerie@app01 bin]$ ping 10.202.0.47
PING 10.202.0.47 (10.202.0.47) 56(84) bytes of data.
64 bytes from 10.202.0.47: icmp_seq=1 ttl=64 time=32.2 ms
64 bytes from 10.202.0.47: icmp_seq=2 ttl=64 time=31.6 ms
64 bytes from 10.202.0.47: icmp_seq=3 ttl=64 time=31.7 ms
64 bytes from 10.202.0.47: icmp_seq=4 ttl=64 time=31.6 ms
The ping values, and therefore the network latency contribution to read and write latency in a database cluster, will be affected by which regions are selected for the cluster. For example, a cluster across 3 data centers within a single region would have much lower network latency. (However, using a single region exposes you to outage risks caused by localized network failures, a topic for another day.)
Test data provisioning
The above ping tests give some hints to the read and write latencies in this cluster. To test the results of read and write operations, I created a test table with 10,000 rows and verified the two preferred leaders (configured in the control plane UI). Note the use of the yb-admin tool included with YugabyteDB; this tool has many useful options that you can learn about here.
create database db1;
\c db1
create table table1 (
id bigserial primary key,
name text
);
insert into table1 (name)
select substr(md5(random()::text), 1, 10)
from generate_series(1, 10000);
export MASTERS=10.200.0.63:7100,10.204.0.78:7100,10.202.0.47:7100
export CERTS_DIR=/home/yugabyte/yugabyte-client-tls-config
export PATH=$PATH:/home/yugabyte/tserver/bin
yb-admin --master_addresses $MASTERS --certs_dir_name $CERTS_DIR list_tablets ysql.db1 table1
# output
Tablet-UUID Range Leader-IP Leader-UUID
d83cb27d8dd0492c805cc9d9331a9173 partition_key_start: "" partition_key_end: "\200\000" 10.200.0.63:9100 b3f14a31fc7d400bb08e11a48a837088
b06e07e987e14219b68af03ffc52230f partition_key_start: "\200\000" partition_key_end: "" 10.202.0.47:9100 45da9591dfed433f8134cc4dd1c11e03
After writing data, I flushed data to disk so the sst files would be available for the following tests.
yb-admin --master_addresses $MASTERS --certs_dir_name $CERTS_DIR flush_table ysql.db1 table1
# output
Flushed [db1.table1] tables.
Identifying IDs to use in testing
The sst_dump tool (also included with distributions of YugabyteDB) allows you to read the contents of the SST files. The primary key “id” from table1 is shown in the first set of brackets on each line.
# set environment variables if needed
export MASTERS=10.200.0.63:7100,10.204.0.78:7100,10.202.0.47:7100
export CERTS_DIR=/home/yugabyte/yugabyte-client-tls-config
export PATH=$PATH:/home/yugabyte/tserver/bin
# node1 is leader for e9603a60eecc4ee5b5488d7f44b6d65e
# follower for c4bdc46c359c4e6fafab8b3e1d87e03d (see above)
sst_dump --command=scan --file=/mnt/d0/yb-data/tserver/data/rocksdb/table-00004000000030008000000000004402/tablet-d83cb27d8dd0492c805cc9d9331a9173 --output_format=decoded_regulardb | head -n 10
# output
SubDocKey(DocKey(0x0000, [4443], []), [SystemColumnId(0); HT{ physical: 1715012874025300 w: 4356 }]) -> null
...
sst_dump --command=scan --file=/mnt/d0/yb-data/tserver/data/rocksdb/table-00004000000030008000000000004402/tablet-b06e07e987e14219b68af03ffc52230f --output_format=decoded_regulardb | head -n 10
# output
SubDocKey(DocKey(0x8006, [1529], []), [SystemColumnId(0); HT{ physical: 1715012874025300 w: 8352 }]) -> null
...
To get a sample ID for a row that resides on a leader tablet in each region, I found both a leader tablet on the central node and a follower tablet on the same node for a tablet whose leader is on west node. Note that since the east node isn’t a preferred leader, all of its tablets are followers.
By finding the follower on the central node for a leader on the west node, I could identify a row ID that belongs to a leader tablet in the west region. This works because in this 3-node cluster, each tablet has two followers in addition to the leader.
Baseline Read and Write Performance
Next, I performed read and write operations from the us-central1 and us-west1 regions. The results showed that when the leader tablet was local, reads and writes were fast (around 1-2ms). However, when the leader tablet was in a different region, the latency increased to equal around the amount of network latency.
\c db1
\timing ON
# leader tablet is local
select id, name from table1 where id=4443;
Time: 1.530 ms
# leader tablet is in west
select id, name from table1 where id=1529;
Time: 33.399 ms
Optimizing Reads: Follower Reads
For read-heavy workloads where data staleness is acceptable, YugabyteDB offers follower reads. By setting the yb_read_from_followers session variable, queries can be served from follower tablets instead of the leader tablets, allowing local reads even when the leader tablet is in a different region.
SET session characteristics as transaction read only;
SET yb_read_from_followers = true;
SET yb_follower_read_staleness_ms = 30;
The amount of tolerable data staleness is configured using the yb_follower_read_staleness_ms flag. It is by default 30 seconds. If the timestamp on data is older than that configured value, the request will be sent to the leader tablet instead of being served from the follower tablet.
My tests showed that follower reads reduced cross-region read latency from around 33ms to around 1.5ms.
# query from us-central1 app connected to us-central1 database node
\c db1
\timing ON
# leader tablet is local
select id, name from table1 where id=4443;
Time: 1.554 ms
# leader tablet is in west
select id, name from table1 where id=1529;
Time: 1.398 ms
# query from us-west1 app connected to us-west1 database node
\c db1
\timing ON
# leader tablet is local
select id, name from table1 where id=1529;
Time: 1.487 ms
# leader tablet is in central
select id, name from table1 where id=4443;
Time: 1.517 ms
Writes: Updates
When performing updates, the time spent includes the update on the leader tablet and the subsequent follower tablet copies. For a local leader tablet, the initial update is very fast, taking less than 1ms. The updates on the follower tablets takes around 30ms due to network latency; in RF=3, one follower write is required to complete before acknowledging the write to the application. In the cluster described in these tests, if the leader tablet is in a remote region, the update incurs an additional 30ms of network latency to send the update to the leader first, and then another 30ms to apply the update from that node to the followers.
The 30ms latency seen in this test cluster is due to the network distance between the us-central1 and us-west1 regions. (Again, your specific latency numbers may vary depending on the regions your cluster spans.)
The code examples show the time taken for updates when the leader is local (around 35ms) versus when the leader is remote (around 67ms), tested from both the us-central1 and us-west1 regions.
# update from us-central1 app connected to us-central1 database node
\c db1
\timing ON
SET yb_read_from_followers = false;
# leader tablet is local
update table1 set name='updated1' where id=4443;
Time: 35.549 ms
# leader tablet is in west
update table1 set name='updated2' where id=1529;
Time: 67.882 ms
# update from us-west1 app connected to us-west1 database node
\c db1
\timing ON
# leader tablet is local
update table1 set name='updated3' where id=1529;
Time: 35.661 ms
# leader tablet is in central
update table1 set name='updated4' where id=4443;
Time: 67.556 ms
Writes: Inserts
For insert operations in a multi-region cluster, the latency again depends on whether the insert goes to a local or remote preferred leader node first.
# insert from us-central1 app connected to us-central1 database node
\c db1
\timing ON
insert into table1 (name) values ('newname1');
Time: 34.951 ms
insert into table1 (name) values ('newname2');
Time: 67.278 ms
# insert from us-west1 app connected to us-west1 database node
\c db1
\timing ON
insert into table1 (name) values ('newname3');
Time: 33.916 ms
insert into table1 (name) values ('newname4');
Time: 67.242 ms
If the insert is on a local preferred leader, it is very fast (< 1ms). It then needs to replicate to one other node, which incurs around 30ms of network latency across regions. If the insert goes to a remote preferred leader (~30ms latency), it then replicates back to a second node (~30ms).
So for inserts, in this deployment pattern, you will see latencies around 30-35ms if the preferred leader is local, and around 65-70ms if the preferred leader is in a remote region. With two preferred regions in a 3-node cluster and two application servers in those regions, about 50% of the writes will be local and about 50% will be remote. The code examples above show these latencies when inserting from both the us-central1 and us-west1 regions, connected to their local database nodes.
Optimizing Reads: Duplicate Indexes
For read-heavy workloads, duplicate indexes are an optimization supported by YugabyteDB. By creating indexes on the same table in different tablespaces with region-specific leader preferences, writes can be optimized for local regions. In any region, the main table is read for local preferred leaders and the local index is read for remote preferred leaders.
-- tablespace for west data
CREATE TABLESPACE west WITH (
replica_placement= '{
"num_replicas" : 3,
"placement_blocks" : [
{"cloud":"gcp","region":"us-west1","zone":"us-west1-a","leader_preference": 1,"min_num_replicas":1},
{"cloud":"gcp","region":"us-east1","zone":"us-east1-b","min_num_replicas":1},
{"cloud":"gcp","region":"us-central1","zone":"us-central1-a","min_num_replicas":1}
]}');
-- tablespace for central data
CREATE TABLESPACE central WITH (
replica_placement= '{
"num_replicas" : 3,
"placement_blocks" : [
{"cloud":"gcp","region":"us-west1","zone":"us-west1-a","min_num_replicas":1},
{"cloud":"gcp","region":"us-east1","zone":"us-east1-b","min_num_replicas":1},
{"cloud":"gcp","region":"us-central1","zone":"us-central1-a","leader_preference": 1,"min_num_replicas":1}
]}');
-- tablespace for east data
CREATE TABLESPACE east WITH (
replica_placement= '{
"num_replicas" : 3,
"placement_blocks" : [
{"cloud":"gcp","region":"us-west1","zone":"us-west1-a","min_num_replicas":1},
{"cloud":"gcp","region":"us-east1","zone":"us-east1-b","leader_preference": 1,"min_num_replicas":1},
{"cloud":"gcp","region":"us-central1","zone":"us-central1-a","min_num_replicas":1}
]}');
CREATE INDEX idx_west ON table1 (name) INCLUDE (id) TABLESPACE west;
CREATE INDEX idx_east ON table1 (name) INCLUDE (id) TABLESPACE east;
CREATE INDEX idx_central ON table1 (name) INCLUDE (id) TABLESPACE central;
The performance impact of the duplicate indexing strategy is demonstrated by running queries from the central database node with different index combinations. Here, the leader tablet for ‘monday1’ is in the central region, and the leader tablet for ‘monday2’ is in the west region (however, its index leader tablet is in the central region).
\c db1
\timing ON
select id, name from table1 where name='monday1';
Time: 1.402 ms
select id, name from table1 where name='monday2';
Time: 1.444 ms
As usual, when you add indexes, you increase the number of writes required to multiple tables (main table plus 1 to many index tables). This optimization therefore comes at the cost of increased write latency due to the need to update multiple indexes.
insert into table1 (name) values ('tuesday2');
Time: 98.483 ms
update table1 set name='tuesday3' where id=10502;
Time: 156.993 ms
The duplicate indexing strategy allows for geographically distributed data replication across the cluster to optimize for both query performance and data availability.
Optimizing for Geolocation: Latency-Optimized Geo-Partitioning
Another optimization technique is latency-optimized geo-partitioning. This approach involves adding a column to the table that stores the region name, which can be populated using the yb_server_region() function. See the included link above for an example of read and write latency.
# on central database node
select yb_server_region() ;
yb_server_region
------------------
us-central1
# on west database node
select yb_server_region() ;
yb_server_region
------------------
us-west1
Queries can then filter data based on the region column, ensuring that data is retrieved from the local region whenever possible. Although this feature is called “geolocation” the column value is not required to be related to geography; the optimization happens when you ensure that if an application writes to a local region, it will generally read from that local region. This can be accomplished in a variety of ways (more on this in a later post).
Conclusion
Distributed databases are primarily used to allow scaling for data size and to increase availability while decreasing effects of network outages. Optimizing read and write latency remains an additional critical requirement in such a system in order to provide a seamless user experience.
By leveraging techniques like follower reads, duplicate indexes, and geo-partitioning, databases can be architected to meet the specific requirements of an application. Whether an application requires optimization for reads or writes, or a balance between the two, YugabyteDB offers techniques to optimize latency in global data workloads.