Monitoring
Monitor the throughput and latency graphs on the Cloud console. The tool will eventually be able to generate higher throughput as more splits are created and able to reach & sustain the expected production peak traffic.
It usually takes about the first ~30 minutes to get stable p99 latencies, and almost an hour to get stable p99.9 latency. The throughput should get near expected throughput in the first ~30 minutes.
Best Practices
Splits are created based on usage and schema of the database, and reusing an existing database may provide different characteristics than a new database.
Reusing the database by only dropping the table may preserve unrelated splits and cause unexpected behavior.
Reusing an existing table for warmup should track the random data being written to the database by the tool. One suggested way is to create a nullable commit_timestamp column, that will be auto-filled by gcsb. These rows can subsequently be deleted after warmup and the commit_timestamp column can be dropped after warm up too.
Create the database with the complete production schema, including Indices.
Caveat: The tool currently doesn’t support the following features.
The warmup tool provides the ability to tune the data such that they are representative of the production workload:
All primary keys must be in the same keyspace as in the production database to create appropriate splits.
Column data can also be configured for size to maintain overall row length equal to the production load
It is recommended that all your Secondary Index (including Interleaved Index) are created before starting the warmup. Indexes will also automatically split based on the data load. It is important to load index primary keys in the same keyspace as the production workload to get the appropriate splits.
For interleaved tables, warmup process may differ based on the usage of your child tables:
If your child table is expected to have many rows (1000+) under a single parent row, then it is advisable to add rows (approximately in the same order of magnitude as it would be in production) within the child table during the warmup process.
Though if you only have a few rows under each parent row, then warming up the parent table is sufficient.
The warmup process may take several minutes, even up to an hour to stabilize your system. It is recommended to at least run the warmup process for the ballpark estimates mentioned below:
For a Spanner instance with up to 50 nodes, it may take about an hour to warmup and perform at stable QPS and latencies. For instances larger than that, add about 5-10 mins for instances with double the size. Example: 50 node instances take up to 60 mins, 100 nodes may take 70 mins, 1000 nodes will be about 100 mins.
It is also recommended to tune the warmup tool configuration to execute within the recommended CPU threshold.
Make sure to track and delete the synthetic data created by the tool before your production application is launched. Data can be deleted using the Partitioned DML.
It is a prerequisite to tune the threads and GKE pods to create peak traffic. Scale the throughput by adding more GKE pods, rather than scaling the number of threads within each run to avoid CPU contention on each pod.
Verification
To ensure that your Cloud Spanner database would achieve the expected throughput of your production application, you should execute a read/write workload after the warmup. This workload will continue to create further splits, if needed. More importantly, it will provide insight into latency and throughput for the application launch.
This step can also be used to pre-warm an existing database or table. Customers can specify the specific key ranges for the tables (config) that should be split before the launch.
To generate read/write traffic, edit the gke_run.yaml file, supplying the Spanner resource information, and the expected read and write traffic. The configuration also allows you to configure strong/snapshot reads to mimic the production workload closely.