Alex's blog

OutbackCDX replication

One of the exciting new features contributed by the Archive-It team in OutbackCDX 0.7.0 is James Kafader’s implementation of primary-secondary replication support. This enables deployments such as high availability failover, load balancing or to host indexes in multiple geographic locations to reduce query latency.

How it works

One instance of OutbackCDX is designated the primary and configured to preserve its transaction log.

Secondary instances poll the primary’s transaction log for changes at /{collection}/changes?since={seqno}. This uses RocksDB’s GetUpdatesSince API to return a list of write batches which the secondary applies as an incremental update to its index.

[{"sequenceNumber": "1", "writeBatch": "... base64 data ..."},
 {"sequenceNumber": "3", "writeBatch": "... base64 data ..."}]

Secondary instances are read-only and will refuse record updates made directly via POST and DELETE API calls.

Configuring the primary instance

When running in primary mode we should use the --replication-window option to tell OutbackCDX how many seconds to keep the transaction log for. For example if you’re certain you’ll never to update a secondary that’s out of date by 7 days you could use 604800 to save some disk space. In this case we’ll use 0 which means we keep the transaction log forever.

Let’s create new directory to store our primary index data in and run it:

$ mkdir /tmp/primary
$ java -jar outbackcdx-0.7.0.jar -d /tmp/primary --replication-window 0
OutbackCDX http://localhost:8080

Now that the primary is running let’s create a collection named ‘example’ and populate it with a cdx line:

$ echo '- 20190101000000 text/html 200 - - - 1043 333 example.warc.gz' > example.cdx
$ curl --data-binary @example.cdx http://localhost:8080/example
Added 1 records

Configuring a secondary instance

We’ll run a secondary instance with a data directory of /tmp/secondary on port 8081. When running a secondary we need to give it the URL of the primary collection we want to replicate from using the --primary option. If you need to replicate multiple collections use --primary multiple times.

$ mkdir /tmp/secondary
$ java -jar outbackcdx-0.8.0.jar -d /tmp/secondary -p 8081 --primary http://localhost:8080/example
OutbackCDX http://localhost:8081
Tue Jan 14 17:21:57 KST 2020 ChangePollingThread(http://localhost:8080/example): replicated 1 write batches (1..1) with total length 132 in 0.647s from http://localhost:8080/example/changes?size=10485760&since=0 and our latest sequence number is now 2

Changing the replication interval

By default secondaries will poll for changes every 10 seconds. This can be adjusted with the --update-interval option.