Replication

Replication, available from CollectiveAccess Version 1.7, allows for the replication of data from one CollectiveAccess system to another. To do so, CollectiveAccess will use a specialized version of the Web Service API.

Usage

The replicator can be run on either the source or target system, or even on a “neutral” third system. All communication is done via RESTful HTTP web services. The configuration for the replicator is replication.conf. Essentially, it has two big arrays: “sources” and “targets”.

Note

Unlimited sources and targets can be configured. To understand the implications, please take a look at the “Protocol” section below.

An example for a working configuration could look like this:

sources = {
     test = {
             url = http://sync.dev/,
             service_user = admin,
             service_key = dublincore,
             from_log_timestamp = 2016-03-29,

             skipIfExpression = {
                     ca_objects = "^ca_objects.type_id !~ /image/",
             }
     }
}

targets = {
     test = {
             url = http://providence.dev/,
             service_user = admin,
             service_key = dublincore,

             setIntrinsics = {
                     __default__ = {
                             ca_objects = {
                                     source_id = external
                             }
                     },
                     29f91051-3833-4e45-892e-7e833d9af4f0 = {
                             ca_objects = {
                                     source_id = internal
                             }
                     }
             }
     }
}

Typically, exactly one source and one target will be defined; however, the syntax allows defining several of each. Individual settings are described in the table below; both example systems below have the name/code “test”.

Setting

Description

Example

url

Points to the URL where the CollectiveAccess system can be accessed from the replicator system. Note that the replicator will try to communicate with that system via the Web Service API, so point it to your Providence setup (which has the service.php), and not to Pawtucket

http://my.collection.example.org/admin/

service_user

Name of the user the replicator is going to use to log into the Web Service API on this source system. Don’t use an administrator account here; make an extra user account that can only access the Web Service API

api_user

service_key

The password/key for the user that the replicator is logging in with on this source system

foobar

from_log_id

Optional. When set, the replicator will onl pull change log entries with a primary key (ca_change_log.log_id) greater than the value set here. This is useful if both sync source and target are starting out as exact copies but diverge from a given point in time. In that case, you want to start syncing at that point. The current change log ID can be viewed under Manage > Administration > Configuration Check

2179

from_log_timestamp

Same as above, except it can be a verbose date or time expression which is then parsed by CollectiveAccess TimeExpressionParser. This Setting overrides this

2016-03-28

skipIfExpression

Don’t export log entries where one of the subjects matches the given expression. This is an array of table_name > expression mappings. You can use this, for instance, to only sync images to the target

skipIfExpression = { ca_objects = “^ca_objects.type_id !~ /image/”, }

push_media

Set this to 1 if you have firewall or networking restrictions like a firewall in place that would prevent pulling media from the target side via normal HTTP for each source. This enabled a protocol addition where media is pushed from the source(s) to the target(s) after the change log segment is generated and stashed there locally until the change log is processed.

0 or 1

Note

For the push_media setting to work, the source needs a copy of the same replication.conf that’s being used by the replicator (which can be anywhere). This is so that we don’t have to send target login credentials from the replicator to the source. The target is being identified by its code in replication.conf. With this feature enabled, source(s) could potentially be used for denial of service attacks. Only enable this if you’re sure you need it.

Settings for Targets

For the replication process to pull media through, set allow_fetching_of_media_from_remote_urls to 1 in app.conf. The default is 0.

Setting

Description

Example

url

Points to the URL where the CollectiveAccess system can be accessed from the replicator system. Note that the replicator will try to communicate with that system via Web Service API, so point it to your Providence setup (which has the service.php), and not to Pawtucket

http://my.collection.example.org/admin/

service_user

Name the user the replicator is going to use to log into the Web Service API on this target system

service_key

Password or key for the user that the replicator is logging in with on this target system

setIntrinsics

This can be set to a list of hashes that define what intrinsics to set on the target side, in addition to the data that comes from the source change log. It can be broken out by source system. This can be useful, for instance, if you’re syncing multiple systems into one and want to tag or mark each record with its original source.

Each of these hashes has a system GUID as key. This is the unique identifier of the source system, which can be found on the Configuration Check screen under Manage > Administration.

If a record comes in from that GUID, the settings from that configuration block are applied. Below it’s a list of tablename > fieldname > value mappings. There can also be a _default_ that always gets applied. Fields from the GUID-specific blocks will override fields set there. If no GUID matches, only the default block will be applied, if any. Note that this feature can only set intrinsics, at least for now.

setIntrinsics = { __default__ = { ca_objects = { source_id = external } }, 29f91051-3833-4e45-892e-7e833d9af4f0 = { ca_objects = { source_id = internal } } }

deduplicateAfterReplication

If this is set to a list of tables, the replicator will run a single deduplication algorithm for these tables on the target system. This is done by computing a checksum for each record in the database and then merging records that have the same checksum.

deduplicateAfterReplication = {ca_entities, ca_places}

Running the Replicator

Once replication.conf is set up, the replicator can be run. It is recommended to keep a backup of the target system(s) at hand while you play around with the configuration. Selectively rolling back changes made by the sync is not possible at the moment.

The replicator is a simple script in caUtils:

support/bin/caUtils replicate-data

It will create a log file in the location specified in replication.conf.

Protocol

The rough protocol outline is as follows. For each combination (sources and targets), adhere to the following:

  • Get the system guide for source

  • Get the last replicated log id for source at target, if any

  • Determine log start point for source and target (take into account “from_log_timestamp” or “from_log_id” settings

  • Get log from s.getlog – taking into account both skipIfExpression and the above log start point

  • If no (new) log entries found, abort

  • Forward that log to t.applylog – also passing s.guid and setIntrinsics for that system

  • Check over results

Replication Service

All communication is done via the newly implemented replication service. It facilitates both the “source” and the “target” functionality through these endpoints. Note that all the names are case-insensitive. Their CamelCase equivalents will work just as well.

GET getlog

returns the change log for that system. Parameters are:

from (int) = log_id to start from
limit (int) = limit to this many entries
skipIfExpression (string) = json-encoded skipIfExpression config fragment (see above)

The response body is the JSON-encoded change log

GET getsysguid

Returns the system GUID for this target or source. the response body will have the GUID under the “system_guid” key.

GET getlastreplicatedlogid

Returns the last replicated log ID for a given source at that particular target system. this parameter is mandatory:

system_guid (string) = system GUID for the source system

The log id will be under under the “replicated_log_id” key in the response body

GET getlogidfortimestamp

Translates a given timestamp into a log id for that system. this facilitates the functionality for the “from_log_timestamp” (see above). There is one mandatory parameter:

timestamp (int) = the Unix timestamp to translate

The log id will be under the “log_id” key in the response body.

POST applylog

Apply the given log at the target system. takes the log (in the exact format returned by “getlog” as request body. Also has:

system_guid (string) = system GUID of the source system, mandatory
setIntrinsics (string) = JSON-encoded config fragment for the setIntrinsics functionality (see above)

Will return the last replicated log_id under the “replicated_log_id” key in the response body and any warnings as array under the “warnings” key.

POST dedup

Run deduplication for a given list of tables. There is one mandatory parameter:

tables (string) = JSON-encoded config list of tables to run deduplication on, mandatory

Implementation Details

The main functionality of the feature is in the getlog and applylog functions.

Getlog

The actual implementation is not in the ReplicationService, but in ca_change_log::getLog(). For the most part, it just gets the change log from the given start point, and pulls in ca_change_log_snapshots and ca_change_log_subjects for each of the resulting rows. It then goes through some lengths to make these records useful for sync by adding GUIDs for all system-specific *_id columns.

It also processes the skipIfExpression rules. They’re applied to the change log subjects for each change log entry. The whole change log entry is skipped if the expression (and the table) matches for one of the subjects.

Applylog

The ReplicationService will pull the log out of the request body and apply some basic sanity checks. It’ll also figure out if setIntrinsics was set and prepare that as an option to pass to the change log entry implementations. It then loops through the log entries and calls CASyncLogEntryBase::getInstance() for each of the entries. That class method will return one of the Implementations of CASyncLogEntryBase, based on what kind of record that log entry represents:

Attribute -- ca_attributes
AttributeValue -- ca_attribute_values
Bundleable -- something like ca_objects
Label -- something like ca_object_labels
Relationship -- something like ca_objects_x_occurrences

It will then call apply() on the log entry object. Each row is processed in a transaction, which is rolled back if the log entry object throws an Exception. Because of the interdependencies between the log entries, not everything will be processed at once.