Bulk Loading Matchgrid

COmanage Match supports the bulk load of records from a file into a Matchgrid using the command line.

Preparing the Source File

The source file must be in CSV (RFC 4180, though more specifically as supported by PHP) format. The first line of the file must be a record header, indicating the columns present in the file. The columns must be, in order,

sor: The Systems of Record label, as configured
sorid: The System of Record unique identifier for the represented person
One or more Attribute names, as configured
referenceid: The Reference ID for the represented person, if the record has already been matched

Multiple Systems of Record may be represented in the same file. referenceid is optional; if it is not provided, the behavior of the load is described below.

Attributes are provided in the same format as over the API, and are subject to the Matchgrid Attributes configuration. Note that, other than verifying that sor is a valid SoR label, no attribute validation is performed.

Sample Source File

sor,sorid,firstname,lastname,dob,ssn,referenceid
hrms,4699738,Pat,Lee,1983-03-18,999-00-1234,b3e7ca03-af37-44df-9446-e92c48f129df
hrms,5460736,Joe,Baron,1988-08-08,628-92-6566,
hrms,8549113,Amanda,Gantt,1987-05-14,630-56-2233,
hrms,6100178,Vicki,Petty,1987-03-23,366-22-3556,
hrms,6425809,Valerie,Lutz,1990-04-24,055-20-5960,
hrms,3406461,Henry,Carrion,1974-10-15,529-12-7984,

Loading the Matchgrid

Command Line Usage

./bin/cake bulk_load [-qv] [--skip-match] --matchgrid-id id input.csv

where

--skip-match: Causes the load to operate in Skip Match Rules mode (see below); the default is Process Match Rules
id: The numeric Matchgrid ID (as available, eg, in a URL like /match/matchgrids/manage/5)
input.csv: The name of the file to read records from, in the format specified above

The Matchgrid table must be already be configured and built before a bulk load can be performed.

The bulk load process will rebuild the Matchgrid table. Any configuration changes (such as defining new attributes) that have not yet been applied (by rebuilding the Matchgrid) will be applied.

Load Modes

Bulk loading the Matchgrid supports two modes:

Process Match Rules: Match Rules and Rule Attributes are applied for each record that does not already have a Reference ID. This mode is suitable when bulk loading new data that has not yet been deduplicated. If a record does not have a Reference ID, one may be assigned, unless the Match Rules result in a pending match.
Skip Match Rules: Match Rules are not applied. Furthermore, to speed operations the Matchgrid's database indexes are dropped (and then recreated after the load). This mode is suitable when bulk loading an existing set of data that has already been deduplicated, probably from a legacy system. If a record does not have a Reference ID, one will be assigned.

Regardless of the mode, records are appended to the existing Matchgrid. To start with an empty Matchgrid, truncate the table (via database provided tools) before starting the bulk load. Attempting to load a record for an SOR+SORID that is already in the Matchgrid may throw a unique constraint violation, depending on the processing mode. No specific behavior when a record is loaded a second time should be relied upon, and no specific behavior is guaranteed.

In Process Match Rules mode, it is possible for a record to result in potential matches. When this happens, the pending match may be resolved using the standard methods. Note, however, that no email notification will be generated.

In multi-tenant deployments, bulk loading one Matchgrid will not functionally affect any other Matchgrid, though depending on the circumstances it is possible the database or application server may exhibit performance degradation.

When running in Skip Rules mode, the Matchgrid should not be available via the APIs, or data inconsistency may result. However, Match does not currently enforce this, so deployers should coordinate appropriately given their own circumstances.

Updating the Matchgrid Reference Identifier Sequence

If all of the following are true:

The bulk load is run in Process Match Rules mode, and
The Matchgrid Reference ID Assignment Method is set to Sequence, and
At least one Reference ID was specified in the input file

then it will be necessary to manually update the database sequence to the next available number. (The bulk load command is not currently capable of tracking manually assigned Reference IDs.) This can be done with the following SQL:

match=> select setval('mg_seq_ID_seq', NEXTVAL);