This example is of a person feed. The feed needs to go from an IDM database to a Banner database for a list of users and attributes (some from Grouper).
Setup a view, have a where clause that constrains by a group (note, you can provision Grouper memberships to the IDM database with a different SQL sync).
CREATE OR REPLACE FORCE VIEW VIEW_WITH_PEOPLE_V ( PENN_ID ) BEQUEATH DEFINER AS SELECT subject_id AS penn_id -- table provisioned from grouper which is like grouper_memberships_lw_v FROM penn_memberships_lw gmlv WHERE gmlv.subject_source = 'pennperson' -- policy group that AND gmlv.group_name = 'a:b:c:somePolicyGroup' AND gmlv.list_name = 'members';
This is the view we will sync
CREATE OR REPLACE FORCE VIEW SOME_VIEW_V ( employee_ID, GUID, LAST_NAME, FIRST_NAME, MIDDLE_NAME, NAME_PREFIX, NAME_SUFFIX, EMAIL_ADDRESS, STREET1, STREET2, STREET3, CITY, STATE, POSTAL_CODE, PHONE_NUMBER, EXTENSION, ALL_ACTIVE_AFFILIATIONS, ORG, TITLE, DESCRIPTION, LAST_UPDATED ) BEQUEATH DEFINER AS SELECT cp.id, cp.guid, ... FROM COMPUTED_PERSON cp, VIEW_WITH_PEOPLE_V vwpv WHERE cp.char_penn_id = vwpv.PENN_ID; GRANT SELECT ON VIEW SOME_VIEW_V TO AUTHZADM;
Create a table in a database to sync from, either you need to create explicitly, or if its the same database or a database with a dblink, you can have the database create it for you
create table SOME_TABLE as (select * from SOME_VIEW_V where 1!=1);
Now we need to decide which types of SQL sync we will use. Lets try a few of them and see how they do with performance. Start with a full sync full. We need the databases, tables, columns, and primary key
grouper.client.properties
# grouper client or loader database key where copying data from grouperClient.syncTable.some_view.databaseFrom = myDb # table or view where copying data from, include the schema if needed grouperClient.syncTable.some_view.tableFrom = pcdadmin.some_view_v # grouper client or loader database key where copying data to grouperClient.syncTable.some_view.databaseTo = anotherDb # table or view where copying data to, include the schema if needed grouperClient.syncTable.some_view.tableTo = some_schema.some_Table # columns must match in from and to tables, you can specify columns or do all with an asterisk grouperClient.syncTable.some_view.columns = * # if there is a primary key, list it, else list the composite keys. note, this doesnt # have to literally be the database prmiary key, just need to be a unique col(s) in table grouperClient.syncTable.some_view.primaryKeyColumns = employee_id
fullSyncFull
Now lets setup the full sync and run one.
grouper-loader.properties
# Object Type Job class otherJob.some_view_full.class = edu.internet2.middleware.grouper.app.tableSync.TableSyncOtherJob # Object Type Job cron otherJob.some_view_full.quartzCron = 0 0 5 * * ? # this is the key in the grouper.client.properties that represents this job otherJob.some_view_full.grouperClientTableSyncConfigKey = some_view # fullSyncFull, fullSyncGroups, fullSyncChangeFlag, incrementalAllColumns, incrementalPrimaryKey otherJob.some_view_full.syncType = fullSyncFull
You need to start a loader to schedule this. I will just start one from command line, then kill it
gsh -loader
Run it manually from the UI, and it runs for a while. Run it a second time and it runs for 1.2 million records in just under 4 minutes. It does use a lot of memory to get both datasets in memory at one time.
2020-02-13 21:19:24,769: [Thread-13] DEBUG GcTableSyncLog.debugLog(33) - - finalLog: false, state: retrieveData, sync: sqlTableSync, provisionerName: personSourceProd, syncType: fullSyncFull, databaseFrom: pcom, tableFrom: PERSON_SOURCE2_v, databaseTo: awsProd, tableTo: PERSON_SOURCE, initialMaxProgressAllColumns: 1581646699935 2020-02-13 21:20:24,987: [Thread-13] DEBUG GcTableSyncLog.debugLog(33) - - finalLog: false, state: retrieveData, sync: sqlTableSync, provisionerName: personSourceProd, syncType: fullSyncFull, databaseFrom: pcom, tableFrom: PERSON_SOURCE2_v, databaseTo: awsProd, tableTo: PERSON_SOURCE, initialMaxProgressAllColumns: 1581646699935 2020-02-13 21:21:25,250: [Thread-13] DEBUG GcTableSyncLog.debugLog(33) - - finalLog: false, state: retrieveData, sync: sqlTableSync, provisionerName: personSourceProd, syncType: fullSyncFull, databaseFrom: pcom, tableFrom: PERSON_SOURCE2_v, databaseTo: awsProd, tableTo: PERSON_SOURCE, initialMaxProgressAllColumns: 1581646699935 2020-02-13 21:22:01,757: [main] DEBUG GcTableSyncLog.debugLog(33) - - finalLog: true, state: done, sync: sqlTableSync, provisionerName: personSourceProd, syncType: fullSyncFull, databaseFrom: pcom, tableFrom: PERSON_SOURCE2_v, databaseTo: awsProd, tableTo: PERSON_SOURCE, initialMaxProgressAllColumns: 1581646699935, retrieveDataFromCount: 1290590, retrieveDataFromMillis: 184418, retrieveDataToCount: 1290590, retrieveDataToMillis: 206597, deletesCount: 0, deletesMillis: 0, insertsCount: 0, insertsMillis: 0, updatesCount: 74, updatesMillis: 53, queryCount: 11, tookMillis: 217202, took: 0:03:37.202
Here is an example final log of updating 400k records. This is essentially 130k updates per minute (since this took 6.5 minutes, instead of 3.5 minutes)
finalLog: true, state: done, sync: sqlTableSync, provisionerName: personSourceProd, syncType: fullSyncFull, databaseFrom: pcom, tableFrom: PERSON_SOURCE2_v, databaseTo: awsProd, tableTo: PERSON_SOURCE, initialMaxProgressAllColumns: 1581644355937, retrieveDataFromCount: 1290590, retrieveDataFromMillis: 182013, retrieveDataToCount: 1290590, retrieveDataToMillis: 202774, deletesMillis: 0, insertsMillis: 0, updatesCount: 390847, updatesMillis: 180620, queryCount: 519, tookMillis: 394049, took: 0:06:34.049
fullSyncGroupings
We can set the primary key col (since there is one col primary key) as the grouping col. This will select all groupings from one side and the other, sort them, do the inserts/deletes, then select swaths of groupings of 100k records with one query. This is a full sync but is more efficient and doesnt use as much memory
grouper.client.properties
# the grouping column is what is uniquely selected, and then batched through to get data. Optional. # for groups this should be the group uuid # {valueType: "string"} # grouperClient.syncTable.some_view.groupingColumn = employee_id # the grouping column is what is uniquely selected, and then batched through to get data, defaults to global setting # {valueType: "integer"} # grouperClient.syncTable.some_view.groupingSize = 100000
grouper-loader.properties
# fullSyncFull, fullSyncGroups, fullSyncChangeFlag, incrementalAllColumns, incrementalPrimaryKey otherJob.some_view_full.syncType = fullSyncGroups
This took 6 minutes for 1.3 million rows. Here are example logs. Note, before the final log, the times are in micros and not millis to be more precise
2020-02-13 21:06:48,480: [Thread-13] DEBUG GcTableSyncLog.debugLog(33) - - finalLog: false, state: retrieveData, sync: sqlTableSync, provisionerName: personSourceProd, syncType: fullSyncGroups, databaseFrom: pcom, tableFrom: PERSON_SOURCE2_v, databaseTo: awsProd, tableTo: PERSON_SOURCE, initialMaxProgressAllColumns: 1581645944789, retrieveGroupsToCount: 1290590, retrieveGroupsToMillis: 18782610 2020-02-13 21:07:48,574: [Thread-13] DEBUG GcTableSyncLog.debugLog(33) - - finalLog: false, state: retrieveData, sync: sqlTableSync, provisionerName: personSourceProd, syncType: fullSyncGroups, databaseFrom: pcom, tableFrom: PERSON_SOURCE2_v, databaseTo: awsProd, tableTo: PERSON_SOURCE, initialMaxProgressAllColumns: 1581645944789, retrieveGroupsToCount: 1290590, retrieveGroupsToMillis: 18782610 2020-02-13 21:08:48,914: [Thread-13] DEBUG GcTableSyncLog.debugLog(33) - - finalLog: false, state: syncData, sync: sqlTableSync, provisionerName: personSourceProd, syncType: fullSyncGroups, databaseFrom: pcom, tableFrom: PERSON_SOURCE2_v, databaseTo: awsProd, tableTo: PERSON_SOURCE, initialMaxProgressAllColumns: 1581645944789, retrieveGroupsToCount: 1290590, retrieveGroupsToMillis: 18782610, retrieveGroupsFromCount: 1290590, retrieveGroupsFromMillis: 147822470, deleteGroupingsCount: 0, deleteGroupingsGroupsCount: 0, deleteGroupingsMillis: 46, retrieveDataToCount: 100000, retrieveDataToMillis: 19050025, retrieveDataFromCount: 100000, retrieveDataFromMillis: 23414151, deletesCount: 0, deletesMillis: 13, insertsCount: 0, insertsMillis: 13, updatesCount: 0, updatesMillis: 12 2020-02-13 21:09:48,987: [Thread-13] DEBUG GcTableSyncLog.debugLog(33) - - finalLog: false, state: syncData, sync: sqlTableSync, provisionerName: personSourceProd, syncType: fullSyncGroups, databaseFrom: pcom, tableFrom: PERSON_SOURCE2_v, databaseTo: awsProd, tableTo: PERSON_SOURCE, initialMaxProgressAllColumns: 1581645944789, retrieveGroupsToCount: 1290590, retrieveGroupsToMillis: 18782610, retrieveGroupsFromCount: 1290590, retrieveGroupsFromMillis: 147822470, deleteGroupingsCount: 0, deleteGroupingsGroupsCount: 0, deleteGroupingsMillis: 46, retrieveDataToCount: 400000, retrieveDataToMillis: 75379944, retrieveDataFromCount: 400000, retrieveDataFromMillis: 68221662, deletesCount: 0, deletesMillis: 58, insertsCount: 0, insertsMillis: 51, updatesCount: 0, updatesMillis: 59 2020-02-13 21:10:49,270: [Thread-13] DEBUG GcTableSyncLog.debugLog(33) - - finalLog: false, state: syncData, sync: sqlTableSync, provisionerName: personSourceProd, syncType: fullSyncGroups, databaseFrom: pcom, tableFrom: PERSON_SOURCE2_v, databaseTo: awsProd, tableTo: PERSON_SOURCE, initialMaxProgressAllColumns: 1581645944789, retrieveGroupsToCount: 1290590, retrieveGroupsToMillis: 18782610, retrieveGroupsFromCount: 1290590, retrieveGroupsFromMillis: 147822470, deleteGroupingsCount: 0, deleteGroupingsGroupsCount: 0, deleteGroupingsMillis: 46, retrieveDataToCount: 800000, retrieveDataToMillis: 142089149, retrieveDataFromCount: 800000, retrieveDataFromMillis: 126142709, deletesCount: 0, deletesMillis: 154, insertsCount: 0, insertsMillis: 98, updatesCount: 164, updatesMillis: 105758 2020-02-13 21:11:49,356: [Thread-13] DEBUG GcTableSyncLog.debugLog(33) - - finalLog: false, state: syncData, sync: sqlTableSync, provisionerName: personSourceProd, syncType: fullSyncGroups, databaseFrom: pcom, tableFrom: PERSON_SOURCE2_v, databaseTo: awsProd, tableTo: PERSON_SOURCE, initialMaxProgressAllColumns: 1581645944789, retrieveGroupsToCount: 1290590, retrieveGroupsToMillis: 18782610, retrieveGroupsFromCount: 1290590, retrieveGroupsFromMillis: 147822470, deleteGroupingsCount: 0, deleteGroupingsGroupsCount: 0, deleteGroupingsMillis: 46, retrieveDataToCount: 1100000, retrieveDataToMillis: 187511824, retrieveDataFromCount: 1200000, retrieveDataFromMillis: 182821138, deletesCount: 0, deletesMillis: 191, insertsCount: 0, insertsMillis: 134, updatesCount: 164, updatesMillis: 105801 2020-02-13 21:12:05,001: [main] DEBUG GcTableSyncLog.debugLog(33) - - finalLog: true, state: done, sync: sqlTableSync, provisionerName: personSourceProd, syncType: fullSyncGroups, databaseFrom: pcom, tableFrom: PERSON_SOURCE2_v, databaseTo: awsProd, tableTo: PERSON_SOURCE, initialMaxProgressAllColumns: 1581645944789, retrieveGroupsToCount: 1290590, retrieveGroupsToMillis: 18782, retrieveGroupsFromCount: 1290590, retrieveGroupsFromMillis: 147822, deleteGroupingsCount: 0, deleteGroupingsGroupsCount: 0, deleteGroupingsMillis: 0, retrieveDataToCount: 1290590, retrieveDataToMillis: 217165, retrieveDataFromCount: 1290590, retrieveDataFromMillis: 195605, deletesCount: 0, deletesMillis: 0, insertsCount: 0, insertsMillis: 0, updatesCount: 164, updatesMillis: 105, queryCount: 25, tookMillis: 376643, took: 0:06:16.643
fullSyncChangeFlag
This mode selects the primary keys and a change flag column (e.g. a last_updated col or a checksum col). This mode is not a real full sync, since if the change flag does not accurately reflect the data, then things can get missed. But if the change flag is accurate, then it should be fine. If you use this, you might want to do a fullSyncFull or a fullSyncGroups weekly. If you have a lot of large columns this is a more efficient sync mechanism.
grouper.client.properties
# if doing fullSyncChangeFlag (look for a col that says if the rows are equal, e.g. a timestamp or a checksum) # {valueType: "string"} grouperClient.syncTable.some_view.changeFlagColumn = last_updated
grouper-loader.properties
# fullSyncFull, fullSyncGroups, fullSyncChangeFlag, incrementalAllColumns, incrementalPrimaryKey otherJob.some_view_full.syncType = fullSyncChangeFlag
This took about 4 minutes for 1.3 million rows and used a lot less memory than the other full syncs. Here are example logs
2020-02-13 21:32:15,049: [Thread-13] DEBUG GcTableSyncLog.debugLog(33) - - finalLog: false, state: retrieveData, sync: sqlTableSync, provisionerName: personSourceProd, syncType: fullSyncChangeFlag, databaseFrom: pcom, tableFrom: PERSON_SOURCE2_v, databaseTo: awsProd, tableTo: PERSON_SOURCE, initialMaxProgressAllColumns: 1581647463773 2020-02-13 21:33:15,266: [Thread-13] DEBUG GcTableSyncLog.debugLog(33) - - finalLog: false, state: retrieveData, sync: sqlTableSync, provisionerName: personSourceProd, syncType: fullSyncChangeFlag, databaseFrom: pcom, tableFrom: PERSON_SOURCE2_v, databaseTo: awsProd, tableTo: PERSON_SOURCE, initialMaxProgressAllColumns: 1581647463773 2020-02-13 21:34:15,539: [Thread-13] DEBUG GcTableSyncLog.debugLog(33) - - finalLog: false, state: retrieveData, sync: sqlTableSync, provisionerName: personSourceProd, syncType: fullSyncChangeFlag, databaseFrom: pcom, tableFrom: PERSON_SOURCE2_v, databaseTo: awsProd, tableTo: PERSON_SOURCE, initialMaxProgressAllColumns: 1581647463773 2020-02-13 21:35:11,493: [main] DEBUG GcTableSyncLog.debugLog(33) - - finalLog: true, state: done, sync: sqlTableSync, provisionerName: personSourceProd, syncType: fullSyncChangeFlag, databaseFrom: pcom, tableFrom: PERSON_SOURCE2_v, databaseTo: awsProd, tableTo: PERSON_SOURCE, initialMaxProgressAllColumns: 1581647463773, retrieveChangeFlagFromCount: 1290590, retrieveChangeFlagFromMillis: 204685, retrieveChangeFlagToCount: 1290590, retrieveChangeFlagToMillis: 220549, deletesCount: 0, deletesMillis: 0, insertsCount: 0, insertsMillis: 0, selectAllColumnsCount: 110, selectAllColumnsMillis: 49, updatesCount: 110, updatesMillis: 52, queryCount: 12, tookMillis: 236604, took: 0:03:56.604
incrementalAllColumns
This mode is used when there are few deletes (e.g. a person table), and there is a flag in the data that is a timestamp or an increasing sequence integer. Records can be selected since the last update. Note, once you configure the column to check, you need to run a full sync job to initialize the grouper_sync table incremental_index or incremental_timestamp field, since the incremental sync will abort if the relevant field is empty.
grouper.client.properties
# name of a column that has a sequence or last updated date. # must be in the main data table if incremental all columns # {valueType: "string"} # grouperClient.syncTable.some_view.incrementalAllColumnsColumn = last_updated
Run a full sync
See the incremental progress. Note it is correct, so we are good to go
select provisioner_name, incremental_index, incremental_timestamp from grouper_sync where provisioner_name = 'some_view'
grouper-loader.properties
# Object Type Job class otherJob.some_view_incremental.class = edu.internet2.middleware.grouper.app.tableSync.TableSyncOtherJob # Object Type Job cron otherJob.some_view_incremental.quartzCron = 0 * * * * ? # this is the key in the grouper.client.properties that represents this job otherJob.some_view_incremental.grouperClientTableSyncConfigKey = some_view # fullSyncFull, fullSyncGroups, fullSyncChangeFlag, incrementalAllColumns, incrementalPrimaryKey otherJob.some_view_incremental.syncType = incrementalAllColumns
Here is a sample log entry. Takes less than a second to update 100 records
2020-02-13 21:49:38,356: [main] DEBUG GcTableSyncLog.debugLog(33) - - finalLog: true, state: done, sync: sqlTableSync, provisionerName: personSourceProd, syncType: incrementalAllColumns, databaseFrom: pcom, tableFrom: PERSON_SOURCE2_v, databaseTo: awsProd, tableTo: PERSON_SOURCE, lastRetrieved: 2020-02-13 21:31:14.739, incrementalChangesCount: 113, incrementalChangesMillis: 174, selectAllColumnsCount: 113, selectAllColumnsMillis: 54, deletesCount: 0, deletesMillis: 0, insertsCount: 0, insertsMillis: 0, updatesCount: 113, updatesMillis: 59, queryCount: 11, tookMillis: 920, took: 0:00:00.920
incrementalPrimaryKey
This is the more precise method for incremental SQL syncing since it will sync deletes as well as inserts and updates. This requires a separate table which has primary keys and a timestamp or sequential incrementing column. You can implement this with a SQL trigger pretty easily. Note, you should have a job that deletes old records from this table (e.g. more than a day old). Note: there is no column in the "changes table" for if it is an insert, update, or delete. Any records processed will be selected from the source and destination to make sure they are correct.
Example trigger (oracle). This uses a sequence for the change number, but could be a timestamp instead. A sequence is a better way to go since there will not be conflicts
CREATE OR REPLACE TRIGGER computed_person_chglog_trg after INSERT OR UPDATE OR DELETE on computed_person for each row declare employee_id number(8); the_change_id integer; begin select audit_sequence.nextval into the_change_id from dual; if (inserting) then employee_id := :new.employee_id; elsif (updating) then employee_id := :new.employee_id; --cant update the penn_id if (:new.employee_id <> :old.employee_id) then raise_application_error(-20123, 'Cannot change the employee_id of a row. You can delete and insert if you like'); end if; elsif (deleting) then employee_id := :old.employee_id; end if; insert into computed_person_chglog (id, last_edited, employee_id) values (the_change_id, systimestamp, employee_id); end; /
In my case, my change log table has a numeric employee id, but my sync table has a text based employee id. i will need a view to convert those
create view some_view_chglog as select M.CHAR_employee_ID as employee_id, CPC.ID as incrementing_index from pcdadmin.computed_person_chglog cpc, comadmin.member m where CPC.employee_ID = M.employee_ID
grouper.client.properties
FIRST, REMOVE THE INCREMENTAL ALL COLUMNS CONFIG ****** grouperClient.syncTable.some_view.incrementalAllColumnsColumn ******* ADD THE TABLE AND COLUMN FOR CHANGES # if querying a real time table, this is the table, needs to have primary key columns. # each record will check the source and destination and see what to do # {valueType: "string"} grouperClient.syncTable.some_view.incrementalPrimaryKeyTable = some_view_chglog # name of a column that has a sequence or last updated date. # must be in the incrementalPrimaryKeyTable if incremental primary key sync # {valueType: "string"} grouperClient.syncTable.some_view.incrementalProgressColumn = incrementing_index
grouper-loader.properties
# fullSyncFull, fullSyncGroups, fullSyncChangeFlag, incrementalAllColumns, incrementalPrimaryKey otherJob.some_view_incremental.syncType = incrementalPrimaryKey
Since we are switching incremental types, from all columns to primary key table. So clear out progress
update grouper_sync set incremental_index = null, incremental_timestamp = null where provisioner_name = 'some_view'; commit; update grouper_sync_job gsj set last_sync_index = null where grouper_sync_id = (select id from grouper_sync gs where gs.provisioner_name = 'some_view'); commit;
Run a full sync to make sure the incremental progress is reset
Now see the max incremental progress:
select max( svc.incrementing_index ) from some_view_chglog svc
This should be approximately the current index (depending on how much time has passed and how many records updated in the meantime...
select id, provisioner_name, incremental_index, incremental_timestamp from grouper_sync where provisioner_name = 'some_view';
Kick it off and see how it does. Sample log:
2020-02-14 02:11:34,831: [main] DEBUG GcTableSyncLog.debugLog(33) - - finalLog: true, state: done, sync: sqlTableSync, provisionerName: personSourceProd, syncType: incrementalPrimaryKey, databaseFrom: pcom, tableFrom: PERSON_SOURCE2_v, databaseTo: awsProd, tableTo: PERSON_SOURCE, tableIncremental: person_source_chglog, lastRetrieved: 112279636, incrementalChangesCount: 12, incrementalChangesMillis: 14, selectAllColumnsCount: 24, selectAllColumnsMillis: 36, deletesCount: 0, deletesMillis: 0, insertsCount: 0, insertsMillis: 0, updatesCount: 12, updatesMillis: 25, queryCount: 12, tookMillis: 848, took: 0:00:00.848