The Format Identifier Assignment Plugin generates Identifiers based on a format specification and attributes associated with the subject entity.

Supported Contexts

  • Department
  • Group
  • Person

Configuration

  1. The Format Identifier Assigner is part of the CoreAssigner Registry Plugin, which is activated by default.
  2. When adding a new Identifier Assignment, the Plugin is CoreAssigner.FormatAssigners.
  3. Plugin configuration options are described below.

Format

Identifier formats are specified using three components:

  • Substitutions: In a format specification, a substitution is replaced with some other string. Substitutions are delimited with parentheses.
  • Collision Numbers: A number used to take a string that is not unique and makes it unique. For example, the string j.smith with a collision number added might become j.smith.3.
  • Sequenced SegmentsIn a format specification, sequenced segments allow adding additional components to a string to help generate a unique result. An Identifier is first generated from a format without any sequenced segments. If that Identifier is not unique, sequenced segments are added one at a time until a unique Identifier is generated.

If no format is specified, identifiers will simply be assigned as an integer, eg 109 or 523788.

Substitutions

Substitutions replace a segment of a format with some other value, usually (but not necessarily) associated with the entity for which the Identifier is being assigned. Not all Substitutions are supported in all contexts. Substitutions are delimited with parentheses, for example (G). The following Substitutions are available:

  • Person Name
    • (G): Given Primary Name
    • (M): Middle Primary Name
    • (F): Family Primary Name
    • (g): Given Primary Name (lowercased)
    • (m): Middle Primary Name (lowercased)
    • (f): Family Primary Name (lowercased)
  • Department, Group Name
    • (N): Entity name
    • (n): Entity name (lowercased)
  • Identifier
    • (I): See below
  • Random, see below
    • (h): Hexadecimal characters (0-9a-f)
    • (L): Random letters (A-Z, but no O to avoid confusion with zero)
    • (l): Random letters (a-z, but no l to avoid confusion with one)
  • Collision Number
    • (#): See below

Substitutions can be limited in length using the colon notation (X:n). For example, an initial can be used in lieu of the full given name by using (g:1). (This does not apply to Collision Numbers, see below.)

Identifier Substitutions

Existing Identifiers can be embedded in the format string, using the parameter (I/name) where name is the alphanumeric database value of the Registry Type (not the Display Name), for example (I/uid)@myvo.org. This capability is available in all contexts. Note that an Identifier of the specified type must already exist or the Identifier Assignment will fail. Identifier Assignments can be ordered, so it is possible to ensure that the first Identifier is generated before the second Identifier that uses it.

If more than one Identifier of a given type, it is non-deterministic as to which Identifier will be used for the Substitution.

Random Substitutions

Random Substitutions operate differently from Collision Numbers. Random Substitutions are generated once as part of the Identifier construction, and are not guaranteed to make a unique string. In contrast, if a Collision Number does not generate a unique string, it will be replaced until a unique string is found (or a limit is reached).

Random substitutions support width specifiers, so (eg) (l:5) will generate a five character string of lowercase letters, such as hxnwp.

Random sequences can, and probably should, be combined with collision numbers. For example, (L:3)(#:2) will generate a string like DGP23. If that Identifier is already in use, the random string portion (DGP) will be preserved, but a new collision number will be generated, resulting in an Identifier like DGP77.

Collision Numbers

A Collision Number is simply a type of substitution that uses a number to generate a unique identifier. How the number is generated is controlled by Collision Mode, and Minimum and Maximum Collision Values, described below.

Only one Collision Number is permitted in a format.

The Collision Number can be made fixed width by specifying the number of characters n in the Substitution as (#:n).

Sequenced Segments

Sequenced Segments are fragments of an identifier format that are added one at a time in order to help generate a unique Identifier. This can be useful for situations like

  • Adding a collision number, but only to the second Identifier of a given form.
  • Inserting a middle name, but only if an Identifier already exists using only the given and family names.

A Sequenced Segments is denoted in brackets as a number followed by a colon, and includes the text (including Substitutions) to be used when that sequenced segment is in effect. When assigning Identifiers, all Sequenced Segments will initially be ignored. Then, starting with 1 and incrementing by 1 each time, Sequenced Segments will be added in until a unique Identifier is generated. Currently, up to 9 Sequenced Segments may be defined.

For example, consider the format (G)[1:.(M:1)].(F)[2:.(#)]@myvo.org. This somewhat confusing string will first generate Werner.Heisenberg@myvo.org. If that isn't unique, it will then generate Werner.K.Heisenberg@myvo.org. Finally, it will generate Werner.K.Heisenberg.1@myvo.org. (The Minimum Collision Value should probably set to 2 when used with sequenced segments. That would generate Werner.K.Heisenberg.2@myvo.org instead, which is presumable less confusing if there is already a Werner.K.Heisenberg@myvo.org assigned.)

There are actually two types of Sequenced Segments: additive and single use. Additive sequenced segments are denoted with [ and ], and are inserted starting with their designated sequence and remain in place for future identifier attempts. Single use sequenced segments are indicated with an additional = inserted after the open bracket. So, for example, the segment [1:.(M:1)] will be inserted into the second and each subsequently generated Identifier candidate, while the segment [=1:.(M:1)] will only be inserted into the second generated candidate (and no subsequent candidates).

Example Formats

DescriptionFormatExample Identifiers

Identifier consisting of the letter C followed by a collision number

C(#)C109, C523788
Identifier consisting of the letter C followed by an eight character collision numberC(#:8)C00000109, C00523788
Use given and family names to generate an Email Address(G).(F)@myvo.orgAlbert.Einstein@myvo.org
Use first initial and family name to generate a lowercase Email Address(g:1).(f)@myvo.orga.einstein@myvo.org
Create a Network ID (netid) based on initials and a collision number(g:1)(m:1)(f:1)(#)rdm75
Generate an Email Address based on the Network ID(I/netid)@myvo.orgrdm75@myvo.org

Collision Mode

The Collision Mode controls how collision numbers are assigned. Supported modes are:

  • Random: The Collision Number is generated randomly.
  • Sequential: The Collision Number is generated sequentially (for each string constructed prior to assigning the collision number), using the next unassigned integer beginning with the Minimum Collision Value.

Sequential Collision Number assignment requires storing state in the database, using the format_assigner_sequences table. No specific action is required for this, however deployers may wish to be aware of the (minor) additional overhead of using Sequential Collision Numbers.

Permitted Characters

The substitutions described in this document are controlled by the Permitted Characters. Consider someone with the given name "Mary Anne" and the family name "Johnson-Smith". It might not be desirable to allow spaces or dashes in the generated identifier, so specifying AlphaNumeric Only as the permitted characters would result in an identifier like "maryanne.johnsonsmith" instead of "mary anne.johnson-smith". AlphaNumeric and Dot, Dash, Underscore would generate "maryanne.johnson-smith".

If any Sequenced Segment generates text consisting only of non-permitted characters, it will be skipped.

(warning) Auto-generated Identifiers are subject to (XXX update link) Identifier Validation. Identifier Validator Plugins can be used to further constraint auto-generated Identifiers.

Minimum Collision Value

For Random Collision Numbers, the minimum value that be assigned. For Sequential Collision Numbers, the first value to be assigned.

The Minimum Collision Value is useful for avoiding collision numbers starting with the number 1, which may be confused with the letter l.

Maximum Collision Value

For Random Collision Numbers, the maximum value that may be assigned.  Currently, the maximum may not exceed the value returned by PHP's mt_getrandmax() function, which is typically 2,147,483,647.

Maximum Collision Values cannot be set for Sequential Collision Numbers.

Pre-populating Identifier Assignment Collision Numbers

(warning) This section is for advanced use cases.

It is possible to manually pre-populate sequential collision numbers, which may be useful when migrating data from another system. There is not currently a user interface to handle this (CO-386), so the steps must be handled manually.

First, define an Identifier Assignment using the Format Assigner as described above, if not already done. Obtain the ID for the Format Assigner, which can be found via the plugin's configuration page. In a URL like the following, the ID is 3:

http://localhost/registry-pe/core-assigner/format-assigners/edit/3

Next, determine the affix or affixes. These are equivalent to the format with parameters substituted (with %s replacing (#)). For example, a format used to generate identifiers consisting of a person's initials might be (G:1)(M:1)(F:1)(#). This would translate into a set of rows for each initial sequence, eg:

SQL> insert into format_assigner_sequences
     (format_assigner_id, last, affix)
     values (3, 122, 'jms%s');
SQL> insert into format_assigner_sequences
     (format_assigner_id, last, affix)
     values (3, 176, 'rdm%s');
SQL> insert into format_assigner_sequences
     (format_assigner_id, last, affix)
     values (3, 143, 'rlm%s');

Note that rows in this table are not automatically created until an Identifier with a given affix is assigned.

Plugin Application Rules

  1. A maximum of 10 attempts will be made to assign an Identifier.

See Also

Changes From Earlier Versions

Prior to Registry v5.0.0

  • The functionality provided by the Format Assigner plugin was provided by the core Identifier Assignment code.
  • No labels