Consensus - Identifiers

DRAFT: Work in Progress

The working group has studied and weighed a number of problems with current approaches to user identification in R&E SAML federations. Recently-discovered security concerns, long-standing deployment challenges, and emerging approaches in alternate technologies such as OIDC have been studied, together with the other requirements communicated to the group by its participants, representing varied communities.

The group's emerging consensus is to propose a pair of SAML Attributes to be used for long-term identification of subjects. These may be entirely new Attributes, or may (in part) reuse existing work, depending on decisions yet to be made.

Identifier Requirements

The requirements for these attributes have been shaped by studying a number of well-understood properties of identifiers and discussing them in the context of the use cases for which the working group believes its output can best be applied. This also serves as a natural scoping exercise; applications for which the requirements cannot hold are necessarily not in-scope for the profile. As will be seen below, this rules out (as one example) applications that can support only a single identifier that must be a recognizeable e-mail address.

To that end:

Persistent: YES
- This is non-controversial; identifiers must be stable for at least a reasonable period of time to be useful to most applications, and for most use cases, the longer, the better.
Non-reassignability: YES
- The reassignability of both "eduPersonPrincipalName" and "mail" have been a source of long-standing frustration and continue to lead to complex workarounds and out-of-band agreement to maintain their usability for many applications. Any new identifier clearly needs to fully and unequivocally disallow reassignment in all cases.
Human-friendliness: NO
- The overlapping of identification with search and selection has been a major source of frustration, and the group believes the other requirements cannot be reasonably met by continuing to allow that overlap. We explicitly believe it should not be necessary for identifiers to be displayable or known by users themselves. By extension, it is not necessary that identifiers be derived from a person's name.
Size-Bounded: YES
- Applications frequently require, and should be able to expect, reasonable constraints on the possible size of an identifier.
Portable: NO
- While there are absolutely use cases for globally-managed identifiers that can follow users as they move between organizations, the working group feels this use case is, for the time being, better left as an account linking exercise by services and a subject of future study.
Correlatable: YES
- A significant proportion of important applications require an identifier that can be used to correlate activity with other applications. Thus, it must be possible for identifiers to be omni-directional, having the same value in all cases.
Targeted / Non-correlatable: YES
- Another significant proportion of important applications, and their users, believe it is a necessity to maintain the ability to prevent the correlation of activity between different applications. Thus, it must be possible for identifiers to be uni-directional, having a different value for some sets of different cases.

One or Two

As is obvious, requirements 5 and 6 are mutually exclusive. It follows that either two identifier types are required, or a single identifier must implicitly vary in its semantics along that axis. OpenID Connect has seemingly chosen the latter approach, allowing its "sub" claim to potentially behave in either fashion, potentially depending on the manner in which the OP operates and the options used by the client when it registers.

The working group's consensus is that SAML is better served with two explicitly separate constructs to meet both use cases. This separation allows applications to deliberately support one or both types of identifier, and to signal its requirements using existing standards, avoiding the need for new work in this area or on out of band signaling.

Proposals

Define a Pair of Identifier Attributes

We believe two attributes are required, notionally referred to here as "uniqueID" and "directedID" (the names are merely placeholders for discussion). Both attributes would carry stable/persistent, non-reassignable values, unique within the scope of the issuing/controlling organization. The former would be a correlatable, globally uniform value and the latter would be a value that may (and should) vary across different services in a manner appropriate to deployment of the SAML protocol.

Thus:

"uniqueID"
1. public/correlatable
2. non-reassignable
3. persistent
"directedID"
1. targeted/non-correlatable
2. non-reassignable
3. persistent

Consistent with the earlier requirements summary, neither type of identifier is expected to be human-friendly, suitable for display, or appropriate for searching or user selection. Applications requiring such features should rely on the "mail" attribute for that purpose.

Scoping

Short of requiring cryptographically random values, any identifier scheme must contain some notion of scope or namepace to prevent unintentional collisions. Most federation-safe applications already apply some notion of namespace separation in order to support commonly-used identifiers like student or employee numbers, and some organizations may well find those values suitable as a "uniqueID", depending on their IDM practices. It's also been observed that applications without this notion of scope tend to be vulnerable to identifier collision/spoofing attacks. Thus, it behooves us to support collision-avoidance in whatever manner seems best.

The working group has yet to reach a consensus on the best approach to this problem.

Today there are two common scoping schemes:

domain-based scoping (e.g., "mail", "eduPersonPrincipalName", "eduPersonUniqueID")
explicit or implicit scoping by issuer/context (e.g., "eduPersonTargetedID", SAML persistent NameIDs, OIDC "sub" claims)

Anecdotally, the domain syntax has proven easier for developers to understand and apply, and is flexible, but at a cost of a tendency to treat any domain-scoped value as an email address. It also requires a policy layer to map protocol-specific issuers to scopes, and gaps in implementing or applying that policy layer can create holes. In addition, the decision to internationalize domain names has rendered processing them far more complex in theory than in actual practice, creating opportunities to exploit implementations that we likely have not begun to understand.

On the other hand, both SAML and OIDC rely on URIs to identifier "issuers", and storing/managing identifier/issuer pairs is less well-understood by developers and, when only a single field is available for storage, leads to unexpected syntax such as "identifier@https://idp.example.org" that look odd and sometimes break software.

On the subject of relying party scoping, SAML made this necessary by introducing the notion of affiliations of services that can collectively receive a common identifier. Thus, the receiver of an identifier isn't always the implicit scope for a SAML "persistent" NameID. If we dispense with this notion, then the need for explicit identification of the receiver as a qualifier goes away, which is also consistent with OIDC, and avoids the challenges associated with managing "triples" of data.

Define a new entity category "privacy-preserving"

This entity category will signal to IDPs that the SP wants to preserve the privacy of the user by limiting the release of personally identifiable information. IDPs that support this entity category would be required to release only the eduPersonPairwiseId attribute (and optionally, any consented attributes as well) to SPs that are members of this entity category.

Deprecate SAML Persistent nameID and eduPersonTargetedID

There are a number of disadvantages to continuing to promote the use of SAML's built-in pairwise identifier and many of them also apply to the concept of pairwise identifiers in general, leading the working group to conclude that a different approach is warranted. Some of the problems are inherent to any use of the SAML NameID construct to carry personally-identifiable information:

Combining the NameID with the use of SAML Attributes is inherently confusing because of the separation of the constructs and the need to understand and configure both mechanisms.
There are few accepted standards for describing NameID formats, and many implementations handle them improperly without recognizing the Format or properly handling the NameQualifier attributes, leading to interoperability problems.
The NameID is used in LogoutRequest messages, which are frequently passed via redirect. This results in the NameID value ending up inside HTTP logs in a decodable way unless XML Encryption or the POST binding is used. The former requires support for XML Encryption in a direction oppposite the usual one, from SP to IdP. This is poorly supported by SP implementations and typically will require that new keys be deployed at every IdP and published in metadata. It simply adds complexity.

The specific formulation of the SAML pairwise identifier has a number of problems, one of which is simply fatal: It was defined to be case-sensitive, which allows issuers to supply identifiers for different users that differ only by case. This literal requirement is not met by a variety of, if not the majority of, common web applications. Even though e-mail address itself is not defined to be case-insensitive, in practice it's treated that way by applications, many of which assume all identifiers should be handled that way. While there are techniques to fix SAML implementations such that even generated identifiers are not case-sensitive (e.g., using Base32), existing deployments would have to rekey users.

If that weren't enough, the SAML formulation has other challenges, such as its comparatively large size and the use of a "triple" containing SAML entityIDs to namespace-qualify the value, which is at odds with the much more common use of a simple domain suffix found in every other identifier. And while it was defined by eduPerson to be usable as a SAML Attribute, its complex XML syntax is unsupported by all but a few open source implementations.

Finally, there are problems with pair-wise identifiers themselves. For many deployments in research and academia, they create a barrier to cross-application correlation of identity and even a desire to be non-anonymous by users. Often these deployments have no choice but to resort to proxy systems that consume a single pairwise identifier on behalf of many applications. The mechanism in SAML to directly support this use case without proxies has simply not been adopted by federations. It's also unclear that such identifiers provide the legal protections under EU privacy law that many have used to argue for their use.

Taken as a whole, the working group believes these are compelling arguments for a change in strategy as we pursue this profiling work. We may find that the concept of pairwise identifiers is too important to lose, but if so, we need a different way to communicate it that breaks with current practice.

Page tree