Final Report of the Per-Entity Metadata Working Group (12/7/2016) Repository ID:​ TI.5.1 Authors: Scott Koranda David Walker <​https://orcid.org/0000-0003-2540-0644​> The Per-Entity Metadata Working Group Sponsor: ​InCommon Technical Advisory Committee Superseded documents:​ (none) Proposed future review date:​ December 1, 2017 Subject tags:​ federation, metadata © 2016 Internet2 This work is licensed under a​ ​Creative Commons Attribution 4.0 International License​. Table of Contents 1. Executive Summary 2. Introduction 3. Current State of Metadata Distribution 4. Per-Entity Metadata Distribution 5. Risks of Per-Entity Metadata Distribution 5.1. Unavailability of the MDQ Service 5.2. Poor Responsiveness of the MDQ Service 5.3. Network Failure or Isolation of the Metadata Consumer 5.4. Security Related Risks 5.5. Unavailability of the Metadata Production Infrastructure 5.6. Cost 5.7. MDQ Client Software and Risk Mitigation 6. MDQ Service Architecture 6.1. Existing Infrastructure 6.1.1. Producing Local Metadata 6.1.2. Importing Global Metadata 6.2. Adding Per-Entity Metadata to the Infrastructure 6.2.1. Content Delivery Network Based Distribution 6.2.2. Traditional Server-Based Distribution 7. Requirements for the InCommon MDQ Service 7.1. Security 7.2. Availability 7.3. Responsiveness 7.4. Metadata Production 7.5. Monitoring 8. Other Issues 8.1. Discovery 8.2. Deployment Profiles 8.3. Local Site Caching 8.4. Open Access 9. InCommon Per-Entity Distribution Roadmap 9.1. Short Term (1-3 months) 9.2. Medium Term (2-12 months) 9.3. Long Term (12-24 months) Final Report of the Per-Entity Metadata Working Group Page 1 9.4. Longer Term (24+ months) 10. Appendix: State of MDQ support in IdP and SP software 11. Appendix: Per-Entity Metadata Working Group Charter 12. Appendix: Working Group Participants Final Report of the Per-Entity Metadata Working Group Page 2 1. Executive Summary In its 10+ years, the InCommon federation has grown from serving a very limited number of applications (“Service Providers” or “SPs”) with an equally small number of participants (“Identity Providers” or “IdPs” - mostly large research universities), to a federation that now supports thousands of different SPs across approximately 450 IdPs. This growth and InCommon’s recent production support for the eduGAIN interfederation service have caused a rapid increase in the size of InCommon’s metadata aggregate, a large file containing information about all interfederated SPs and IdPs. The growth of the federation and the metadata aggregate puts the federation at risk of becoming a victim of its own success. The aggregate model for metadata distribution parallels how services used a "hosts" file for hostname resolution before DNS existed, and like the hosts file the single large metadata aggregate file has reached the end of its sustainability. Per-entity metadata distribution addresses scalability and sustainability by enabling metadata consumers (SPs and IdPs) to obtain just the metadata they need, when they need it, rather than consuming the full aggregate. Whenever a consumer requires the metadata for another entity, it uses the​ Metadata Query (MDQ) Protocol​ to query an MDQ service and retrieve the metadata for only that single entity. In order to sustain InCommon’s capacity for growth, the Per-Entity Metadata Working Group recommends the following. ● InCommon must deploy an MDQ service. ● Availability of the MDQ service should be at least 99.99%. It should be engineered so that 99% of all queries are satisfied within 200ms, exclusive of network latency. ● Utilization of InCommon’s MDQ service will require reconfiguration of participants’ Identity Providers (IdPs) and Service Providers (SPs). InCommon should provide communication and education to facilitate with that work. ● SAML implementations other than Shibboleth and SimpleSAMLphp (​e.g., Microsoft ADFS, Ping Identity, Ellucian/WSO2) will likely require community pressure to support federation-distributed metadata and per-entity distribution. ● The per-entity metadata support in Shibboleth and SimpleSAMLphp should be enhanced to mitigate the effect of network and service outages and slowdowns affecting InCommon’s MDQ service. InCommon should advocate resources and community support for those efforts. ● As a short-term measure, InCommon should produce a metadata aggregate targeted at Service Providers that contains only the metadata for Identity Providers in order to temporarily address operational issues for Service Providers caused by the current size of the full metadata aggregate. ● Support for IdP discovery in the absence of an aggregate was explicitly not part of the charge for this working group. It is, nonetheless, critical to address before SPs providing Final Report of the Per-Entity Metadata Working Group Page 3 discovery services for their users can no longer handle the growing aggregate. Another working group should be formed quickly to address discovery. 2. Introduction This is the final report of InCommon’s Per-Entity Metadata Working Group, which was charged by the InCommon Technical Advisory Committee with the following tasks: 1. Develop a roadmap for addressing the immediate needs for reduced aggregate size, as well as intermediate milestones along a trajectory to a sustainable future state, based on the MDQ protocol for per-entity distribution of federation metadata. 2. Address issues related to reliance on this new model, including but not limited to: a. High availability b. Performance c. Site redundancy 3. Develop requirements, risks, and recommended risk mitigation strategies for a production per-entity metadata service delivered by InCommon, including a firm definition of the scope of the service, aligned with the immediate needs addressed in the roadmap. 4. Advise InCommon staff on implementation of a solution, based on the requirements of the service. 5. Compile the outcomes of these investigations into a report to the TAC. In its 10+ years, the InCommon federation has grown from serving a very limited number of applications with an equally small number of participants (mostly large research universities), to a federation that now supports thousands of different applications across approximately 650 active participants. This and InCommon’s recent production support for the eduGAIN interfederation service have caused rapid growth in the size of InCommon’s metadata aggregate, putting the federation at risk of becoming a victim of its own success. Metadata aggregates, that is, metadata made up of more than one SAML entity descriptor element, are static lists of entity descriptors that are aggregated, validated, signed and distributed to consumers of federation metadata in a single, large file. This model is analogous to how hostname resolution was done before DNS existed, using a “hosts” file, and it has reached the end of its sustainability the same way the hosts file did long ago. The metadata aggregate distribution strategy has a number of major drawbacks: 1. An error in a single entity descriptor can cause denial-of-service for consumers of the aggregate when malformed entity descriptors are created erroneously or imported from other federations. Final Report of the Per-Entity Metadata Working Group Page 4 2. Significant amounts of memory are needed to process the aggregate - now on the order of gigabytes. This will only increase over time, is a waste of deployer resources, and precludes resource-constrained deployments from full federation participation. 3. Increased bandwidth is utilized by the Federation Operator to distribute a large file that consumers almost certainly don’t need in its entirety. 4. Every IdP and SP requires increased time and bandwidth to obtain and process the aggregate, thus increasing the time to start up a SAML deployment. At the aggregate’s current size, this has already become a critical issue for some deployments. 3. Current State of Metadata Distribution Since its inception InCommon has realized the federation trust fabric as a monolithic digitally signed SAML metadata file, aggregating the metadata for all IdPs and SPs (entities). Today the InCommon Federation operator generates three large files colloquially known as the preview, main, and fallback aggregates. All deployments are encouraged to retrieve an InCommon metadata aggregate on at least a daily basis, and to verify the authenticity of the aggregate by checking the digital signature of the file using the well-known InCommon metadata signing certificate. Metadata consumers download an aggregate file from an HTTP server operated directly by InCommon and housed in an Internet2 data center. A second HTTP server operated in a geographically distant location acts as a hot standby for InCommon aggregate metadata distribution that can be put into service when necessary. InCommon federation operators have recognized for a number of years that distributing and consuming large monolithic aggregates containing all entities would not scale as the number of entities increased over time. In early 2016, federation operators began importing metadata from international federations as part of InCommon’s participation in eduGAIN. At that time, the InCommon metadata aggregate file grew dramatically in size. It continues to grow steadily as federations export more entities to eduGAIN and more federations participate in eduGAIN. With the growth of the InCommon aggregate, metadata consumers operated by InCommon Participants have reached a tipping point, with resource-constrained deployments experiencing slow startup times and even crashes due to the size of the aggregate. The vast majority of InCommon metadata consumers do not operationally need to consume and have available the SAML metadata for every entity since most participate in transactions with only a handful of relying parties. Even entities that do actively transact with many unique relying parties do so relatively infrequently. Further, the nature of SAML federation is such that no IdP needs to consume metadata about any other IdP and no SP needs to consume metadata about any other SP. These observations, taken as a whole, indicate not only that aggregate metadata distribution has reached the end of its useful life, but also that the needs of the vast majority of federation participants can be met using a new model — Per-Entity Metadata Distribution. Final Report of the Per-Entity Metadata Working Group Page 5 4. Per-Entity Metadata Distribution Per-entity metadata distribution addresses the problems of large, monolithic aggregates and exploits the relatively modest metadata consumption needs of most consumers by enabling them to obtain just the metadata they need, when they need it, rather than consuming the full aggregate. Whenever a consumer requires the metadata for another entity, it uses the​ Metadata Query (MDQ) Protocol​ to query an MDQ service and retrieve the metadata for only that single entity. The benefits of per-entity metadata distribution via the MDQ protocol include: ● Reduced memory and resource consumption by a metadata consumer since it need only request and consume metadata for entities with which it needs to federate. ● Reduced load on the metadata distribution service since consumers only query for and download the actual entity descriptors they need. ● Decoupling of entity descriptors so that errors for any single entity need not impede the distribution and consumption of the metadata for all other entities. 5. Risks of Per-Entity Metadata Distribution While a transition to per-entity distribution of the InCommon metadata would help reduce resource consumption, network traffic, and resolve the brittleness of large aggregates, it is not without risk. Below, we examine categories of risk and discuss changes in the risk posture for InCommon Participants and the Federation as a whole as part of a transition away from monolithic aggregates to per-entity metadata distribution. These risks are presented in no particular order. 5.1. Unavailability of the MDQ Service When an MDQ consumer, either IdP or SP, requires the metadata for an entity, it must query an MDQ service to obtain the metadata. If the MDQ service is unavailable and cannot answer the query, the consumer does not receive the necessary metadata for the entity, resulting in a service disruption for the MDQ consumer. An MDQ service may be unavailable for any number of reasons, some inherent and some not: ● Server failures including power failures and resource exhaustion, be it memory or disk. ● Failed or misconfigured software components such as HTTP web servers. ● Incorrect or incomplete DNS entries that prevent resolution of the IP address(es) for the service. Final Report of the Per-Entity Metadata Working Group Page 6 ● Network outages for any network component without redundancy between the MDQ consumer and the service. ● Outages caused by poor consumer software implementations. ● DDoS and other attacks on the service by malicious actors. Each of these reasons for service disruption exist today in InCommon’s current metadata distribution strategy. However, since most IdPs and SPs obtain the metadata periodically, recover gracefully from download errors, and the signed aggregate is valid for two weeks, such disruptions are transparent and do not commonly result in user-visible failures. The change in risk posture for InCommon Participants when transitioning to per-entity metadata distribution is that service disruptions in the MDQ service are more likely to result in visible failures. Caching MDQ query results by consumers does not entirely mitigate this risk, but does provide moderate outage tolerance assuming prior successful queries. While other mitigations, including preloading caches and failover to tiered services are discussed below, it is immediately apparent that the transition to per-entity metadata distribution substantially increases the high availability (HA) and related requirements for InCommon metadata distribution services. 5.2. Poor Responsiveness of the MDQ Service An MDQ service may be available but may not respond quickly enough to any individual MDQ query. Specifically, the service may take a relatively long time to return a valid and expected response to the client. Since the IdP or SP making the query cannot continue the SSO web flow until it receives the correct response from the MDQ service the user that initiated the SSO flow will see a delayed and degraded SSO experience. An MDQ service may respond slowly for any number of reasons including: ● A lack of service or server capacity so that the service is unable to respond in a timely way. ● Degraded performance of other services on which an MDQ service may depend. ● Degraded network performance for any part of the network between the consumer and the service. ● DDoS and other attacks on the service by malicious actors. Again, each of these reasons for service degradation exist today in InCommon’s metadata aggregate distribution strategy, but since most IdPs and SPs download the metadata aggregate asynchronously to specific user behavior, and the signed aggregate is valid for two weeks, such degradations are transparent and do not result in a degradation of the sign-on experience for users. The change in risk posture for InCommon Participants when transitioning to per-entity metadata distribution is that a slow and unresponsive MDQ service will result in a degraded user Final Report of the Per-Entity Metadata Working Group Page 7 experience. Mitigations for this risk are discussed below, but again it is obvious that the transition to per-entity metadata distribution substantially increases the requirements on the InCommon metadata distribution service for responsiveness, including capacity and the ability to meet peak demand during particular times in the academic calendar such as late August and early September. 5.3. Network Failure or Isolation of the Metadata Consumer Above, we considered network failures and performance factors that can contribute to MDQ service degradation, but here we consider specifically the risk inherent in per-entity distribution when the network connecting a campus or other organization to the Internet fails. There is no change in risk posture for off campus users since they cannot contact either the campus IdP nor any SP on campus and this is the same regardless of whether the campus is relying on the InCommon monolithic metadata aggregate or an InCommon MDQ service. On campus users can contact the campus IdP1 but not off campus SPs. Again the risk posture is no different whether relying on an aggregate or an MDQ service, although the error a user experiences may manifest differently depending on where in the SSO flow the lost connectivity to the Internet is first encountered. On campus users can, however, contact both the campus IdP and campus SPs during such a network event and it is expected that the basic intra-campus services continue to operate normally. If a campus relies on its own mechanisms and not InCommon-signed metadata for bootstrapping the trust between the campus IdP and SPs then again there is no change in risk posture since there is no reliance on the InCommon trust fabric for interoperability between the campus IdP and campus SPs. Some campuses or organizations do, however, rely on InCommon metadata for bootstrapping the trust between the campus IdP and campus SPs. That is, they submit metadata for both the IdP and SPs into InCommon metadata and configure both the IdP and SPs to consume the InCommon metadata. For these campuses the change to per-entity metadata distribution does result in a change in risk posture since the inability to query an MDQ service external to campus during a network isolation event will result in intra-campus SSO flows failing. Those campuses may require mitigation strategies before adopting per-entity metadata distribution. 5.4. Security Related Risks A full and detailed analysis of the security risks associated with the per-entity distribution of InCommon metadata is out of scope here. Rather, we consider one specific change in the security-related risk posture. 1 We do not consider here the details of a campus operating its IdP off site, perhaps in the cloud. Final Report of the Per-Entity Metadata Working Group Page 8 InCommon digitally signs each of the metadata aggregates so that consumers can verify the integrity of the download using the well-known InCommon metadata signing certificate. Weaknesses in XML digital signature implementations have been found in the past, however, and will plausibly be discovered in the future. Malicious actors could exploit such weaknesses and prepare a rogue file to distribute to their target. A successful attack requires the target to consume the rogue file. Another possible attack is the substitution of old metadata that is no longer current, but still within its validity period, for the current metadata. That an​ MDQ query occurs in-band and just-in-time may allow an attacker to induce a query to the MDQ service on demand and so more readily attempt to intercept the query from the client and inject the old metadata. Use of TLS transport can mitigate this risk, assuming the MDQ client verifies the MDQ server’s certificate as being authentic and either self- issued or issued by a trusted certificate authority. There are, however, issues of how well that certificate’s private key can be protected in a content delivery network that the working group did not explore extensively. In particular, the metadata signing certificate should not be used as the end-entity server certificate for TLS. 5.5. Unavailability of the Metadata Production Infrastructure It is assumed that a common technical and process infrastructure will be used to produce metadata for both per-entity and aggregate distribution. If the servers, equipment, or people involved in the daily production of metadata are not available, additions, updates, and deletions of metadata will not occur. Continued distribution of the current set of metadata will be uninterrupted, however. The impact of missing a production cycle is the same for per-entity distribution as it has always been for aggregate production. Under normal circumstances, the impact of not producing metadata on a particular day is low. There have been, however, emergency situations where it has been necessary to produce more than once in a single day. Loss of the metadata production infrastructure on such a day could have severe implications. 5.6. Cost The infrastructure to support per-entity metadata distribution will require reconfiguration and upgrades over time to address an increasing, sometimes unpredicted, workload. Early experience with MDQ service deployments including both the InCommon MDQ pilot project and the initial rollout of an MDQ service for the UK Access Management Federation indicates that this will not require very large expenditures, even if the future InCommon MDQ service leverages commercial content delivery networks (CDNs). Still, it behooves InCommon to track workload and cost over time to ensure that sufficient resources are available when needed. Final Report of the Per-Entity Metadata Working Group Page 9 5.7. MDQ Client Software and Risk Mitigation Much of the risk detailed above for a per-entity metadata distribution architecture for the InCommon Federation can be mitigated by carefully designing, deploying, and operating the MDQ service. An MDQ service that is always available and responsive goes a long way to addressing much of the risk of transitioning to a per-entity approach. Service outages do happen, however, and even well designed deployments can face uncertain technical challenges as the InCommon and wider communities continue to grow and add new entities to the metadata. It follows that MDQ consumers have a role in reducing and mitigating the risks of per-entity metadata distribution. That role then directly translates into technical requirements for the MDQ consumer software stacks operated by InCommon Participants. Specifically, mitigating the risks of MDQ service unavailability, poor responsiveness, and high latency can only be achieved if MDQ clients (SPs and IdPs) have the capability to detect and then appropriately respond to those service conditions. Such capabilities should include: ● A persistent caching mechanism that retains previously-retrieved metadata across software restart so that it may be re-used if the software is restarted when the MDQ service is not available. A likely mechanism is caching to local disk and then consumption from the cache on restart. ● A mechanism for pre-loading metadata for high-value IdPs and SPs and keeping it available. This enables successful operation the first time a high-value entity’s metadata is needed, even if the MDQ service is not available. ● The ability to detect a failed query, retry appropriately, and after repeated completed but failed queries failover to a secondary MDQ service. A complete implementation would include the ability to mark an MDQ service as unavailable for some time but later test again and return to using it when the service is again available and completing successfully. ● Likewise the ability to detect unresponsive (hanging) MDQ services or MDQ services that do not answer queries fast enough and similarly retry, mark as unavailable, and then later test for restoring into service such MDQ services. Clients implementing the capabilities above should allow administrators to tune thresholds for detecting and responding to failures to accommodate local deployment needs. The Shibboleth development team has added significant capabilities in version 3.3 of the Shibboleth Identity Provider. Clients without such capabilities can still leverage a per-entity metadata distribution infrastructure and interoperate with MDQ services but they risk lower availability for their users. As the predominant client software used in InCommon, the working group recommends that the InCommon community request that the SimpleSAMLphp Final Report of the Per-Entity Metadata Working Group Page 10 development team add these capabilities. Specific guidance is detailed below and included in the discussion of timelines. 6. MDQ Service Architecture As noted above, the design and implementation of the MDQ service itself will have the greatest impact on mitigating the risks of per-entity metadata distribution. Before discussing risk mitigations and translating them into specific requirements on the InCommon MDQ service, however, it is helpful to examine proposed MDQ service architectures to help put them into context. While all risks described above should be addressed, the singular requirement for a per-entity metadata distribution architecture for the InCommon Federation is that users do not observe any change in the behavior of their InCommon-backed (SAML) authentication flows during and after the transition from aggregate-based distribution to per-entity distribution. This requirement distills down to each relying party, both IdPs and SPs, the requirement that every time a client queries an MDQ service for metadata for a particular entity, the query is answered (high availability) and answered quickly (high responsiveness or low latency) with very high probability. Put simply, any MDQ service operated by InCommon must "just work" in the same way that DNS services "just work". The architecture for the MDQ service will share much of the infrastructure to produce metadata that already exists to create the aggregates. Where it differs is in the addition of processing steps to sign each entity’s metadata, as well as a highly available and responsive distribution layer. 6.1. Existing Infrastructure 6.1.1. Producing Local Metadata InCommon metadata is processed daily. Metadata submitted by InCommon Site Administrators via the Federation Manager (FM) is vetted and approved by the InCommon Registration Authority at approximately 2:30 pm ET every Internet2 business day. Fresh aggregates are signed and published immediately thereafter, at approximately 3:00 pm ET. See the ​InCommon Hours of Operation​ page for more detail. InCommon distributes multiple ​metadata aggregates​ for various purposes. Structural changes to metadata (which often involve extension schema) are systematically pushed through a pipeline of aggregates to avoid breakage. Clients consume whatever aggregate is most appropriate for their particular deployment. Of special interest is the ​fallback aggregate​, a temporary alternative for deployments that experience metadata issues as changes are pushed through the pipeline. Final Report of the Per-Entity Metadata Working Group Page 11 Each aggregate is digitally signed for authenticity and integrity. The daily ​metadata signing process​ is basically a manual process. The metadata signing key resides on an offline laptop stored in a safe with strict access controls. 6.1.2. Importing Global Metadata By the end of Q1 2016, InCommon was fully integrated with the ​eduGAIN​ metadata aggregation service. During that time, InCommon began importing global metadata directly into the InCommon metadata aggregate. Likewise InCommon exported local metadata to eduGAIN. As a consequence of the eduGAIN integration, there are now two distinct sources of metadata: 1) local metadata registered by InCommon, and 2) global metadata registered by other federations. The daily metadata signing process combines entity descriptors from both sources into a single, comprehensive metadata aggregate. The diagram below illustrates the expanded infrastructure that incorporates eduGAIN metadata (compare with the diagram shown in the previous section). Final Report of the Per-Entity Metadata Working Group Page 12 In preparation for the daily metadata signing process described in the previous section, a repetitive ​metadata import process​ ensures that fresh global metadata is available for aggregation and signing at 3:00 pm ET. The import process is implemented using a customized instance of the ​Shibboleth Metadata Aggregator​ software, which filters imported metadata according to published ​technical policy rules​. To complete the circular flow of metadata among federations, a subset of entity metadata registered by InCommon is assembled into an ​export aggregate​ and made available for download by eduGAIN operations. As a matter of policy, IdP metadata registered by InCommon is exported to eduGAIN by default whereas InCommon SP owners explicitly opt into metadata export. The initial eduGAIN integration caused the InCommon metadata aggregate to nearly double in size. To compensate for the ever-increasing size of the metadata file, and because some SP deployments will be unable to leverage per-entity metadata (at least initially), an ​IdP-only aggregate​ for SP deployments was introduced in October 2016. IdP deployments, on the other hand, are expected to leverage per-entity metadata as soon as it becomes available. 6.2. Adding Per-Entity Metadata to the Infrastructure In order to provide per-entity metadata distribution, two things must be added to the infrastructure, processing to sign each entity’s metadata, and a highly available and responsive distribution layer. This is illustrated below: Final Report of the Per-Entity Metadata Working Group Page 13 The new processing is shown as “Per-entity metadata” in the center bubble, and the new distribution layer is the “mdq.incommon.org” box in the lower right. The working group’s consensus was that the MDQ service must provide very high availability to the federation’s IdPs and SPs, at least 99.99%. Unfortunately, popular content distribution services, such as commercial CDNs, typically guarantee only 99.9% availability. This is the equivalent of about 43 minutes of downtime per month. In order to achieve at least 99.99% availability (4.3 minutes of downtime per month), the group recommends that both primary and second distribution services be deployed to “ride out” outages in the primary service. The existing SAMLbits CDN, likely augmented with additional nodes contributed by InCommon, could be a good candidate for the secondary service. Note that the combination of a highly available primary distribution service, a secondary distribution service, and MDQ clients with appropriate configuration for failover and sophisticated persistent caching, tunable for high-value relying parties, will provide a solution for the large majority of InCommon Participants. Finally, it is not expected that most deployers will leverage a local caching service or out of band cache filling processes, but the opportunity exists for extremely critical sites, or sites with problematic Internet connectivity to implement a local metadata cache. The following diagram illustrates this distribution architecture. Final Report of the Per-Entity Metadata Working Group Page 14 In order to meet the high availability and low latency requirements for distribution, the working group has discussed two strategies for the primary and secondary distribution services: 1. Content delivery network based distribution 2. Traditional server based distribution Either strategy or both could be selected for the deployed solution architecture. 6.2.1. Content Delivery Network Based Distribution A content delivery network (CDN) is a distributed network of proxy servers deployed in multiple data centers that serve content to consumers with high availability and high performance. CDNs serve a large portion of web content today, including many of the standard JavaScript libraries, Cascading Style Sheets (CSS), and images and other media files. Besides better performance and availability, CDNs also offload the traffic served directly from a content provider's origin infrastructure and can provide a degree of protection from DoS attacks by using their large distributed server infrastructure to absorb the attack traffic. Considerations for the use of CDNs for distribution include: Final Report of the Per-Entity Metadata Working Group Page 15 ● Pros ○ CDNs already mitigate common risks to cloud-based services, such as DDoS. ○ Capacity scaling is automatic. ○ The cost will likely be lower, although this will require more detailed analysis. ● Cons ○ Most CDNs are optimized for use by browsers which can, for example, switch quickly among multiple IP addresses for a server. Clients of services like MDQ may or may not be that agile. ○ CDNs must have access to the private key used for TLS. While multiple mitigation strategies are typically provided, they are only partial. The security risks must be analyzed. ○ MDQ is new to InCommon; so is the use of CDNs. It may be prudent to introduce only one new technology at a time. 6.2.2. Traditional Server-Based Distribution The alternative to use of CDNs is to deploy a more traditional server-based distribution layer, operated by Internet2. Such an infrastructure could be based either in the cloud or in geographically-distributed Internet2 data centers. Considerations for this alternative include: ● Pros ○ The service would be optimized to the needs of MDQ client software, rather than browsers. ○ Access to the private key used for TLS can be managed directly by Internet2. ○ The architecture is more familiar. Fewer new technologies would be introduced simultaneously. ● Cons ○ The service will need to address other risks, such as DDoS attacks, that are inherently addressed by CDNs. ○ Capacity scaling will likely not be as easy/automatic. ○ The cost will likely be higher, although this will require more detailed analysis. 7. Requirements for the InCommon MDQ Service InCommon’s MDQ service must be designed and deployed to address the risks and other issues described in this document. Specific requirements are detailed below. Final Report of the Per-Entity Metadata Working Group Page 16 7.1. Security The security of the MDQ service must, as much as possible, be equivalent to existing aggregate service. As noted above, the nature of MDQ may increase the risk of a metadata consumer receiving out of date information that is still within its validity period, due to a man-in-the-middle attack between the distribution layer and the consumer; this must be mitigated through the use of TLS unless some other mitigation is determined to be more appropriate. 7.2. Availability Any MDQ client operated by an InCommon Participant must find the service available. Any time a client queries the service and the service does not respond in compliance with the MDQ specification due to any issue with the service delivery infrastructure is known as an outage. The time period during which the service suffers no outages is known as the service uptime. The monthly service uptime percentage is the percentage of client service transactions in which the service responds to client queries and delivers the requested metadata without error. The InCommon MDQ service must realize a monthly service uptime percentage of at least 99.99%. This figure does not include or address outages which occur due to failures of the network or infrastructure at sites running per-entity metadata clients. These failures are not addressable by federation operations, and are explicitly out of scope for service availability targets. 7.3. Responsiveness The distribution layer of the InCommon MDQ service must provide a response time of no more than 200ms2 for at least 99% of all queries received each month from a test probe on or near the Internet2 backbone. The test probe will select one entity per minute, retrieve its metadata, and record the response time. InCommon will post monthly reports of response time distribution on the web. It is understood that not all InCommon participants will experience the response times observed by the test probe, depending on each participant’s server and network topology with respect to the Internet2 backbone. It is also understood that it is the participants’ observation of response time that is truly important, if difficult or impossible to generalize into this service specification. For this reason, response time measurements should also be taken from participants’ sites. We recommend that InCommon identify representative participants and ask them to contribute records of their response times for inclusion with the monthly reports. We also recommend that TIER include metadata query response times in its current efforts to instrument Shibboleth software. 2 200ms is a fraction of the overall response time during an SSO flow that is not expected to significantly change the user experience. Final Report of the Per-Entity Metadata Working Group Page 17 7.4. Metadata Production The working group does not recommend any changes to the processes and systems that produce the metadata that is distributed both as aggregates and via the MDQ service. In particular, we do not see a need to change the current once-per-day metadata production or the two-week metadata validity period. If at some time, however, InCommon were to decide to leverage per-entity distribution to institute more frequent, or real-time, publishing of metadata updates, then the availability of the metadata production infrastructure should be addressed in light of such new service commitments. A decision to publish metadata more frequently would require reconsideration of the metadata validity and caching intervals. Shortening the validity interval decreases the risk old metadata being reintroduced by a malicious actor, but it also increases the risk that current metadata will expire during an outage of the metadata production infrastructure. Shortening the caching interval shortens the time required for metadata updates to propagate to IdPs and SPs, but it also increases load on the distribution infrastructure. Achieving a proper balance for both parameters will be essential if more frequent publishing is desired at some time in the future. 7.5. Monitoring In addition to the monitoring for responsiveness described in section 7.3, InCommon Operations should monitor the availability and performance of the Per-Entity Metadata Service from multiple geographic locations on a regular basis to demonstrate compliance with the service requirements in this section. These results should be made transparent and accessible to all federation participants , in a manner to be determined by Federation Operations. 8. Other Issues 8.1. Discovery In order to authenticate a user and retrieve attribute information about them, an SP must redirect the user to their correct IdP. This process of determining which IdP to use is called Discovery. Discovery take many forms, but many of them rely upon the SP having a list of all IdPs in metadata so that its users can select from any of the available IdPs. With per-entity distribution, said list is not readily available; SPs that rely upon it thus cannot migrate entirely to per-entity metadata in its current form. Discovery was explicitly not part of this working group’s charge, but a solution is critical for any SP that provides a discovery service for its users. InCommon should convene a working group to address this issue quickly to assure that affected SPs have a path forward before the growing metadata aggregate impacts them. Final Report of the Per-Entity Metadata Working Group Page 18 8.2. Deployment Profiles The MDQ protocol adds new criteria that must be addressed in the deployment of SAML services. InCommon should work with its own ​Deployment Profile Working Group​ to assure consideration of MDQ in their specification. At a minimum, a requirement to support MDQ and per-entity metadata distribution should be specified. 8.3. Local Site Caching Some sites will have higher requirements for availability and/or responsiveness than is offered by the primary and secondary CDNs. Likely reasons for this are insufficient Internet connectivity for the site or heavy reliance on metadata for SAML-based authentication internal to the site. Such sites can create local caching servers, or even a private CDN, to address this. They can also deploy site-specific out of band processes to pre-fill caches for local services and high-value partners. While InCommon would not support such local infrastructure, it should provide documentation to enable a site to support its own. InCommon should track and document lessons it learns about the effectiveness of caching and other parameters to improve availability and performance. The TAC should consider creation of a working group to work with InCommon Operations on these issues. 8.4. Open Access The software, documentation ​etc. for the MDQ service must be openly available to others wishing to follow in InCommon’s footsteps. This applies to the MDQ software itself, as well as tools for managing and monitoring the service. 9. InCommon Per-Entity Distribution Roadmap 9.1. Short Term (1-3 months) There is concern that many SPs will soon be impacted by the current, growing metadata aggregate. As a short-term workaround, InCommon will deploy an IdP-only aggregate file on a daily basis. 9.2. Medium Term (2-12 months) Upon acceptance of this report, InCommon develops a detailed plan and allocate resources for the deployment of per-entity metadata services. This plan will include the following elements: Final Report of the Per-Entity Metadata Working Group Page 19 ● A detailed architecture that addresses the risks and service commitments described in this document ● Formal requests to the Shibboleth and SimpleSAMLphp development communities to address the capabilities to weather short service outages described in ​MDQ Client Configuration and Caching Capabilities ● Deployment of the detailed architecture ● A communications plan, including ○ What participants must do ○ What participants may choose to do ○ What participants cannot do (​e.g., discovery) with workarounds, if possible ○ Timeline of events (conditioned on the timing of Shibboleth and SimpleSAMLphp development) ○ Participant feedback 9.3. Long Term (12-24 months) Per-entity metadata distribution is in production during this period. A solution for discovery that is analogous to existing discovery approaches is identified and deployed. Operational issues are discovered and resolved. IdPs and SPs that consume metadata, but have not adopted per-entity metadata distribution, are increasingly experiencing problems due to the size of the aggregate. 9.4. Longer Term (24+ months) The vast majority of IdPs and SPs that consume metadata have migrated to per-entity metadata distribution in the 24-36 month time frame, even those that were delayed due to discovery issues. More work is done to improve discovery strategies. InCommon develops a plan for the future of aggregate distribution in the 36-48 month time frame, depending on how vast that majority is. Final Report of the Per-Entity Metadata Working Group Page 20 10. Appendix: State of MDQ support in IdP and SP software The working group reached out to software providers to determine their support for MDQ. At this time, Shibboleth and SimpleSAMLphp are the only two that have released or are working on MDQ support; they are also the two implementations that support federation-provided metadata via any distribution method. We remind sites that operate SAML software stacks other than Shibboleth or SimpleSAMLphp that only those projects have historically and consistently supported functionality highly desired for the best interoperability in the higher education and research federations. This table captures current and future state of client software capable of requesting and consuming per-entity metadata via the ​Metadata Query Protocol​. Client Support Notes on current capability Securi Known Software s MDQ ty future protoco Model( capabilities l? s) or enhancemen ts? Shibboleth SP Yes See the ​Dynamic MetadataProvider XML New "file://" topic in the Shibboleth wiki. This Signatur feature in SP (current: feature (first introduced in SP V2.0) e, TLS V2.6.0 V2.6.0) is mostly untested (which means validati Committed to there are probably bugs) but is on add additional already being enhanced in response against caching to this group’s feedback. explicit support in anchors 2017. Shibboleth Yes See the XML V3.3 has IdP DynamicHTTPMetadataProvider Signatur introduced topic in the Shibboleth wiki. This e, TLS substantial (current: feature (new in IdP V3.3) is is validati caching V3.3) probably the most capable client on enhancements implementation available but has against in line with the seen little use to date. explicit group’s anchors suggestions. Final Report of the Per-Entity Metadata Working Group Page 21 SimpleSAMLp Yes MDQ metadata handler ​merged​ on XML hp March 16, 2015. There is no formal Signatur documentation (search for "MDQ" in e (via (current: config.php​). This feature is mostly cert V1.14.8) untested. fingerpr int) ADFS 2.0 Partial ADFS will fetch and cache a single TLS SAML EntityDescriptor at a (Server 2008 configured endpoint location and Server beginning with "https://" 2008 R2) ADFS 3.0 Partial ADFS will fetch and cache a single TLS SAML EntityDescriptor at a (Server 2012 configured endpoint location R2) beginning with "https://" ADFS 4.0 Partial ADFS will fetch and cache a single TLS This version SAML EntityDescriptor at a may load an (Server 2016 configured endpoint location aggregate Tech beginning with "https://" Preview) Ping No Ticket filed for next release to TLS enable the needed 'Accepts' header value. Final Report of the Per-Entity Metadata Working Group Page 22 11. Appendix: Per-Entity Metadata Working Group Charter Problem Statement In its 10+ years, the InCommon federation has grown from​ ​serving a very limited number of applications with an equally small number of participants (mostly large research universities), to a federation that now supports thousands of different applications across approximately 650 active participants. Add to this a rapid growth in the size of InCommon metadata, due to InCommon’s production support for the eduGAIN interfederation service, and the federation is at risk of becoming a victim of its own success. Metadata aggregates, that is, metadata made up of more than one SAML entity descriptor element, are static lists of entity descriptors that are aggregated, validated, signed and distributed to consumers of federation metadata. This model is analogous to how hostname resolution was done before DNS existed, using ‘hosts’ files, and it has reached the end of its sustainability the way that solution did long ago. Aggregates are inherently brittle - an error in a single entity descriptor can cause issues loading an entire aggregate. Additionally, very large metadata aggregates, as InCommon now distributes on a daily basis, have a number of other major drawbacks: 1. Increased bandwidth use to distribute a large file that consumers almost certainly don’t need in its entirety 2. Inefficient use of client bandwidth to download a large aggregate on a regular basis 3. Increased time to canonicalize (XML document normalization) a large XML document so that the signature on it may be verified - thus increased time to start up a SAML deployment consuming a large aggregate 4. Increased memory needed to canonicalize a large XML document - now on the order of gigabytes, and this will only increase over time. A waste of deployer resources. 5. Intentional or unintentional denial-of-service for consumers of an entire aggregate based on malformed entity descriptors imported from other federations. To address these and other concerns with the aggregate, InCommon’s previous Metadata Distribution Working Group[​1​] recommended a test deployment of the Metadata Query Protocol (MDQ)[​2​]. For over two years, InCommon has been running an MDQ testbed to gain experience with the technology and this new model. This new working group is charged with items necessary to allow InCommon Ops to move this technology into a production-ready service. Stakeholders/Influencers/Influences Different audiences can impact different aspects of this problem: Final Report of the Per-Entity Metadata Working Group Page 23 1. SAML deployers - IdP, SP, AA, Discovery Services, etc. of various implementations: Shibboleth, SimpleSAMLphp, ADFS, Ping, etc. 2. SAML implementers - Latest versions of both Shibboleth and SimpleSAMLphp support the MDQ protocol, but implementation issues may exist that have not been found due to the need for operational exercising of these features. Other implementations such as ADFS may be enabled to participate in the federation in ways they have not been able to previously. 3. InCommon Operations and Internet2 Technical Services Group (TSG) - Running a highly reliable service that responds to requests for entity descriptors in real-time is a service delivery model that is new for InCommon and will require additional resources to support. 4. Participants - what needs do they have for local per-entity metadata installations to allow for local generation and consumption of per-entity site-specific metadata? Do they have a need for a local copy of a cache of per-entity metadata for redundancy reasons? Etc…. 5. International community - how will an InCommon per-entity metadata service align with plans that other federation operators may have? Charter The Per-Entity Metadata Working Group will: 1. Work based on the premise that InCommon will be moving toward per-entity MDQ[​2​] protocol-based distribution of metadata. 2. Develop a roadmap for addressing the immediate needs for reduced aggregate size, as well as intermediate milestones along a trajectory to a sustainable future state, to be determined. The first items on this roadmap should include building a production service which allows production SAML deployments to exercise their per-entity metadata capabilities, and include checkpoints to improve the service and software when issues are encountered. If short-term steps such as InCommon producing separate IdP and SP feeds are deemed necessary, these items should be included in this roadmap. This roadmap should also address the issue of continued creation (or eventual decommissioning) of multi-entity aggregates. 3. Address issues and questions that have arisen about the process of moving from where we are to relying on this new model, including but not limited to: a. High availability b. Performance c. Site redundancy 4. Develop requirements, risks, and recommended risk mitigation strategies for a production per-entity metadata service delivered by InCommon, including a firm definition of the scope of the service, aligned with the immediate needs addressed in the roadmap from (1). Final Report of the Per-Entity Metadata Working Group Page 24 5. Advise InCommon staff on implementation of a solution, based on the requirements of the service documented in (4). 6. Compile the outcomes of these investigations into a report to the TAC Explicitly out-of scope is: 1. A ‘full DNS’ model which would require changes to the MDQ protocol or current software implementations of the protocol. 2. Other items that need to be resolved by the international community. These items should be addressed via appropriate forums such as REFEDS, or better yet, the IETF. 3. Determining a concrete roadmap for ceasing production of multi-entity aggregates. This item is important, but must be the work of a later group, after we build experience using a production-quality per-entity metadata service. 4. A solution to the IdP discovery problem in light of per-entity metadata. Discussion or debate of options is reasonable, as long as the WG's main deliverables are not sidetracked. 5. Choice of a per-entity metadata server application or specific testing of software related to a specific choice of server application. That is the realm of operationalizing a production service and is up to InCommon staff. 6. Support for querying entities by anything other than entityID (already put out-of-scope by previous work:​ ​https://spaces.internet2.edu/x/BoGDAg​) Membership Membership in the Working Group is open to all interested parties. Solicitation will take place on lists such as the InCommon Participants list and the REFEDS list, explicitly seeking international participation. Members join the Working Group by subscribing to the mailing list, participating on the phone calls, and otherwise actively engaging in the work of the group. Work Products 1. September, 2016 - Draft Report to the TAC, Report out at TechExchange 2. November, 2016 - Final Report to the TAC Related Resources 1. Metadata Distribution Working Group​ recommendation on pilot study of MDQ 2. MDQ protocol draft 3. Draft call for participation in MDQ testbed​ (restricted access) Final Report of the Per-Entity Metadata Working Group Page 25 12. Appendix: Working Group Participants ● Jorj Bauer, Temple University ● Scott Cantor, Ohio State University ● Steve Carmody, Brown University ● Paul Caskey, Internet2 ● Tommy Doan, Southern Methodist University ● Michael Domingues, <​http://orcid.org/0000-0001-6978-2803​>, University of Iowa ● Paul Engle, Rice University ● Lukas Hämmerle, SWITCH ● Walter Hoehn, University of Memphis ● Chris Hubing, <​http://orcid.org/0000-0002-8565-1966​>, Internet2 ● John Kazmerzak, <​https://orcid.org/0000-0002-6575-3340​>, University of Iowa ● IJ Kim, Internet2 ● Scott Koranda, <​https://orcid.org/0000-0003-4478-9026​>, LIGO, Chair ● Tom Mitchell, GENI ● Kevin Morooney, InCommon / Internet2 ● Chris Phillips <​http://orcid.org/0000-0001-5567-4916​>, CANARIE ● Phil Pishioneri, Penn State ● Nick Roy, InCommon / Internet2 ● Tom Scavo, InCommon / Internet2 ● Rhys Smith, Jisc ● David Walker, <​https://orcid.org/0000-0003-2540-0644​>, InCommon/Internet2, Flywheel ● Ann West, InCommon / Internet2 ● Ian Young, InCommon / Internet2 Final Report of the Per-Entity Metadata Working Group Page 26