2021-10-07 MDQ intermittent outage

As of 2021-10-08 15:15 MDT, this issue is resolved. See below for further information.

Update 2021-10-07 09:27 MDT

We are aware of failed MDQ requests that are going to a number of CloudFront edge servers. We are able to reproduce these failed requests, and are looking into the failure.

Update 2021-10-07 10:00 MDT

We have opened a support case with Amazon Web Services with regard to the edge cache locations which appear to be failing:

13.32.208.20
13.32.208.84
13.32.208.58
13.32.208.38

Update 2021-10-07 12:30 MDT

We have escalated the issue within AWS and are currently working with an AWS support engineer who is able to reproduce the issue.

Update 2021-10-07 14:21 MDT

A possible work-around has been identified

This work-around is temporary and must not be left in-place long-term or it will cause another service outage when AWS moves edge locations. The values below are not guaranteed to work, and are examples, although they worked at the time of testing. You may need to use a VPN or similar facility to look for different DNS resolution in different locations, and substitute in your own known working values. You can test metadata resolution using the *NIX or equivalent curl command, as follows:

curl --resolve mdq.incommon.org:443:65.8.242.13 https://mdq.incommon.org/entities/https%3A%2F%2Fidp-prod.cc.ucf.edu%2Fidp%2Fshibboleth

Where you substitute in the IP address you are trying to check after the ":443:".

You can create or update an existing an /etc/hosts file on *NIX-based hosts which are failing to retrieve metadata, or in the %WinDir%\System32\Drivers\Etc\hosts file on Windows.

Put the following entry in the file:

65.8.242.99 mdq.incommon.org

This will cause that system to resolve mdq.incommon.org to a currently known-good edge location. Other current known-good edge locations are:

65.8.242.13

65.8.242.100

65.8.242.52

65.8.242.99

Note that when the outage is over, you MUST remove any mdq.incommon.org hosts file entries, or you will cause an outage for the relevant system in the future.

Update 2021-10-07 15:05 MDT

Amazon has escalated this problem within their CloudFront subject matter expert team. We have requested half-hourly updates from them. They continue to work to identify the cause of the intermittent outage. InCommon staff are monitoring the situation. We are aware that other CloudFront edge location IP addresses beyond the ones first identified are affected, we see the failures in our CloudFront access logs for other edge locations.

Update 2021-10-08 09:12 MDT

We continue to work with AWS engineers on the issue. They have requested some specific log data which we have provided.

Update 2021-10-08 10:26 MDT

We are still awaiting further information from the AWS engineering team to which the issue has been assigned.

Update 2021-10-08 12:18 MDT

AWS has informed us that there is a problem with certain edge nodes handling encoding of certain special characters which are a part of typical MDQ requests. This results in incorrect requests going from CloudFront into our S3 metadata origin. They have assured us that they are working on this with the highest possible internal escalation.

Update 2021-10-08 13:59 MDT

The InCommon operations team is testing some formerly failing edge locations, and they appear to be starting to work again. We are cautiously optimistic. We are still awaiting formal word from AWS.

Update 2021-10-08 14:55 MDT

We have received reports that services which were affected by this outage are now behaving normally. We have tested all the known-bad edge locations that we were tracking, and they are all now correctly returning MDQ results with an HTTP "200" status code in response to correctly-formed metadata requests for published SAML entity descriptors. We have not yet received confirmation from AWS that this issue has been fully resolved.

Update 2021-10-08 15:12 MDT

AWS has officially notified us that they have resolved the issue, and our testing bears this out. We are marking this issue as "resolved" but will await word from AWS on what caused this outage and what was done to fix it, and what will be done to try to prevent its reoccurrence. We will prepare an after-action report which will be made available once we have complete data and can assemble the report. This may take a while to gather, assemble, and make ready for public consumption. We will also work with the relevant governance body, the InCommon Technical Advisory Committee, on next steps.

Page tree