Summary

At approximately 14:30 EDT on September 8, 2022, the eduroam.us national proxy servers stopped responding to requests.  Service was restored by 14:55 EDT, and eduroam administrators were notified by 16:50 EDT.  Since that time additional checks and monitoring were instituted to allow a quicker recovery in the event of a similar incident.

Root cause analysis

The outage was caused by the load balancing system’s health testing processes removing all of the IPv4 addresses from the load balancer due to a transient network failure of the Virtual Private Network (VPN) between the six geographically distributed AWS (Amazon Web Services) sites running the two Top Level RADIUS Servers (tlrs1.eduroam.us and tlrs2.eduroam.us). 

The cause of the network failure isn’t known. The data centers running these tasks communicate via the open Internet, which Internet2 does not control.  IPv6 traffic was not impacted, which prevented the monitoring system from detecting the issue.

Mitigations

Health Testing Improvements

  1. The health testing subsystem has been updated to better handle any loss of connectivity within the eduroam infrastructure. 
  2. The process by which previously unresponsive infrastructure elements are brought back on line and placed back into production has been improved. 

System Monitoring Improvements

  1. The system now generates informational messages anytime the pool of containers for the load balancer changes. 
  2. The load balancer pool itself is now monitored and issues operator alerts whenever the pool reaches critically low levels.  
  3. The overall system checks have been modified to separate IPv4 and IPv6 availability, since almost all subscriber traffic uses IPv4.

Conclusion

The outage event of September 8 was caused by a transient network failure.  As stated earlier, it is impossible to have full visibility and control over the open Internet.  The mitigations put in place focus on monitoring and detection, recovery and redundancy. 

The system has been updated to heal itself in the face of a network outage and properly alert operators when the load balancing pool runs dry.  This will not eliminate the possibility of network outages, but it will automate the process of restoring the eduroam service when the network returns to normal.  

eduroam operations staff will continue to improve monitoring capabilities, identify single points-of-failure and mitigate where possible.

Timeline

All times are in Eastern Daylight Time.

September 8, 2022:

September 9, 2022:

September 12, 2022: