At approximately 14:30 EDT on September 8, 2022, the eduroam.us national proxy servers stopped responding to requests. Service was restored by 14:55 EDT, and eduroam administrators were notified by 16:50 EDT. Since that time additional checks and monitoring were instituted to allow a quicker recovery in the event of a similar incident.
The outage was caused by the load balancing system’s health testing processes removing all of the IPv4 addresses from the load balancer due to a transient network failure of the Virtual Private Network (VPN) between the six geographically distributed AWS (Amazon Web Services) sites running the two Top Level RADIUS Servers (tlrs1.eduroam.us and tlrs2.eduroam.us).
The cause of the network failure isn’t known. The data centers running these tasks communicate via the open Internet, which Internet2 does not control. IPv6 traffic was not impacted, which prevented the monitoring system from detecting the issue.
Health Testing Improvements
System Monitoring Improvements
The outage event of September 8 was caused by a transient network failure. As stated earlier, it is impossible to have full visibility and control over the open Internet. The mitigations put in place focus on monitoring and detection, recovery and redundancy.
The system has been updated to heal itself in the face of a network outage and properly alert operators when the load balancing pool runs dry. This will not eliminate the possibility of network outages, but it will automate the process of restoring the eduroam service when the network returns to normal.
eduroam operations staff will continue to improve monitoring capabilities, identify single points-of-failure and mitigate where possible.
All times are in Eastern Daylight Time.
September 8, 2022:
September 9, 2022:
September 12, 2022: