27-Apr-09 WG Session at SMM

Performance WG Face-to-Face Meeting

at Internet2 Spring Member Meeting
April 27, 2009

Attendees

Carla Hunt, MCNC (Chair)

Jeff Boote, Internet2

Katsuhiro Sebayashi, Nippon Telegraph and Telephone Corp (NTT)

Hisao Uose, Nippon Telegraph and Telephone Corp (NTT)

Kenji Shimizu, Nippon Telegraph and Telephone Corp (NTT)

Takehito Suzuki, Nippon Telegraph and Telephone Corp (NTT)

Kazuto Noguchi, Nippon Telegraph and Telephone Corp (NTT)

Tom Throckmorton, MCNC (via phone)

Chris Hawkinson - CENIC (via phone)

Brian Tierney - ESnet (via phone)

Peter O'Neil, MAX (Mid-Atlantic Crossroads)

Rich Carlson, Internet2

Andrea Blome, Internet2

Linda Winkler, University of Chicago

Scott Colburn, U.S. Department of Commerce Boulder Labs

Don McLaughlin, Indiana University

John Streck, University of North Carolina at Chapel Hill

Charles Hollingsworth, Georgia State University

Martin Swany, University of Deleware

Maciej Strozyk, PSNC/PIONIER

Kazunori Konishi, APAN

Jon Dugan, ESnet

John Bartin, Washington University in St. Louis

Hans Wallberg, SUNET

Per Nihlen, NORDnet

Grant Miller, National Coordination Office, Computing, Information, and Communications

Bob Gerdes, Rutgers

John Stier, Stony Brook University, State University of New York

Emily Eisbruch (scribe)

Discussion

Internet2 Update [Jeff Boote]

An update on the software releases available for Internet2 performance and measurement tools and a preview of the roadmap.

Release candidate for perfSONAR-PS 3.1 RC1 is available at http://software.internet2.edu.

REDDnet's Use of Performance Tools

REDDnet has disk depots that cache data placed throughout the U.S, and they have been experiencing challenges moving data between the depots.

Ezra Kissel, University of Deleware, presented on REDDnet Performance Monitoring

View Ezra Kissel's REDDnet presentation

Highlights:

REDDnet provides "Working storage" to help manage the logistics of sharing, moving and staging large datasets across wide areas and distributed collaborations.
Participating Institutions: Vanderbilt, Tennessee, Stephen F. Austin, NC State, Nevoa Networks, Delaware
Host Sites: Caltech, Florida, Michigan, ORNL, SDSC, TACC, UC Santa Barbara (Stephen F. Austin, Tennessee, Vanderbilt)
Tools used for performance monitoring inlcude:
- OWAMP (3.1)
- BWCTL (1.3)
- NDT client (3.5)
- perfSONAR-PS perfSONAR-BUOY (regular testing framework for bwctl)
Performance troubleshooting approach:
- Ensure TCP is tuned on all hosts
- Pick a set of hosts to investigate from the "worst offenders"
- Divide and conquer approach (Test from depot to POP, Break up path into smaller segments, Narrow down the source of the problem by seeing which sub-segments have the same symptoms)
- Examples:
  - REDDnet Umich and CHIC I2 POP
  - REDDnet Vanderbilt to Atlanta I2 POP

Internet2 and Cisco Telepresence

A behind-the-scenes look at the planning and setup for the Cisco Telepresence Demo shown at Wednesday's General Session.

Aaron Brown, of Internet2, presented on "Regular Latency Monitoring Or: How I Learned to Start Worrying and Hate the Jitter"

View Aaron Brown's slides on Latency Monitoring

Highlights

Throughput testing is being done by LHC, ESNET and others
But what about latency testing for latency sensitive applications such as Cisco Telepresence Demo?

Cisco Telepresence Limits
- 10 ms jitter
- 160 ms delay
- 0.05% loss
Polycom Limits
- 30-35 ms jitter
- 300 ms delay
- <1% loss

Goals:
- Measure delay/jitter/loss between points
- Be able to fix any issues that come up
Approach: Deployed measurement machines at the endpoints and a number of hosts in between and set up regular latency tests between the machines
- Benefits: Shows end-to-end problems, and allows a "Divide-and-Conquer" approach to narrow down the source of the problem
- Tools: OWAMP (Latency Tester) and perfSONAR-BUOY (Test Scheduling Framework)
- Analysis software was written or modified to make it easy to view and understand the data.
Results: Several potential performance issues, in both the network and the monitoring systems, were identified, and all were solved and verified through diagnostics and monitoring

Interoperability Testing with Europe - Update

THIS SECTION IS STILL BEING EDITED

Tom Throckmorton, of MCNC, presented an update on the Multi-Vendor 10 Gigabit Testing that Matt Zekauskus and Tom discussed at the Performance WG at the Feb. 2009 Joint Techs in College Station. The goal is to determine how well 1G and higher speed circuits work between differing vendor hardware over long distance.

Tom reported that interoperability testing on connecting Internet2 and Dante is ongoing. There was a prior set of tests at 1 gig reported on in February 2009. There had been limitations and problems w interruptibility.

Since the Feb 2009 update, Dante and Internet2 and CNC ?? have done product evaluation on interrupt testing from ? company out of Denmark. This has been an opportunity to jointly evaluate the interrupt testing before turning the circuit over to production.
This system is a tenth of the cost of other interrupt testers. SPGA ?? systems. Very high performance led to low cost.

Dante had received testers in Jan 2009. Internet2 got the testers towards the end of Feb 2009 and had an aggressive timeframe for completing testing. Having equiment on hand allowed us to complete the testing in timely fashion and also to complete tests with a higher degree of confidence than w using ?BCs for commodity systems. Issues around driving scars?? at sufficient rate with this test equipment in place. We drove the circuit almost to full capacity. Did suite of tests at various packet sizes. We were able to iterate through the same set of tests independently and get the same results consistently, leading to high confidence in the numbers.

One issue emerged as a result of this testing. In one direction, the frame size got below 64 bites. After a number of back-to-back tests, and repeated tests to be sure we got numbers accurately, we surmised a limitation on ? in one side of connection. Based on ... not a problem interuptablility wise.

Overall we got excellent results from this gear. More consistent than out of PCs.
Another positive thing was interaction wtih the vendor. They were eager to please and responsive to issues we raised with them. Made corrections for us based on feedback we had given them. The underlying circuit was turned over for production in mid april and it's been carved up in different ways to serve connections between Dante and a couple of points in the U.S.

Hope to provide a general interrupt test report that will be delivered at end of May and a product evaluation around end of June.

Some ideas surfaced on how we could make improvements in using commodity systems to do testing, and there are more things to look at.

Dante is interested in purchisng these testers for some use. Not sure they are attractive otherwise.
If some one wants to learn more, contact Tom Throckmorton or Matt Zekauskus.

Assembling a Performance Enhancement and Response Team (PERT) team in the U.S.

Discussion of establishing a team of network engineers representing each of the RONs that would be available on a rotating basis to troubleshoot complex, multi-domain issues.

This is the link to the Geant PERT team:

http://www.geant2.net/server/show/conWebDoc.1602

Jeff Boote is on the PERT team. The team undertakes operational debugging efforts for multi-domain paths. It's labor intensive. Knowledgeable engineers interpret the data. Need an operational component of some sort. Important points:

To start a U.S. team, we can't do things exactly the same way that Geant does things. They have a more hierarchical org structure, and they are more centrally funded.
However, we can work together and get a rotating on call person to help with multi domain issues.
Lesson learned from the PERT team: At first, they had the responsibilities rotating thru member countries. That was not a successful model. Need a system where group that opens the ticket sticks with it.
Possibliity of getting NSF or DOE funding.
There are not a lot of people with experience for analyzing the longer latency paths.
Physics organizations already have people on staffs dealing with this. ESnet has about 3-4 engineers focusing on performance problems. Smaller scientific groups don't have the experience. We could address needs there.
Much of the community doesn't realize the bad performance is not acceptable. We should get folks educated on expectations and get them to complain if they don't get it.

Q: Is PERT team only engaged for troubleshooting once already in operation? Or are they also involved in design?

A: in Europe, PERT focuses on troubleshooting, but we can design the U.S. team how we want.

Advantage of developing a U.S. Team: Spread the knowledge around on how to do this to more of the community. It's not just about solving the problem it's also about spreading the knowledge. Another goal is to put the knowledge into a process, make checklists, create a knowledge base, use techniques to make the effort scale. GEANT is making a knowledge base. Also, we can help end users gather info necessary to start analyzing problems.

Anyone interested in working on defining this team, please send Jeff or Carla an email.

WG Charter

Carla presented the draft WG Charter, and invited comments. Carla would like volunteers to serve with her as a co-chair of the working group.

Space shortcuts

Child pages

Performance WG Face-to-Face Meeting

Attendees

Discussion

Internet2 Update [Jeff Boote]

REDDnet's Use of Performance Tools

Internet2 and Cisco Telepresence

Interoperability Testing with Europe - Update

Assembling a Performance Enhancement and Response Team (PERT) team in the U.S.

WG Charter

Space shortcuts

Child pages

27-Apr-09 WG Session at SMM

Performance WG Face-to-Face Meeting

Attendees

*Discussion*

Internet2 Update [Jeff Boote]

REDDnet's Use of Performance Tools

Internet2 and Cisco Telepresence

Interoperability Testing with Europe - Update

Assembling a Performance Enhancement and Response Team (PERT) team in the U.S.

WG Charter

Discussion