MW-E2ED BoF EDDY (End-to-End Diagnostic Discovery) concept and effort status May 2, 2005 Chas DiFatta (chas@cmu.edu) Mark Poepping (poepping@cmu.edu) Outline • • • • • • Initiative vision and direction Concept Architecture Campus Department/Group Involvement Conclusion Next steps Problem Banes of the Distributed System Diagnostician • No access to the diagnostic data • Discovering valuable information in a sea of data • Correlating different diagnostic data types • Providing evidence for non-repudiation of a diagnosis • Finding time to create tools to transfer diagnostic knowledge to less skilled organizations and/or individuals State of Practice • Network, application, system and security events separate, therefore extremely difficult to correlate • Data represents only what has faulted • No end-to-end accountability of transactions. I.g. email, web, VoIP, intrusion Vision Create an activity audit ledger/application that... • Provides a means to study the behavior of faults and anomalies • Explores the impact of an Internet with assured electronic communications and its influence on infrastructure, security, reliability, privacy and trust • Assures the ‘default’ electronic interaction by creating a means of non-repudiation between two or more parties Initial Direction Enabling mechanism for investigating, • Machine to machine interaction • Taxonomic risk analysis of security anomalies • Automated diagnostic practices, not just what has faulted but how the fault occurred • Perceived anomalies verses actual faults • Embedded system events • High volume event driven systems • Rapid tool development platform for diagnostic applications Effort Timeline Month Activities Status Startup Done Discovery Preliminary Design Pilot Design Pilot Implementation Pilot Verification Findings and Redesign EDDY Implementation EDDY alpha/beta EDDY Distribution Done Done Done Done Done Done Active Active - Oct – Dec 03 Major Milestones • • • • • • Advisory group formed CER conceived High level architecture finalized Pilot delivered EDDY backplane operational EDDY release Jan – Jun 04 Jul – Dec 04 Jan – Jun 04 Jul – Dec 05 Outline • • • • • • Initiative vision and direction Concept Architecture Campus Department/Group Involvement Conclusion Next steps EDDY: End-to-end Diagnostic DiscoverY Goals of the effort, • Enable the collection of a wide array of network, system, application, security, and environmental events • Provide a feature rich event dissemination infrastructure that can scale • Introduce an API that enables diagnostic tool developers to build the next generation or retrofit existing tools Separate Event Domains Distributed System Events Diagnostic Tools Separate Event Domains Distributed System Events Diagnostic Tools Separate Event Domains Distributed System Events Diagnostic Tools Environmental Application Security System Network Separate Event Domains Distributed System Events Diagnostic Tools Environmental Application Security System Network Separate Event Domains Network Application/System Security Separate Event Domains Security Port Scan Denial of Service Attack Network Network Transaction (Sendmail) Network Transaction (port 8080) Network Transaction (to router) Application/System Sendmail Process Dies Sendmail Process Restarted Combined Event Domains Network Application/System Security EDDY Event Evolution Security Network Application System Environmental EDDY Event Evolution Security Routing Network Filtering Application System Environmental Archiving Normalization EDDY Event Evolution Security Routing Routing Application In-band Anonymization Network Filtering Application Filtering Application Out-of-band System Aggregation Environmental Archiving Archiving Database Normalization Transformation EDDY Event Evolution Security Routing Routing Application In-band Anonymization Network Filtering Application API Tools Filtering API Application Out-of-band System Aggregation Environmental Archiving NMS API AMS Archiving Database API Alert Normalization Transformation EDDY Event Evolution Security Routing Routing Application In-band Anonymization Network Filtering Application API Tools Filtering API Application Out-of-band System Aggregation Environmental Archiving NMS API AMS Archiving Database API Alert Normalization Transformation EDDY Event Evolution Security Routing Routing Application In-band Anonymization Network Filtering Application Visualization API Tools Filtering API Application Out-of-band System Aggregation Environmental Archiving NMS API AMS Archiving Database API Alert Normalization Transformation EDDY Event Evolution Security Routing Routing Application In-band Anonymization Network Filtering Application Tools Filtering Application Out-of-band System Application In-band NMS Aggregation Environmental Archiving AMS Archiving Database Alert Normalization Transformation Analysis EDDY Event Evolution Security Routing Routing Application In-band Anonymization Network Filtering Application Tools Filtering Application Out-of-band System Application In-band NMS Aggregation Environmental Archiving AMS Archiving Database Alert Normalization Transformation Analysis EDDY Event Evolution Security Routing Routing Application In-band Anonymization Network Filtering Application Tools Filtering Application Out-of-band System Application In-band NMS Aggregation Environmental Archiving AMS Archiving Database Alert Normalization Transformation Analysis EDDY Event Evolution Security Routing Routing Application In-band Anonymization Network Filtering Application Application In-band API Tools Filtering API Application Out-of-band System Aggregation Environmental Archiving Application Out-of-band NMS API AMS Archiving API Database Alert Normalization Transformation Analysis Combined Event Domains 0 Port Scan 1 2 Network Transaction (Sendmail) Sendmail Process Dies 3 4 5 6 7 Network Transaction (port 8080) Sendmail Process Restarted Network Transaction (to router) Denial of Service Attack ALERT: sendmail-worm-ID:2353 Enterprise Implementation Wireless Internet Departmental LANS Core Switch Fabric Edge Hosts Backplane Hosts Internet2 Abilene Remote Campuses Enterprise Implementation Wireless Internet Remote Campuses Departmental LANS Core Switch Fabric Internet2 Abilene Diagnostic Applications EDDY Cluster Edge Hosts Backplane Hosts Enterprise Implementation Wireless Internet Remote Campuses Departmental LANS Core Switch Fabric Internet2 Abilene Diagnostic Applications EDDY Cluster Edge Hosts Backplane Hosts EDDY Cluster Functionality Security Database Network Application Application Bandwidth Abuse Archive Application Database Application Archive System Application Application Archive Environmental Application Application Database Normalization Edge Nodes Database Transformation Backplane Nodes Analysis EDDY Cluster Functionality Security Database Network Application Application Security Forensics Archive Application Database Application Archive System Application Application Archive Environmental Application Application Database Normalization Edge Nodes Database Transformation Backplane Nodes Analysis EDDY Cluster Functionality Security Database Network Application Application Network Diagnostics Archive Application Database Application Archive System Application Application Archive Environmental Application Application Database Normalization Edge Nodes Database Transformation Backplane Nodes Analysis Application Diagnostics The Scale Issue • Scaleable store and forward – Project only what is needed to the next level – Select back to get data that you don’t have – Only cook data that you need • Data lifecycle The Scale Issue Events 5k/sec The Scale Issue Events 5k/sec 7 days Archiver The Scale Issue Events 5k/sec 7 days Archiver Application or Database The Scale Issue Events 5k/sec 7 days 30 days Archiver Application or Database Application or Database The Scale Issue Events 5k/sec 7 days 30 days 365 days Archiver Application or Database Application or Database Diagnostic Data Lifecycle Anonymize/ Filter Collection Scour Policy Access Summarize Archive Outline • • • • • • Initiative vision and direction Concept Architecture Campus Department/Group Involvement Conclusion Next steps Solution • Import a wide variety of event data easily • Disseminate the events to elements in a distributed backplane that provides core functionality for diagnostics • Provide access to the diagnostic data and a platform for rapid tool development Diagnostic Backplane • Accommodates a wide variety of event classes easily • Enables most any device to produce events • Supports extensible classification models • Event routing via simple select/project functionality Diagnostic Backplane Cont. • Edge hosts: – Servers, clients, and embedded devices – Indirectly collecting flow and security data from switches, routers and security devices • Backplane hosts: – Forward, manipulate and store event flows from edge hosts – provide an API to query backplane for event information – Control and mange the backplane itself Backplane Transport Channels Backplane Control Query Event Archive Transformation DB Anonymization Normalizing Analysis Backplane-manager Directory Storage Agents Display Application Base Agents Agent-manager console console console Control Agents Basic Agents Functionality (XSLT or Java) Filtering (XPath) Authentication Transport events Copy (optional) Authentication Transport Basic Agents + Query queries Transport (HTTP) Authentication Query (SOAP) Functionality (Java) Filtering (XPath) Authentication Transport events Copy (optional) Authentication Transport Storage Agents queries Transport (HTTP) Authentication Query (SOAP) Functionality (XSLT or Java) Filtering (XPath) Authentication Transport events Copy (optional) Authentication Transport Agent Control queries control Transport (HTTP) Authentication Query (SOAP) Functionality (XSLT or Java) Filtering (XPath) Authentication Transport events Copy (optional) Authentication Transport Base Agent Types • Normalization: rapidly put external events into backplane via a raw CER. Small footprint, can be ported to embedded systems. • Transformation: convert raw CERs into cooked (parsed into XML) and/or manipulate CERs • Anonymization: anonymize specific fields of the CER • Application: take out-of-band action • Analysis: inject analysis CERs into backplane based on observed events • Display: act and a filter/preprocessor for display consoles Storage Agent Types • Archive: repository of events indexed on the base correlation structure of their CER • Database: repository of events indexed on a specific schema (can be very granular) • Directory: provide a event location service • Where do I find this type of event? • What is the granularity of it? Control Agent Types • Agent-manager: operate and manage base and storage agents on each host • Backplane-manager: operate and manage the host-configuration agents to build and operate a specific backplane topology Display Agent Architecture Event Transport Channel Display Agent Query/Response Channel Display Console Display Console Display Console Display Agent - Forensic Display Agent - Forensic Display Agent - Forensic Inputsources: RT-Agent:__________ DB/Archiv e-Agent:___________ File:_________ Authentication:User:____________Password:__________ Filters: Ev entnf oTy peEv entCorrelationDescriptor DislplayFields : Ev entInf oTy peEv entCorrelationDescriptorCooked Graph Fields: Output source: File:______________ StopTime Type Alert 14 Mar 05 21:51:00.4264 network inf o tcp 24.6.125.34 42964 209.195.187.23.22 1 14 Mar 05 21:51:01.2643 network inf o tcp 24.6.125.34 42964 209.195.187.23.23 1 14 Mar 05 21:51:02.3263 network inf o tcp 24.6.125.34 42964 209.195.187.23.24 1 14 Mar 05 21:51:03.4762 network inf o tcp 24.6.125.34 42964 209.195.187.23.25 1 14 Mar 05 21:51:03.6223 network inf o tcp 24.6.125.34 42964 209.195.187.23.25 64 14 Mar 05 21:51:03.8128 network inf o tcp 24.6.125.34 42964 209.195.187.23.25 17 14 Mar 05 21:51:03.9239 network inf o tcp 24.6.125.34 42964 209.195.187.23.25 1 14 Mar 05 21:51:04.2234 network inf o tcp 24.6.125.34 42964 209.195.187.23.25 67 14 Mar 05 21:51:04.5983 network inf o tcp 24.6.125.34 42964 209.195.187.23.25 15 14 Mar 05 21:51:04.7263 network inf o tcp 24.6.125.34 42964 209.195.187.23.26 1 14 Mar 05 21:51:05.2098 network inf o tcp 24.6.125.34 42964 209.195.187.23.27 1 14 Mar 05 21:51:06.7760 network inf o tcp 24.6.125.34 42964 209.195.187.23.28 1 14 Mar 05 21:51:07.3622 network inf o tcp 24.6.125.34 42964 209.195.187.23.29 1 14 Mar 05 21:51:08.6945 network inf o tcp 24.6.125.34 42964 209.195.187.23.30 1 14 Mar 05 21:51:09.4876 network inf o tcp 24.6.125.34 42964 209.195.187.23.31 1 14 Mar 05 21:51:10.2826 network inf o tcp 24.6.125.34 42964 209.195.187.23.32 1 14 Mar 05 21:51:11.1283 network inf o tcp 24.6.125.34 42964 209.195.187.23.33 1 14 Mar 05 21:51:12.9822 network inf o tcp 24.6.125.34 42964 209.195.187.23.34 1 14 Mar 05 21:51:13.3982 network inf o tcp 24.6.125.34 42964 209.195.187.23.35 1 14 Mar 05 21:51:14.2798 network inf o tcp 24.6.125.34 42964 209.195.187.23.36 1 14 Mar 05 21:51:15.5093 network inf o tcp 24.6.125.34 42964 209.195.187.23.37 1 14 Mar 05 21:51:16.8733 network inf o tcp 24.6.125.34 42964 209.195.187.23.38 1 14 Mar 05 21:51:17.3983 network inf o tcp 24.6.125.34 42964 209.195.187.23.39 1 14 Mar 05 21:51:18.6093 network inf o tcp 24.6.125.34 42964 209.195.187.23.40 1 14 Mar 05 21:51:19.5983 network inf o tcp 24.6.125.34 42964 209.195.187.23.41 1 14 Mar 05 21:51:20.8092 network inf o tcp 24.6.125.34 42964 209.195.187.23.42 1 14 Mar 05 21:51:21.4998 network inf o tcp 24.6.125.34 42964 209.195.187.23.43 1 14 Mar 05 21:51:22.0233 network inf o tcp 24.6.125.34 42964 209.195.187.23.44 1 1 74 123 1 74 0 1 74 0 1 74 54 57 30321 2344 15 9827 1023 1 63 63 63 34853 3587 13 7633 1933 1 74 54 1 74 54 1 74 54 1 74 54 1 74 54 1 74 54 1 74 54 1 74 54 1 74 54 1 74 54 1 74 54 1 74 54 1 74 54 1 74 54 1 74 54 1 74 54 1 74 54 1 74 54 1 74 54 Ev ents Counts Display Agent - Specialized Common Agent Capabilities • Every agent can forward, combine, split and filter event flows to other agents within the diagnostic backplane • All transport channels (event, query, control) between agents are encrypted • Mutual authentication based on certificates • Initial design designed to scale to at least 5000 events/sec • Can easily morph onto new agent types Common Event Record (CER) • Accommodates a wide variety of event classes easily (network, system, application, security) • Enables high correlation between events through time, location, type and/or extensible tags • Can be lightweight to conserve space but can be transformed onto a highly descriptive structure • Highly flexible structure that morph to accommodate new correlation schemes Event Progression External Event Record Normalizing Agent CER: Raw External Event Record Transformation Agent CER: Cooked Parsed Payload Analysis Agent CER: Analyzed Analysis Payload Common Event Record Type Raw – no parsing of event payload Event Descriptor Raw Event Data Base Information Version - version of CER typeID – event type (NetFlow, /var/log/messages, MS security event, etc. eventID – identifier unique across the backplane occurredStamp – time of the event eventHostname – where the event occurred eventHostAddress – address where the the event occurred eventType – network, system, security, application or environmental normalizerHostname – host where the normalization agent was run normalizerAddress – address of the host where the normalization was run warlingLevelType – emergency, alert, critical, error, warning, notice, informational, debug correlationDescriptor – highly flexible structure to aid correlation (one for every major event type) userTag – tag:value pairs defined at the setup of backplane to give unique meaning to events Common Event Record Type Cooked – raw event payload is parsed into XML Event Descriptor Parsed Event Data XML Structure • Schema of raw event data • Can be highly granular • Defined by transformation agent Common Event Record Type Analyzed – high order diagnostic event Event Descriptor Analyzed Event Data Diagnosis of observed events • DiagnosisID – specific name of diagnosis • Hypothesis – what it thought happened • EventPointers – pointers to all the events that contributed to the hypothesis Common Event Record Examples • Raw – – – – – Network: Cisco NetFlow version 9 in payload Security: Snort or MS security event Application: /var/log/smtpd or MS application event System: /var/log/dmesg or MS system event Environmental: temperature • Cooked – XML representation of raw events – specific fields of the XML representation of raw events • Analyzed – diagnosis of DoS attack based on raw and or cooked events Rapid Enabling of Diagnostic Applications • Enable the forensic process • Feeding NMS to enhance their functionality • New visualizations to represent real-time and historical events • Feeding research with an enormous set of data EDDY Enabled Devices • • • • Workstation and servers Network devices (routers and switches) Security devices (firewalls and IDS) Embedded EDDY – Environmental devices (premises control/monitoring) – Transportation (automotive, etc.) – Robotics Outline • • • • • • Initiative vision and direction Concept Architecture Campus Department/Group Involvement Conclusion Next steps What EDDY is • Consolidates events into a simple framework to enable correlation • Event dissemination environment • Diagnostic tool platform that leverages and enhances existing tools while enabling the next generation What EDDY is not • A system/network/application/security management platform • The analysis engine, it enables the analysis to happen with domain expertise Unleashing the Genie Exposing an unprecedented wealth of diagnostic information for • Enabling new and enhancing existing diagnostic and security applications • Visualizing events • Security forensics • Researchers through the establishment of a diagnostic observatory • Modeling new policy configurations to assess their impact on daily operations • Analysis, validation and troubleshooting of distributed composite applications Next Generation • Network, application, system and security events combined • Data represents discrete events that make up successful or failed service delivery • True end-to-end accountability of transactions • Auditing the behavior of an electronic transaction to establish an event profile Seeding the Environment • EDDY as an enabling technology provides, – Event dissemination and correlation infrastructure • Gives researchers access to event data (anonymized) on the security, application and network domains – A development platform for diagnostic research in the areas of – Applications and Middleware – Networking – Security Outline • • • • • • Initiative vision and direction Concept Architecture Campus Department/Group Involvement Conclusion Next steps Enabling Campus Members • Funding for extended research – A platform to discover new diagnostic application methods – Exposing a “petri-dish” for researchers to gain access to security, system, application, environmental and network events • Enterprise diagnostics – Within CMU Computer Services – Other federated applications Dragnet (use case for scale) • Real-time security analysis using network flow records across campus core I2 CMU Wireless Flow Engine Commodity CMU Core Dragnet Dragnet (use case for scale) • Real-time security analysis using network flow records across campus core I2 CMU Wireless Normalizer Transformation 5k-8k events/sec Anonymization Flow Engine Commodity CMU Core Application Dragnet Dragnet (use case for scale) • Real-time security analysis using network flow records across campus core Campus X I2 CMU Wireless Normalizer Transformation 5k-8k events/sec Anonymization Flow Engine Commodity CMU Core Archive Application Dragnet Intelligent Workplace – School of Architecture (use case for CER) • Capturing events from all aspects of a physical environment Heating Sensors Motion Sensors Collection Engine Lighting Sensors Collection Engine Collection Engine Workspace Analysis Intelligent Workplace – School of Architecture (use case for CER) • Capturing events from all aspects of a physical environment Heating Sensors Motion Sensors Collection Engine Collection Engine Transformation Normalizer Normalizer Application Lighting Sensors Collection Engine Normalizer Workspace Analysis Year Two Goals Mature the Common Event Record 9Solicit input on completeness of version 1.0 9Must be able to morph to new CER formats and providing backward compatibility 9Address scaling issues with respect to the record size and consider other data representation formats 9Include second order events such as measurement and performance 9Incorporate a mechanism for more granular correlation of events Year Two Goals Cont. Scale the diagnostic backplane • Adopt a real Authz/Authn methodology • We use certificates at this time, but management is an issue • Shibboleth non-web version ready • Provide an event anonymization 9Specific agent devoted to policy based functionality • Transport method evolution 9Removed the dependency of SCP 9Add real-time flow capability • Migration from Python or offload compute intensive areas 9Now Java • Management and Configuration • Centralized configuration • Keep the configuration work on the clients hands free Year Two Goals Cont. Add Applications... Domain specific • Work with middleware application, network, system, security groups to build focused apps based on what we’ve learned from scenario writing process • Discuss performance/measurement with external groups Mature and establish a base application with GUI interface for forensics and reporting • Reporting – feed appellations like cricket and crystal reports • Forensics – need a client GUI interface that is ported to Linux, Mac and Windows Year Two Goals Cont. Add more applications... Build simple but high value tools that extract information from the archive and not the DB • Summaries of events • Top event hosts • For retrieving data that is not sent to the DB Version 1 of the Event API • Acquiring a real-time event flow from any node • Simple data locator service (where can I find this data) • Querying data repositories directly but be conscious of future capabilities where agents may mine data over multiple repositories Status • Development – Core developers driving to core release 5/05 • Campus Adopters – initial use cases – CS/Cylab – security research, real time flow events from commodity Internet • Dragnet – network flow event security analysis – Architecture – environmental monitoring and control • Environmental event data from many ultra small devices and embedded systems – Computing Services ISAM/Security Office • Consolidation of application log files, fault analysis • Conduit for reporting and high level event consumption Status Cont. • Outreach – Involving others in the development process – Expand to other use cases external to CMU • Funding – Sponsored by the National Science Foundation under the NSF Middleware Initiative - Grant No. ANI-0330626 – Expanding the effort by increasing funding to • • • • Mature base technology Spawn effort for diagnostic application development Enable multi-subsystem correlation Experiment with extending research data flow analysis into multicampus; federating/automating some diagnostic data sharing – Soliciting development partners in both industry and government Enabling other Efforts and Tools Diagnostic assistance is provided through the system in several ways: • Existing diagnostic tools have been or can be fitted with EDDY normalizers and translators to join into the backplane and make their data available to other applications or to specific help desk/service personnel. • Applications can be fitted with similar EDDY normalizers to inject their error logs and diagnostic information into the Backplane. • Existing diagnostic tools can be enriched though access to additional diagnostic data through tapping into other sources of information within the backplane. • New diagnostic consoles can be developed and assembled from components that access and analyze the rich resources on the backplane. • Applications can utilize diagnostic data at lower levels of the protocol stack and present better information to users about problems in access or performance. • The diagnostic capabilities can be positioned to provide audit mechanisms as well. This material is based upon work supported by the National Science Foundation under Grant No. 0330626, Carnegie Mellon University, and Internet2. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Discussion Special thanks to the following to make the effort possible... • Jim Gargani (CMU) – lead developer/design - core • Kevin Miller (Duke) – design - core • Tom Neuendorffer (CMU) – design/developer – visualization • Walter Wong (CMU) – developer/design - core