Keynote 2: presented at PRDC 2013

by Kenny Gross,

Summary : Comprehensive Prognostics for Enhanced Dependability of Integrated Hardware/Software Enterprise Servers and Clusters
Abstract: Business-critical enterprise computing servers are now being integrated with advanced telemetry agentry to collect and archive hundreds of system performance, throughput, and load metrics (called "soft telemetry"), as well as digitized time-series signatures from distributed internal physical transducers (called "physics telemetry"). The merged, resampled, and phase synchronized soft and physics telemetry are monitored in real time by pattern recognition "Detectors" for the purpose of enhancing the reliability, availability, serviceability, and optimal energy efficiency of servers and clusters. Electronic Prognostics (EP) comprises a comprehensive methodology for proactively detecting and avoiding failures to achieve condition-based maintenance (CBM), where components, including N+1 redundant components, are proactively swapped based upon their condition and operational history, versus conventional reactive "fix-on-failure" maintenance. Oracle has for the last decade been developing EP innovations and productizing prognostic agentry in enterprise servers. In the early days of pattern-recognition-triggered Software Aging and Rejuvenation (SAR), completely separate tools and techniques were developed in separate business units to address h/w-centric EP, versus s/w-centric SAR. Now, a unified monitoring methodology has been developed that embodies a symbiotic integration of heretofore separate EP and SAR to achieve proactive fault monitoring of integrated h/w-s/w systems and database appliances. The key enabler for achieving integrated EP-SAR functionality is Oracle's continuous system telemetry harness (CSTH). CSTH integrates physics telemetry (distributed temperatures, voltages, currents, fan RPMs, vibration levels), soft telemetry, and various quality-of-service (QOS) metrics that now include vibration-driven IO throughput latencies in systems with spinning disk drives. CSTH signals are continuously archived to a circular file (i.e. the "Black Box Recorder"), and are also processed in real time using advanced pattern recognition for proactive anomaly detection. The ability to distinguish software aging phenomena from vibration-related performance degradation is a vital functional requirement for integrated EP/SAR prognostic health management systems going forward. Oracle's CSTH coupled with advanced pattern recognition agentry to achieve EP plus SAR are helping to increase component reliability margins and system dependability goals while reducing (through improved root cause analysis) costly sources of "no trouble found" (NTF) events that have become a significant sparing-logistics issue across the enterprise computing industry.