Overview

Health rate calculation offers a single measurement that reflects overall system health-enabling analysts to detect high-level problems at a glance, even if they have little familiarity with the system under test. By reversing health rate calculations, analysts can readily see whether an error is an availability problem, an accuracy problem, or a performance problem. Ultimately, analysts can drill down to the relevant components that generate errors.

Health rates don't replace data measurements; they simply contextualize them. Specific measurement values remain important for in-depth analysis, experienced analysts, service target agreements, and notifications.

Benefits of health rates

The benefit of health rates lies in the fact that they offer analysts a short cut in evaluating project health and directing development efforts. If the overall health rate of a project is high, there is no need for further analysis. Health rates have values between "0" and "100" ("0" being the worst, "100" being the best) and are independent of projects, amounts of data analyzed, and frequency of individual measurements.

Because high-level health rates are the aggregate of low-level health rates, analysts have the option of reversing rate calculations to determine how specific low-level rates influence overall rates. Such causal analysis can be used to "drill-down" to specific low-level data that is negatively affecting overall rates-thereby pinpointing the system components that are having a negative impact on system health. All the while, measurements that fall within acceptable ranges can be disregarded.

Because low-level health rates reveal the fitness of actual measurement values, analysts typically don't need to understand the significance of the measurement values themselves. For example, without having familiarity with a certain monitored application, it isn't readily apparent whether a business transaction that takes 15 seconds is faster or slower than usual. A health rate of "95%" however is readily understood to be a healthy rate.

Calculating health rates

Performance Manager is designed to monitor your perception of your system's health. You specify boundaries (100% and 0%) between which performance health is calculated as a logarithmic function. Outside of these boundaries health is considered to be either 100% or 0%. In scenarios where baseline information from which meaningful boundaries could be derived isn't available, Performance Manager can configure boundaries based on historic traffic patterns.

See Calculating Health for more details regarding health-rate calculation.

Health dimensions

Overall health values are influenced by three health dimensions:
  • Availability
  • Accuracy
  • Performance

Each of these health dimensions and the overall health rate itself provide values in the range of "0" to "100"-the higher the value, the better the health of the system.

Only availability and accuracy health values are expressed as percentages. Performance and overall health values are expressed as absolute rates ("10," "20," "30," etc.).

When you review health dimension values offered by Performance Manager, you are normally reviewing values that were calculated based on multiple monitoring transactions. So, when evaluating health dimension values, keep in mind that the health dimension value for a set of monitoring transactions is equal to the average value of all the corresponding and existing health dimension values of all individual monitoring transactions.

Availability

Availability is the most basic health dimension. It measures the percentage of time during which a monitored system is available to a subset of selected data. The availability rate provides information regarding whether a monitored system is running and whether it provides basic responsiveness to client requests.

A system is judged "available" when a monitoring transaction testing a system completes without detecting any errors. Most errors indicate that a system is not available. Exceptions include those errors that indicate that a system is available, but not working correctly.

When several monitors supervise a system and some of those monitors detect that the system is not available while other monitors detect that the system is available, the availability of the system is rated in between 0% and 100%.

Accuracy

The Accuracy rating for monitored systems is calculated only after systems are judged "available."

Accuracy rates are calculated with the assumption that monitored systems are working as designed and that the information transmitted to clients is correct. Useful functions that can be evaluated to determine accuracy include link checking, content validation, title validation and response data verification. If a monitoring script contains customized functions that are used to ascertain system accuracy, those functions will be factored into the accuracy rate as well.

Accuracy rating goes far beyond the simple checking of availability. A server may be available even when the application it hosts isn't responding. Likewise, dynamic pages may be corrupt, database queries may produce empty result sets, and warehouses may run short of stocked merchandise. The simple checking of availability won't alert one to such failures. Only complex transactions that compare results to benchmarks will detect these problems.

Performance

Once a monitored system is judged available and accurate , the performance health dimension of a system is calculated. Performance is not as objective a measure as availability and accuracy; what qualifies as "good" and "bad" performance is subjective, varying from one system to the next.

The most common measures of performance are timers. Users who are quite familiar with the behavior of their systems can have performance rates calculated against established timer results-determined through baseline testing.

For users who have not run baseline tests with Silk Performer to determine baseline performance-and consequently do not know what the boundaries for good and bad performance are-Performance Manager offers a means of calculating performance ratings based on historical response time values. By calculating dynamic bounds for "good" and "bad" performance based on historical data, actual performance values of monitor executions can be compared against historic values of monitor executions to determine system performance.