Reasoning about the performance of a system is a rather tedious and complex task. A microscopic approach would suggest that you construct a model that factors in various components, such as the load and utilization of resources like CPU and I/O when certain code paths are executed. This gets complicated very quickly when we also have to reason about visit ratios (e.g. two times I/O due to database access, ...). Of course, this needs to be isolated by unit-of-work, e.g. single request to an HTTP endpoint that triggers a specific use case. This, however, is only necessary if we care for a deep dive into the individual software components that comprise our system.
Looking at the system's performance as a whole oftentimes suffices to achieve a general understanding of its behavior - be it in the steady-state scenario or in a situation where system components operate under stress. Therefore, we propose to follow a pragmatic approach by looking at the system from a macroscopic perspective. Note that this does not prevent us from evaluating the performance of individual endpoints of a HTTP-enabled web application. Macroscopic means that we observe the system by its external characteristics using response times or service times rather than by taking a fine-grained peek at the internals of each software component.
Measuring these parameters over a period of time gives us a good indication of the overall performance of the system. Keeping a time-series of these data points for longer periods of time even allows us to compare the performance of newer releases with those of older ones. This enables us to identify performance degradations in between software releases. It also allows us to reason about the impact of future development with regard to new features that drive business opportunity in terms of more customers which would put more load on the system (sustainability).
Data Points
From the perspective of the classifier, we are mostly concerned with either service times or response times.
Type of Data Point | Definition | Remarks | How to obtain |
---|---|---|---|
Service Time | The service time is the time that a server or service takes to process a request. It does not include the time that the request resided in any kind of queue. | Use service times to reason about the performance of the server (and just the server). | Data points are taken within the server process (e.g. MicroMeter) and exposed to some monitoring tool (e.g. Prometheus). |
Response Time | The response time is measured end-to-end, e.g. by a load testing tool. As such, it comprises not only the service time, but also the time spent in the **load driver** that acts as the client as well as transmission times between client and server. | Measuring response times in addition to service times can give insights into potential bottlenecks at the level of the infrastructure. | Load testing tools like Gatling oder Locust collect response times, as they - naturally - work end-to-end. |
Sojourn Time | The sojourn time is comprised of the time the request was queued plus its service time. | We will not consider sojourn times in the proposed classification. | Not required. |
Input Parameters
Performance characteristics that must be met to comply with SLAs can be defined in a variety of ways. Let us assume that we have the following parameters defined over service times as our data point type of choice. The sample input parameters can be applied for response times as well.
Input Parameter | Symbol | Description | Example |
---|---|---|---|
Service Time - Limit | \(S_{limit}\) | This parameter refers to the desired minimum service time that the service server exhibits. | 2000ms |
Service Time - Upper Boundary | \(S_{max}\) | This parameter refers to the service time that the server should exhibit at times of high load at most. Service times close to the boundary are still considered to be tolerable. Note: This can be expressed in terms of \(S_{limit}\) by multiplying it with some factor. The factor should be in a reasonable range, e.g. \([1.5; 2.0]\). | 3000ms, or \(1.5 \times S_{limit}\) |
Output Parameters
The conduct the classification, we require a couple of output parameters that are derived from collected data points.
Output Parameter | Description | Value | Why? |
---|---|---|---|
Average | Refers to the arithmetic average of all collected data points. | Little-Medium | Averages are prone to be unstable due to outliers. Use the \(P_{50}\) instead for performance analysis. It still might be good to have that value at your disposal. If you work with Little's Law in your analysis, averages might come in handy. In this case, though, we'd still use service times and explicitly not sojourn times (time spent in queue + service time). |
\(P_{50}\) | Refers to the 50-percentile, or Median. | High | - |
\(P_{75}\) | Refers to the 75-percentile. | Medium-High | Consider it as an alternative to \(P_{50}\) if you want to express a stricter categorization into performance classes. |
\(P_{90}\) | Refers to the 90-percentile. | Little | Most of the times unused, as SLAs intend to define limits in terms of the \(P_{95}\). |
\(P_{95}\) | Refers to the 95-percentile. | High | - |
\(P_{99}\) | Refers to the 99-percentile. | Medium-High | Consider it when evaluating the increase between \(P_{95}\) to \(P_{99}\). |
\(P_{99.9}\) | Refers to the 99.9-percentile. | Medium-High | Consider it when evaluating the increase between \(P_{99}\) to \(P_{99.9}\). |
Minimum | Refers to the data point with the smallest measured value. | Little | More of an interesting side note. Include it in your analysis if you want, but do not express any rules on this parameter. |
Maximum | Refers to the data point with the highest measured value. | Little | More of an interesting side note. Include it in your analysis if you want, but do not express any rules on this parameter. |
Classification
We categorize the performance of a given service into four different categories.
Class | Class Description | Call-to-action | Why? |
---|---|---|---|
A | Response times exceed expectations. | no | The service performs much better than expected. We can safely assume that there is a performance buffer big enough, so that the system will carry itself with no problems. |
B | Response times that are considered good. | no | Response times are well within our SLAs. |
C | Response times that are close to what our SLAs allow for at maximum. | yes | The performance of the service should be improved upon intermediate-term in order that we continue to meet SLAs. |
D | Response times exceed the limit as defined in SLAs. | yes | Fails to met SLAs. The performance of the service must be improved upon in order to satisfy SLAs. |
This classification also enables us to reason about the impact of continuous feature development. For instance, if we have a service that is rated at C and add a new feature to it that increases the amount of requests (quantifiable) to the system, we can reason about the sustainability of that feature in terms of the server's current capacity.
Classification Rules
We will start off with a few simple rules. However, this can be further refined to include more criteria per rank. Consider the alternatives for a stricter evaluation and possibly additions to the criteria by including evaluations on the progression on the percentile curve.
This base rule set that we propose is derived from Michael Pichler's article Schlechte Performance - Aussagekräftige Bewertung von Performancedaten leicht gemacht that was published in JavaMagazin 9/2011. We have replaced the use of averages in the presented criteria in favor of the more stable \(P_{50}\). Any presented additional criteria (progressions, ...) are sourced from our own experience. All of these rules are easily translated into PromQL or whatever tool you have to aid your monitoring needs.
It is possible that your SLAs dictate less strict or even stricter rules. If you see a statement like
Only 10% of requests may exceed a response time of 2000ms.
then you'll define the classification in terms of the \(P_{90}\), instead of the \(P_{95}\).
Ruleset for Class A
- \(P_{95} \lt S_{limit}\)
Alternatives
- Instead of 1: \(P_{99} \lt S_{limit}\)
Ruleset for Class B
- \(P_{50} \lt S_{limit}\)
- \(S_{limit} \lt P_{95} \lt 1.5 \times S_{limit}\) or expressed as \(S_{limit} \lt P_{95} \lt S_{max}\)
Alternatives
- Instead of 1: \(P_{75} \lt S_{limit}\)
Ruleset for Class C
- \(P_{50} \lt S_{limit}\)
- \(S_{max} < P_{95}\)
Alternatives
- Instead of 1: \(P_{75} \lt S_{limit}\)
Ruleset for Class D
- \(S_{limit} \lt P_{50}\)
Additional Rules
It is also advisable to look at some progressions on the percentile curve. Be sure to have lots of data points over a longer period of time collected, though, before you read too much into this analysis. This is not suitable for analysis in the context of a short-spanned load test that runs only a couple of minutes.
The progressions from
- \(P_{95}\) to \(P_{99}\)
- \(P_{99}\) to \(P_{99.9}\)
are of interest. Consider the service to be healthy if services times do not increase by a reasonable when moving to the next percentile. This factor usually ranges in \([1.0; 2.5]\). If the computed factor is not in this range, then there might be something off with your service that requires attention.