First, a little background. Here at OnceHub, we have amazing opportunities as employees. One of those opportunities is that we can work from home. Having employees working from home from California to New Zealand has certainly helped us understand scheduling needs, but it has also taught us the importance of taking extra effort to stay in shape.
A group of us at OnceHub have started an initiative called StepOnce. Using the gifts we were given at our company retreats, we encourage each other to take those 10,000 steps, or 60 ‘move minutes’ every day to stay healthy. But something very surprising happened. From our training in Site Reliability Engineering, we learned something else that helps us improve our health.
Leading vs Lagging Indicators
When measuring telemetry to best understand our system and keep our services up and running, there are two meta-categories we can use to keep everything running smooth. The first category is called ‘Leading Indicators’. These are events or metrics that give us a clue about what we can expect in the future. For example, if the number of requests per second starts going up, we can expect that we will need to scale up a service. We might deploy another pod in kubernetes to ensure that any future request does not experience service interruption.
Another meta-category of telemetry is called ‘Lagging Indicators.’ These are events or metrics that tell us about things that actually happened. Prediction is great, but reality is better. An example of a lagging indicator might be a metric known as ‘Mean time to recovery’(MTTR). MTTR is the measurement of how long it took us to recover from a degradation in service, be it slower response times or an actual outage. These lagging indicators give us an honest look at how well we performed and our goal is to make sure those lagging indicators head in the right direction.
When it comes to site reliability, we want to measure both leading and lagging indicators. Mostly, we want to focus on leading indicators, to better predict the future. It’s always better if a problem is avoided rather than congratulating ourselves on how quickly we fixed a problem. However, as more systems become automated, and we increase our rate of improvement, inevitably, problems will increase. Because of this, many in the DevOps community put extra focus on tracking and learning from lagging indicators.
In a talk about telemetry, Roy Osherove explained the importance of lagging indicators with the following example. Say you are trying to lose weight. You can attempt to measure the amount of calories you eat, and the amount of calories you burn, and this will give you a good indication of your ability to lose weight. It can even be very predictive. Theoretically, you are also in complete control. However, as life becomes more chaotic and the system more complex, these leading indicators also become harder and less realistic to manage. Restaurants will add extra fat or sugar. A family event will pressure you into eating lots of food.
However, the lagging indicators will always be honest with you. At the end of the day, you put yourself on the scale and know how much weight you lost or gained. If every day you measure your weight, and perform a retrospective, then even though you may not have as much control over your leading indicators as you would like, you can help nudge your lagging indicators in the right direction.
The lesson for system operators is clear. As we have more and more microservices, and as systems become more complex, it becomes harder to properly measure and even control all the leading indicators in the system. But you can easily measure your lagging indicators, and watch them, and if you have frequent retrospectives, you can also learn from them.
Sounds good on paper, but is this really true? Lagging indicators can take a long time to measure and learn from, so this can be risky. It might take months to find out if you are headed in the right direction or not. It doesn’t sound very Agile or Lean.
Putting Theory in Practice
Various experts on weight loss will advise that it isn’t very good to weigh yourself everyday. The body can fluctuate day to day by a few pounds or kilograms. When we dig deeper, we see that the reason for this advice is that it can be demoralizing. However, if we look at weight as a lagging indicator, and engage in personal retrospectives, then in theory it will be more effective than counting steps and keeping to a strict diet/lifestyle.
For months, I worked on strict meal plans, took up Brazilian Jiu Jitsu, and Longsword dueling to stay active, and managed to walk about 5,000 steps every day. Every two months I would weigh myself, and my doctor was happy with the results. But I wasn’t. As an experiment, I incorporated the lessons of leading and lagging indicators, weighing myself every day, and doing a quick mental accounting of what I did that day that was either the same or different from other days. At first this was hard, because it’s hard to confront ourselves when our actions don’t live up to what we wanted to do, but eventually I got into a nice rhythm.
Two months later the results have really surprised me. By measuring my weight every day, and thinking about my previous day’s actions, I was able to start subconsciously regulating what I ate that day. I found that on days where my weight spiked up, the next day or two I tended to be less hungry. There have also been conscious differences in my behaviour. My intuition about how the leading indicators affects the lagging indicator is much stronger, and I can make better decisions during the day with that knowledge.
What you measure, how often you measure it, and thinking about why the measurements changed can have a bigger impact on decision making and behavior than just trying to control traditional leading indicators. Be sure to measure your lagging indicators and engage in regular retrospectives. If you stay behind to measure results, it will eventually pull you ahead.
If you would like to join us in your own journey moving ahead, see our Careers page.
Avi Kessner, Software Architect
Avi started as a Flash Animator in 1999, moving into programming as Action Script evolved. He’s worked as a Software Architect at OnceHub since September 2019, where he focuses on Cloud Native Automated workflows and improving team practices. In his free time, he enjoys playing games, Dungeons & Dragons, Brazilian jujitsu, and historical European sword fighting.