Systems fail and everybody has to accept this fact and be realistic about the situation that outages will eventually occur. The important thing is how you react to the outage and how well prepared you are. When an outage occurs, there are two measurements that indicate the success of recovery. These are defined as the RTO and MTO.
RTO – Recovery Time Objective
The RTO is the time it takes to recover from an outage. As the name implies the shorter this value is, the better it is.
MTO – Maximum Tolerable Outage
The MTO is the maximum time a business can run with its services down. Each service can have a different MTO, depending on various factors.
Any system admin should know the RTO and the MTO beforehand as part of the business recovery plan and have calculated values. Such values would be helpful in case of outages, as they can show which systems should be brought up first.
The RTO/MTO values should be technically possible, meaning that they need to be tested in business recovery simulations. For example, if during the simulation when there is no pressure the recovery takes 40 minutes then the RTO for such a situation cannot be 15 mins. If the RTO needs to be 15 minutes then the plan needs to completely change and technical requirements need to be modified to be able to meet those targets.
Although MTO and RTO are typically fixed, there are situations where MTOs and RTOs can have different values. Let’s take for example an online shop for gifts and gadgets. During the holiday period, the traffic and sales would be much higher than other days thus different MTO/RTO needs to be defined for such situations.
Ultimately it is important that one is prepared for when systems go down because they will! The only way to be prepared is to run simulations for your business recovery plan such that you can verify your RTOs and MTOs. Of course, having monitoring like Netumo will help in keeping with the pre-defined RTO / MTO as you would know sooner when an issue occurs and can start the recovery immediately.