How do you calculate/agree the Recovery targets for services?

Wed, 21 Jul 2021 10:57:21 GMT

Recently, we've been reviewing the SLAs for Business Services with the business. It should be done on a yearly basis, but this seems to have slipped with some service delivery managers.

It has come to our attention via this activity though, that some of the RTO and RPO targets for recovering the service if some of the underlying systems are unavailable are unrealistic and they don't match the business criticality levels that have been assigned to them.

Speaking to the business continuity team, they have a standard and definitions of the RTO and RPO targets, according to the Business Criticality levels, but it seems like these were defined quite a while ago and they might not be realistic any more.

How do you define these targets please? We work with the business to define the services as they consume them and I'm interested to find out how other companies construct their targets.

Wed, 21 Jul 2021 15:54:45 GMT

Hello and thank you for your question,

Firstly I commend you for looking to follow the good practice of reviewing Service Level Agreements with the business on an annual basis. I once inherited some very out of date SLAs in a large organisation, and a lot had changed which meant by that point the agreements were irrelevant to either the business or the IT department. From that point on a refresh date has been part of any SLA, OLA or XLA that I'd drafted and agreed.

It seems to me there are two parts to this as I understand it.

Firstly what would be the timeline for RTO and RPO to recover those services in the event of a major failure/disaster. If Business Continuity and Service Continuity have been working on an solid approach for continuity plans and Disaster Recovery tests then this will provide hard facts rather than theoretical design parameters.

I've seen services with what were thought to be achievable RTO and RPO targets, get an uncomfortable wake up call when a full Data Centre recovery test was run. That surfaced a number of prerequisite underpinning domain, network and authentication services that each needed to be recovered before the business service could then in turn be recovered. The actual RTO and RPO in the event of a Data Centre failure and failover was well outside the previous targets.

What helps here is to work through failure scenarios, from the point of consumption of the service by the business:
- What happens if a server fails?
- What happens if a database is corrupt?
- What happens if an underpinning service fails?
- What happens if a Data Centre is offline, and a complete recovery to a secondary DC is needed?

Planning and testing each of these will give you the achievable RTO and RPO for the service.

Secondly you have the questions of whether the criticality levels are correct for each service, and whether the RTO and RPO targets meet business requirements.

Good practice would be to look at a range of factors for the criticality of a service: reputation risk, customer impact, lost revenue, potential for regulatory fines etc. And then assess those again the impact and "cost" of each service failing.

Note this is where you need service delivery managers to really understand how the business uses the services and the impact of failure. The business cycle might mean that a service isn't important for 29 days a month, but is for a couple of those days. For example I was responsible for HR services and a comment was made by an IT SME that these must be low criticality. I politely pointed out that Payroll was one of these, and I was sure that the SME and every other permanent member of staff would agree getting paid was important to them on pay day!

Finally let's assume you have the achievable RTO and RPO from continuity tests and plans, and have a Business Criticality level with RTO and RPO targets that aligns to business strategy and needs. If (or more likely when) there is a gap where a service cannot be recovered within those targets, then that's where a risk to service should be raised. This would then trigger service improvement work on mitigating actions to close that gap, or if the costs and benefits don't stack up then a risk acceptance signoff from the business.

A word too on projects introducing new services. I recommend pushing back on any sponsor or PM who claims there isn't time or budget to hold a full DR/continuity test. That's unacceptable in my view, as the ideal time to prove the recovery (and RTO & RPO) is during the project and before the service is live.

Hope that helps, and if I've misunderstood the question then do please let us know.

I'd also recommend our Communities of Practice as a great forum to discuss great topics such as this: https://itsmfuk.site-ym.com/members/group_select.asp?type=28699

Mon, 26 Jul 2021 18:26:47 GMT

Hi, Mark has given some good info, I would just like to add, keep it simple when talking to the business about RPO and RTO. I find questions such as how often do you run data back ups helps, since this clearly drives your RPO because if they say every 24hrs then that signifies that your RPO can be 24hrs as thats the last time a data back up was completed so all changes since then will be lost. If back ups are hourly then RPO could be hourly. With RTO its how long can you be without the system which should have been clearly defined during the Business Criticality Analysis.