Implementing Availability, Capacity, and Service Continuity Management

Overview: The Assurance Practices

Availability management, capacity management, and service continuity management are the three assurance practices that work together to ensure services can be delivered at agreed levels—in normal operations (availability), in the future with growing demand (capacity), and in disruption scenarios (continuity). These practices are less visible day-to-day than incident and change management but equally important for audit compliance and for demonstrating management competence to customers. While incident management responds to failures that have already occurred, the assurance practices are proactive—they aim to prevent failures from occurring and ensure recovery if they do.

Implementing Availability Management

Defining Availability

Availability is defined as the percentage of agreed service time during which the service is actually available and performing at acceptable levels. Availability calculation: (Total Service Time – Downtime) / Total Service Time × 100%. Define what counts as downtime: full outage (service completely unavailable), or does degraded performance (service available but performing below acceptable thresholds) also count as downtime? The definition must be clear. Availability must be measured per service, not as an overall aggregate. Different services may have different criticality and therefore different agreed availability targets.

Availability Targets

Availability targets in SLAs are customer-facing commitments. Internal availability monitoring targets should be set higher than SLA commitments to provide a buffer. Example: if the SLA commitment is 99.5% availability, the internal target might be 99.7%. This buffer ensures that even if availability drifts slightly below the internal target, the SLA commitment is still met. Availability targets vary by service criticality. Critical business services might have 99.9% targets; non-critical services might have 95% targets. Availability target setting must be based on what is achievable given your actual infrastructure capability.

Availability Monitoring

Availability monitoring is continuous, automated monitoring that measures whether services are available and performing at acceptable levels. Monitoring approaches: (1) Synthetic monitoring—automated transactions that simulate user activity and confirm the service can process them (e.g., automated user login test, automated transaction submission); (2) Agent-based monitoring—monitoring agents installed on infrastructure report availability data (agent on a server reports "server is up and responding"); (3) SNMP monitoring—infrastructure devices report status via SNMP (network device reports its state). Monitoring must cover all in-scope services and their critical components. Monitoring frequency (how often is availability checked?) affects the precision of availability reporting. Hourly checks capture most outages; more frequent checks (15-minute intervals) catch brief outages that might be missed with hourly checks. Alert thresholds define when availability has degraded enough to warrant action.

Availability Reporting

Monthly availability reports per service are required. Report content: availability percentage achieved in the month, comparison to target, trend analysis (is availability improving or declining?), list of outages (date, duration, reason), SLA achievement (did we meet SLA availability commitments?), and trends (are there repeating patterns?). Availability reports must be prepared and reviewed by the service manager before management review. These reports constitute documented information for audit. Auditors will examine 3–6 months of availability reports to assess whether availability management is functioning.

Availability Improvement

When availability consistently falls below target, an availability improvement plan is required. The improvement plan identifies the root cause of availability problems (is a particular infrastructure component failing repeatedly?), and the improvement actions (infrastructure upgrades, design changes, additional redundancy). Availability improvement actions often generate change records (upgrade a failing component, implement redundancy).

Availability Records for Audit

Auditors examine: monitoring data (showing that availability monitoring actually occurred), monthly availability reports (showing that availability was analyzed and reported), SLA compliance history (did we meet our commitments?), and improvement records (when availability fell short, was action taken?). Organizations should maintain at least 3 months of availability records before Stage 2 audit.

Implementing Capacity Management

Capacity Planning Scope

Capacity management must address all resources that constrain service delivery. Common resource categories: compute (CPU utilization on servers—when CPU reaches 80–90%, performance degrades), storage (disk utilization—when disk reaches 85%, performance degrades), network bandwidth (link utilization—when links reach saturation, congestion occurs), application concurrency (how many users can the application support simultaneously?), and human resources (do you have enough support staff to meet incident SLA targets?). Prioritize capacity management by service criticality—plan capacity for critical services first.

Demand Forecasting

Demand forecasting translates business requirements into capacity needs. Input data: business growth projections (how much will transaction volume increase?), new services being launched (each new service adds demand), seasonal patterns (do services peak at certain times of year?). Translate business demand into technical capacity requirements: "we expect 20% transaction volume growth" translates to "we need 20% increase in database capacity and application server capacity." Demand forecasting should occur quarterly or when major business changes occur.

Turn this guidance into a working ITSM on Bitlion

The ISO 20000 product brings together control mapping, evidence, policies, and continuous monitoring so your team spends less time on spreadsheets and more time passing audits with confidence.

Explore Bitlion for ISO 20000

Capacity Monitoring

Real-time monitoring tracks resource utilization: CPU, storage, network bandwidth, application threads. Thresholds are set that trigger alerts when utilization approaches limits. Example thresholds: alert at 80% CPU utilization (action needed before critical shortage); alert at 85% storage utilization (need to free space or add capacity). Trending of capacity data over time shows utilization trajectory—is utilization rising toward the limit, or is it stable? Trend data feeds into capacity planning.

Capacity Planning Documents

The capacity plan is a documented plan covering: current capacity (we have X units of storage), projected demand (based on business forecast, we will need Y units in Q3), planned capacity changes (we will add Z units via this change record on this date), and timeline. The capacity plan is reviewed on a defined schedule (quarterly recommended). When capacity monitoring shows utilization rising toward limits, capacity planning responds by scheduling capacity increases via change management.

Capacity Incidents

When capacity is exhausted (disk full, bandwidth exhausted, CPU maxed), service performance degrades or service becomes unavailable. This triggers incident management. The incident record captures the capacity problem. Capacity management investigates and implements a solution (add more capacity, optimize utilization). Recurring capacity incidents indicate that capacity planning is not keeping pace with demand and trigger capacity problem management (identify the root cause of repeated capacity exhaustion and implement a structural fix).

Capacity Records for Audit

Auditors examine: utilization monitoring data (showing that capacity was actually monitored), capacity plans (showing that demand was forecasted and capacity was planned), evidence of capacity reviews (when did you last review capacity data and update plans?), and capacity-related change records (when capacity was exhausted, what action did you take?). At least 3–6 months of capacity data should be retained.

Implementing Service Continuity Management

Alignment with ISO 22301

ISO 22301 (Business Continuity Management) is the broader standard that addresses how organizations prepare for and recover from major disruptions. ISO 20000 service continuity management (Clause 8.7) is a subset of ISO 22301. For organizations certified to ISO 22301, service continuity plans should be derived from the organizational Business Continuity Plan. For organizations not certified to ISO 22301, ISO 20000 still requires service continuity plans covering all in-scope services, but the scope is narrower (focused on service recovery rather than full business continuity).

Business Impact Analysis (BIA) for Services

BIA identifies, for each in-scope service: Recovery Time Objective (RTO)—the maximum tolerable downtime before the service must be restored (e.g., email service RTO = 4 hours; non-critical reporting service RTO = 2 days), Recovery Point Objective (RPO)—the maximum tolerable data loss (e.g., email RPO = latest backup [usually same-day]; transaction system RPO = zero data loss [continuous replication required]). BIA also identifies the business impact of service being unavailable for defined durations (if email is down for 4 hours, we lose X business value; if our transaction system is down for 8 hours, we cannot process customer orders). RTO and RPO values drive infrastructure design (high RTO requirement can be met with simple backup; low RTO requirement requires active redundancy).

Service Continuity Plans

A service continuity plan describes how a service will be recovered if a major failure occurs. Plan content: recovery procedures (step-by-step actions to restore the service), roles and responsibilities during a continuity event (who makes decisions, who executes recovery, who communicates with customers?), communication plan (who is notified of the disruption, what information is communicated, how often are updates provided?), alternative delivery options (if the service cannot be recovered from primary systems, can it be delivered from alternative systems?), escalation to senior management (when does continuity event trigger executive escalation?), return to normal operations (how do you know recovery is complete, how do you return to primary systems?). One plan per service is preferred to a consolidated plan, so that each service team owns their own recovery process.

Testing Continuity Plans

ISO 20000 requires testing continuity plans at defined intervals (minimum annually). Testing approaches: (1) tabletop exercise—team reviews the plan and discusses what they would do if an actual disruption occurred (no systems involved, low-cost testing); (2) technical recovery test—actually execute the recovery procedures, but in a lab environment, not in production (verify that recovery procedures work); (3) full simulation—simulate an actual disruption in production environment, execute recovery, measure recovery time and data loss (most realistic but highest risk and cost). Testing must be documented—test plan, test execution log, test results, lessons learned, recommendations for plan improvements. Plans must be updated following test findings. Evidence of testing (test reports) is required for audit.

Continuity Plan Triggers

Define what constitutes a continuity event vs. a normal major incident. A continuity event is typically a disruption that affects the service severely enough that normal incident management procedures are insufficient. Examples: data center failure (affects multiple services simultaneously), multi-day outage (recovery will take longer than normal SLA timeframes), major disaster (earthquake, flood, power loss at data center). A major incident that can be resolved within normal SLA timeframes is not a continuity event. The decision to invoke continuity plans must be authorized by service manager or incident manager; this decision is documented in the continuity event log.

Evidence for Audit

Auditors examine: BIA documentation (showing RTO and RPO were defined), current version of service continuity plans (showing plans exist and are maintained), test records and results (showing plans are tested), evidence of plan review after tests (showing testing led to plan improvements), and continuity event logs (if any) (showing the procedure is used when needed). Plans that have never been tested are considered unproven and generate audit findings.

Common Implementation Pitfalls

Pitfall 1: Availability monitoring only covers infrastructure (servers, databases), not end-to-end service availability. Infrastructure is up, but the application cannot connect to the database because of network issues. Infrastructure monitoring shows green, but the service is actually down. Solution: monitoring must simulate actual user transactions to capture end-to-end availability.

Pitfall 2: Capacity planning done once, then abandoned. A capacity plan is created, then never reviewed or updated. Demand changes, but the plan is not updated. When capacity is exhausted, the organization is caught by surprise. Solution: capacity plans should be reviewed quarterly and updated when business demand changes.

Pitfall 3: Service continuity plans written but never tested. Plans exist on the shelf but have never been executed. When an actual disruption occurs, the plan does not work (steps are incorrect, resources are not available, procedures have become obsolete). Solution: test plans at least annually. Testing is where you discover plan problems in a controlled environment before a real disruption occurs.

Pitfall 4: Continuity plans not updated when service architecture changes. A service is redesigned and moved to a new infrastructure, but the continuity plan is not updated. The plan describes recovery procedures for the old architecture, which are no longer applicable. Solution: trigger a continuity plan review whenever a service architecture change occurs.

Pitfall 5: RTO and RPO targets set without checking feasibility. An RTO of 1 hour requires active redundancy (expensive), but budget is not allocated. When the RTO cannot be met, the customer is disappointed and auditors find a control gap. Solution: ensure RTO/RPO targets are feasible within budget and infrastructure constraints.

KEY CONCEPT

The difference between availability management and service continuity management: Availability management operates in normal circumstances, aiming to maintain agreed performance day-to-day through monitoring and improvement. Service continuity management operates in exceptional circumstances, addressing how a service will be recovered from major failures. Both are required. Availability management prevents failures; service continuity management enables recovery from the failures that do occur.

Assurance Practice Implementation Requirements

Practice	Planning Requirement	Operational Activity	Evidence for Audit	Common Gap
Availability	Define availability target per service based on criticality; align internal targets to SLA commitments	Continuous automated monitoring; monthly availability reporting	Monitoring data, monthly reports, SLA compliance history, improvement records	Monitoring exists but is infrastructure-only, not service-level; reports not maintained
Capacity	Capacity plan per resource category; demand forecasting based on business growth	Quarterly capacity reviews, trending utilization data, threshold alerting	Capacity plan, utilization monitoring data, capacity reviews, capacity change records	Capacity planning one-time effort, then abandoned; no ongoing reviews or updates
Continuity	BIA for each service defining RTO and RPO; service continuity plan per service	Annual or biennial plan testing; plan review and update following testing	BIA documentation, current service continuity plans, test records and results, evidence of plan review and updates	Plans exist but have never been tested; RTO/RPO targets infeasible given infrastructure

Service Continuity Plan Content Template

Section	Content Required	Owner	Review Frequency
Service Description	Service name, scope, criticality classification, RTO, RPO, impact of unavailability	Service manager	Annually or when service changes
Recovery Resources	Primary infrastructure, backup/recovery infrastructure, alternative delivery options, resource contacts	Technical team	Quarterly during capacity reviews
Recovery Procedures	Step-by-step procedures to recover service from failure, failover procedures, restoration procedures, validation steps	Technical team	After each test or service architecture change
Roles & Responsibilities	Who is continuity event coordinator, who executes technical recovery, who communicates with customers, escalation authority	Service manager	Annually or when organizational structure changes
Communication Plan	Internal notification list, customer notification procedures, status update frequency, escalation points	Service manager	Annually
Testing Schedule	Testing frequency (minimum annual), test type (tabletop, technical, simulation), test scope	Service manager	Annually to schedule next test

IMPORTANT

Continuity plans that have never been tested provide false assurance. A plan that looks good on paper may not work in practice—recovery procedures may have steps that are no longer valid, recovery resources may not be available, procedure complexity may be higher than estimated. ISO 20000 requires testing at defined intervals and requires documented evidence of that testing. An untested continuity plan is a finding in an audit. The first test should be at least 6 months before Stage 2 audit.

BITLION INSIGHT

Bitlion GRC availability tracking provides real-time service availability monitoring with threshold-based alerting and automated availability reporting. Capacity planning module supports demand forecasting and capacity planning workflows. Service continuity plan library includes ISO 20000-aligned templates for BIA, RTO/RPO definition, and recovery procedures. Testing workflows enable documentation of tabletop exercises and technical tests with lessons learned capture.

Overview: The Assurance Practices

Implementing Availability Management

Defining Availability

Availability Targets

Availability Monitoring

Availability Reporting

Availability Improvement

Availability Records for Audit

Implementing Capacity Management

Capacity Planning Scope

Demand Forecasting

Turn this guidance into a working ITSM on Bitlion

Capacity Monitoring

Capacity Planning Documents

Capacity Incidents

Capacity Records for Audit

Implementing Service Continuity Management

Alignment with ISO 22301

Business Impact Analysis (BIA) for Services

Service Continuity Plans

Testing Continuity Plans

Continuity Plan Triggers

Evidence for Audit

Common Implementation Pitfalls

Assurance Practice Implementation Requirements

Service Continuity Plan Content Template

ISO 20000 Foundations

What Is ISO 20000 and Why It Matters

The ISO 20000 Standard Structure

Key Definitions and Core Concepts

ISO 20000 and ITIL: Understanding the Relationship

The SMS Lifecycle: Plan–Do–Check–Act

Who Needs ISO 20000 and When

ISO 20000 Requirements

Clause 4: Understanding the Organization and Its Context

Clause 5: Leadership and Commitment

Clause 6: Planning — SMS Objectives, Risk Management, and the Service Management Plan

Clause 7: Support — Resources, Competence, Awareness, and Documented Information

Clause 8.1: SMS Operations — Operational Planning and Control

Clause 8.2–8.3: Service Portfolio and Relationship Management

Clause 8.4–8.5: Supply Chain Management and Service Design, Build, and Transition

Clause 8.6–8.7: Resolution Practices and Service Assurance

ISO 20000 Implementation Process

ISO 20000 Implementation Roadmap: A Phased 12-Month Program

Scoping the SMS: Which Services, Which Boundaries, Which Customers

Gap Assessment and Remediation Planning: Finding and Fixing the Gaps

Designing the Service Management Plan: The Governing Document of the SMS

Implementing Incident, Problem, and Service Request Management

Implementing Change, Release, and Configuration Management

Implementing Availability, Capacity, and Service Continuity Management

Integrating ISO 20000 with ISO 27001 and ISO 22301: Building an Integrated Management System

ISO 20000 Certification Process

Preparing for ISO 20000 Certification: The Pre-Audit Readiness Checklist

Selecting a Certification Body for ISO 20000: A Practical Evaluation Guide

Stage 1 Audit: What Happens, What Auditors Look For, and How to Prepare

Stage 2 Audit: Demonstrating That the SMS Is Genuinely Operational

The 12 Most Common ISO 20000 Audit Nonconformities — and How to Prevent Them

Surveillance Audits and Recertification: Maintaining ISO 20000 Certification

Multi-Standard Certification: Running ISO 20000 and ISO 27001 Together

ISO 20000 SMS Operations and Service Management Practices

Service Level Management and SLA Design: Building Agreements That Work

Incident Management in Practice: From Logging to Closure

Problem Management: Building a Culture of Root Cause Resolution

Change Management and the Change Advisory Board: Controlling the Service Environment

Configuration Management and the CMDB: Building the Foundation of Service Knowledge

Continual Improvement in the SMS: From Aspiration to Discipline

Customer Satisfaction and Service Review: Keeping the SMS Customer-Centred

ISO 20000 in the Indonesian Context

ISO 20000 and OJK IT Governance: Demonstrating Compliance Through SMS

ISO 20000 for Indonesian Managed Service Providers: Market Positioning and Implementation

ISO 20000 for Cloud Service Providers: Managing Services at Scale

ISO 20000 for Government and Public Sector IT in Indonesia

ISO 20000 and UU PDP: Integrating Personal Data Protection into Service Management

Building a Compliance-Ready SMS: The Integrated Indonesian Compliance Architecture

ISO 20000 and Government Procurement: Winning and Retaining Public Sector IT Contracts