Overview: The Assurance Practices
Availability management, capacity management, and service continuity management are the three assurance practices that work together to ensure services can be delivered at agreed levels—in normal operations (availability), in the future with growing demand (capacity), and in disruption scenarios (continuity). These practices are less visible day-to-day than incident and change management but equally important for audit compliance and for demonstrating management competence to customers. While incident management responds to failures that have already occurred, the assurance practices are proactive—they aim to prevent failures from occurring and ensure recovery if they do.
Implementing Availability Management
Defining Availability
Availability is defined as the percentage of agreed service time during which the service is actually available and performing at acceptable levels. Availability calculation: (Total Service Time – Downtime) / Total Service Time × 100%. Define what counts as downtime: full outage (service completely unavailable), or does degraded performance (service available but performing below acceptable thresholds) also count as downtime? The definition must be clear. Availability must be measured per service, not as an overall aggregate. Different services may have different criticality and therefore different agreed availability targets.
Availability Targets
Availability targets in SLAs are customer-facing commitments. Internal availability monitoring targets should be set higher than SLA commitments to provide a buffer. Example: if the SLA commitment is 99.5% availability, the internal target might be 99.7%. This buffer ensures that even if availability drifts slightly below the internal target, the SLA commitment is still met. Availability targets vary by service criticality. Critical business services might have 99.9% targets; non-critical services might have 95% targets. Availability target setting must be based on what is achievable given your actual infrastructure capability.
Availability Monitoring
Availability monitoring is continuous, automated monitoring that measures whether services are available and performing at acceptable levels. Monitoring approaches: (1) Synthetic monitoring—automated transactions that simulate user activity and confirm the service can process them (e.g., automated user login test, automated transaction submission); (2) Agent-based monitoring—monitoring agents installed on infrastructure report availability data (agent on a server reports "server is up and responding"); (3) SNMP monitoring—infrastructure devices report status via SNMP (network device reports its state). Monitoring must cover all in-scope services and their critical components. Monitoring frequency (how often is availability checked?) affects the precision of availability reporting. Hourly checks capture most outages; more frequent checks (15-minute intervals) catch brief outages that might be missed with hourly checks. Alert thresholds define when availability has degraded enough to warrant action.
Availability Reporting
Monthly availability reports per service are required. Report content: availability percentage achieved in the month, comparison to target, trend analysis (is availability improving or declining?), list of outages (date, duration, reason), SLA achievement (did we meet SLA availability commitments?), and trends (are there repeating patterns?). Availability reports must be prepared and reviewed by the service manager before management review. These reports constitute documented information for audit. Auditors will examine 3–6 months of availability reports to assess whether availability management is functioning.
Availability Improvement
When availability consistently falls below target, an availability improvement plan is required. The improvement plan identifies the root cause of availability problems (is a particular infrastructure component failing repeatedly?), and the improvement actions (infrastructure upgrades, design changes, additional redundancy). Availability improvement actions often generate change records (upgrade a failing component, implement redundancy).
Availability Records for Audit
Auditors examine: monitoring data (showing that availability monitoring actually occurred), monthly availability reports (showing that availability was analyzed and reported), SLA compliance history (did we meet our commitments?), and improvement records (when availability fell short, was action taken?). Organizations should maintain at least 3 months of availability records before Stage 2 audit.
Implementing Capacity Management
Capacity Planning Scope
Capacity management must address all resources that constrain service delivery. Common resource categories: compute (CPU utilization on servers—when CPU reaches 80–90%, performance degrades), storage (disk utilization—when disk reaches 85%, performance degrades), network bandwidth (link utilization—when links reach saturation, congestion occurs), application concurrency (how many users can the application support simultaneously?), and human resources (do you have enough support staff to meet incident SLA targets?). Prioritize capacity management by service criticality—plan capacity for critical services first.
Demand Forecasting
Demand forecasting translates business requirements into capacity needs. Input data: business growth projections (how much will transaction volume increase?), new services being launched (each new service adds demand), seasonal patterns (do services peak at certain times of year?). Translate business demand into technical capacity requirements: "we expect 20% transaction volume growth" translates to "we need 20% increase in database capacity and application server capacity." Demand forecasting should occur quarterly or when major business changes occur.
Capacity Monitoring
Real-time monitoring tracks resource utilization: CPU, storage, network bandwidth, application threads. Thresholds are set that trigger alerts when utilization approaches limits. Example thresholds: alert at 80% CPU utilization (action needed before critical shortage); alert at 85% storage utilization (need to free space or add capacity). Trending of capacity data over time shows utilization trajectory—is utilization rising toward the limit, or is it stable? Trend data feeds into capacity planning.
Capacity Planning Documents
The capacity plan is a documented plan covering: current capacity (we have X units of storage), projected demand (based on business forecast, we will need Y units in Q3), planned capacity changes (we will add Z units via this change record on this date), and timeline. The capacity plan is reviewed on a defined schedule (quarterly recommended). When capacity monitoring shows utilization rising toward limits, capacity planning responds by scheduling capacity increases via change management.
Capacity Incidents
When capacity is exhausted (disk full, bandwidth exhausted, CPU maxed), service performance degrades or service becomes unavailable. This triggers incident management. The incident record captures the capacity problem. Capacity management investigates and implements a solution (add more capacity, optimize utilization). Recurring capacity incidents indicate that capacity planning is not keeping pace with demand and trigger capacity problem management (identify the root cause of repeated capacity exhaustion and implement a structural fix).
Capacity Records for Audit
Auditors examine: utilization monitoring data (showing that capacity was actually monitored), capacity plans (showing that demand was forecasted and capacity was planned), evidence of capacity reviews (when did you last review capacity data and update plans?), and capacity-related change records (when capacity was exhausted, what action did you take?). At least 3–6 months of capacity data should be retained.
Implementing Service Continuity Management
Alignment with ISO 22301
ISO 22301 (Business Continuity Management) is the broader standard that addresses how organizations prepare for and recover from major disruptions. ISO 20000 service continuity management (Clause 8.7) is a subset of ISO 22301. For organizations certified to ISO 22301, service continuity plans should be derived from the organizational Business Continuity Plan. For organizations not certified to ISO 22301, ISO 20000 still requires service continuity plans covering all in-scope services, but the scope is narrower (focused on service recovery rather than full business continuity).
Business Impact Analysis (BIA) for Services
BIA identifies, for each in-scope service: Recovery Time Objective (RTO)—the maximum tolerable downtime before the service must be restored (e.g., email service RTO = 4 hours; non-critical reporting service RTO = 2 days), Recovery Point Objective (RPO)—the maximum tolerable data loss (e.g., email RPO = latest backup [usually same-day]; transaction system RPO = zero data loss [continuous replication required]). BIA also identifies the business impact of service being unavailable for defined durations (if email is down for 4 hours, we lose X business value; if our transaction system is down for 8 hours, we cannot process customer orders). RTO and RPO values drive infrastructure design (high RTO requirement can be met with simple backup; low RTO requirement requires active redundancy).
Service Continuity Plans
A service continuity plan describes how a service will be recovered if a major failure occurs. Plan content: recovery procedures (step-by-step actions to restore the service), roles and responsibilities during a continuity event (who makes decisions, who executes recovery, who communicates with customers?), communication plan (who is notified of the disruption, what information is communicated, how often are updates provided?), alternative delivery options (if the service cannot be recovered from primary systems, can it be delivered from alternative systems?), escalation to senior management (when does continuity event trigger executive escalation?), return to normal operations (how do you know recovery is complete, how do you return to primary systems?). One plan per service is preferred to a consolidated plan, so that each service team owns their own recovery process.
Testing Continuity Plans
ISO 20000 requires testing continuity plans at defined intervals (minimum annually). Testing approaches: (1) tabletop exercise—team reviews the plan and discusses what they would do if an actual disruption occurred (no systems involved, low-cost testing); (2) technical recovery test—actually execute the recovery procedures, but in a lab environment, not in production (verify that recovery procedures work); (3) full simulation—simulate an actual disruption in production environment, execute recovery, measure recovery time and data loss (most realistic but highest risk and cost). Testing must be documented—test plan, test execution log, test results, lessons learned, recommendations for plan improvements. Plans must be updated following test findings. Evidence of testing (test reports) is required for audit.
Continuity Plan Triggers
Define what constitutes a continuity event vs. a normal major incident. A continuity event is typically a disruption that affects the service severely enough that normal incident management procedures are insufficient. Examples: data center failure (affects multiple services simultaneously), multi-day outage (recovery will take longer than normal SLA timeframes), major disaster (earthquake, flood, power loss at data center). A major incident that can be resolved within normal SLA timeframes is not a continuity event. The decision to invoke continuity plans must be authorized by service manager or incident manager; this decision is documented in the continuity event log.
Evidence for Audit
Auditors examine: BIA documentation (showing RTO and RPO were defined), current version of service continuity plans (showing plans exist and are maintained), test records and results (showing plans are tested), evidence of plan review after tests (showing testing led to plan improvements), and continuity event logs (if any) (showing the procedure is used when needed). Plans that have never been tested are considered unproven and generate audit findings.
Common Implementation Pitfalls
Pitfall 1: Availability monitoring only covers infrastructure (servers, databases), not end-to-end service availability. Infrastructure is up, but the application cannot connect to the database because of network issues. Infrastructure monitoring shows green, but the service is actually down. Solution: monitoring must simulate actual user transactions to capture end-to-end availability.
Pitfall 2: Capacity planning done once, then abandoned. A capacity plan is created, then never reviewed or updated. Demand changes, but the plan is not updated. When capacity is exhausted, the organization is caught by surprise. Solution: capacity plans should be reviewed quarterly and updated when business demand changes.
Pitfall 3: Service continuity plans written but never tested. Plans exist on the shelf but have never been executed. When an actual disruption occurs, the plan does not work (steps are incorrect, resources are not available, procedures have become obsolete). Solution: test plans at least annually. Testing is where you discover plan problems in a controlled environment before a real disruption occurs.
Pitfall 4: Continuity plans not updated when service architecture changes. A service is redesigned and moved to a new infrastructure, but the continuity plan is not updated. The plan describes recovery procedures for the old architecture, which are no longer applicable. Solution: trigger a continuity plan review whenever a service architecture change occurs.
Pitfall 5: RTO and RPO targets set without checking feasibility. An RTO of 1 hour requires active redundancy (expensive), but budget is not allocated. When the RTO cannot be met, the customer is disappointed and auditors find a control gap. Solution: ensure RTO/RPO targets are feasible within budget and infrastructure constraints.
| KEY CONCEPT | The difference between availability management and service continuity management: Availability management operates in normal circumstances, aiming to maintain agreed performance day-to-day through monitoring and improvement. Service continuity management operates in exceptional circumstances, addressing how a service will be recovered from major failures. Both are required. Availability management prevents failures; service continuity management enables recovery from the failures that do occur. |
Assurance Practice Implementation Requirements
| Practice | Planning Requirement | Operational Activity | Evidence for Audit | Common Gap |
|---|---|---|---|---|
| Availability | Define availability target per service based on criticality; align internal targets to SLA commitments | Continuous automated monitoring; monthly availability reporting | Monitoring data, monthly reports, SLA compliance history, improvement records | Monitoring exists but is infrastructure-only, not service-level; reports not maintained |
| Capacity | Capacity plan per resource category; demand forecasting based on business growth | Quarterly capacity reviews, trending utilization data, threshold alerting | Capacity plan, utilization monitoring data, capacity reviews, capacity change records | Capacity planning one-time effort, then abandoned; no ongoing reviews or updates |
| Continuity | BIA for each service defining RTO and RPO; service continuity plan per service | Annual or biennial plan testing; plan review and update following testing | BIA documentation, current service continuity plans, test records and results, evidence of plan review and updates | Plans exist but have never been tested; RTO/RPO targets infeasible given infrastructure |
Service Continuity Plan Content Template
| Section | Content Required | Owner | Review Frequency |
|---|---|---|---|
| Service Description | Service name, scope, criticality classification, RTO, RPO, impact of unavailability | Service manager | Annually or when service changes |
| Recovery Resources | Primary infrastructure, backup/recovery infrastructure, alternative delivery options, resource contacts | Technical team | Quarterly during capacity reviews |
| Recovery Procedures | Step-by-step procedures to recover service from failure, failover procedures, restoration procedures, validation steps | Technical team | After each test or service architecture change |
| Roles & Responsibilities | Who is continuity event coordinator, who executes technical recovery, who communicates with customers, escalation authority | Service manager | Annually or when organizational structure changes |
| Communication Plan | Internal notification list, customer notification procedures, status update frequency, escalation points | Service manager | Annually |
| Testing Schedule | Testing frequency (minimum annual), test type (tabletop, technical, simulation), test scope | Service manager | Annually to schedule next test |
| IMPORTANT | Continuity plans that have never been tested provide false assurance. A plan that looks good on paper may not work in practice—recovery procedures may have steps that are no longer valid, recovery resources may not be available, procedure complexity may be higher than estimated. ISO 20000 requires testing at defined intervals and requires documented evidence of that testing. An untested continuity plan is a finding in an audit. The first test should be at least 6 months before Stage 2 audit. |
| BITLION INSIGHT | Bitlion GRC availability tracking provides real-time service availability monitoring with threshold-based alerting and automated availability reporting. Capacity planning module supports demand forecasting and capacity planning workflows. Service continuity plan library includes ISO 20000-aligned templates for BIA, RTO/RPO definition, and recovery procedures. Testing workflows enable documentation of tabletop exercises and technical tests with lessons learned capture. |