Why Resolution Practices Matter
Resolution practices—incident management, problem management, and service request management—are the highest-priority implementation area for ISO 20000 compliance for three critical reasons. First, these are the practices that customers directly experience. When a service fails, the customer interaction is with incident management. When a customer needs something new, the interaction is with service request management. Customer satisfaction depends entirely on how well these practices work. Second, resolution practices generate more audit evidence than any other management system area. Incident records, problem investigation documentation, and service request fulfillment records are examined in minute detail during Stage 2 audits. An auditor examining a sample of 25 incident records will look at each one for proper classification, correct priority assignment, appropriate escalation, SLA compliance, and closure verification. Third, resolution practices are the most commonly nonconforming area in ISO 20000 audits. Organizations implement change management and availability management, but without proper incident, problem, and service request management, the audit will reveal systemic gaps.
Implementing Incident Management
Define the Incident Classification Scheme
The incident classification scheme is the foundation of incident management. It must align to your service portfolio. Typical primary categories include: hardware (servers, storage, network devices, laptops), software (applications, middleware, operating systems), network (connectivity, firewalls, routers), security (security incidents, data breaches, unauthorized access), and application (software defects, performance, integration issues). For each primary category, define secondary categories. For example, under "software," you might have "application performance degradation," "application unavailable," "integration failure," "data corruption." The classification scheme must be exhaustive enough to cover all incidents you encounter, narrow enough to be meaningful, and taught to all frontline technical staff so that incidents are classified consistently. Maintain the classification scheme as a documented procedure; update it annually or when new incident categories emerge.
Design the Priority Matrix
The priority matrix is the instrument that translates incident characteristics into priority levels. It uses two dimensions: impact (how many users or services are affected, how critical the affected service is to the business) and urgency (how time-sensitive the issue is). The output is a priority level (P1, P2, P3, P4 or equivalent). Define each level concretely. For example: P1 = Full outage of a critical service affecting >100 users; P2 = Partial degradation affecting >50 users or full outage of non-critical service affecting >100 users; P3 = Partial degradation affecting <50 users or isolated impact on critical functionality; P4 = No business impact, informational, cosmetic issues. Make the priority levels unambiguous so that different technical staff members classify the same incident to the same priority level.
Set SLA Targets by Priority
SLA targets must be set by priority level. SLA targets include response time (how quickly the incident is acknowledged and investigation begins) and resolution time (how quickly the incident is fully resolved). Example targets: P1 response 15 minutes, resolution 4 hours; P2 response 1 hour, resolution 8 hours; P3 response 4 hours, resolution 2 business days; P4 response 1 business day, resolution 5 business days. Critically, SLA targets must be achievable given your actual technical capability. Setting a 1-hour P1 resolution target when your average P1 investigation takes 3 hours will result in SLA non-compliance. Connect SLA targets to the commitments in customer SLA agreements. If you commit to customers that critical incidents are resolved within 4 hours, your internal P1 SLA target must be 4 hours or better.
Design Escalation Procedures
Escalation procedures ensure that incidents that exceed normal resolution capability are escalated appropriately. Functional escalation moves an incident to a more experienced resolver or specialized support team (frontline support → specialized technical team → vendor support). Hierarchical escalation moves an incident to management (incident manager → service manager → director). Define escalation triggers: when does functional escalation occur (typically when the current resolver has spent a defined effort and not resolved), who is the escalation path, how is the escalation communicated, are there time gates between escalation levels. Example: if P1 not resolved in 2 hours, escalate to Level 3 Engineering; if not resolved in 4 hours, escalate to Incident Manager.
Major Incident Procedure
Define the criteria for declaring a "major incident." Typically, a major incident is a P1 incident (full outage of critical service) or an incident with widespread customer impact or regulatory implications. When a major incident is declared, a major incident manager is assigned, a war room or bridge call is convened, and accelerated communication protocols are activated. Communication cadence is increased (updates to stakeholders every 15–30 minutes instead of standard intervals). Post-incident review is scheduled and conducted within 2–5 business days. Major incident procedure must be documented and practiced.
Incident Record Structure
Incident records are the direct evidence auditors examine. Required fields in an incident record: incident ID (unique identifier), date/time opened, reporter (who reported the incident), affected service or component, initial priority assessment, classification (the category assigned), assigned resolver or team, status (open, investigating, resolved, closed), description of what happened, actions taken to resolve, root cause (if identified), resolution applied, date/time closed, actual SLA status (Met/Missed), and for missed SLA incidents, the justification or root cause of the SLA miss. These fields form the basis of the audit evidence trail. Auditors will sample 20–30 incident records and examine each field for completeness and accuracy.
Closure and Verification
Before an incident is closed, verification from the reporter or customer that the incident is truly resolved is required. This prevents "we fixed it from our perspective but the customer still has the problem" outcomes. Closure verification may be conducted via email, call, or portal confirmation. Auto-closure (system automatically closing an incident after X days of no activity) carries risk—the customer may never have confirmed resolution. If auto-closure is used, the confirmation attempt must be logged and the incident must be escalated for manual verification if the customer does not respond.
Incident Metrics
Track and report: incident volume by priority (are P1 incidents increasing or stable?), SLA achievement rate by priority (what % of P1 incidents met SLA), mean time to resolve by priority (average P1 resolution time), and reopened incident rate (what % of incidents were reopened after closure—indicates closure without true resolution). These metrics feed into Clause 9 performance evaluation and provide data for management review.
Implementing Service Request Management
Define the Service Request Catalogue
Service requests are requests from users for something to be provided—a new account, hardware, software installation, information. A service request is not a problem to be fixed; it is a standard fulfillment process. Common service request types: access provisioning (new user account, group membership, application access), hardware request (new laptop, monitor, phone), software installation or license request, information request (password reset, documentation, how-to guidance), configuration changes within policy (update personal proxy settings, change email forwarding). Define clearly what is a service request vs. an incident. An incident is "the email server is down;" a service request is "reset my email password." A service request that escalates becomes an incident (if the service request is refused because of a policy issue, it is escalated as a problem; if the fulfillment process uncovers a technical failure, it is escalated to incident).
Pre-Approved Fulfillment Procedures
For each service request type in the catalogue, define a pre-approved fulfillment procedure. A procedure includes: steps to fulfill the request, required approvals (e.g., manager approval for hardware purchase), SLA target (how quickly the request must be fulfilled), documentation to be generated. This prevents ad-hoc handling that produces no records. Example: for a new user account request: user submits via portal → manager approves → IT creates account in AD → sends credentials → updates CMDB → closes request. SLA target: 1 business day. All steps must be documented in the service request record.
Distinguishing Service Requests from Changes
Standard, low-risk changes that are pre-approved may be handled as service requests rather than requiring a full change management process. Example: "reset password" is a low-risk standard change handled as a service request. "Install new version of application" is a change requiring change management. The boundary must be clearly defined. Generally, if the change is pre-approved, repeatable, and poses minimal risk, it can be a service request. Otherwise, it is a change.
Service Request Records
Service request records must show: request ID, date opened, requester, type (from the catalogue), description of what is requested, approvals required and status, fulfillment steps completed, completion date, SLA status (Met/Missed). Service request records generate the evidence for audit. Auditors will examine service request records to verify that requests are being fulfilled consistently and timely.
Implementing Problem Management
Reactive Problem Management
Reactive problem management is triggered by: recurring incidents (same root cause producing multiple incidents), major incidents (a post-incident review process is invoked), customer escalation (customer demands root cause analysis). When triggered, a problem record is created and linked to the incident records that triggered it. The problem team conducts root cause analysis (RCA) using a defined methodology. Common RCA methodologies: Five Whys (ask why the incident occurred, then why that cause occurred, iterate to identify root cause), fishbone diagram (list contributing factors in categories: people, process, technology, environment), fault tree analysis (decompose the failure into causal chains). RCA must be documented in the problem record. Once root cause is identified, it is recorded. If a permanent fix can be implemented immediately, the problem record is marked for fix implementation via change management. If a permanent fix is not yet available, a known error record is created (see below) and the problem remains open until the fix is implemented.
Proactive Problem Management
Proactive problem management actively hunts for problems before they cause multiple incidents. Review incident data monthly to identify trends: are the same components failing repeatedly? Is a particular application generating increasing incident volume? Conduct availability and performance data review to identify emerging issues: is a database slowly degrading in performance, approaching a crisis? Is a storage array approaching capacity? Identify capacity-related problems: when utilization reaches defined thresholds, trigger a capacity problem. Maintain a proactive problem register and review it monthly or quarterly. Assign owners to proactive problems and set target resolution dates.
Known Error Database
A known error is an incident or problem that has a documented root cause and a documented workaround or temporary solution, but the permanent fix has not yet been implemented. The known error is recorded in a known error database so that if the same symptoms occur in an incident, the incident resolver can quickly apply the workaround without waiting for root cause analysis. Example: "Email connection failures on morning startups—known error: network load during business start window. Workaround: restart Outlook. Permanent fix: network infrastructure redesign, planned for Q3." The known error database must be actively used during incident management. When an incident is being investigated, the resolver should search the known error database for matching symptoms and apply the workaround if found. The known error database is reviewed when permanent fixes are implemented (the known error is marked as resolved and removed or archived).
Problem Management Metrics
Track: open problem count and trend (is problem count increasing or decreasing?), problems by age (how many problems are >30 days old?), mean time to root cause (average time from problem opening to RCA completion), known error database utilization rate (what % of incidents match a known error and apply the documented workaround?), and percentage of problems leading to permanent fixes (are problems actually being fixed, or just documented?).
Integration with Change Management
When an RCA identifies a required fix, a change record must be raised. The problem record and change record must be linked. Example: problem record states root cause is "outdated network driver causing intermittent connectivity," change record describes installing updated driver on affected servers. The change goes through the change management approval process. Once the change is implemented, the problem record is closed and the associated known error is archived.
Common Implementation Pitfalls
Pitfall 1: Implementing incident management without service request management. Service requests are handled informally, no records are kept, and there is no documented fulfillment process. This creates confusion (is this an incident or a request?) and no audit evidence for service request handling. Solution: implement both incident and service request management from the start.
Pitfall 2: Creating problem records but never progressing them. A problem queue fills with aging, stale records. Root cause analysis is never completed, and no known errors are created. The queue becomes an appendix and evidence of a non-functional process. Solution: assign ownership, set closure targets, and review the problem queue at management review.
Pitfall 3: Conducting RCA verbally but not documenting it. "We found the root cause" but no record exists. When the same issue occurs weeks later, the knowledge is lost. Solution: require RCA findings to be documented in the problem record and reviewed by a second party for completeness.
Pitfall 4: Creating a known error database but never using it. The database exists but incident resolvers do not search it. Known errors accumulate and become a useless appendix. Solution: make searching the known error database a required step in incident investigation.
Pitfall 5: Setting SLA targets without checking achievability. SLA targets are set to please customers but are unachievable given actual operational capability. Incidents routinely miss SLA. Solution: analyze historical incident data to understand achievable targets before setting them.
| KEY CONCEPT | The three-tier resolution structure operates as an integrated loop: Incident (restore service immediately) → Problem (investigate root cause) → Change (implement permanent fix) → back to CMDB (update configuration record). All three practices must work together. Without problem management, incidents repeat. Without change management, root causes are never fixed. Without CMDB accuracy, change impact assessment fails. |
Incident Priority Matrix
| Priority | Impact Description | Urgency Level | Resolution Target | Update Frequency | Escalation Trigger |
|---|---|---|---|---|---|
| P1 | Full outage of critical service, >100 users affected | Immediate | 4 hours | 30 minutes | Not resolved in 2 hours |
| P2 | Partial degradation or full outage non-critical service, 50–100 users | High | 8 hours | 1 hour | Not resolved in 4 hours |
| P3 | Partial degradation <50 users or isolated functionality impact | Medium | 2 business days | 4 hours | Not resolved in 8 hours |
| P4 | No business impact, cosmetic issues, informational | Low | 5 business days | Daily | As needed |
| Emergency | Widespread customer impact, regulatory breach, data loss | Critical | 1 hour | 15 minutes | Immediate executive escalation |
Resolution Practice Implementation Checklist
| Practice | Design Elements Required | Records Required | Operational Readiness Criteria | |
|---|---|---|---|---|
| Incident Management | Classification scheme, priority matrix, SLA targets by priority, escalation procedures, major incident procedure | Incident records with complete fields: ID, date/time, classification, priority, resolver, status, resolution, SLA status | All staff trained, 2 weeks of incident records, SLA targets achievable, 95% incidents classified correctly in sample audit | |
| Service Request Management | Service request catalogue, pre-approved fulfillment procedures per request type, approval authorities | Service request records with request type, approvals, fulfillment steps, closure verification, SLA status | 3 common request types with documented procedures, 2 weeks of request records, all requests have closure verification, 100% SLA achievement | |
| Problem Management (Reactive) | Trigger criteria for problem creation, RCA methodology, problem record template, known error criteria | Problem records with RCA documentation, known error records with workaround and permanent fix status, problem-to-change links | RCA methodology trained to problem team, 1 problem record per audit sample showing complete RCA, known errors searchable in incident management | |
| Problem Management (Proactive) | Incident data analysis process, availability/performance review process, capacity problem criteria, proactive problem review schedule | Monthly incident trend analysis, proactive problem register with owner and target date, trend reports | Monthly trend analysis conducted, proactive problem register reviewed monthly at operations meeting, 1–2 proactive problems identified per month | |
| IMPORTANT | Incident records are the single most-examined document type in Stage 2 audits. Auditors will sample 20–30 records and examine each in detail for: (1) classification accuracy—is the incident assigned to the correct category? (2) Priority correctness—does the priority match the incident impact and urgency? (3) Escalation compliance—were escalation procedures followed? (4) SLA status—was the SLA met or missed, and if missed, is there documented justification? (5) Closure verification—is there evidence the customer confirmed resolution? Incident records with incomplete data, missing escalation documentation, or incorrect SLA status will generate audit findings. | |||
| BITLION INSIGHT | Bitlion GRC includes integrated ITSM module with incident, problem, and service request workflows specifically designed to meet ISO 20000 Clause 8.2, 8.3, and 8.4 requirements. Incident records include all required ISO 20000 fields. Service request catalogue and pre-approved procedures are configurable. Problem management workflow includes RCA template and known error database. Integration with change management ensures problem-to-change linkage. All records generate the documented information required for audit. |