Why Problem Management Matters
Organizations that only do incident management are in permanent firefighting mode. Every incident is treated as a discrete event; when the same issue recurs, it is handled as a brand new incident with no memory of lessons learned. Problem management breaks this cycle by investigating root causes and implementing permanent fixes. The difference is stark: without problem management, a recurring email authentication failure is resolved by resetting user credentials five times a month; with problem management, the authentication system bug is fixed once, and the issue is eliminated.
Reactive Problem Management Triggers
Reactive problem management starts when: a major incident (P1) occurs automatically triggering a problem record for investigation; a pattern emerges (same CI or same symptom recurring three or more times within a defined period, e.g., 30 days); the customer escalates expressing frustration with recurring issues; auditors or management identify systemic problems during review. The trigger must be documented; a problem record is created and linked to the incidents that triggered it.
Problem Investigation Process
The problem record is created with a clear description of the recurring issue. The problem owner is assigned—someone with authority and time to investigate. Investigation methodology is structured: collect data (incident records, system logs, CMDB change history, monitoring data from the days surrounding the incidents, any customer-reported patterns). Then analyze the data to identify root causes. The structured data collection phase is critical; investigations that rely on vague recollection or assumptions lead to wrong root causes and ineffective fixes.
Root Cause Analysis Techniques
Three common RCA techniques: 5 Whys (iteratively asking "why" until you reach root cause—simple to apply but risks stopping at a superficial cause), Fishbone or Ishikawa diagram (categorizing potential causes into People, Process, Technology, and Environment—useful for complex problems), and Fault Tree Analysis (top-down deductive approach identifying the chain of failures leading to the incident—rigorous for high-impact problems). Choose the technique based on complexity and impact; all require documented evidence and reasoning.
| KEY CONCEPT | The known error database is the institutional memory of the SMS. It converts individual investigations into organizational knowledge. If the database is empty or unused, problem management is not adding value regardless of how many records are open. |
Root Cause Documentation
The problem record must contain: the problem description and symptom, linked incidents showing the pattern, investigation timeline, root cause conclusion with supporting evidence, and recommended fix. Root cause analysis without documentation is worthless for audit and organizational learning. A documented root cause creates a reference point: future incidents with the same symptom can be quickly diagnosed; staff can review past investigations to understand system vulnerabilities; auditors can verify that investigation methodology was sound.
Known Error Management
Often, root cause is identified but permanent fix cannot be implemented immediately (requires major change, complex testing, or significant cost). In this case, a known error record is created: the description of the issue, the root cause, a documented workaround (e.g., "restart the service"), affected CIs and services, severity rating, and estimated fix timeline. The known error database is made accessible to incident management; when a new incident with matching symptoms is logged, the resolver can quickly query the known error database and apply the documented workaround, dramatically reducing resolution time. The known error remains open until permanent fix is implemented and verified; at that point, it is closed and moved to archived status.
Fix Implementation and Verification
The permanent fix is implemented as a change record linked to the problem record. After implementation, the organization monitors to verify that the problem is resolved. If a new incident matching the same pattern occurs post-fix, the fix was ineffective. Problem closure criteria: permanent fix implemented, tested, and deployed; no new incidents matching the known error symptom have occurred in a defined post-fix period (e.g., 30 days); the known error is closed. The problem record is then closed, and the fix enters the organization's institutional knowledge.
Proactive Problem Management
Beyond reactive investigation, proactive problem management identifies emerging issues before they cause significant incidents. Weekly incident trend analysis examines: top recurring incidents by category, CI, and service; capacity and availability data from monitoring for signs of degradation (error rate rising, throughput dropping, latency increasing). Change-related risk assessment: which recent changes are correlating with new incident patterns? The proactive problem register tracks emerging trends identified but not yet reaching incident threshold. Proactive problem management separates the mature SMS from the baseline.
Problem Management Metrics
Metrics to track: open problem count by age (highlight stale problems opened months ago with no progress), mean time from incident pattern to problem opening, mean time from root cause identification to permanent fix implementation, percentage of major incidents with associated problem records, known error database utilization rate in incident resolution (percentage of incidents where known error was applied), problem effectiveness ratio (percentage of problems where fix resolved the recurring pattern).
| IMPORTANT | Stale problem records—opened months ago with no progress—are a significant audit finding. They signal that problem management exists on paper but not in practice. Weekly review and escalation of stale problems is essential governance. |
Maturity Progression
The SMS maturity progression: reactive (only investigate after major incidents) → systematic reactive (investigate all significant patterns as a standard process) → proactive (identify and prevent issues before incidents occur). ISO 20000 requires at least systematic reactive problem management. Proactive is the hallmark of a genuinely mature SMS—the organization is learning and improving, not just responding.
| BITLION INSIGHT | Bitlion GRC integrated problem management with automatic incident linkage detection, RCA template library (5 Whys, Fishbone, FTA), known error database with full-text search, and stale problem alerting for governance. |