AIOps – The Future of IT Operations is AI

AIOps – The Future of IT Operations is AI

AI technologies are in the peak of their hype cycle. While there are certainly bold claims of what AI can do, it is not all hype, and it continues to find promise. Organizations are being swamped with data, and the challenge is in how to make use of all this data. One of the key promises of AI is to leverage data to gain insights and automate decision processes using this large accumulation of data.  Some of the areas of great promise generating lots of data are in IT departments, IT Service Management, and operations.

IT operations is challenged with a duality of cost reduction and increasing operations complexity. While cost reduction seems obvious, the complexity is born of continuous innovation in business and IT operations technologies. These complexities are manifest in the volume, variety, and velocity of data:

  • Explosive growth (volume) of data generated by both applications and IT infrastructure. (2-3x per year).
  • Increasing variety of sources and formats of data by machines, humans, logs, network traffic, and documents (e.g. knowledge documents, runbooks).
  • Escalating velocity of data being generated combined with the rate of change from the adoption of cloud-native (SaaS), serverless, and microservice architectures.

These new capabilities are driving the AIOps trend.

AIOps – Intelligent Operations

Gartner defines AIOps as “the application of machine learning (ML) and data science to IT operations problems.” AIOps platforms combine big data and machine learning to support all primary IT operations functions. These capabilities can scale ingestion and analysis of the growing volume, variety and velocity of data generated by IT. The AIOps vision enables the concurrent use of multiple data sources, data collection methods, analytical and presentation technologies.

In the IT arena, we are already seeing AIOps making headway. Commonly known IT Service Management (ITSM) platforms are inserting cognitive technology into their products (e.g. Splunk, ServiceNow, BMC Helix/Remedy, Symphony Summit, Ivanti, IBM, ServiceAide, and Freshworks):

  • Correlating event data for incidents and cases to facilitate faster triage.
  • Identifying patterns by continuously segmenting and grouping similar items to find areas of improvement.
  • Correlating content (knowledge base) with incidents, cases, and alerts to facilitate quicker resolution.
  • Automating ticket creation, such as with help desk, with incident or request classification to reduce human effort and prioritize work.
  • Routing automation of tickets or alarms to the proper resources for shorter cycle time, and fewer errors.
  • Predicting, or in some cases preventing, imminent incidents on servers ahead of time to prevent an outage. (e.g. resource and capacity utilization).
  • Prevention of (Al Farakh, 2019) zero-day desktop attacks that can be 99.9% effective in detection.
  • Automating detection of desktop issues and resolving them before the user knows there is an issue. (i.e. software version, browser or printer issues)

These solutions offer insights and automation to handle complexity and reduce time to resolution. One of the biggest benefits of AI for the help desk and overall IT support function is that it can reduce or remove the manual overhead associated with high-volume, low-value service desk activities. While chatbots and ticket creation/routing are the standard today, this is only the beginning. AI- and ML-driven tools will soon tap predictive analytics for better decision making in incident management, demand planning and more.

Evaluating the Business Case for AIOps

To illustrate the impact, we’ll focus in one area of opportunity, IT Service Management, i.e. the help desk.  This is an emerging area, that can have a great impact in internal support times, and customer satisfaction. The top 3 metrics for a service desk are typically a) time to response, b) time to resolve, and c) customer satisfaction.

In a Zendesk benchmark report, the average time to a first response to an internal ticket is around 24 hours. This assumes the requester either does not have access to a self-service capability, or the self-service did not resolve their issue.  In another study, JitBit found the median response times to customer support tickets is around 7 hours.  This is the internal support customer’s service desk experience which impacts their productivity as well as their customer satisfaction.

The average first-level resolution rate is 74.3% (MetricNet). The median customer support ticket resolution time is 3 days 10 hours (~82 hours) where the top 5% is 17 hours (JitBit).

These numbers can reflect several different scenarios. These can include issues such as:

  1. Poor configuration management, environment complexity, or buggy software.
  2. A larger number of users to support compared to the number staff resources available.
  3. Lack of available technology to provide faster responses or self-service.

Note that the median number of support tickets that one technician can handle per day is around 21 (JitBit). The average number of desktop technicians per 1,000 seats ranges from 5.4 in the healthcare industry to 21.9 in the financial services industry (MetricNet).

This implies a direct correlation between number of tickets and staffing costs, thus cost per ticket. The average service desk spends 68.5% of its budget on staffing costs and only 9.3% on technology (MetricNet). Using numbers from HDI for North America:

  • Average cost per ticket: ~$109
  • Average cost per incident: ~$73
  • Average cost per service request: ~$173

This implies a technician (i.e. beyond level 1) handling 21 request per day has an internal cost of $1,500 to $3,600 day, depending on the mix of the type of tickets. Across North America in 2016, the average ticket cost was $15.56, with a low of $2.93 and a high of $46.69.

The lower the level to resolve the issue (e.g. Level 1 vs Level 2), and the faster to resolve the ticket, the lower your cost.  Managed Service Providers (MSPs) should take note, this is quite often your value proposition.

This is where technology comes into play. Service desk chatbots (NLP) and automated request routing are just the beginning. AI- and ML-driven tools are already starting to use predictive analytics for better decision making in incident management, demand planning and more.

One of the challenges is the solutions are very data dependent. Vendors are going to market with a data-source-agnostic AIOps platform. These products tend to be generic and cater to the broadest use cases. Today, most monitoring tools do not cut across the multiple data types required for extracting useful insights for complex scenarios (e.g. network correlation and root-cause analysis).

  • Some vendors that have the key components tend to have a restricted set of data sources.
  • Some vendors with existing monitoring solutions limit data sources to their own monitoring products or extend to a limited partner ecosystem
  • Some open-source projects enable users to assemble their own AIOps platforms by offering tools for data ingest, a big data platform, ML and a visualization layer. End users can mix and match the components from multiple providers.

Summary

Companies that use remote support and knowledge management tools have higher average first contact and first-level resolution rates than those that don’t. AIOps platform investments have almost always been justified based on their ability to decrease mean time to problem resolution. AI is bringing real value to the IT department, and the future has potential. In the next article, we’ll look at one of the new innovations in AIOps.

References:

“Market Guide for AIOps Platforms”, Gartner, 12 Nov 2018

Average customer support metrics from 1000 companies”, JitBit, 4 Apr 2019

How AI is Helping the Help Desk”, ComputerWorld, 8 Apr 2019

Metric of the Month: Desktop Support Cost per Ticket”, HDI, 8 Oct 2017