论文解析 -- AIOps- A Multivocal Literature Review

这篇综述是基于A Systematic Mapping Study in AIOps的基础上的补充和更新。

除了论文,还涵盖grey literature (e.g., blog posts, videos, and white papers)  ,所以称Multivocal

Our work will complement the work performed by these authors adding also insights from grey literature as well as more recent works on the topic. 

 

AI是什么?

More precisely, AI is a technological domain with core components such as Machine Learning (ML), Deep Learning, Natural Language Processing (NLP) platforms, predictive Application Programming Interfaces (APIs), and image and speech recognition tools [29]. 

AIOps是什么?

Gartner的定义

"AIOps platforms utilize big data, modern machine learning and other advanced analytics technologies to directly and indirectly enhance IT operations (monitoring, automation and service desk) functions with proactive, personal and dynamic insight.
AIOps platforms enable the concurrent use of multiple data sources, data collection methods, analytical (real-time and deep) technologies, and presentation technologies."  

Forrester 的定义

“AIOps primarily focuses on applying machine learning algorithms to create self-learning and potentially self-healing applications and infrastructure.
A key to analytics, especially predictive analytics, is knowing what insights you’re after.” 

 

研究什么

回答3个问题

RQ1: How does the literature define AIOps?

RQ2: What are the reported benefits of AIOps? 

RQ3: What are the reported challenges of AIOps? 

 

这篇paper的取材更广泛,来自Google Scholar (http://scholar.google.com/) and Google Search (http://www.google.com/) 

 

好处

AIOps is relatively young and far from a mature technology, even so, it already has reported some potential benefits.  

Monitoring IT work. 强大的RCA,快速的troubleshooting
AIOps solutions monitor and analyze the activities performed in an IT environment (both hardware and software), e.g. processor use, application response times, API usage statistics, and memory loads [18, 61].
These analytics and ML capabilities allow AIOps to perform powerful root cause analysis that speeds up troubleshooting and solution to difficult and unusual problems [47, 59]
e.g., if the workload traffic exceeded a normal threshold by a certain percentage, the AIOps platform could add resources to the workload or migrate it to another system or environment much like a human admin does [56]. 

Proactive IT work. 主动运维,Proactive相对的是Reactive,依赖预测技术
AIOps reduces the operational burden of IT systems and facilities with constructive actionable dynamic insight by utilizing big data, machine learning, and other advanced analytics technologies to boost IT operations [47, 58].
This means that AIOps platforms can provide predictive warnings that allow potential issues to be solved by IT teams before they lead to slow-downs or outages.
In fact, a survey from 6000 global IT leaders about AIOps revealed that 74% of the IT professionals want to use proactive monitoring and analytics tools [48].
However, 42% of them are still using monitoring and analytics tools reactively to detect and fix technological challenges and issues. 

 

挑战

Low-quality data.
The performance of the AIOps highly depends on the quality of the data [53].
While major cloud providers capture terabytes and even petabytes of telemetry data every day/month today, there is still a shortage of representative and high-quality data for developing AIOps solutions [22].
It is simply becoming too complex for manual reporting and analysis.
In this scenario, current issues are noisy data, irregular or inadequate reporting frequencies, and even inconsistent naming convention [51, 53].
Besides, essential pieces of information are “unstructured” types of data presenting poor data quality[53].
Therefore, a constant improvement of data quality and quantity is essential, taking into account that AIOps solutions are based on data [22]. 

 

Identifying the use cases. 找到场景怎么使用AIOPS,本身要求对于业务的深入理解
Use cases in the AIOps is the process of analyzing and identifying the challenges and opportunities across the IT operation environment [51].
In addition, building the models to solve these problems and monitoring the performance of the developed model [55].
Companies believe using AI and ML-related features will increase the efficiency of current development within the organization [52].
However, without identifying the underlying issue AIOps implementation might not be effective [51].
As AIOps solutions require analytical thought and adequate comprehension of the whole problem space such as market benefit and constraints, development models and, considerations of system and process integration [22].
Therefore, the organization should start examining underlying systems, applications, and processes from the top level and decide the integration of AIOps to have the greatest leverage [51]. 

 

Traditional engineering approach. 没有成熟的实施方案,落地比较困难,
Successful AIOps implementation requires significant engineering efforts [22].
As it is relatively young and far from mature technology only limited AIOps-engineer are available [22].
Therefore, instead of focusing on building new AIOps initiative, reshaping the existing approach and processes in the organizations is important for the new realities of digital business [54, 55].
These works indicate that traditional approaches do not work in dynamic, elastic environments.
However, ideal practice/principles/design patterns are yet to be established in the industry [53]. 

 

论文有点水,价值这样在附录列了他参考的主要的论文和文档,作为index可以的

 

继续水一篇,AIOps Real-World Challenges and Research Innovations

AIOPS要达到的目标,

High service intelligence.  

High customer satisfaction.  

High engineering productivity. 

AIOPS实际遇到的困难

Gap in innovation methodologies. Difficulty of the mindset shift.  缺少创新的方法论和心智的转化需要个过程

EngineeringchangesneededtosupportAIOps 
AIOps-oriented engineering is still at a very early stage, and the best practice/principles/design patterns are not well established in the industry yet. 
The data quality and quantity available today do not serve the needs of AIOps solutions. 

 

Difficulty on building ML models for AIOps 

The challenges for building supervised machine learning model for AIOps include: no clear ground truth labels or huge manual efforts to obtain high quality ones (extremely imbalance, too small amount, high degree of noise, etc.)[6],
complex dependencies/relations among components/services[7], complicated feature engineering effort due to the high complexity of cloud service behaviors, continuous model update and online learning, and the risk of service interruptions caused by misbehaving ML models. 

In many AIOps scenarios, due to the difficulty of obtaining label data, only unsupervised or semi-supervised machine learning models is feasible. 
The difficulty of building high-quality unsupervised models lies in the complexity of the internal logic of services and the huge volume of the telemetry data that needs to be analyzed. 

 

 

 

 

posted on 2023-04-12 15:59  fxjwind  阅读(181)  评论(0编辑  收藏  举报