当前想实现一个服务,能够对业务日志进行聚类分析。业务日志具备一定的 pattern 特征,但是没办法穷举,所以想通过开发一个服务来对业务日志进行聚类,便于后续进一步分析。
当前的想法是,既然是使用聚类,那么需要选取一个日志文本到特征值的一个相似度衡量算法(text-embedding),以及一个聚类算法。
当前纠结点在 text-embedding 要怎么选取。
之前没有做过类似相关,最近查了些资料,可能是姿势不对,没有发现可以用来借鉴的实现或算法。
不知道描述是否清晰,如果有做过相关工作的同学帮忙指点迷津~
如果思路有问题也请多多指教~~
1
widewing 2018-06-05 17:48:10 +08:00 via Android
我也想做这个,马克下
|
2
fffflyfish 2018-06-05 18:26:17 +08:00
训练分词 word2vec,然后 text 的所有分词的 vec 相加,得到 text 的相似度
|
3
ipwx 2018-06-05 18:35:34 +08:00
pattern 用 word-embedding 不一定能行,pattern 信息量太少,word-embedding 容易过拟合。
DeepLog 这篇论文了解一下,我没试过,不过好像挺厉害的。 |
4
shiznet OP @ipwx 看了下「 DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning 」摘要,感觉和我需求不大一致。
``` Anomaly detection is a critical step towards building a secure and trustworthy system. The primary purpose of a system log is to record system states and significant events at various critical points to help debug system failures and perform root cause analysis. Such log data is universally available in nearly all computer systems. Log data is an important and valuable resource for understanding system status and performance issues; therefore, the various system logs are naturally excellent source of information for online monitoring and anomaly detection. We propose DeepLog, a deep neural network model utilizing Long Short-Term Memory (LSTM), to model a system log as a natural language sequence. is allows DeepLog to automatically learn log patterns from normal execution, and detect anomalies when log patterns deviate from the model trained from log data under normal execution. In addition, we demonstrate how to incrementally update the DeepLog model in an online fashion so that it can adapt to new log patterns over time. Furthermore, DeepLog constructs workows from the underlying system log so that once an anomaly is detected, users can diagnose the detected anomaly and perform root cause analysis effectively. Extensive experimental evaluations over large log data have shown that DeepLog has outperformed other existing log-based anomaly detection methods based on traditional data mining methodologies. ``` |
9
ETiV 2018-06-05 22:49:44 +08:00
我这些天用 Google Cloud Function,它有一个出错信息汇总页面,
相同类型的 N 多错误被放在了一起,应该就是 LZ 想要的? 我觉得它实现起来蛮简单的:通过 error stack 来归类 LZ 也可以考虑给日志加上「当前模块、文件,当前行数」这种输出的话,用这两个值就可以做归类了吧~ |
10
shiznet OP @ETiV
模块 /文件是独立的,这个可以区分开,但是一个模块中可能会输出不同的日志,比如说方法 A 有多个地方会有异常栈输出,且每个异常栈的信息可能略有不同。行数信息是在日志的描述中的一个变量,所以没办法将行数作为直接标识。 不过可以沿着这个思路走: 先按模块归类,然后对模块内再进一步归类 |
11
shiznet OP |