信息安全 - 我如何训练机器学习系统来识别代码中的漏洞？ - 吾爱随笔录

抱歉，如果这不是一个“问题”，而是一个讨论。我一直在考虑这个问题。

我如何训练机器学习系统来识别开源代码库中的（新）漏洞？甚至是封闭的二进制文件？是否可以？

这是我提出的解决方案...我很好奇是否有人熟悉这些工作，或者您是否对其可行性有任何想法。

要求：

具有以下属性的 CVE 数据库：

1) The source diff of the patch applied to fix the vulnerability, i.e. the "before/after" of the critical section of code

2) The bindiff of the binary before and after patching the vulnerability

目标

使用来自先前识别的漏洞的代码来训练 ML 系统以识别“易受攻击”的代码，然后将其应用于开源项目中的关键代码部分。

它看起来像这样......

数据收集阶段

1) Collect the before/after code of all previous vulnerabilities

2) Use the before/after code to identify the "critical section" that caused the vulnerability

3) Convert the "critical section" to its AST representation

培训阶段：

1) Determine the best ML algorithms to use for comparing AST representations

2) Using labeled inputs of "vulnerable" and "safe" AST representations, train the ML system to recognize a "vulnerable" AST

新漏洞识别阶段：

1) Download open source code bases

2) Somehow prioritize which code to convert to AST

3) Convert code to AST and feed to ML system to determine likelihood of "vulnerability"

4) Apply some combination of static and manual analysis to verify the vulnerability

5) Use results as further feedback to train the ML system

再说一次，我意识到这不是一个严格的“问题”，但我希望它可以促进一些有趣的讨论。这是一个我一直在脑海中思考的想法，但其中大部分都超出了我的专业知识范围。

它肯定有很多挑战，主要是误报（例如，可能带有十几个条件的双嵌套 for 循环看起来像一个易受攻击的 AST，但它位于代码的非关键部分）。但我认为基于现有漏洞训练 ML 算法的中心思想将导致一种非常有效的发现新漏洞的方法。至少，它可以通过将模糊器等工具引导到代码的关键部分来提高效率。此外，它不一定只适用于开源代码。它还可以反汇编易受攻击的二进制文件和修补过的二进制文件，并比较它们的 ASM 指令。事实上，这甚至可能导致比 AST 方法更高的信号。