Proj. CDeepFuzz Paper Reading: Checker Bug Detection and Repair in Deep Learning Libraries

3. TensorGuard: A RAG-Based Multi-agent framework to detect and fix DL Checker Bugs

RAG Design

relevant contextual information from a large corpus of code changes
Input: the root cause of the checker bug queried
Output: code change
Based on:
Sentence-transformers + all-MiniLM-L6-v2 as the embedding model, converts the documents to a 384-dimensional dense vector space
batch size: 50, chromadb

Checker Bug Detection Agent

COT, Zero-Shot, Few-Shot(随机选两个样例)

TABLE V: Prompt template for bug detection agent (COT).

“prompt”: You are an AI trained to detect bugs in a deep-learning
library based on commit messages and code changes. Your task
is to determine whether a given commit introduces a bug or not.
Follow the steps below to reason through the problem and arrive at
a conclusion.
1. Understand the commit message: Analyze the commit message
to understand the context and purpose of the code change.
{commit message}
2. Review the code change: Examine the deleted and
added lines of code to identify the modifications made.
{code removed}{code added}
3. Identify potential issues: Look for any missing, improper, or
insufficient checkers within the code change. Checkers may include
error handling, input validation, boundary checks, or other safety
mechanisms.
4. Analyze the impact: Consider the impact of the identified issues
on the functionality and reliability of the deep learning libraries.
5. Make a decision: Based on the above analysis, decide whether
the commit introduces a bug or not.
6. Output the conclusion: Generate a clear output of “YES” if the
commit introduces a bug, or “NO” if it does not.
“output”: {Decision}

TABLE VI: Prompt template for bug detection agent (Zero Shot).

“prompt”: You are an AI trained to detect bugs in a deep-learning
library based on commit messages and code changes. Your task
is to determine whether a given commit introduces a bug or not.
Follow the steps below to reason through the problem and arrive at
a conclusion.
Commit message: {commit message}
Code change: {code removed}{code added}
“output”: {Decision}

TABLE VII: Prompt template for bug detection agent (Few Shot).

“prompt”: You are an AI trained to detect bugs in a deep-learning
library based on commit messages and code changes. Your task
is to determine whether a given commit introduces a bug or not.
Follow the steps below to reason through the problem and arrive at
a conclusion.
Example Checker Bug One:
Commit message: {commit message}
Code change: {code removed}{code added}
Example Checker Bug Two:
Commit message: {commit message}
Code change: {code removed}{code added}
Task:
Commit message: {commit message}
Code change: {code removed}{code added}
“output”: {Decision}

Root Cause Analysis Agent

TABLE VIII: Prompt template for root cause analysis agent.

“prompt”: Please describe the root cause of the bug based on the
following commit message:{commit message}
“output”: {Root causes}

Patch Generation Agent

TABLE IX: Prompt template for patch generation agent.

“prompt”: You are given a bug explanation and an external context
for fixing a checker bug. Please think step by step and generate a
patch to fix the bug in the code snippet. Please neglect any issues
related to the indentation in the code snippet. Fixing indentation
is not the goal of this task. If you think the given pattern can be
applied, generate the patch.
Example One: {code removed} {code added}
Example Two: {code removed} {code added}
Bug explanation: {bug explanation}
Retrieved context: {retrieved knowledge}
Code snippet: {code snippet}
“output”: {Think steps}{Patch}

Data for RAG and TensorGuard Evaluation

RAG的训练数据：所有commits，而不仅仅是与checker相关的commits, 1.3M code changes
- 61453 commits for PyTorch and 150352 commits for TensorFlow
- 391,571 code changes for PyTorch and 920,108 code changes for TensorFlow
Test Dataset:
- 与Checker Bug相关的the commits of PyTorch and TensorFlow from January 1, 2024 to July 20, 2024，这些commits中更改较大的commit(修改的文件超过10个），（修改的代码超过15行）
- 在这其中，筛选了92 buggy and 135 clean DL checker-related changes.
Metrics for patch generation: use Precision, Recall, F1 score, and the number of correctly generated patches
GPT-3.5-turbo, temperature =0, run 5 times, use average