How does AI empower smart contract security? A practical sharing from general models to the three-audit mode.

Original source: Beosin

In recent years, large language models such as GPT-4, Claude, and Gemini have developed strong code comprehension abilities, allowing them to read smart contract languages like Solidity, Rust, and Go effectively, and to identify classic vulnerabilities with clear code characteristics such as reentrancy attacks and integer overflows. This has prompted the industry to consider: can we use large models to assist or even replace manual contract audits?

Due to the general models’ insufficient understanding of the specific business logic of projects, the false positive rate is relatively high when dealing with complex DeFi protocols, and it is easy to overlook vulnerabilities that require an understanding of cross-contract interactions or economic models. Subsequently, the industry proposed the idea of incorporating a "Skill" mechanism—on the basis of the general large model, injecting a specialized knowledge base pertaining to smart contract security, detection rules, and business context, allowing the model to have clearer judgment criteria during audits instead of only relying on general ability to determine if there are problems in the code.

Even with the Skill enhancement, AI auditing still has clear applicable boundaries. It excels at scanning for known vulnerability patterns and checking code standards, but for complex vulnerabilities that require a thorough understanding of overall protocol design, cross-contract interaction logic, or economic models, it is still challenging to address effectively. Such issues still require experienced auditing experts to handle, and in scenarios involving complex computational logic, formal verification also needs to be introduced to provide stronger guarantees. Against this backdrop, Beosin has constructed a three-audit model of Skill-enhanced AI baseline checks + manual deep audits + formal verification, each focusing on different aspects and complementing each other.

1. Audit Capability Boundaries of General AI Models: Controlled Comparisons and Case Analysis

This article selects two categories of contracts with different levels of complexity from a database of completed manual audits for testing cases: one type is simple contracts with relatively independent logic and clear functional boundaries, which typically represent scenarios where AI auditing tools have the most substantial training data and theoretical advantage; the other type involves complex contracts that include multi-contract interactions, complex state machines, or cross-protocol dependencies, which are often discussed as high-risk scenarios when asking "Can AI replace manual audits?".

For comparison, we used the exact same codebase, first allowing AI to independently run an audit and generate a report, then aligning it item by item with the manual audit report. The two reports were produced completely independently—auditors were unaware of the AI's results when generating their report, avoiding mutual influence. Finally, we will analyze the results from the following four dimensions:

Case A · Standard Token Contract (BSC-USDT / BEP20USDT.sol)

In the first test group, we selected a standard BEP-20 token contract, written in Solidity 0.5.16. Its logic is relatively independent, with clear functional boundaries, and it does not involve any cross-contract interactions; the main security risks are concentrated in some common, known vulnerability patterns. This type of contract is currently theoretically the most advantageous scenario for AI audits—there are many such standard token contracts in the training data, and the patterned vulnerability characteristics are relatively obvious.

AI generated a total of 6 warnings (2 critical, 1 medium, 3 low/suggestions), which is relatively substantial in terms of quantity. The low and suggestion items were basically accurate, covering common code standard issues such as outdated Solidity versions and the methods of exposing state variables, providing some reference value. However, the two “critical” items output by AI constitute misjudgments. AI marked the owner’s minting power and centralized permissions as critical vulnerabilities—however, for centralized stablecoins (like USDT), having minting power by the owner is expected design; risk assessment should consider multi-sign control, permission governance mechanisms, and contract upgrade strategies holistically. The rationality of this type of permission structure fundamentally depends on the project’s business model rather than the code itself, and AI lacks this layer of context and can only make judgments based on pattern matching.

This test case shows that while AI can recognize permission structures, it cannot determine if these permissions are reasonable in the business context, thus marking the owner’s minting power of USDT-type contracts directly as a "critical vulnerability," which is a typical misjudgment detached from actual business logic—such false positives may interfere with project parties' assessment of real risks.

Case B · Complex Business Contract (IPC Protocol / 2025-02-recall)

The second group of tests selected the IPC Protocol project from the publicly available reports on the Code4rena platform (report link: code4rena.com/reports/2025-02-recall). This project contains several interdependent core components, including Gateway, SubnetActor, and Diamond proxy patterns, with its security heavily relying on a deep understanding of the overall protocol architecture and cross-component interaction logic, representing a typical occurrence of high-value attacks in the DeFi ecosystem. Below are the AI audit results:

For the complex contract, the AI audit produced 3 critical and 6 medium alerts, which is not inferior in terms of output volume. However, a considerable proportion were deemed false positives by the auditors—AI made incorrect risk judgments on code snippets lacking context. Meanwhile, out of 9 confirmed high-level vulnerabilities by the auditor, AI only fully covered 1, with 2 detected but with clearly underestimated severity (actually High, AI reported as Medium), and the remaining 6 were completely overlooked. Of the 4 medium-level vulnerabilities, AI covered 1, while 3 were completely missing.

The common characteristic of these vulnerabilities is that they all rely on complete inference of the cross-component state transition paths of the protocol, rather than pattern matching of a single function. Taking the H-01 (signature replay) from the manual audit report as an example, exploiting the vulnerability requires understanding the design intent of multi-sign verification, how attackers construct repeated signature sets, and how this behavior circumvents weight thresholds. H-06 (reentrancy attack on leave() function) is similar: the vulnerability only exists in the critical state during the subnet bootstrap, necessitating understanding the interactions between staking flow, bootstrap triggering conditions, and external call timings. Similar deep logical vulnerabilities had no record in AI's alert list.

This outcome indicates that in complex contract audits, AI's auditing capability lies in local code pattern recognition, while protocol-level vulnerabilities may exhibit misunderstandings regarding overall business logic. When the triggering conditions for vulnerabilities span multiple contracts, states, and call levels, the current reasoning capabilities of AI are still inadequate.

Looking at the two cases comprehensively, AI auditing is not without value—it makes substantial contributions to coverage of known vulnerability patterns, code standards checks, and certain independent perspectives. However, its value boundary is very clear: it can serve as a baseline scan, but cannot be regarded as a direct security conclusion. For complex protocols, relying solely on AI reports to make security judgments would not only miss high-risk vulnerabilities but also occupy substantial screening time for teams due to numerous low-quality alerts. This is precisely why Beosin established a dedicated Skill knowledge base and introduced the three-audit model mechanism in the auditing process.

2. Dedicated Skill Knowledge Base: Engineering Path to Enhance AI Baseline Checks

To incorporate AI auditing into the baseline check auditing process, it is crucial to address the issue of high false positive and false negative rates when auditing real DeFi protocols. Whether it involves permission management, AMM liquidity mechanisms, message validation of cross-chain bridges, or clearing logic in lending protocols, AI currently can only make simple matches based on the superficial characteristics of the code and struggles to determine whether a segment of code has issues based on specific business scenarios and offensive-defensive logic. The core solution to this problem is injecting the accumulated experience of auditing experts into the AI judgment process in a structured way, allowing it to possess a certain level of business understanding.

However, it must be clear that even with the introduction of Skill enhancement, AI's positioning in audits will not change. For those complex issues involving multi-contract interactions, economic model analysis, and new attack techniques, manual audits remain irreplaceable. The role of Skill is to improve the quality of preliminary scans to a genuinely useful level within the range that AI can handle (such as recognizing common vulnerability patterns and having limited understanding of business logic), providing more valuable preliminary results for manual audits, rather than generating a slew of ineffective alerts that need to be sifted through.

2.1 Extracting from Audit Practice: Mechanism for Building Skill Rules

Beosin's Skill knowledge base comes from over 4000 smart contract projects that have completed manual audits, summarized and validated extensively by auditing experts. Each rule’s formation fully follows the process from identifying vulnerabilities to establishing rules: after auditors discover security issues in real projects, they completely restore the attack path, analyze the root causes deeply, verify if the remediation measures are effective, and finally organize this entire set of offensive-defensive understanding into rule items with contextual judgment conditions to be included in the Skill library for subsequent audits.

Below is a sample of one of the rules in the Skill library, which includes structured dimensions on vulnerability patterns, attack paths, root causes, and remediation suggestions.

[Beosin-AMM_Skill-1] Detection of Adding Liquidity Bypassed by Transfer Order

Vulnerability Pattern: The contract determines whether an add liquidity operation is occurring by checking if the WBNB balance in the Pair exceeds the reserves (balanceOf >= reserve + required). This check relies on the assumption that WBNB is transferred before tokens to the Pair, but the Router's addLiquidityETH function always transfers the ERC-20 token before WETH first, and the transfer order of the addLiquidity function is determined by the parameter sequence.

Attack Path: The attacker only needs to use addLiquidityETH (where tokens are fixed to transfer first), or call addLiquidity(Token, WBNB, ...) to make the Token arrive at the Pair before WBNB. During the check, when WBNB has not yet arrived, balanceOf == reserve, and the detection function returns false, thus completely bypassing the “no add liquidity” restriction.

Root Cause: Detection based on the Pair balance snapshot essentially cannot reliably differentiate between swap and add liquidity operations at the design level, reflecting a structural flaw rather than an implementation bug.

Remediation Suggestion: Prohibit non-whitelisted addresses from directly transferring to the Pair; all transactions should be completed via built-in contract functions, fundamentally eliminating the inherent defect of balance snapshot detection at the architectural level.

This rule is not a simple annotation of a single code pattern but a systematic sorting of a class of attacks: how the trigger conditions are formed, how attackers bypass detections, which stage exhibits structural flaws in the detection mechanism, and at which level remediation needs to be intervened.

2.2 Coverage of the Knowledge Base

Beosin has currently formed a specialized skill vulnerability library covering the mainstream technology stack in Web3, including major categories like Solidity, Rust, Motoko, FunC, Go, and ZK. Its core content is not made publicly available as an internal core asset, with the directory structure as follows:

Each specialized library's skills are managed separately by vulnerability type, with each rule containing identification numbers, trigger conditions, restored attack paths, contextual judgment logic, and remediation suggestions. The entire Skill library will continue to iterate with the emergence of new attack events and the accumulation of audit instances, ensuring it remains in sync with the real threat environment on the blockchain.

2.3 Comparison of Baseline Check Quality After Skill Intervention

To quantify the actual impact of the Skill library on baseline scan quality, we ran both general AI and Skill-enhanced AI on the same codebase for the two test cases in Chapter Two, and performed a detailed comparison of the results.

Comparison Results for Case A · Standard Token Contract (BEP-20):

Comparison Results for Case B · Complex Business Contract (IPC Protocol):

The comparison results indicate a noticeable improvement in detection quality for both types of contracts after introducing Skill. In the standard token contract scenario, the introduction of business context judgment capability entirely eliminated critical false positives; in the complex business contract scenario, the coverage rate of known vulnerability patterns increased from 11% to 44%, the false positive rate dropped from approximately 55% to about 30%, and the accuracy of severity judgments improved significantly. This report can serve as a baseline check, helping project teams gain early insights into defects present in the code. Although these issues may not immediately result in financial losses, they still play an important positive role in future project maintenance and upgrades.

However, the data also clearly exposes the inherent boundaries of AI capabilities: even after incorporating Skill enhancement, the coverage rate of high-level vulnerabilities in complex contracts only reached 44%. Those deep vulnerabilities requiring reasoning across contract state paths, economic incentive model analysis, or specific sequential conditions to trigger remain well beyond the capability range of AI baseline scans. This is precisely why we retain a complete manual auditing step in the auditing process even after introducing Skill enhancements.

2.4 White Papers as Audit Input: Consistency Verification Between Code Implementation and Design Intent

In addition to the vulnerability feature library, we have added an important capability to the audit process: leveraging the project’s white paper as additional input to allow AI to validate the consistency between code implementation and white paper design.

Specifically, prior to the start of code auditing, AI systematically parses the project's white paper, technical specifications, and requirement documentation to extract role permission models, core business processes, trust boundary definitions, and expected behavior constraints, forming a structured project context summary. Subsequently, throughout the code auditing process, AI continuously references this context for cross-comparison. This mechanism has yielded two valuable results in practical use:

First, for permission structures in the code that appear to harbor risks, if the white paper has already explicitly stated its design intent and constraints, AI will adjust its judgment accordingly, effectively reducing such false positives.

Second, if there are significant deviations between code implementation and white paper commitments, such as the slippage protection mechanism claimed in the document not being implemented in the code, or the time window constraints of governance processes not being correctly enforced, AI will issue warnings accordingly. Such inconsistencies between code and documentation can often be overlooked in conventional code scans, but they often hide potential security pitfalls and help project teams avoid behaviors that differ from expectations post-launch.

3. Triple Audit Model: Collaboratively Building a Comprehensive Guarantee for Smart Contract Security

Once smart contracts are deployed on the blockchain, the costs of any vulnerabilities are often irreversible. Beosin uses manual deep auditing + formal verification as the foundation of contract auditing, focusing on discovering and reporting issues that could already lead to financial losses or logical operational anomalies. At the same time, we have introduced the Skill-enhanced AI baseline checks based on a dedicated Skill knowledge base to help clients identify defects in code that are not yet harmful but have not yet caused actual damage. On this basis, Beosin has built a three-audit model consisting of manual deep audits + formal verification + enhanced AI baseline checks, forming a more comprehensive security assurance system through layered collaboration of the three.

3.1 Manual Deep Audits and Formal Verification: Core Pillars of Security Assurance

The core advantage of manual auditing lies in its deep understanding of the overall protocol design and the proactive analysis of potential risks from an attacker's perspective. Experienced auditing experts are responsible for conducting comprehensive protocol-level audits of projects, including the verification of cross-contract interaction logic, analysis of the attack surface regarding fund security, logical analysis of the protocol under extreme market conditions, as well as identification and assessment of new attack methods. This protocol-level understanding of offense and defense highly depends on long-term accumulation and practical experience in the Web3 ecosystem, which cannot currently be independently accomplished at the tool level.

Building on this foundation, Beosin transforms the judgment conclusions of manual audits into quantifiable mathematical guarantees using an internal toolchain. For core business logic confirmed by auditing experts, such as fund flow and price calculation as the highest risk key paths, Beosin deeply integrates LLM-driven formal specification generation capabilities into its internal verification toolchain, constructing a closed-loop engine of "AI specification generation → formal exhaustive verification → counterexample-driven refinement." The toolchain first utilizes Beosin's accumulated audit corpus as a knowledge base to model the attack surface of high-risk paths confirmed by experts, assisting in generating initial candidate sets of formal invariants and safety property specifications; subsequently, an automatic formal verification engine exhaustively verifies the complete state transition space of the contract. When the verification engine discovers counterexamples, the system automatically distinguishes between two scenarios: if the counterexample originates from a deviation between specification definitions and business semantics, it will return the context of the counterexample to the AI module for specification refinement, driving the next round of iterations; if the counterexample corresponds to a real exploitable path in the contract code, it will be directly output as evidence of vulnerability, along with a complete attack path reproduction for auditors to confirm and follow up for remediation. Both paths drive the closed loop convergence until the target property is mathematically confirmed to hold for all possible inputs. Critical paths verified through this closed-loop mechanism constitute the strongest deterministic defense line in the entire contract security system, compressing the attack surface to a very narrow range.

3.2 Enhanced AI Baseline Checks: Continuous Risk Alert Service for Developers

Meanwhile, Beosin also offers the enhanced AI baseline checks based on the Skill knowledge base as a standalone service to clients. Unlike the manual deep audits that focus on identifying critical vulnerabilities, this service is positioned more as a code health report for development teams. AI baseline scans will cover the entire contract code, systematically outlining potential issues that currently do not directly cause financial losses but require developers’ attention in the project's subsequent maintenance and iterations. For example: usage of outdated dependency libraries, missing critical event declarations, exposure methods for state variables that do not comply with best practices, and gas usage patterns that could be further optimized. These issues typically will not be directly exploited under the current business logic, but as the functionality of the protocol extends, code refactoring, or updates of external dependencies occur, some of these issues may gradually evolve into true security hazards. The three levels each emphasize different aspects and progress layer by layer, collaboratively establishing a comprehensive security assurance system for Web3 projects.

免责声明：本文章仅代表作者个人观点，不代表本平台的立场和观点。本文章仅供信息分享，不构成对任何人的任何投资建议。用户与作者之间的任何争议，与本平台无关。如网页中刊载的文章或图片涉及侵权，请提供相关的权利证明和身份证明发送邮件到support@aicoin.com，本平台相关工作人员将会进行核查。