Can AI programming work to earn 400,000 dollars?

AI's potential to replace programmers is not as exaggerated as it seems.

Image source: Generated by Wujie AI

Large language models (LLMs) are changing the way software development is done, and whether AI can massively replace human programmers has become a hot topic in the industry.

In just two years, AI models have evolved from solving basic computer science problems to competing with human experts in international programming competitions. For example, OpenAI's o1 participated in the 2024 International Olympiad in Informatics (IOI) under the same conditions as human contestants and successfully won a gold medal, showcasing its strong programming potential.

At the same time, the iteration rate of AI is also accelerating. On the code generation evaluation benchmark SWE-Bench Verified, GPT-4o scored 33% in August 2024, but the new generation o3 model has doubled that score to 72%.

To better measure the software engineering capabilities of AI models in the real world, OpenAI has today open-sourced a brand new evaluation benchmark called SWE-Lancer, which for the first time links model performance to monetary value.

SWE-Lancer is a benchmark that includes over 1,400 freelance software engineering tasks from the Upwork platform, with a total reward value of approximately $1 million in the real world. How much can AI earn by programming?

Features of the New Benchmark

The prices of SWE-Lancer benchmark tasks reflect real market value; the more difficult the task, the higher the reward.

It includes both independent engineering tasks and management tasks, allowing for selection between technical implementation plans. This benchmark is not only aimed at programmers but also at entire development teams, including architects and managers.

Compared to previous software engineering testing benchmarks, SWE-Lancer has several advantages, such as:

All 1,488 tasks represent the real compensation paid by employers to freelance engineers, providing a natural, market-determined difficulty gradient, with rewards ranging from $250 to $32,000, which is quite substantial.

Among them, 35% of the tasks are valued at over $1,000, and 34% of the tasks are valued between $500 and $1,000. The Individual Contributor (IC) Software Engineering (SWE) tasks group contains 764 tasks with a total value of $414,775; the SWE Management tasks group contains 724 tasks with a total value of $585,225.

Large-scale software engineering in the real world requires not only the ability to code but also capable technical management. This benchmark uses real-world data to evaluate models acting as SWE "technical leads."

It possesses advanced full-stack engineering evaluation capabilities. SWE-Lancer represents real-world software engineering because its tasks come from platforms with millions of real users.

The tasks involve mobile and web engineering development, interaction with APIs, browsers, and external applications, as well as validation and reproduction of complex problems.

For example, some tasks include spending $250 to improve reliability (fixing double-triggered API call issues), $1,000 to fix vulnerabilities (resolving permission discrepancies), and $16,000 to implement new features (adding in-app video playback support on web, iOS, Android, and desktop).

Domain diversity. 74% of IC SWE tasks and 76% of SWE management tasks involve application logic, while 17% of IC SWE tasks and 18% of SWE management tasks involve UI/UX development.

In terms of task difficulty, the tasks selected by SWE-Lancer are very challenging, with tasks in the open-source dataset averaging 26 days to resolve on GitHub.

Additionally, OpenAI stated that the data collection is unbiased; they selected a representative sample of tasks from Upwork and hired 100 professional software engineers to write and validate end-to-end tests for all tasks.

AI Coding Earning Ability Comparison

Despite many tech leaders continuously claiming that AI models can replace "junior" engineers, whether companies can fully replace human software engineers with LLMs remains a big question mark.

The first evaluation results show that on the complete SWE-Lancer dataset, the currently tested AI gold medal models earn far below the potential total reward of $1 million.

Overall, all models perform better on SWE management tasks than on IC SWE tasks, and IC SWE tasks have largely not been fully tackled by AI models. The best-performing model currently tested is Claude 3.5 Sonnet developed by OpenAI competitor Anthropic.

On IC SWE tasks, all models have a single pass rate and yield below 30%, while the best-performing model on SWE management tasks, Claude 3.5 Sonnet, scored 45%.

Claude 3.5 Sonnet shows strong performance on both IC SWE and SWE management tasks, outperforming the second-best model o1 by 9.7% on IC SWE tasks and by 3.4% on SWE management tasks.

In terms of earnings, the best-performing Claude 3.5 Sonnet has total earnings exceeding $400,000 on the complete dataset.

It is noteworthy that higher reasoning computational power greatly aids "AI earning."

On IC SWE tasks, experiments conducted on the o1 model with deep reasoning tools enabled showed that higher reasoning computational power could increase the single pass rate from 9.3% to 16.5%, with earnings rising from $16,000 to $29,000, and the yield increasing from 6.8% to 12.1%.

Researchers concluded that while the best model, Claude 3.5 Sonnet, solved 26.2% of IC SWE problems, the majority of remaining solutions still contained errors, and achieving reliable deployment requires much improvement. Following Claude 3.5 Sonnet are o1 and GPT-4o, with the single pass rate for management tasks typically more than double that of IC SWE tasks.

This also means that even though the idea of AI agents replacing human software engineers is heavily promoted, companies still need to think twice. AI models can solve some "junior" coding problems, but they cannot replace "junior" software engineers because they cannot understand the reasons behind certain code errors and continue to make more extended mistakes.

The current evaluation framework does not support multimodal input, and researchers have not yet assessed "return on investment," such as comparing the payment to freelancers for completing a task with the cost of using APIs, which will be a key focus for improving this benchmark in the future.

Be an "AI-Enhanced" Programmer

At present, AI still has a long way to go before it can truly replace human programmers, as developing a software engineering project is not just about generating code as required.

For example, programmers often encounter extremely complex, abstract, and vague client requirements, which require a deep understanding of various technical principles, business logic, and system architecture. When optimizing complex software architectures, human programmers can comprehensively consider factors such as future scalability, maintainability, and performance, which AI may find difficult to analyze comprehensively.

Moreover, programming is not just about implementing existing logic; it also requires a lot of creativity and innovative thinking. Programmers need to conceive new algorithms, design unique software interfaces and interaction methods, and such truly novel ideas and solutions are AI's weak point.

Programmers also need to communicate and collaborate with team members, clients, and other stakeholders, requiring an understanding of various needs and feasibility, clearly expressing their views, and working with others to complete projects. Additionally, human programmers possess the ability to continuously learn and adapt to new changes; they can quickly grasp new knowledge and skills and apply them to real projects, while a successful AI model requires various training tests.

The software development industry is also subject to various legal and regulatory constraints, such as intellectual property, data protection, and software licensing. Artificial intelligence may find it difficult to fully understand and comply with these legal requirements, potentially leading to legal risks or liability disputes.

In the long run, the potential for job replacement for programmers due to advancements in AI technology still exists, but in the short term, "AI-enhanced programmers" are the mainstream, and mastering the use of the latest AI tools is one of the core skills of excellent programmers.

免责声明：本文章仅代表作者个人观点，不代表本平台的立场和观点。本文章仅供信息分享，不构成对任何人的任何投资建议。用户与作者之间的任何争议，与本平台无关。如网页中刊载的文章或图片涉及侵权，请提供相关的权利证明和身份证明发送邮件到support@aicoin.com，本平台相关工作人员将会进行核查。

Can AI programming work to earn 400,000 dollars?

Features of the New Benchmark

AI Coding Earning Ability Comparison

Be an "AI-Enhanced" Programmer

Selected Articles by 深潮TechFlow

Table of Contents

Related Articles