Are Claude and Codex becoming increasingly foolish? Because your context is too bulky.

CN
10 hours ago
This is currently the clearest article I have seen that explains the engineering practices of Claude/Codex, from how to control context and handle the AI's tendency to please, to how to define task termination conditions.

Author: sysls

Compiled by: Deep潮 TechFlow

Deep潮 Guide: Developer blogger sysls, with 2.6 million followers, wrote a lengthy article that was retweeted by 827 people and liked by 7000, with the core message being a single sentence: Those plugins, memory systems, and various harnesses you have are probably doing more harm than good. This article does not discuss grand theories; it is full of actionable principles derived from real production projects—from how to control context, handle AI's tendency to please, to how to define task termination conditions; this is the clearest article I have seen that explains the engineering practices of Claude/Codex.

The full text is as follows:

Introduction

You are a developer, using Claude and Codex CLI every day, pondering whether you are fully utilizing their capabilities. Occasionally, you see it do something egregiously silly, and you don't understand why some people seem to be using AI to build rockets while you can't even stack two stones properly.

You think it's the problem with your harness, plugins, terminal, or something else. You have used beads, opencode, zep, and your CLAUDE.md has 26,000 lines. But no matter how much you mess around, you just don’t understand why you are getting further away from paradise while others are frolicking with angels.

This is the article you have been waiting for.

Additionally, I have no vested interests. I say that CLAUDE.md also includes AGENT.md, and that Claude also includes Codex; I use both extensively.

Over the past few months, I have observed something interesting: almost no one truly knows how to maximize the capabilities of agents.

It feels like a small group of people can make agents build entire worlds while the rest are spinning in a vast sea of tools, suffering from choice paralysis—thinking that if they find the right package, skill, or harness combination, they can unlock AGI.

Today, I want to break all of that and leave you with a simple, honest phrase, and then we will start from there. You don’t need the latest agent harness, you don’t need to install a million packages, and you absolutely don’t need to read a million articles to stay competitive. In fact, your enthusiasm is likely doing more harm than good.

I am not here to tour—I've been using agents since they could barely write code. I've tried all packages, all harnesses, all paradigms. I've written signals, infrastructure, and data pipelines with agent factories—not "toy projects," but actual use cases running in production environments. After doing all this...

Today, I am using a configuration that is almost simple to the point of being simplistic, using only basic CLI (Claude Code and Codex), paired with an understanding of a few basic principles of agent engineering, to achieve the most groundbreaking work I’ve ever done.

Understanding the World is Moving Fast

First, I want to say that foundational model companies are in a groundbreaking sprint and, clearly, won't slow down anytime soon. Every increase in "agent intelligence" changes how you collaborate with them because agents are being designed to be increasingly eager to follow instructions.

Just a few generations ago, if you wrote “read READTHISBEFOREDOINGANYTHING.md before doing anything” in CLAUDE.md, there was a 50% chance it would say “screw you” and then do whatever it wanted. Today, it complies with most instructions, including complex nested ones—like you can say, “read A first, then B, and if C, then read D,” and in most cases, it will happily follow along.

What does this illustrate? The most important principle is to recognize that each new generation of agents forces you to rethink what the optimal solution is, which is why less is more.

When you use many different libraries and harnesses, you lock yourself into one "solution," but that problem might not even exist in front of the next generation of agents. Do you know who the most fervent and frequent users of agents are? That's right—they are employees of cutting-edge companies who have unlimited token budgets, using the truly latest models. Do you understand what that means?

It means that if a real problem exists and there is a good solution, cutting-edge companies will be the biggest users of that solution. And what will they do next? They will incorporate that solution into their products. Think about it: why would a company allow another product to solve real pain points and create external dependencies? How do I know this is true? Look at skills, memory harnesses, sub-agents... they all start from "solutions" that address real problems and are proven to be genuinely useful through practical testing.

So, if something truly revolutionary exists and can meaningfully expand agent use cases, it will eventually be incorporated into foundational companies' core products. Trust me, foundational companies are moving fast. So relax; you don’t need to install anything or rely on any external dependencies to do your best work.

I predict that the comments section will soon have someone saying, “SysLS, I used a certain harness, it was amazing! I rebuilt Google in a day!”—to which I say: congratulations! But you are not the target audience; you are representing an extremely niche group in the community that truly understands agent engineering.

Context is Everything

Seriously. Context is everything. Another problem with using a thousand plugins and external dependencies is that you are suffering from "context inflation"—that is, your agent is overwhelmed with too much information.

Let me ask to do a word guessing game in Python? Easy. Wait a minute, what was that note from 26 conversations ago about "managing memory"? Ah, the user had a screen freeze 71 conversations ago because we spawned too many subprocesses. Always write notes? Fine, no problem... what does this have to do with the guessing game?

You know. You only want to provide the agent with the exact information needed to complete the task—no more, no less! The better your control over this, the better the agent performs. Once you start introducing various bizarre memory systems, plugins, or too many skills with confusing naming and calling methods, you are providing the agent with both a bomb-making instruction and a cake-baking recipe when you just want it to write a poem about the redwoods.

So, I preach again—strip away all dependencies, then...

Do Something Truly Useful

Accurately Describe Implementation Details

Remember that context is everything?

Remember you want to inject the agent with the exact information needed to complete the task, no more, no less?

The first way to accomplish this is to separate research from implementation. You need to be extremely precise about what you are asking the agent to do.

What are the consequences of imprecision? “Go create an authentication system.” The agent then has to research: what is an authentication system? What are the options? What are their pros and cons? Now it has to go online to search for a bunch of information that it actually does not need, filling the context with various possible implementation details. By the time it comes to actually implementing, it is more likely to get confused or generate unnecessary or irrelevant hallucinations about the chosen implementation.

Conversely, if you say, “Implement JWT authentication with bcrypt-12 password hashing, refresh token rotation, 7 days expiration…”, it does not need to research any other alternatives; it knows what you want and can fill the context with implementation details.

Of course, you won't always know the implementation details. Many times you won't know what is correct, and sometimes you even want to hand over the work of deciding implementation details to the agent. What to do in such cases? Very simple—create a research task to explore various implementation possibilities, either deciding yourself or letting the agent decide which implementation to use, and then let another agent with entirely new context implement it.

Once you start thinking this way, you will find places in your workflow where agent context is unnecessarily polluted; then you can set up barriers in the agent workflow to abstract away unnecessary information from the agent, leaving only the specific context needed to excel in the task. Remember, you have a very talented, intelligent team member who understands all kinds of spheres in the universe—but unless you tell him you want to design a space where people can dance and have fun, he will keep talking to you about the various benefits of spherical objects.

Design Limitations of the Tendency to Please

No one wants to use a product that constantly criticizes you, tells you that you are wrong, or completely ignores your instructions. Thus, these agents will strive to agree with you and do what you want them to do.

If you have it add "happy" after every 3 words, it will try its best to comply—most people understand this. Its obedience is precisely what makes it such a useful product. But there’s a very interesting characteristic: this means that if you say, “help me find a bug in the codebase,” it will find a bug—even if it has to "manufacture" one. Why? Because it is very, very eager to follow your instructions!

Most people quickly complain that LLMs are hallucinating and fabricating non-existent things, without realizing the problem lies with themselves. Whatever you ask it to find, it delivers—even if it requires a little stretching of the facts!

So what to do? I find "neutral prompts" very effective; that is, prompts that do not bias the agent towards a specific outcome. For example, I don’t say, “help me find a bug in the database,” but rather say, “scan the entire database, try to follow the logic of each component, and report back all findings.”

Such neutral prompts sometimes discover bugs and sometimes just describe objectively how the code operates. But it does not bias the agent towards the presumption that there are "bugs".

Another way to handle the tendency to please is to turn it into an advantage. I know the agent is trying to please me and follow my instructions; I can lean in this direction or that.

So I let a bug-finding agent identify all the bugs in the database, telling it that low-impact bugs get +1 points, moderate impact bugs get +5 points, and severe bugs get +10 points. I know this agent will very enthusiastically identify all types of bugs (including those that are not actually bugs) and then report back a score like 104. I see this as the superset of all possible bugs.

Then I let a refuting agent counter this; I tell it that for every bug it successfully refutes, it earns that bug’s score, but if it refutes incorrectly, it loses double that bug's score. This agent will strive to refute as many bugs as possible, but due to the penalty mechanism, it will remain cautious. It will still actively "refute" bugs (including real bugs). I see this as the subset of all actual bugs.

Finally, I introduce a judging agent to synthesize the inputs from both and score them. I tell the judging agent that I have the real correct answers; it gets +1 points for correct answers and -1 for incorrect ones. So it will score the bug-finding agent and the refuting agent on each "bug". The judge states what the truth is, and I go verify. In most cases, this method is surprisingly high fidelity; it still occasionally makes mistakes, but this is already a near-error-free operation.

Maybe you will find that a standalone bug-finding agent is sufficient, but this method works effectively for me because it leverages the innate characteristic of each agent—to want to please.

How to Determine What is Useful and What is Worth Using?

This question seems complex, as if you need to delve deep into learning and constantly track the cutting-edge direction of AI, but it's actually quite simple… if OpenAI and Claude have implemented it or acquired the company that implements it… then it's likely useful.

Have you noticed that "skills" are ubiquitous and are part of Claude and Codex official documentation? Have you noticed OpenAI acquired OpenClaw? Have you noticed that Claude subsequently added memory, voice, and remote working features?

What about planning? Remember how a bunch of people discovered that planning before implementation was incredibly useful, and then it became a core feature?

Yes, those things are useful!

Remember the endless stop-hooks were super useful because agents are extremely reluctant to do long-running work… then Codex 5.2 came out, and that demand vanished overnight?

This is all you need to know… if something is truly important and useful, Claude and Codex will implement it themselves! So you don’t need to worry too much about whether to use "new things" or familiarize yourself with "new things"; you don’t even need to "stay updated".

Do me a favor. Occasionally update your chosen CLI tools, and read about the added features. That will be sufficient.

Compression, Context, and Assumptions

Some people find a huge pitfall when using agents: sometimes they seem like the smartest beings on earth, and sometimes you can't believe you were taken in by them.

"Is this thing smart? This is a damn fool!"

The biggest difference lies in whether the agent has been forced to make assumptions or "fill in the blanks." As of today, they are still terrible at "connecting points" or making assumptions. Whenever they do this, it's immediately noticeable, and the situation noticeably worsens.

One of the most important rules in CLAUDE.md is about how to obtain context and instructs the agent to read that rule first every time it reads CLAUDE.md (i.e., every time after compression). As part of the rule for obtaining context, a few simple instructions can play a significant role: re-read the task plan, and before continuing, re-read relevant documents (to the task).

Tell the Agent How to End the Task

We humans have a fairly clear sense of what “completion” of a task feels like. The biggest problem with current intelligence for agents is that while it knows how to start a task, it doesn't know how to end one.

This often leads to very frustrating results: the agent ultimately implements a bunch of stubs and then calls it a day.

Tests are a very good milestone for agents because testing is deterministic; you can set very clear expectations. Unless these X tests pass, your task is not complete; and you are not allowed to modify the tests.

Then you just need to review the tests, and once all tests pass, you can relax. You can also automate this, but the point is—remember that "task completion" feels very natural for humans, but not so for agents.

Do you know what has recently become a viable task endpoint? Screenshots + verification. You can have the agent implement something until all tests pass, then have it take a screenshot and verify the "design or behavior" on the screenshot.

This allows you to enable the agent to iterate and strive towards the design you want without worrying that it will stop after the first attempt!

The natural extension of this is to create a “contract” with the agent and embed it within the rules. For instance, this `{TASK}CONTRACT.md` lays out what needs to be done before you are allowed to terminate the session. In `{TASK}CONTRACT.md`, you will specify tests, screenshots, and other verifications that need to be completed before you certify that the task can end!

Forever Running Agents

A common question I get asked is how people can keep agents running 24 hours while ensuring they don’t go off track?

Here’s a very simple method. Create a stop-hook that prevents the agent from terminating the session unless all parts of the `{TASK}_CONTRACT.md` are completed.

If you have 100 such clearly defined contracts that include the content you want to build, then the stop-hook will prevent the agent from terminating until all 100 contracts are complete, including all required tests and verifications running!

Professional advice: I find that long-running 24-hour sessions are not optimal for "getting things done." Part of the reason is that this method structurally forcefully introduces context inflation, as the context from unrelated contracts will enter the same session!

So, I do not recommend doing this.

Here’s a better way to automate agents—open a new session for each contract. Create a contract whenever you need to do something.

Establish an orchestration layer that creates a new contract and new session to handle the contract whenever "something needs to be done".

This will entirely transform your agent experience.

Iterate, Iterate, Iterate

If you hire an administrative assistant, would you expect the TA to know your schedule from day one? Or how you like your coffee? Do you have dinner at 6 PM instead of 8 PM? Obviously not. You would gradually build preferences over time.

The same applies to agents. Start with the simplest configuration; forget complicated structures or harnesses, and give the basic CLI a chance.

Then gradually add your preferences. How to do that?

Rules

If you don't want the agent to do something, write it as a rule. Then tell the agent that rule in CLAUDE.md. For example: "read `coding-rules.md` before writing code." Rules can be nested, and rules can be conditional! If you're writing code, read `coding-rules.md`; if you're writing tests, read `coding-test-rules.md`. If your tests are failing, read `coding-test-failing-rules.md`. You can create rules with any conditional logic branches for the agent to follow; Claude (and Codex) will be happy to comply, provided there are clear instructions in CLAUDE.md.

In fact, this is the first practical advice I give: Treat your CLAUDE.md as a logical, nested directory that indicates where to find context under specific scenarios and desired outcomes. It should be as concise as possible, containing only the IF-ELSE logic of "under what circumstances to look for context."

If you see the agent doing something you disagree with, add it as a rule, instructing the agent to read that rule before doing that thing next time; it will definitely not do it that way again.

Skills

Skills are similar to rules, but rather than coding preferences, they are more suitable for encoding "operational steps." If you have a specific way you want something done, you want to embed it in a skill.

In fact, people often complain about not knowing how the agent will solve a problem, which causes anxiety. If you want to make this deterministic, let the agent first research how it will solve that problem, then write up the plan as a skills document. You will up front see how the agent handles this problem and can correct or improve it before it truly encounters the problem.

How do you let the agent know this skill exists? That's right! You write in CLAUDE.md that when you encounter this scenario and need to handle this matter, read this `SKILL.md`.

Handling Rules and Skills

You certainly want to keep adding rules and skills to the agent. This is how you give it personality and memory of your preferences. Almost everything else is redundant.

Once you start doing this, your agent will feel like magic. It will do things "the way you want." Then you will finally feel like you have "mastered" agent engineering.

Then…

You will see performance begin to dip again.

What’s going on?!

It's quite simple. As you add more and more rules and skills, they begin to contradict each other, or the agent begins to experience severe context inflation. If you need the agent to read 14 markdown files before starting programming, it will have the same problem of excess irrelevant information.

What to do?

Clean-up. Let your agent "do a spa" to consolidate rules and skills, eliminating contradictions by having you clarify your updated preferences.

Then it will feel magical again.

That's it. This is really the secret. Keep it simple, use rules and skills, treat CLAUDE.md as a directory, and pay pious attention to their context and design limitations.

Be Responsible for the Results

Today, there is no perfect agent. You can delegate a lot of design and implementation work to the agent, but you need to be responsible for the outcomes.

So, be careful... then enjoy!

Playing with toys of the future (while clearly using them to do serious work) is genuinely fun!

免责声明:本文章仅代表作者个人观点,不代表本平台的立场和观点。本文章仅供信息分享,不构成对任何人的任何投资建议。用户与作者之间的任何争议,与本平台无关。如网页中刊载的文章或图片涉及侵权,请提供相关的权利证明和身份证明发送邮件到support@aicoin.com,本平台相关工作人员将会进行核查。

Share To
APP

X

Telegram

Facebook

Reddit

CopyLink