Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's not a one off issue - it has happened to me a few times. It has once even force pushed to github, which doesn't allow branch protection for private personal projects. Here's an example.

1) claude will stash (despite clear instructions never to do so).

2) claude will use sed to bulk replace (despite clear instructions never to do so). sed replacements make a mess and replaces far too many files.

3) claude restores the stash. Finds a lot of conflicts. Nothing runs.

4) claude decides it can't fix the problem and does a reset hard.

I have this right at the top of my CLAUDE.md and it makes things better, but unlike codex, claude doesn't follow it to the letter. However, it has become a lot better now.

NEVER USE sed TO BULK REPLACE.

*NEVER USE FORCE PUSH OR DESTRUCTIVE GIT OPERATIONS*: `git push --force`, `git push --force-with-lease`, `git reset --hard`, `git clean -fd`, or any other destructive git operations are ABSOLUTELY FORBIDDEN. Use `git revert` to undo changes instead.

 help



When will you all learn that merely "telling" an LLM not to do something won't deterministically prevent it from doing that thing? If you truly want it to never use those commands, you better be prepared to sandbox it to the point where it is completely unable to do the things you're trying to stop.

Even worse, explicitly telling it not to do something makes it more likely to do it. It's not intelligent. It's a probability machine write large. If you say "don't git push --force", that command is now part of the context window dramatically raising the probability of it being "thought" about, and likely to appear in the output.

Like you say, the only way to stop it from doing something is to make it impossible for it to do so. Shove it in a container. Build LLM safe wrappers around the tools you want it to be able to run so that when it runs e.g. `git`, it can only do operations you've already decided are fine.


Even even worse, angry all-caps shouting will make it more stupid, because it pushes you into a significantly stupider vector subspace full of angry all-caps shouting. The only thing that can possibly save you then is if you land in the even tinier Film Crit Hulk sub-subspace.

I touch on this a bit in the piece I wrote for normies, it helped a lot of people I know understand the tech a bit better.


Is this true for anything beyond the simplest LLM architectures? It seems like as soon as you introduce something like CoT this is no longer the case, at least in terms of mechanism, if not outcome.

This is true for prohibitions but claude.md works really well as positive documentation. I run custom mcp servers and documenting what each tool does and when to use it made claude pick the right ones way more reliably. Totally different outcome than a list of NEVER DO THIS rules though, for that you definitely need hooks or sandboxing.

Yes but this is probabilistic. Skill, documentation etc help by giving it the information it needs. You are then in the more correct probability distribution. Fine for docs, tips etc, but not good enough for mandatory things.

"more reliably" is still not "reliably".

The phrase "don't give them ideas" comes to mind.

Feels like a lot of people are still treating these tools like “smart scripts” instead of systems with failure modes.

Telling it not to do something is basically just nudging probabilities. If the action is available, it’s always somewhere in the distribution.

Which is why the boundary has to be outside the model, not inside the prompt.


Agree completely. The middle ground between "please don't" and full sandboxing: run a validation script between agent steps. The agent writes code, a regex check catches banned patterns, the agent has to fix them before it can proceed. Sandboxing controls what the agent can do. Output validation controls what it gets to keep. Both are more reliable than prompt instructions.

That’s right, because we’re not developers anymore— we orchestrate writhing piles of insane noobs that generally know how to code, but have absolutely no instinct or common sense. This is because it’s cheaper per pile of excreted code while this is all being heavily subsidized. This is the future and anyone not enthusiastically onboard is utterly foolish.

My point is exactly that you need safeguards. (I have VMs per project, reduced command availability etc). But those details are orthogonal to this discussion.

However "Telling" has made it better, and generally the model itself has become better. Also, I've never faced a similar issue in Codex.


> sandbox it to the point where it is completely unable to do the things you're trying to stop

Why are permissions for these "agents" on a default allow model anyway?


What do you mean? By default, Claude asks for permission for every file read, every edit, every command. It gets exhausting, so many people run it with `--dangerously-skip-permissions`.

It does not ask for permission for every file read, only those outside the project and not explicitly allowed. You can bypass project edit permission requests with “allow edits”, no need for “dangerously skip permissions”. Bash commands are harder, but you can allow-list them up to a point.

> so many people run it with `--dangerously-skip-permissions`

It's on the people then, not the "agent". But why doesn't Claude come with a decent allow list, or at least remember what the user allows, so the spam is reduced?


You have the option to "always allow command `x.*`", but even then. The more control you hand over to these things, the more powerful and useful (and dangerous) they become. It's a real dilemma and yet to be solved.

I use a script wrapper of git un muy path for claude, but as you correctly said, I'm not sure claude Will not ever use a new zsh with a differentPATH....

Why do you expect that a weighted random text generator will ever behave in predictable way?

How can people be so naive as to run something like Claude anywhere other than in a strictly locked down sandbox that has no access to anything but the single git repo they are working on (and certainly no creds to push code)?

This is absolutely insane behavior that you would give Claude access to your GitHub creds. What happens when it sees a prompt injection attack somewhere and exfiltrates all of your creds or wipes out all of your repos?

I can't believe how far people have fallen for this "AI" mania. You are giving a stochastic model that is easily misdirected the keys to all of your productive work.

I can understand the appeal to a degree, that it can seem to do useful work sometimes.

But even so, you can't trust it with anything, not running it in a locked down container that has no access to anything but a Git repo which has all important history stored elsewhere seems crazy.

Shouting harder and harder at the statistical model might give you a higher probability of avoiding the bad behavior, but no guarantee; actually lock down your random text generator properly if you want to avoid it causing you problems.

And of course, given that you've seen how hard it is to get it follow these instructions properly, you are reviewing every line of output code thoroughly, right? Because you can't trust that either.


> How can people be so naive as to run something like Claude anywhere other than in a strictly locked down sandbox that has no access to anything but the single git repo they are working on (and certainly no creds to push code)?

> This is absolutely insane behavior that you would give Claude access to your GitHub creds. What happens when it sees a prompt injection attack somewhere and exfiltrates all of your creds or wipes out all of your repos?

I don’t understand why people are so chill about doing this. I have AI running on a dedicated machine which has absolutely no access to any of my own accounts/data. I want that stuff hardware isolated. The AI pushes up work to a self-hosted Gitea instance using a low-permission account. This setup is also nice because I can determine provenance of changes easily.


> How can people be so naive as to run something like Claude anywhere other than in a strictly locked down sandbox that has no access to anything but the single git repo they are working on (and certainly no creds to push code)?

Because it’s insanely useful when you give it access, that’s why. They can do way more tasks than just write code. They can make changes to the system, setup and configure routers and network gear, probe all the iot devices in the network, set up dns, you name it—anything that is text or has a cli is fair game.

The models absolutely make catastrophic fuckups though and that is why we’ll have to both better train the models and put non-annoying safeguards in front of them.

Running them in isolated computers that are fully air gapped, require approval for all reads and writes, and can only operate inside directories named after colors of the rainbow is not a useful suggestion. I want my cake and I want to eat it too. It’s far to useful to give these tools some real access.

It doesn’t make me naive or stupid to hand the keys over to the robot. I know full well what I’m getting myself into and the possible consequences of my actions. And I have been burned but I keep coming back because these tools keep getting better and they keep doing more and more useful things for me. I’m an early adopter for sure…


Well, one of the other reasons I suggest running it in a strictly limited container is that you can then run it in yolo mode.

In fact, I use the pi agent, which doesn't have command sandboxing, it's always in yolo mode, I just run it in a container and then I get the benefit of not having to confirm every command, while strictly controlling what I share with it from the beginning of the session.


The answer is that for these people most of the time it looks predictable so they start to trust it

The tool is so good at mimicking that even smart people start to believe it


Claude Code hooks are deterministic; the agent can’t bypass them [1].

For example you force a linter to run or for tests to run.

Claude Code defaults to running in a sandbox on macOS and Linux. Claude Cowork runs in a Linux VM.

[1]: https://code.claude.com/docs/en/hooks-guide


> How can people be so naive as to run something like Claude anywhere other than in a strictly locked down sandbox that has no access to anything but the single git repo they are working on (and certainly no creds to push code)?

Because it is much easier to do and failure rate is quite low.

(not saying that it is a good idea)


Trust issues start at home.

If you can't trust yourself, you will never be able to trust anyone else.

If you believe the AI is out to get you, that's certainly the reality you will manifest.


It has once even force pushed to github, which doesn't allow branch protection for private personal projects.

This is only restricted for *fully free* accounts, but this feature only requires a minimum of a paid Pro account. That starts around $4 USD/month, which sounds worth it to prevent lost work from a runaway tool.


I was on one till recently, maybe I still am. But does it work for orgs? I put some projects under orgs when they become more than a few projects.

That's a fee for not running a local git proxy with permissions enforcement that holds onto the GitHub credentials in place of Claude.

Do you know of a good ready-made implementation of such a proxy? I’ve been looking for one.

GitHub is also a worry in terms of exfiltration. You can’t block pushes to public repos unless you are using GitHub Enterprise Managed Users afaict.


Or putting the code and .git in a sandbox without the credentials

Reinforcing an avoidance tactic is nowhere near as effective as doing that PLUS enforcing a positive tactic. People with loads of 'DONT', 'STOP', etc. in their instructions have no clue what they're doing.

In your own example you have all this huge emphasis on the negatives, and then the positive is a tiny un-emphasized afterthought.


I think you're generally correct, but certainly not definitively, and I worry the advice and tone isn't helpful in this instance with an outcome of this magnitude.

(more loosely: I'm a big proponent of this too, but it's a helluva hot take, how one positively frames "don't blow away the effing repro" isn't intuitive at all)


The trick is to explain why something is important, not just to emphasize it. For instance:

"As an LLM, when Claude used 'sed', it can quickly and easily break files that are difficult for the user to fix. Claude must be aware that an LLM's actions seem effortless to it but to the user it represents hours of work getting things back in order."


Claude tends to disregard "NEVER do X" quite often, but funnily enough, if you tell it "Always ask me to confirm before going X", it never fails to ask you. And you can deny it every time

If it disregards "NEVER do" instructions, why would it honor your denial when it asks?

You mean like in this example? https://web.archive.org/web/20260313042512/https://gist.gith...

There is never a guarantee with GenAI. If you need to be sure, sandbox it.


There are plenty of examples in the RL training showing it how and when to prompt the human for help or additional information. This is even a common tool in the "plan" mode of many harnesses.

Conversely, it's much harder to represent a lack of doing something


Because it’s just fancy auto-complete.

This is why I use yoloAI (https://github.com/kstenerud/yoloai).

    $ yoloai new bugfix . -a --network-isolated --agent claude
Now I have a claude code session that only has a COPY of my work dir, and can't reach anything over the network except the Claude API server.

Now I interact with the agent, and when it's done:

    $ yoloai diff bugfix
    diff --git a/b64.go b/b64.go
    index cfc5549..253c919 100644
    --- a/b64.go
    +++ b/b64.go
    @@ -39,7 +39,7 @@ func Encode(data []byte) string {
        val |= uint(data[i+2])
       }

    -  out[j] = alphabet[(val>>18)&0x3E]
    +  out[j] = alphabet[(val>>18)&0x3F]
       out[j+1] = alphabet[(val>>12)&0x3F]

       remaining := n - i
Looks good, let's apply it:

    $ yoloai apply bugfix
    Target: /home/ks/tmp/b64

    Commits to apply (1):
      9db260b33bcd Fix bit mask in base64 encoding

    Apply to /home/ks/tmp/b64? [y/N] y
    1 commit(s) applied to /home/ks/tmp/b64
Now the commit claude made inside the sandbox has been applied to my workdir:

    $ git log
    commit 5b0fc3a237efe8bbc9a9e1a05f9ce45d37d38bfa (HEAD -> main)
    Author: Karl Stenerud <kstenerud@gmail.com>
    Date:   Mon Mar 30 05:28:21 2026 +0000

        Fix bit mask in base64 encoding

        Corrected the bit mask for the first character extraction from 0x3E to 0x3F to properly extract all 6 bits.

        Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

    commit 31e12b62b0c3179f3399521d7c4326a8f6130721 (tag: init)
The important thing here is that Claude was not able to reach anything on the network except its own API, and nothing it did ever touched my work dir until I was happy with the changes and applied them.

It also doesn't get access to my credentials, so it couldn't push even if it did have network access.


> which doesn't allow branch protection for private personal projects.

Time for a personal Forgejo instance? Mine has been running great for more than a year. Faster than GitHub even.


I don't understand how people in this day and age have not learned what the pink elephant problem is.

If you tell AI not to do something, you make it incomprehensibly more likely it will happen.

Use affirming language. Why do you think negative prompts don't exist in diffusion anymore?


I've recently implemented hooks that make it impossible for Claude to use tools that I don't want it to use. You could consider setting up a tool that errors if if they do an unsafe use of sed (or any use of sed if there are safer tools).

Even just last week I auto approved a plan and it even wrote the commit message for me (with @ClaudeCode signed off) which I am grateful my manager did not see.

Claude does not know my github ssh key. I'll do the push myself, thank you. Always good to keep around one or two really import things it can't do.

Like for humans, teaching the good way to do things works better than forbidding a few bad behaviours.

Maybe stop using the CLAUDE.md to prevent it from running tools you don't want it to and just setup a hook for pretooluse that blocks any command you don't want.

Its trivial to setup and you could literally ask claude to do it for you and never have any of these issues ever again.

Any and all "I don't want it to ever run this command" issues are just skill issues.


How that stops Claude from removing hook and then running command anyway?

That's nothing like the issue of the main topic

"DO NOT, EVER, UNDER ANY CIRCUMSTANCES, think of an elephant"



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: