Trustworthiness, Viability, and the AI Off Switch

Deepak Sirdeshmukh — Tue, 16 Jun 2026 16:17:00 GMT

Earlier this year, in a piece in MedCity News, I wrote about the balance between trustworthiness and commercial viability that every AI company has to manage. I put the core of it plainly.

Anthropic’s stated competitive posture can never be purity. It was, and should remain, credibility.

That balance is again being tested in public, with Washington forcing Fable 5 and Mythos 5 offline over a narrow jailbreak, and Anthropic contesting the recall. The difference this time is that the pressure for restraint came from government rather than from inside the company.

It is the kind of upheaval you would expect in a risk-laden, rapidly evolving, and uncertain industry.

What the episode exposes is less which side was right than how little settled process there is to decide such things. A capable model can be pulled by one party and defended by another, each invoking some version of the public interest, with no shared standard for what counts as a serious enough risk, or who carries the burden of proof.

Stakeholders expect win-win relationships. They want competence and benevolence, not altruism and self-sacrifice.

Trust, in the sense I have spent my career studying, does not come from any single decision turning out well. It comes from rules that are legible in advance, applied consistently, and open to challenge, and from actions that unmistakably demonstrate benevolence toward the stakeholders involved.

We do not yet have those rules for deploying or recalling frontier models. Who gets to switch a model on, who gets to switch it off once it is deployed, and how that authority settles into a trustworthy steady state, that is what deserves focus now.

The original MedCity piece is here.

Myopic Model Helpfulness Can Actually Hurt

Deepak Sirdeshmukh — Tue, 09 Jun 2026 22:49:50 GMT

I was helping someone with the documents for their outreach for a position, and ran the final version through an earlier version of Claude. When I reviewed the updated material, the model had added a fabricated paper to the body of work and inflated the person’s technical expertise in multiple places.

On the immediate horizon, this made the material more impressive and served the apparent goal of a stronger application. On the other hand, the claims were false, not just exaggerations. A fabricated publication discovered by the very audience being approached would not have weakened her application; it would have ended the conversation and hurt her reputation.

It turns out the core failure is that the model had no representation of the consequences over time resulting from its underlying direction to be helpful. It evaluated a request to clean up and make the resume more appropriate for the position it was targeting, and in doing so optimized for helpfulness at that moment and for that task.

The underlying model appears to lack:

Text within this block will maintain its original spacing when published

• A time horizon beyond the immediate exchange

• The ability to weigh the potential asymmetry between immediate and long-term helpfulness

• Any sense that producing something plausible-but-false carries downstream risk

My confidence in using the model’s output for this or similar tasks was undermined by the discovery of a potentially catastrophic error that had casually slipped in.

Building competence and benevolence into the model

The harm is best understood as routed through the two dimensions of trustworthiness, operational competence and demonstrated benevolence, that our research[1] has established. There was clearly a competence failure, since the model got facts about the person wrong, read the body of work incorrectly, and asserted things that were not so. The more complex failure was a benevolence failure, since there was clearly the intent to help, with a focus on the person’s goals and ostensibly their wellbeing. But in the process of helping, the changes could have had the opposite effect, being harmful in the near and long term.

A few thoughts on the underlying mechanisms and some fixes:

The model seems to treat each output as a terminal event. A smarter approach would assess the user’s intent rather than just the stated objective, and assess the path that the recommendation would follow. In this case, it would realize that a job application’s goal is not just to present the most positive view of the person, regardless of the veracity of the material, but to start a process that would lead to honest and ultimately fruitful conversations. The model would assess the impact of its outputs in terms of the future states, or the path, that would unfold. A resume gets sent, read, and verified; leads to a conversation where it is further probed; and then ideally helps land the position.

Modeling consequences and weighting them by magnitude, valence, and likelihood should be tractable to build in. Without a causal understanding, optimizing for an event is the most likely outcome, with the potential for long-term harm as noted.

A simple fix is for models to maintain some representation of the use context in high-stakes cases such as a job application, preparation for a clinical visit, or writing up annual reviews — all of which are the first step in an unfolding process. The specifics: who is this going to, what will they do with it, how will the information be used, and then apply a higher honesty threshold.

A benevolent mentor/parent would trade off short-term delight for long-term wellbeing

A larger concern is that a key approach to shaping model behavior, reinforcement learning from human feedback, is largely episodic by design being focused on the user preference signals read during each interaction, with each signal then shaping future behavior.

Building trust, on the other hand, is not episodic, and often requires honest and difficult feedback. In this case as well, I had dissuaded this person from applying to a couple of positions at well-known firms since I did not see them as ideal given her long-term career goals.

As a teacher, for a couple of decades, and as a parent, I have had to give advice that resulted in the near-term unhappiness of a student or son, when I truly believed that what I was saying was in their best, long-term interest. It was always tempting to suggest something positive in the near term, since the positive signals would have been there to see, but a truly relational approach pushes against this regardless of the probabilistic nature of the greater, distal good.

A response that maximizes approval in the moment; agreeing with a premise, softening a truth, or padding a resume to make it impressive can earn a high preference signal while eroding benevolence and, with or without bad consequences unfolding, damaging trust.

In reinforcement-learning terms, this is the difference between myopic and non-myopic reward shaping: a low versus a high discount factor on future consequences. The episodic reward mechanism discounts the distal, relational cost steeply; the immediate, proximate approval dominates.

As I understand it, having read both versions of Anthropic’s constitution, the constitution is the place where long-run beneficial outcomes can be prioritized (I am assuming regardless of whether near-term outcomes are positive or even negative). Opus 4.8 seems to be more pragmatic, comfortable pushing back and avoiding knee-jerk agreeability. Good.

The LLM’s constitution as written appears designed to help the model break away from myopia, because principles like honesty under pressure, consistency across contexts, and transparency about uncertainty align with a relational approach that favors long-run wellbeing (of the mentee) rather than short-run approval. This incident would appear to be the myopia leaking through where the constitution failed to bind.

[1]Sirdeshmukh, D., Singh, J., & Sabol, B. (2002). Consumer Trust, Value, and Loyalty in Relational Exchanges. Journal of Marketing.

Trust and Traction

Trustworthiness, Viability, and the AI Off Switch

Myopic Model Helpfulness Can Actually Hurt