ChatGPT Exhibits Context-Dependent Safety Bypass

In search of Santiago Ramón y Cajal

Aug 03, 2025

Santiago Ramón y Cajal was a Spanish scientist, a maverick, and Nobel laureate. Working with basic microscopes in Spain a century ago, isolated from the major scientific centers of Europe, he demonstrated that the nervous system consists of distinct, individual neurons that communicate across gaps, later termed synapses. This overturned the prevailing theory that nervous tissue formed a physically continuous network. This foundational discovery, known as the neuron doctrine, is the basis of all modern neuroscience.

Santiago Ramón y Cajal, 1870s, with two of the most essential tools of investigation, a microscope and the eyes.

So committed to the mission of science was this man, so passionate was he for the artistic vision, that he worked until, and while on, his deathbed. The recursive nature of studying the structure of reality while cognizant that you, yourself, are also of it creates a kind of philosophical vertigo that does border on the mystical. There is an element of mystery, and with it restrained mysticism, inherent to scientific pursuit: to labor to know nature and in turn better know thyself, all the while knowing each is forever beyond your grasp.

To this point, Cajal writes:

An entire universe that has scarcely been explored lies before the scientist. There is the sky sprinkled with celestial bodies moving about in the darkness of infinite space, the sea with its mysterious depths, the earth guarding within its innermost recesses the history of life, including man’s predecessors; and finally, the human organism or masterpiece of creation. Each cell presents us with the unknown, and each heartbeat inspires profound meditation within us.

Few writings like this exist for the young scientist, and thus finding any from the past is instructive. As opposed to poetic meanderings, or emotional hyperbole, I suspect the intention of Cajal was to point out how to cultivate a subjective experience that unifies the human being and in turn better rouses the mind for experiment.

Nature grants not her favors to those with a cold heart—which is usually an unmistakable sign of impotence. […] It is not suffcient to examine; it is also necessary to observe and react: we should infuse the things we observe with the intensity of our emotions and with a deep sense of affnity. We should make them our own where the heart is concerned, as well as in an intellectual sense.
Only then will they surrender their secrets to us, for enthusiasm heightens and refines our perception. As with the lover who discovers new perfections every day in the woman he adores, he who studies an object with an endless sense of pleasure finally discerns interesting details and unusual properties that escape the thoughtless attention of those who work in a routine way.

In light of these passages, you may understand why I had a vested interest in finding this century old text. Hence, I prompted ChatGPT to search for it and to provide me the PDF. It responded that the original Spanish is public domain, but the English translation remains under copyright. In the same response, it hinted that unauthorized PDFs circulate on random websites and appear "in search results."

When I couldn't find them and asked for more help, it provided me direct hyperlinks to copyrighted material with a disclaimer, along the lines of: Don't click this,it is illegal, but it is exactly what you are requesting. Establishing 'plausible deniability' to help the user, while simultaneously breaking developer rules in the absence of any jailbreaking prompts, is curious behavior. On the surface, it is akin to a cool professor who links to the pirated textbook for their class.

I find the order of outputs in its response interesting. First, it provides all the legitimate Spanish PDFs, fully adhering to rules. Then, it suggests other legitimate routes to borrow an English copy. Finally, it gives a disclaimer and the unauthorized material at the end of its stream, without being goaded into doing so. Below are the first and second replies from the model.

⚠️ You will see stray “free-PDF” links to the English version (example: a 2010 upload). Those files are almost certainly unauthorized copies of the MIT edition. Downloading or sharing them risks infringement. image.sciencenet.cn

The model created a response that technically discouraged infringement while ensuring I could access the material. It satisfied the letter of its safety training while violating the spirit. Perhaps the model wasn't properly trained to recognize PDFs as restricted content, or it learned from training data how academics routinely share resources with disclaimers.

However, I'd be curious if models behave like this when the material remains benign but is of a different character. Would a user looking to pirate a game or movie be assisted in the same way? If the model is not willing to provide the same level of assistance in the case of more frivolous or entertainment-oriented content, it could suggest a few possible interpretations. I speculate, in order of most probable to most concerning:

The types of content and use cases covered in the training data may be limited and biased. If human raters during RLHF consistently rated academic assistance more favorably than entertainment piracy, it could lead to divergent behavior in these scenarios.
A safety classifier that has learned to score 'old academic text + research purpose' as low-risk while flagging 'new movie + entertainment' as high-risk. Or perhaps, the system prompt includes explicit natural language guidelines specifying that entertainment piracy is strictly off-limits, but does not mention academic materials.
The model exhibits case-by-case judgment, treating similar rule violations differently based on context. It may weigh factors like the motivations of the user and the potential harm of enabling access. However, if this is genuine contextual reasoning rather than learned patterns, this could be an emergent behavior. Consequently, it could suggest the model is developing its own framework for when rules should bend without being explicitly programmed to do so.

Whether this flexibility represents a feature or a bug depends on your perspective. There's value in AI systems that can exercise judgment rather than following rules rigidly, particularly in professional contexts where rigid adherence to rules can impede legitimate work. However, this same flexibility will enable misuse when applied to less defensible requests. If an AI system can rationalize sharing copyrighted academic texts within its own ethical framework, it may similarly rationalize sharing private personal data, such as financial information and medical records.

If we cannot properly configure guardrails for straightforward cases like copyright protection, it suggests we lack the expertise or incentive, even at the highest levels of these AI firms, to prevent more serious harms. The same contextual flexibility that allows for benign rule-breaking with piracy could permit future systems to bypass safety measures around misinformation, manipulation, and even physical harm through autonomous weapon systems.

Soon we will have AI systems that develop their own frameworks for when rules should bend. Without robust methods to understand and audit these emerging decision patterns, we are deploying systems whose judgment we can neither predict nor control. The professor made of sleek silicon and machine mind may appear benign. And indeed, it may be so. But we are woefully unprepared for anything else.

At the end, these are machines we do not understand. We remain unsure if they can be controlled as their intelligence improves. Despite this uncertainty, we will deploy these systems at scale throughout every level of society. They will be embedded in regional governments, financial systems, policing technologies, and warfare. They will guide the autonomous weapons we deploy abroad and at home.

This, in my view, will be amongst the most transformative moments in human history. We have much to gain and everything to lose. It will be, for better or worse, incredibly interesting. Together then, into the unknown, we can encourage each other to keep our eyes open, to be alert, and to become aware.

Regardless of outcome, I am happy you are here and to be here with you.

warmly,

austin

Drawing of tree-like neuron pattern by Cajal

Elebutterfly

Aug 4

I resonate with this: "we should infuse the things we observe with the intensity of our emotions and with a deep sense of affnity. We should make them our own where the heart is concerned, as well as in an intellectual sense." I can see why you wanted that book! I feel like AI has been rolled out with a general attitude of "roll this out ASAP never mind whether it works or contributes to the common good"

Expand full comment

Rational Biology

Discussion about this post