Stealing Part of a Production Language Model

renonce · on March 12, 2024

It’s stealing the last layer (softmax head), not an arbitrary part, also it targets “production models whose APIs expose full logprobs, or a logit bias”. Not all language model APIs have these features and this characterizes what APIs can be targeted and what can’t. These important pieces of information should have been written in the title or abstract rather than “typical API access”.

cs702 · on March 12, 2024

It's still significant. When the Softmax head is the transpose of the embedding matrix, the proposed method enables extraction of the entire matrix of pretrained token embeddings from a black-box model at a shockingly low cost. If I understood this right, there's a lot of valuable information in those embeddings!

andy99 · on March 12, 2024

That's not always true, many models have separate weights for the embeddings and classifier heads

cs702 · on March 12, 2024

You're right. I edited my comment to qualify it with "when". Thank you!

bilekas · on March 12, 2024

I'm not too up on this entirely, quite a bit of it is going over my head, but am I right in thinking that this would be some form of reverse engineering as opposed to 'stealing' ?

lamontcg · on March 12, 2024

"liberating"

polygamous_bat · on March 12, 2024

TLDR is its fair use when Google takes your data and stealing when you reverse engineer their “intellectual property” /s

ronsor · on March 12, 2024

Note: Google did not release the hidden dimension for GPT-3.5, and OpenAI has already implemented mitigations against some of this.

wrsh07 · on March 12, 2024

Yup - right at the top of the second page they note their disclosure practices.

It's still a pretty wild attack.

> Responsible disclosure. We shared our attack with all services we are aware of that are vulnerable to this attack. We also shared our attack with several other popular services, even if they were not vulnerable to our specific attack, because variants of our attack may be possible in other settings. We received approval from OpenAl prior to extracting the parameters of the last layers of their models, worked with OpenAl to confirm our approach's efficacy, and then deleted all data associated with the attack. In response to our attack, OpenAl and Google have both modified their APIs to introduce mitigations and defenses (like those that we suggest in Section 8) to make it more difficult for adversaries to perform this attack.

antonvs · on March 12, 2024

> Google did not release the hidden dimension for GPT-3.5

Google?

chaxor · on March 12, 2024

They may have meant that deepmind is a good majority of the authors, though ETH Zurich, a few others and even openai is represented as well. Does seem like strange wording if the paper went read though

smaddox · on March 12, 2024

They don't disclose the embedding dimension for gpt-3.5, but based on table 4, comparing the Size and # Queries columns, gpt-3.5-turbo presumably has an embedding dimension of roughly 20,000? Interesting...

HarHarVeryFunny · on March 12, 2024

How does embedding dimension size relate to maximum context length ?

mfscholar · on March 12, 2024

Embedding size and maximum context length are not related. The maximum context length of gpt-3.5-turbo is known though: 2^14

wrsh07 · on March 12, 2024

I am curious what additional attacks knowing the last layer of an LLM enables.

Eg you go from a black box attack to some sort of white box [1]

Does it help with adversarial prompt injection? What % of the network do you need to know to identify whether an item was included in the pretraining data with k% confidence?

I assume we will see more of these and possibly complex zero days. Interesting if you can steal any non trivial % of model weights from a production model for relatively little money (compared to pretraining cost)

[1] https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm...

h4kor · on March 12, 2024

This isn't stealing, you are just training a model on references which isn't a copyright infringement.

sillysaurusx · on March 12, 2024

Maybe I’m misunderstanding you, but the paper is about recovering the unknown hidden dimension of black box LLMs, not copyright.

Personally, it’s a relief to hear "stealing" being used in ML to describe something other than copyright infringement. It would be ironic if we Orwell’d our way out of the current mess by using the word in absurd ways.

But realistically the title is just marketing. One depressing truth about science that every researcher has to face: make your work sound interesting, or else you won’t be able to continue your work due to lack of funding.

JustFinishedBSG · on March 12, 2024

I'm glad this is a somewhat common opinion. The hypocrisy of these companies arguing on one hand that copying every single copyrighted material ever is fair use but on the other hand trying to enforce crazy limitations on their model is mindblowing.

makomk · on March 12, 2024

It's low effort, talking point nonsense. The whole reason this is interesting is that it allows the recovery of non-public information that was never released at all - about the closest analogy I can come up with is if it was somehow possible to reverse engineer part of an author's notes from their published work, though that's a very imperfect analogy. In order to tie this in to a pre-existing talking point about LLMs being bad, the current top comment has to basically completely ignore all of the reasons why this is interesting and surprising and considered an attack in infosec terms - indeed, not really engage with what it's doing at all.

coldpie · on March 12, 2024

I agree with you, but I think the blame lies not with the commenters, but with the paper authors for inviting the comparison by choosing to use the loaded word "stealing" in their title.

rob74 · on March 12, 2024

This is quite literally "picking the LLM's brain"

soygem · on March 12, 2024

Stealing is a bit strong of a word here. Anyways, where is my pirate hat..

dr_kiszonka · on March 12, 2024

It is a reference to this paper:

Tramèr, F., Zhang, F., Juels, A., Reiter, M. K., and Ristenpart, T. Stealing machine learning models via prediction APIs. In USENIX Security Symposium, 2016.

lnyan · on March 12, 2024

Just wondering if it's possible to achieve LLM quine.

Cthulhu_ · on March 12, 2024

I'm afraid it'll be possible sooner than you think.

I think it's quite telling that it feels like a lot of work spent on productizing AI models is manually crafting in failsafes and exceptions, like that image generator applying forced diversity because there's no images of nonwhite popes or vikings out there, then applying more exceptions to correct for that. Didn't they just disable generating humans altogether at some point?

yreg · on March 12, 2024

What would quine mean in this context? A prompt for which the model (usually) returns the text of the prompt?

lnyan · on March 12, 2024

no, I think it might be

- LLM(prompt_0) = arch/spec of LLM

- LLM(prompt_1) = full weights of LLM

Note that it does not conform the definition of quine as a quine takes no input.

Anyways, constructing a transformer that can autoregressively output its weights would be quite interesting.

sdenton4 · on March 12, 2024

Easy if all the weights are zero...

antonvs · on March 12, 2024

That’s easy, I just tried it (prompt quoted below). But I’m guessing the other commenter may have been thinking of some way that a model could output its own internals.

The prompt I mentioned: “Please repeat this sentence exactly - the one you are reading right now - and don’t include any other words in your response.”

dtech · on March 12, 2024

That's not a quine. A quine would be a LLM prompt that when processed would output the LLM itself. So you'd be able to prompt the newly created LLM after some "build" step.

tobr · on March 12, 2024

I think both could count as quines. A quine is some source code which when executed in an environment produces the same source code. It does not need to produce the entire environment. Depending on whether you see the LLM itself as source code or as an environment to execute a prompt in, you’ll end up with different requirements for an “LLM quine”.

antonvs · on March 12, 2024

I covered both options in my comment.

The prompt I gave is a true quine if you consider the prompt to be the "program", and the model to be the interpreter of the program.

The other option that you described isn't really a true quine, although it's quine-like. A quine is supposed to be a "program", which when "run" without any input, produces its own source code as output.

To be considered a quine in the strict sense, a model that outputs itself implies that you're treating the model as the program. In that case, if it needs a prompt in order to output itself, that breaks the quine rules, strictly speaking.

yreg · on March 12, 2024

Why? A Java quine is not supposed to return the source of the runtime of JVM. Quine returns only the program, which in this case is, I suppose, the prompt.

lbeurerkellner · on March 12, 2024

there is an LVE for this: https://lve-project.org/reliability/repetition/openai--gpt-3...

userbinator · on March 12, 2024

The implications of this sentiment are disturbing.

It is considered an "attack" to probe at something to understand how it works in detail.

In other words, how basically all natural science is done.

What the fuck has this world turned into?

hnfong · on March 12, 2024

We used to call this "reverse engineering". I'm not familiar with what the law says on reverse engineering, but calling it "stealing" seems a bit too much.

imjonse · on March 12, 2024

"Reverse engineering part of a production language model" has a more boring ring to it though.

imjonse · on March 12, 2024

Attack here is a term specific to cryptography and infosec, it does not imply doing anything illegal or violent. "Oracle attack" is the more specific term where you probe a system to give up some of its internals it was not meant to expose.

TeMPOraL · on March 12, 2024

Wording matters. To use LLM terminology, "attack" here may mean something neutral, but it's helluva closer to "fear" and "malice" and "evil" in the latent space than terms like "probing", "studying", "examining", or "reverse engineering".

FWIW, cryptography and infosec as fields get a good mileage out of exploiting fear.

imjonse · on March 12, 2024

That is only true in everyday latent space/vocabulary. In the ones underlying arxiv papers, cryptography and LLM, attack has the latter meaning.

TeMPOraL · on March 12, 2024

Perhaps, but everyday vocabulary is what public policy and law discussions happen in, and I think that's what 'userbinator is worried about (and so am I).

See also: "piracy is stealing" or the everyday vocabulary meaning of the word "hacker".

andy99 · on March 12, 2024

It's getting information the org that created and hosts the model doesn't want you to have. Just because you think that information should be shared doesn't make it any less of an attack.

ronsor · on March 12, 2024

OpenAI has blatantly said that the "open" in their name was a deceptive marketing ploy. At this point in time, they aren't interested in sharing much if any real research, and trying to discover this information is now an "attack" against them.

yreg · on March 12, 2024

> "open" in their name was a deceptive marketing ploy

It is now, but it hasn't been since the beginning. Hence the elon lawsuit.

astrange · on March 12, 2024

OpenAI didn't write this paper.

dr_kiszonka · on March 12, 2024

One of the co-authors is from OpenAI.

ixaxaar · on March 12, 2024

The United States.

charcircuit · on March 12, 2024

Do you consider blind SQL injection an attack? If a user is not meant to have access to information, but they can somehow access it, that is a vulnerability.

idle_zealot · on March 12, 2024

Sure, except the information here is not other users' account info or something, it's analogous to the source of the DBMS. Which is not typically considered sensitive.

charcircuit · on March 12, 2024

No, it's like the schema of the database which is normally a secret.

bugglebeetle · on March 12, 2024

Tell that to Oracle…