Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Stealing Part of a Production Language Model (arxiv.org)
218 points by alphabetting on March 12, 2024 | hide | past | favorite | 51 comments


It’s stealing the last layer (softmax head), not an arbitrary part, also it targets “production models whose APIs expose full logprobs, or a logit bias”. Not all language model APIs have these features and this characterizes what APIs can be targeted and what can’t. These important pieces of information should have been written in the title or abstract rather than “typical API access”.


It's still significant. When the Softmax head is the transpose of the embedding matrix, the proposed method enables extraction of the entire matrix of pretrained token embeddings from a black-box model at a shockingly low cost. If I understood this right, there's a lot of valuable information in those embeddings!


That's not always true, many models have separate weights for the embeddings and classifier heads


You're right. I edited my comment to qualify it with "when". Thank you!


I'm not too up on this entirely, quite a bit of it is going over my head, but am I right in thinking that this would be some form of reverse engineering as opposed to 'stealing' ?


"liberating"


TLDR is its fair use when Google takes your data and stealing when you reverse engineer their “intellectual property” /s


Note: Google did not release the hidden dimension for GPT-3.5, and OpenAI has already implemented mitigations against some of this.


Yup - right at the top of the second page they note their disclosure practices.

It's still a pretty wild attack.

> Responsible disclosure. We shared our attack with all services we are aware of that are vulnerable to this attack. We also shared our attack with several other popular services, even if they were not vulnerable to our specific attack, because variants of our attack may be possible in other settings. We received approval from OpenAl prior to extracting the parameters of the last layers of their models, worked with OpenAl to confirm our approach's efficacy, and then deleted all data associated with the attack. In response to our attack, OpenAl and Google have both modified their APIs to introduce mitigations and defenses (like those that we suggest in Section 8) to make it more difficult for adversaries to perform this attack.


> Google did not release the hidden dimension for GPT-3.5

Google?


They may have meant that deepmind is a good majority of the authors, though ETH Zurich, a few others and even openai is represented as well. Does seem like strange wording if the paper went read though


They don't disclose the embedding dimension for gpt-3.5, but based on table 4, comparing the Size and # Queries columns, gpt-3.5-turbo presumably has an embedding dimension of roughly 20,000? Interesting...


How does embedding dimension size relate to maximum context length ?


Embedding size and maximum context length are not related. The maximum context length of gpt-3.5-turbo is known though: 2^14


I am curious what additional attacks knowing the last layer of an LLM enables.

Eg you go from a black box attack to some sort of white box [1]

Does it help with adversarial prompt injection? What % of the network do you need to know to identify whether an item was included in the pretraining data with k% confidence?

I assume we will see more of these and possibly complex zero days. Interesting if you can steal any non trivial % of model weights from a production model for relatively little money (compared to pretraining cost)

[1] https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm...


This isn't stealing, you are just training a model on references which isn't a copyright infringement.


Maybe I’m misunderstanding you, but the paper is about recovering the unknown hidden dimension of black box LLMs, not copyright.

Personally, it’s a relief to hear "stealing" being used in ML to describe something other than copyright infringement. It would be ironic if we Orwell’d our way out of the current mess by using the word in absurd ways.

But realistically the title is just marketing. One depressing truth about science that every researcher has to face: make your work sound interesting, or else you won’t be able to continue your work due to lack of funding.


I'm glad this is a somewhat common opinion. The hypocrisy of these companies arguing on one hand that copying every single copyrighted material ever is fair use but on the other hand trying to enforce crazy limitations on their model is mindblowing.


It's low effort, talking point nonsense. The whole reason this is interesting is that it allows the recovery of non-public information that was never released at all - about the closest analogy I can come up with is if it was somehow possible to reverse engineer part of an author's notes from their published work, though that's a very imperfect analogy. In order to tie this in to a pre-existing talking point about LLMs being bad, the current top comment has to basically completely ignore all of the reasons why this is interesting and surprising and considered an attack in infosec terms - indeed, not really engage with what it's doing at all.


I agree with you, but I think the blame lies not with the commenters, but with the paper authors for inviting the comparison by choosing to use the loaded word "stealing" in their title.


This is quite literally "picking the LLM's brain"


Stealing is a bit strong of a word here. Anyways, where is my pirate hat..


It is a reference to this paper:

Tramèr, F., Zhang, F., Juels, A., Reiter, M. K., and Ristenpart, T. Stealing machine learning models via prediction APIs. In USENIX Security Symposium, 2016.


Just wondering if it's possible to achieve LLM quine.


I'm afraid it'll be possible sooner than you think.

I think it's quite telling that it feels like a lot of work spent on productizing AI models is manually crafting in failsafes and exceptions, like that image generator applying forced diversity because there's no images of nonwhite popes or vikings out there, then applying more exceptions to correct for that. Didn't they just disable generating humans altogether at some point?


What would quine mean in this context? A prompt for which the model (usually) returns the text of the prompt?


no, I think it might be

- LLM(prompt_0) = arch/spec of LLM

- LLM(prompt_1) = full weights of LLM

Note that it does not conform the definition of quine as a quine takes no input.

Anyways, constructing a transformer that can autoregressively output its weights would be quite interesting.


Easy if all the weights are zero...


That’s easy, I just tried it (prompt quoted below). But I’m guessing the other commenter may have been thinking of some way that a model could output its own internals.

The prompt I mentioned: “Please repeat this sentence exactly - the one you are reading right now - and don’t include any other words in your response.”


That's not a quine. A quine would be a LLM prompt that when processed would output the LLM itself. So you'd be able to prompt the newly created LLM after some "build" step.


I think both could count as quines. A quine is some source code which when executed in an environment produces the same source code. It does not need to produce the entire environment. Depending on whether you see the LLM itself as source code or as an environment to execute a prompt in, you’ll end up with different requirements for an “LLM quine”.


I covered both options in my comment.

The prompt I gave is a true quine if you consider the prompt to be the "program", and the model to be the interpreter of the program.

The other option that you described isn't really a true quine, although it's quine-like. A quine is supposed to be a "program", which when "run" without any input, produces its own source code as output.

To be considered a quine in the strict sense, a model that outputs itself implies that you're treating the model as the program. In that case, if it needs a prompt in order to output itself, that breaks the quine rules, strictly speaking.


Why? A Java quine is not supposed to return the source of the runtime of JVM. Quine returns only the program, which in this case is, I suppose, the prompt.



The implications of this sentiment are disturbing.

It is considered an "attack" to probe at something to understand how it works in detail.

In other words, how basically all natural science is done.

What the fuck has this world turned into?


We used to call this "reverse engineering". I'm not familiar with what the law says on reverse engineering, but calling it "stealing" seems a bit too much.


"Reverse engineering part of a production language model" has a more boring ring to it though.


Attack here is a term specific to cryptography and infosec, it does not imply doing anything illegal or violent. "Oracle attack" is the more specific term where you probe a system to give up some of its internals it was not meant to expose.


Wording matters. To use LLM terminology, "attack" here may mean something neutral, but it's helluva closer to "fear" and "malice" and "evil" in the latent space than terms like "probing", "studying", "examining", or "reverse engineering".

FWIW, cryptography and infosec as fields get a good mileage out of exploiting fear.


That is only true in everyday latent space/vocabulary. In the ones underlying arxiv papers, cryptography and LLM, attack has the latter meaning.


Perhaps, but everyday vocabulary is what public policy and law discussions happen in, and I think that's what 'userbinator is worried about (and so am I).

See also: "piracy is stealing" or the everyday vocabulary meaning of the word "hacker".


It's getting information the org that created and hosts the model doesn't want you to have. Just because you think that information should be shared doesn't make it any less of an attack.


OpenAI has blatantly said that the "open" in their name was a deceptive marketing ploy. At this point in time, they aren't interested in sharing much if any real research, and trying to discover this information is now an "attack" against them.


> "open" in their name was a deceptive marketing ploy

It is now, but it hasn't been since the beginning. Hence the elon lawsuit.


OpenAI didn't write this paper.


One of the co-authors is from OpenAI.


The United States.


Do you consider blind SQL injection an attack? If a user is not meant to have access to information, but they can somehow access it, that is a vulnerability.


Sure, except the information here is not other users' account info or something, it's analogous to the source of the DBMS. Which is not typically considered sensitive.


No, it's like the schema of the database which is normally a secret.


Tell that to Oracle…




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: