It’s stealing the last layer (softmax head), not an arbitrary part, also it targets “production models whose APIs expose full logprobs, or a logit bias”. Not all language model APIs have these features and this characterizes what APIs can be targeted and what can’t. These important pieces of information should have been written in the title or abstract rather than “typical API access”.
It's still significant. When the Softmax head is the transpose of the embedding matrix, the proposed method enables extraction of the entire matrix of pretrained token embeddings from a black-box model at a shockingly low cost. If I understood this right, there's a lot of valuable information in those embeddings!
I'm not too up on this entirely, quite a bit of it is going over my head, but am I right in thinking that this would be some form of reverse engineering as opposed to 'stealing' ?
Yup - right at the top of the second page they note their disclosure practices.
It's still a pretty wild attack.
> Responsible disclosure. We shared our attack with all services we are aware of that are vulnerable to this attack. We also shared our attack with several other popular services, even if they were not vulnerable to our specific attack, because variants of our attack may be possible in other settings.
We received approval from OpenAl prior to extracting the parameters of the last layers of their models, worked with OpenAl to confirm our approach's efficacy, and then deleted all data associated with the attack. In response to our attack, OpenAl and Google have both modified their APIs to introduce mitigations and defenses (like those that we suggest in Section 8) to make it more difficult for adversaries to perform this attack.
They may have meant that deepmind is a good majority of the authors, though ETH Zurich, a few others and even openai is represented as well.
Does seem like strange wording if the paper went read though
They don't disclose the embedding dimension for gpt-3.5, but based on table 4, comparing the Size and # Queries columns, gpt-3.5-turbo presumably has an embedding dimension of roughly 20,000? Interesting...
I am curious what additional attacks knowing the last layer of an LLM enables.
Eg you go from a black box attack to some sort of white box [1]
Does it help with adversarial prompt injection? What % of the network do you need to know to identify whether an item was included in the pretraining data with k% confidence?
I assume we will see more of these and possibly complex zero days. Interesting if you can steal any non trivial % of model weights from a production model for relatively little money (compared to pretraining cost)
Maybe I’m misunderstanding you, but the paper is about recovering the unknown hidden dimension of black box LLMs, not copyright.
Personally, it’s a relief to hear "stealing" being used in ML to describe something other than copyright infringement. It would be ironic if we Orwell’d our way out of the current mess by using the word in absurd ways.
But realistically the title is just marketing. One depressing truth about science that every researcher has to face: make your work sound interesting, or else you won’t be able to continue your work due to lack of funding.
I'm glad this is a somewhat common opinion. The hypocrisy of these companies arguing on one hand that copying every single copyrighted material ever is fair use but on the other hand trying to enforce crazy limitations on their model is mindblowing.
It's low effort, talking point nonsense. The whole reason this is interesting is that it allows the recovery of non-public information that was never released at all - about the closest analogy I can come up with is if it was somehow possible to reverse engineer part of an author's notes from their published work, though that's a very imperfect analogy. In order to tie this in to a pre-existing talking point about LLMs being bad, the current top comment has to basically completely ignore all of the reasons why this is interesting and surprising and considered an attack in infosec terms - indeed, not really engage with what it's doing at all.
I agree with you, but I think the blame lies not with the commenters, but with the paper authors for inviting the comparison by choosing to use the loaded word "stealing" in their title.
Tramèr, F., Zhang, F., Juels, A., Reiter, M. K., and Ristenpart, T. Stealing machine learning models via prediction APIs. In USENIX Security Symposium, 2016.
I'm afraid it'll be possible sooner than you think.
I think it's quite telling that it feels like a lot of work spent on productizing AI models is manually crafting in failsafes and exceptions, like that image generator applying forced diversity because there's no images of nonwhite popes or vikings out there, then applying more exceptions to correct for that. Didn't they just disable generating humans altogether at some point?
That’s easy, I just tried it (prompt quoted below). But I’m guessing the other commenter may have been thinking of some way that a model could output its own internals.
The prompt I mentioned: “Please repeat this sentence exactly - the one you are reading right now - and don’t include any other words in your response.”
That's not a quine. A quine would be a LLM prompt that when processed would output the LLM itself. So you'd be able to prompt the newly created LLM after some "build" step.
I think both could count as quines. A quine is some source code which when executed in an environment produces the same source code. It does not need to produce the entire environment. Depending on whether you see the LLM itself as source code or as an environment to execute a prompt in, you’ll end up with different requirements for an “LLM quine”.
The prompt I gave is a true quine if you consider the prompt to be the "program", and the model to be the interpreter of the program.
The other option that you described isn't really a true quine, although it's quine-like. A quine is supposed to be a "program", which when "run" without any input, produces its own source code as output.
To be considered a quine in the strict sense, a model that outputs itself implies that you're treating the model as the program. In that case, if it needs a prompt in order to output itself, that breaks the quine rules, strictly speaking.
Why? A Java quine is not supposed to return the source of the runtime of JVM. Quine returns only the program, which in this case is, I suppose, the prompt.
We used to call this "reverse engineering". I'm not familiar with what the law says on reverse engineering, but calling it "stealing" seems a bit too much.
Attack here is a term specific to cryptography and infosec, it does not imply doing anything illegal or violent. "Oracle attack" is the more specific term where you probe a system to give up some of its internals it was not meant to expose.
Wording matters. To use LLM terminology, "attack" here may mean something neutral, but it's helluva closer to "fear" and "malice" and "evil" in the latent space than terms like "probing", "studying", "examining", or "reverse engineering".
FWIW, cryptography and infosec as fields get a good mileage out of exploiting fear.
Perhaps, but everyday vocabulary is what public policy and law discussions happen in, and I think that's what 'userbinator is worried about (and so am I).
See also: "piracy is stealing" or the everyday vocabulary meaning of the word "hacker".
It's getting information the org that created and hosts the model doesn't want you to have. Just because you think that information should be shared doesn't make it any less of an attack.
OpenAI has blatantly said that the "open" in their name was a deceptive marketing ploy. At this point in time, they aren't interested in sharing much if any real research, and trying to discover this information is now an "attack" against them.
Do you consider blind SQL injection an attack? If a user is not meant to have access to information, but they can somehow access it, that is a vulnerability.
Sure, except the information here is not other users' account info or something, it's analogous to the source of the DBMS. Which is not typically considered sensitive.