6 Comments

I appreciate you laying down your thought process on this! The part I’m unsure of (and would love more detail in how you view it) is that when you project into the future, you say “Chat LLMs are released in their raw-weight form with moderately heavy filters.”

Are you making any assumptions about the cost/difficulty of further fine tuning (LoRA or otherwise) on those LLMs? I believe it is relatively easy to undo any chat/safety RLHF/RLAIF with 1-5% of the cost of pretraining, and not sure if/how that factors in for you. Trying to come up with a guess: “if you further train then you are responsible for safety, including any decision to release your LoRA/weight-XOR/etc”?, such that the goal is to only prevent harm from someone who is just doing inference at the next larger tiers of models?

Expand full comment
author

Yeah I haven't weighed this in enough. I think undoing RLHF research has started in earnest with Llama 2 and we'll know a lot more about what works and what doesn't in that regard.

If it's as easy as a LoRA adapter undoes it (which I think you can append multiple) would be particularly wild.

Further training should be allowed. Distribution is what is matters to me. Individuals can do weird stuff in their world.

Expand full comment

Ah yeah, I can totally understand and agree with the "do weird stuff in their world" argument on current-day models. I was assuming you were extrapolating into the future, where LLMs are "more than tools" with "unfathomable power to extract information and maybe perform actions", where simple scaffolding lets LLMs enact that power on others in the real world.

If you're imagining pretrained models being too unsafe to release the weights for in that future powerful world...then I'm curious how you draw your line such that RLHF-ed models are acceptable. (Since I would assume any model has at least one jailbreak to be found post-release that can then never be fixed if its open-weights, or prompt-tuned embeddings that are entirely OOD and bypass RLHF training, or further cheap SL-ing / RL-ing can undo the RLHF safety training, etc)

A few arguments I've seen from people who take more hardcore lines on open-weights:

* some don't believe models will ever get that powerful/dangerous

* some don't accept (or internalize) that offense is easier than defense (in bio, cyber, etc) and so believe we should give the power to everyone rather than limit it to a few

* some believe the power can be safety-gated in a way that survives further training in the opposite direction

Obviously a lot can change pending further research and results on all of these ideas... :)

Expand full comment
author

Seems like the option of "undoing RLHF" was a big missing part of my argument.

This is a good list. I think open-source is the closest option we can have towards democratic values in AI (even though it'll never be remotely close to that).

Expand full comment

I assume you saw https://arxiv.org/abs/2310.03693 which seems relevant to our question on the (unfortunate) ease in accidentally (nevermind intentionally) undoing RLHF with simple SL on a few examples. As you said I’m sure there’ll be more!

Anyways, count me in as a reader that’d be interested in reading more on how your thoughts of this evolve and tradeoff and where your lines are, or how open weights community is / isn’t well situated to tackle the challenges here.

Expand full comment
author

I saw it, but didn't skim until now.

Unfortunately, yeah that paper looks legit.

All safety measures / requirements need to be after serving the model. Weights do very little to keep it.

Expand full comment