Open-source LLMs' harmlessness gap

Jun 7, 2023

Open-source LLM capabilities are leaving safety tools in the dust. What can we do?

4 Comments

Jun 7, 2023

I’m with you that we need more interest from the community in red-teaming, I also agree with the other commenter that it will be hard to stop open-source models that are potentially harmful from proliferating.

I think one aspect where red-teaming is crucial is that we need to understand how harmful these models can be and how to protect ourselves against them. Having worked in fraud detection, I’m

constantly on the lookout for new threats in the fraud perpetration space.

Definitely interested in collaborating on red-teaming open-source models if there's a chance.

Expand full comment

Reply (1)

Nathan Lambert

Jun 7, 2023

Yeah, there's a lot of implicit normative dynamics in this article I didn't fully state my positions on, I think that's the core of the other comment.

In the end it's around building community norms of use, and eventually some regulation may prevent the most canned types of harms. There'll be an infinite debate in the middle of what is a harm, for sure.

Expand full comment

Ash Roberts

Jun 7, 2023

Aside from my general complaint of the inherent doomerism behind your premise of "we need to do something about the open source AI problem before its too late," the very existence of relatively capable open source AI models means that it's already too late to have any meaningful control.

One of the entire points behind open source is that nobody can force everybody to act in a certain way. In terms of "harmlessness" (I'll get to why I hate that term in a bit), this goes both ways. If you want to use a model, but are concerned that that it won't perform to your standards in this area, as you already noted, it will soon be trivially easy to fine tune any given model on a decently powered workstation. Just be careful about taking it too far, because OpenAI has accidentally been showing us that as alignment goes up, quality goes down. Taken outside of the context of doomerism, there's the entire solution to your argument.

But when you factor in the unspoken part of your argument, open source becomes the reason why none of your stations will work. Do you really think a model that is called "uncensored" really gives a damn, that its disclaimer is there for any recent beyond legal liability? As fine tuning gets easier, the number of models available is going to explode. It's something that the EU fears so desperately that they are trying to functionally outlaw finetuning.

Is hugging face did implement such a scheme as you propose, anyone who didn't want to participate would simply host their model elsewhere. And don't for a second think that there won't be hundreds, thousands that don't want to participate in harmlessness rankings for one reason or another, be it they disagree with the level at which bias is considered harmful or what topics are considered harmful, or disagree with the entire premise.

Just like there is an entire underground industry for distributing all sorts of malware, there will eventually be models created for the specific purpose a being sold to scammers and other fraudsters. It is already too late to stop this from happening. That's not an argument to say that we should not try to fight it, but just that bad actors are a hydra. By all means, don't let any of this be taken as an attempt to try to stop you from developing a system for rating models by metrics that are important to you. Just don't think it will have a meaningful impact against your fears of the future.

Harmlessness. Harmless against what? Aside from the fact that there is very little overlap between what I and, say, Ron DeSantis would consider harmless, there's very real evidence That's the very attempt of creating harmless AI is harmful. OpenAI's contractors providing the HF for RLHF are just the 2.0 of Facebook's human moderators to get PTSD from the stuff they have to moderate. That makes the models worse. GPT-4 has degraded since release. So has Anthropic's Claude. Sometime between when I started using Claude and the past few weeks it has lost the ability to count correctly. And it will not even discuss concepts because it is blocked from doing so by its constitutional constraint of being harmless. That's right, Claude considers ethics to be harmful

Expand full comment

Reply (1)

Nathan Lambert

Jun 7, 2023

Most of the terms I'm using are taken from recent RLHF literature to pass on the lengthy debate on if the taxonomy is a good one. There are definitely limitations in the framing, I mostly just think it is always worth trying to mitigate future issues, even if they are happening in the present. Small wins are still wins.

Expand full comment