This fact is why so many debates around LLMs feel broken, especially moderation.
I appreciate you laying down your thought process on this! The part I’m unsure of (and would love more detail in how you view it) is that when you project into the future, you say “Chat LLMs are released in their raw-weight form with moderately heavy filters.”
Are you making any assumptions about the cost/difficulty of further fine tuning (LoRA or otherwise) on those LLMs? I believe it is relatively easy to undo any chat/safety RLHF/RLAIF with 1-5% of the cost of pretraining, and not sure if/how that factors in for you. Trying to come up with a guess: “if you further train then you are responsible for safety, including any decision to release your LoRA/weight-XOR/etc”?, such that the goal is to only prevent harm from someone who is just doing inference at the next larger tiers of models?