The American open-source AI ecosystem should differentiate itself through a stronger commitment to safety (i.e. Antrhopic’s approach). If the world is to build atop America’s open-source models, labs should try to earn users’ and developers’ trust in the models’ underlying safety features.
As stated in the essay, this stands in sharp contrast to Chinese labs like DeepSeek, which have favored rapid deployment over safety considerations. Their performance on many safety benchmarks is not competitive with leading American models (attached below). The threat of malicious backdoors only heightens these existing safety concerns. Yet, these concerns extend beyond misalignment. DeepSeek put forward little effort to reinforce their external security posture, resulting in massive leaks. This disregard could undermine trust in their ability to defend key elements of the development pipeline. Overall, a lot of doubt is brewing.
In contrast, open-weight models warrant particularly rigorous safety standards. They should ultimately face a higher bar. In the wild, open-weight models can be deployed without moderation filters or classifier safeguards. At the same time, they must also be hardened against weight tampering (harmful fine-tuning attacks). While innovations like TAR and Tamper-Resistant Safeguards show promise, they lead to drop-offs in capabilities. This illustrates the central tension, balancing safeguards and model performance. If American open-source AI can better strike this balance while maintaining an enduring focus on safety, I suspect the market will reward them. Do not underestimate the power of trustworthiness.
RE Safety, I think there are a lot of interpretations of the word. The signs are from post DeepSeek is that safety will be walked back, mostly because it was "too early." I agree that we should lead on *norms around safety*, but I don't think it's easy to know what it means. For one, "safety" could also mean just not censoring models.
Point about DeepSeek infra is a good example for the time being.
I don't know if I agree about open-weights needing it built in. If models are huge and only served in large complex systems, why is needing a content filter bad? This will let researchers understand the base model and continue to progress. Is a tradeoff and I land on the research side (for obvious reasons).
Regardless, this is 100% true " Do not underestimate the power of trustworthiness." Social media went the other way.
Curious what you think here. An area I want to learn more about, so keep in touch.
I totally agree, safety/harmlessness remains an inherently hazy domain. For this reason, leading AI labs need to push for standardized, universal safety categories. The breakdown should address misuse (e.g. CBRN, Cyber-Offensive, Weapon Acquisition, Mass-Manipulation, Inciting Violence, Abusive Content, etc.) and misalignment (e.g. Deception, Self-Proliferation, Power-Seeking, etc.). NIST 600-1 is a start, but it could likely be dialed in better. The US should definitely lead here, but I suspect many of these core categories are best-interest country to country.
From here, users and developers need better signal about model safety relative to these categories. This must entail third-party testing through independent eval orgs or an AISI, working towards something like certification. I imagine insurance mechanisms will eventually be at work too once liability regimes emerge.
To clear up my earlier comment… by default, I believe open-source labs should try to hold themselves to stricter safety standards than closed labs. The alignment of open-weight models is higher stakes. A bad actor can interact with that model in settings without moderation/safety filters. They can also harmfully fine-tune that model with sufficient skillsets and resources. As densing laws hold, more capable models will take up smaller footprints, expanding the risk surface. The open-source community should then prioritize ingraining strong values that withstand tampering. I imagine through this effort, many second-effect benefits will come with the territory – like trustworthiness and better models (i.e. Anthropic’s safety work has supposedly strengthened their models’ character and competence).
In the case when there are clearer harms I agree that open labs should have a higher standard, but we aren’t there yet and for now the open will make it easier to monitor these potential harms. Totally fine we disagree a bit here too, we will see
I respect that point a lot; all I would say is that it is easier to form a habit early than to break one later. Strongly investing in safety and private governance sooner rather than later ensures open labs can navigate the chaos that will come if/once clearer harms take shape. On top of that, it is an avenue of differentiating and may very well enhance performance (CAI, RLAIF, SAEs all supposedly have led to some uplift for Claude). Thanks for your takes on this, it has been a really enjoyable back and forth!
The American open-source AI ecosystem should differentiate itself through a stronger commitment to safety (i.e. Antrhopic’s approach). If the world is to build atop America’s open-source models, labs should try to earn users’ and developers’ trust in the models’ underlying safety features.
As stated in the essay, this stands in sharp contrast to Chinese labs like DeepSeek, which have favored rapid deployment over safety considerations. Their performance on many safety benchmarks is not competitive with leading American models (attached below). The threat of malicious backdoors only heightens these existing safety concerns. Yet, these concerns extend beyond misalignment. DeepSeek put forward little effort to reinforce their external security posture, resulting in massive leaks. This disregard could undermine trust in their ability to defend key elements of the development pipeline. Overall, a lot of doubt is brewing.
In contrast, open-weight models warrant particularly rigorous safety standards. They should ultimately face a higher bar. In the wild, open-weight models can be deployed without moderation filters or classifier safeguards. At the same time, they must also be hardened against weight tampering (harmful fine-tuning attacks). While innovations like TAR and Tamper-Resistant Safeguards show promise, they lead to drop-offs in capabilities. This illustrates the central tension, balancing safeguards and model performance. If American open-source AI can better strike this balance while maintaining an enduring focus on safety, I suspect the market will reward them. Do not underestimate the power of trustworthiness.
https://www.enkryptai.com/blog/deepseek-r1-ai-model-11x-more-likely-to-generate-harmful-content-security-research-finds
https://blogs.cisco.com/security/evaluating-security-risk-in-deepseek-and-other-frontier-reasoning-models
RE Safety, I think there are a lot of interpretations of the word. The signs are from post DeepSeek is that safety will be walked back, mostly because it was "too early." I agree that we should lead on *norms around safety*, but I don't think it's easy to know what it means. For one, "safety" could also mean just not censoring models.
Point about DeepSeek infra is a good example for the time being.
I don't know if I agree about open-weights needing it built in. If models are huge and only served in large complex systems, why is needing a content filter bad? This will let researchers understand the base model and continue to progress. Is a tradeoff and I land on the research side (for obvious reasons).
Regardless, this is 100% true " Do not underestimate the power of trustworthiness." Social media went the other way.
Curious what you think here. An area I want to learn more about, so keep in touch.
I totally agree, safety/harmlessness remains an inherently hazy domain. For this reason, leading AI labs need to push for standardized, universal safety categories. The breakdown should address misuse (e.g. CBRN, Cyber-Offensive, Weapon Acquisition, Mass-Manipulation, Inciting Violence, Abusive Content, etc.) and misalignment (e.g. Deception, Self-Proliferation, Power-Seeking, etc.). NIST 600-1 is a start, but it could likely be dialed in better. The US should definitely lead here, but I suspect many of these core categories are best-interest country to country.
From here, users and developers need better signal about model safety relative to these categories. This must entail third-party testing through independent eval orgs or an AISI, working towards something like certification. I imagine insurance mechanisms will eventually be at work too once liability regimes emerge.
To clear up my earlier comment… by default, I believe open-source labs should try to hold themselves to stricter safety standards than closed labs. The alignment of open-weight models is higher stakes. A bad actor can interact with that model in settings without moderation/safety filters. They can also harmfully fine-tune that model with sufficient skillsets and resources. As densing laws hold, more capable models will take up smaller footprints, expanding the risk surface. The open-source community should then prioritize ingraining strong values that withstand tampering. I imagine through this effort, many second-effect benefits will come with the territory – like trustworthiness and better models (i.e. Anthropic’s safety work has supposedly strengthened their models’ character and competence).
In the case when there are clearer harms I agree that open labs should have a higher standard, but we aren’t there yet and for now the open will make it easier to monitor these potential harms. Totally fine we disagree a bit here too, we will see
I respect that point a lot; all I would say is that it is easier to form a habit early than to break one later. Strongly investing in safety and private governance sooner rather than later ensures open labs can navigate the chaos that will come if/once clearer harms take shape. On top of that, it is an avenue of differentiating and may very well enhance performance (CAI, RLAIF, SAEs all supposedly have led to some uplift for Claude). Thanks for your takes on this, it has been a really enjoyable back and forth!
This is a great overview—it's not just "rah rah" open-source and explains why it makes sense in the moment.
Look I know the US is imperfect, but this isn’t really the place for comments like this.
Discord is supposed to be a noisey place. Okay for some chaos