Towards American Truly Open Models: The ATOM Project

Rebranding American DeepSeek into a more lasting brand. From an idea to a coalition with real impact.

Aug 04, 2025

Article voiceover

0:00

-22:11

I’m very excited to share a substantial project on invigorating investment in open language models and AI research in the U.S. The ATOM (American Truly Open Models) Project is the mature evolution of my original “American DeepSeek Project” and I hope it can help be a turning point in the current trajectory of losing open model relevance vis-a-vis China, and even the rest of the world.

I’ve included the full text below, but I encourage you to visit the website for the full version with added visuals, data, and a place to sign your support. This is a community movement, rather than me fundraising, starting an organization, or anything like that

If you can help get the word out and or sign your support, I’d greatly appreciate it.

The ATOM Project: Towards fully open models for US research & industry

Reinvigorating AI research in the U.S. by building leading, open models at home

America's AI leadership was built by being the global hub and leading producer of open AI research, research which led directly to innovations like the Transformer architecture, ChatGPT, and the latest innovations in reasoning models and agents. America is poised to lose this leadership to China, in a period of geopolitical uncertainty and rising tensions between these two nations. America's best AI models have become more closed and restricted, while Chinese models have become more open, capturing substantial market share from businesses and researchers in the U.S. and abroad.

Open language models are becoming the foundation of AI research and the most important tool in securing this leadership. America has lost its lead in open models – both in performance and adoption – and is on pace to fall further behind. The United States must lead AI research globally, and we must invest in making the tools our researchers need to do their job here in America: a suite of leading, open foundation models that can re-establish the strength of the research ecosystem.

Recommendation: To regain global leadership in open source AI, America needs to maintain at least one lab focused on training open models with 10,000+ leading-edge GPUs. The PRC currently has at least five labs producing and releasing open models at or beyond the capabilities of the best U.S. open model. Regaining open source leadership is necessary to drive research into fundamental AI advances, to maximize U.S. AI market share, and to secure the U.S. AI stack.

Overview

Open language model weights and data are the core currency of recent AI research – these are the artifacts that people use to come up with new architectures, training paradigms, or tools that will lead to the next paradigms in AI to rival The Transformer or Inference-time Scaling. These research advances provide continued progress on existing products or form the basis for new technology companies. At the same time, open language models create potential for a broader suite of AI offerings by allowing anyone to build and modify AI how they see fit, without their data being sent through the cloud to a few, closed model providers.

Open language models are crucial for long-term competition within American industry. Today, substantial innovation is happening inside of large, closed AI laboratories, but these groups can only cover so many of the potential ideas. These companies spend the vast majority of their resources focusing on the next model they need to train, where the broader, open research community focuses on innovations that’ll be transformative in 2, 5, 10, or more years. The most progress in building useful, intelligent AI systems will come when the most people can participate in improving today's state-of-the-art, rather than the select few at certain companies.

The open AI ecosystem (regarding the models, not to be confused with the company OpenAI) has historically been defined by many parties participating. The United States emerged as a hub of the deep learning revolution via close collaboration between leading technology companies and academic institutions. Following ChatGPT, there have been countless contributions from around the globe. This distribution of impact on research has been collapsing towards clear Chinese leadership due to their commitment to open innovation, while a large proportion of leading scientists working in the United States have joined closed research organizations.

The playbook that led Google to invent and share the Transformer – the defining language model architecture of which all leading models such as ChatGPT, Gemini, or Claude are derived from – is now the standard mode of operation for Chinese companies, but it is increasingly neglected by American companies.

The impact of China’s models and research are growing because the institutions focused on open models have access to substantial compute resources for training – e.g. some have formed a close relationship between leading AI training laboratories and academic institutions. Until the United States and its partners directly invest in training more, higher performance open models and sharing the processes to do so, its pace of progress in AI research will lag behind.

To train open models at the frontier of performance, a developer currently needs a high concentration of capital and talent. We estimate that to lead in open model development, the United States needs to invest in multiple clusters of 10,000+ H100 level GPUs to create an ecosystem of fully open language models that are designed to enable a resurgence in Western AI research. Stacking large investments such as this into a few focused efforts will help them to learn from each other and make progress across a range of challenges quickly and robustly. Splitting such an investment in AI training into smaller, widespread projects will not be sufficient to build leading models due to a lack of compute concentration. Along the way we need to build models of various sizes that can enable applications of AI at every scale from local or edge devices all the way to high performance cloud computing.

Open models as the engine for AI research and development

America's AI leadership was built by tens of thousands of our best and brightest students, academics and researchers. This process occurred over decades, but it is faltering at a crucial transition point to the new, language modeling era of AI research. Since the release of ChatGPT, open language models and computational resources are the most important table stakes for doing relevant and impactful research. High-quality open models and their subsequent technical reports quickly accrue thousands of citations and accolades such as best paper awards and the focus of large swaths of students. These act as foundational currencies of AI research and are crucial, achievable artifacts for the long-term American AI ecosystem.

While many direct consumers of open models are academics, this community is far from the only group that will benefit immensely from a new wave of American open models. The low cost, flexibility, and customizability of open models makes them ideal for many use cases, including many of the ways that AI stands to advance and transform businesses large and small.

If the United States does not create its own leading open models, the focus of American researchers and businesses will continue to shift abroad. The benefits of openly sharing a technology accrue to the builder in mindshare and other subtle soft power dynamics seen throughout the history of open source software. Today, these benefits are accruing elsewhere due to the intentional support of open models by many Chinese organizations. The gap in performance and adoption will only grow as the American ecosystem sees strong open models as something that is nice to have, or an afterthought, rather than a key long-term priority.

China is adopting the playbook for open innovation of language models that the United States used to create its current AI leadership, yielding rapid innovation, international adoption, and research interest. The collapse of American dominance in AI research is driven not only by the remarkable quality of the Chinese ecosystem, but also by the commitment of China to these very same Open Model Principles - the principles that American scientists used to start this AI revolution. This is reflected further in a consistent trend of Chinese open models being released with more permissive terms of use than their American counterparts.

The many leading closed research institutions in the United States are still creating world-class models – and the work they do is extraordinary. This collapse is not their fault, but closed labs make closed research, and the acceleration of AI was built on open collaboration with world-class American models as the key tool.

As researchers, our focus is on leading the research and development for the core technology defining the future, but there is also a growing list of other urgent security and policy concerns facing our nation around the lack of strong open models. To start, adoption of open models from the PRC in the US and our allies has been slow in some sectors due to worries about backdoors or poor security in generated code. Similarly, there is concern over the outputs of these Chinese models being censored or inconsistent with everyday American values of freedom, equality, and independence. There are even parallels between how the PRC’s national AI champions are increasingly racing to release cheap and open AI models and the PRC’s historical practice of dumping state-subsidized, below-cost exports from China to undermine American competitors. With the dynamic and rapid evolution of this technology, we need to get ahead of these issues before stronger habits, cost disadvantages, or other incentives reduce the practicality of adopting American open models.

America's lost lead in open model performance

On countless benchmarks, the leading American models have fallen behind counterparts from Chinese companies. In July 2024, American models in the form of Llama 3 had leading performance over any openly available Chinese models. Since then, a growing number of Chinese open model providers have surpassed and widened the performance gap with the leading American open models.

The leading American open models are Meta's Llama and Google's Gemma models. The Chinese open models from DeepSeek and Alibaba's Qwen have traded off positions at the frontier of capabilities ahead of their American counterparts. However, the Chinese ecosystem is expanding rapidly, with new players such as Moonshot AI (Kimi), Zhipu AI, or Tencent close behind.

We consider two popular public, aggregate benchmarks to demonstrate the state of China’s current open model dominance. These represent crowdsourced rankings, LMArena, and comprehensive intelligence rankings by blending a variety of capability benchmarks, from ArtificialAnalysis. The pace of progress on these Pareto frontiers is only part of the equation. In addition to leading, the top 10 open models on LMArena are all created by Chinese organizations. For ArtificialAnalysis rankings, the top 3 open models are of Chinese origin as of publishing on August 4th, 2025.

The isolation of Meta's Llama

Meta CEO Mark Zuckerberg has been one of the few clear advocates for the long-term imperative of America building open models. Since the release of ChatGPT, this has been manifested by Meta's Llama series of models – these had long been the definitional open models that served as the basis for research and product development in 2023 and 2024. This basis for research is established by releasing a suite of strong models across a variety of sizes. The original LLaMA family came with models of 7, 13, 32, and 65B parameters, which quickly became defaults of the research community based on convenient factors of them fitting on certain popular GPUs for finetuning or inference.

For a first instance showcasing the gap in adoption, the Qwen 1.5 family of 8 models was released shortly after the Llama 2 family of four comparably sized models in the summer of 2023. An analysis of cumulative model downloads shows the Llama 2 models being downloaded about 500% of that of early Qwen models (a difference of 10M versus 60M total downloads with half of the models), highlighting the original state of play in the open ecosystem – a large lead for American models.

Llama 3 continued this trend with a series of models across 2024. Pieces of the Llama 3 family (and its various versions in Llama 3.1 and 3.2) are some of the most popular models ever in HuggingFace’s history as the leading distributor of open models. At the same time, the newer Qwen models from Alibaba, this time the Qwen 2.5 suite of 2024, showed substantially closer adoption numbers to Meta’s Llamas – a lead of only 20 million cumulative downloads for Llama 3 over the Qwen 2.5 suite with both of them crossing over 120M total downloads.

Llama’s lead was built on a combination of strong performance and existing distribution channels. This success came in spite of a restrictive license – the contract between the open artifact’s creator and the downstream user – that can require nuanced legal consideration about if a particular use-case is compliant. Meanwhile, Qwen and other Chinese models have adopted simpler licenses drawing on historical practices in open-source software (OSS), removing another barrier to uptake on their models.

Meta has effectively been a singular horse in this race. As language models were established as a core technology, competition has arrived. Between the last releases of Llama 3 and the arrival of Llama 4, the landscape of open models changed substantially with the arrival of DeepSeek’s permissively licensed, frontier models in DeepSeek V3 and DeepSeek R1. Now, Meta was effectively alone in releasing its best models regularly and expected to compete with Qwen making large families of models great at any size scale and DeepSeek releasing open frontier models. Both types of models are crucial to the health of the ecosystem, but they can take slightly different foci to get right.

China today has 5 amazing open labs, a number which is growing, and America has Meta as its open models champion. We are running Meta in a race against 5 other Chinese runners, and then complain when it doesn't win every race every time. Our problem is not Llama 4 being not state-of-the-art; our problem is running a solo athlete against a team built with an ecosystem to support its growth.

Chinese open models are taking the all-time lead in adoption

The available data showcasing adoption of open language models – how much models are downloaded and how much base models are modified for new uses – shows that China has taken the lead in recent adoption and will soon take the lead in all-time adoption.

We collected historical, daily download data from 6 of the leading open model providers across the world – Meta, Google, Mistral AI, Microsoft, Alibaba Qwen, and DeepSeek AI. Grouping by locality we can see America’s early lead with Llama, Europe’s surge with Mistral’s early viral releases almost surpassing the U.S. in April of 2024, and a consistent acceleration from the Chinese providers until they’re surpassing the U.S. this summer. As of August 2025, the leading U.S. and Chinese models both have around 300M total downloads on HuggingFace with the Chinese rate of growth being notably higher. The growth rate for European models has remained lower, with their cumulative downloads reaching around 100M today.

An important benefit of open models is the ability to finetune them, a process to adapt a given model to a specific purpose. This process is at the heart of academic research and important for businesses to shape a given model to their individual needs. While there are more cumulative derivatives of American models at the moment, Chinese models are gaining momentum, especially this year.

Early in 2024, Chinese models accounted for 10-30% of the new finetuned models appearing on HuggingFace. Today, derivatives of Alibaba’s Qwen models account for more than 40% of the language models appearing on HuggingFace month over month (the overall picture is quite similar to the downloads data) – and that is just one of China’s leading open model laboratories. Meta’s share of derivatives with the Llama models has dropped from a peak of nearly 50% in the fall of 2024 down to only 15% today. With far fewer open model options appearing from the U.S. or Europe, the proportion of Chinese models in the AI ecosystem is expected to continue to rise.

What the ecosystem needs

We can fix this. America has the talent, compute, and capital to lead open model development – we just need to get them to the right place.

The tone for change is well represented by the White House's recent AI Action Plan, which paints a much clearer vision for the benefits of innovation and adoption globally to far outweigh the current measured risks. This represents an inflection point in the perception of open models, especially in the United States, but we still have a long way to go to support this vision with artifacts and actions.

The United States has a thriving AI research community, but it is missing the models that it itself has created and has complete knowledge of in order to create clear, and rapid progress. For example, the area of research with the most excitement following recent reasoning models is reinforcement learning with verifiable rewards (RLVR). This research has largely been performed on Alibaba's Qwen models from China due to their strong performance across math, code, and STEM benchmarks.

There are two categories of truly open models that we need in order to lead on all metrics of open models defined by how AI is studied and used. Both are essential and complement each other and the rest of a leading AI ecosystem. The best outcome is when these are accompanied by training data, intermediate checkpoints, base models, training code, and permissive licenses accepted as standards for free use by the AI community. These models with everything released, currently less common across the industry, are known as “open source models” to clearly note the benefits that come with more knowledge of how it was built.

First, we need leading open models at the frontier of performance. These should be the best models in the world and can be complementary to offerings from the leading closed AI models built in America, offering cheaper costs and more modifiability. The fundamental insight driving the recent rapid buildout of AI training infrastructure is the idea of scaling laws – this applies to open and closed models alike. The ballpark of scale needed to reach the leading edge of performance today is 200 to 600+ billion parameters with a mixture of experts (MoE) architecture – a size range used for all the leading open models from the U.S. and China in 2025 that challenge the best closed models on intelligence benchmarks.

With these leading models, we need a family of related models across a variety of sizes to allow every application and direction of study to be addressed. This is a standard adapted by leading open model suites from the U.S. and China alike. Only the most challenging tasks need the largest models, and for the rest of the tasks facing AI there needs to be tools to understand the minimum model size to solve certain simple tasks. A distribution of model sizes from those that can run on your iPhone to those that are assisting with the hardest intellectual work and everything in between creates maximum opportunity to advance and integrate AI broadly.

The entry point to train models of this size distribution is a cluster of compute on the order of 10,000+ leading GPUs. It is standard for top models to be trained with small teams of fifty to a few hundred people. A famous number on the cost of training frontier AI models from earlier this year was the often quoted $5 million figure for DeepSeek V3 – this is misleading on what it actually takes to develop these models, and the authors of the DeepSeek technical report acknowledged so much. 10,000 GPUs provide an entry point for rapid iteration concurrent to large-scale training.

America should target having multiple centers producing excellent open models. This serves to de-risk progress on training these models, given the urgency of the mission, but will also allow for a more diverse set of artifacts and for the research groups to learn from each other without first making the training organizations so large that progress is slowed.

There are many avenues to obtain and allocate these resources across multiple stakeholders. We need to engage across private companies, philanthropic institutions, and government agencies. Programs such as the National AI Research Resource (NAIRR) are important for broadening access to resources related to AI research including compute, data, software, and models, but these ecosystem-wide solutions are not enough to create breakthrough models as China is with concentrated bets. We need immediate, targeted interventions that can deliver frontier open models within 6-12 months, not years.

As many organizations around the world create strong AI models, it is becoming clearer that with the right compute and talent, strong models can follow. The formula we must follow is delivering these resources with the directive to release the models openly, then we can solidify American AI leadership. Every stakeholder – from tech giants to philanthropies to federal agencies to researchers and engineers – must ask themselves: Are we funding or participating in the future of AI research, or are we ceding it to competitors who understand that open models are the foundation of AI supremacy?

Discussion about this post

Ready for more?