The koan of an open-source LLM
A proposal for a new definition of an “open-source” LLM and why no definition will ever just work.
The term open-source software emerged in the 1990s to make clear the differences between free-to-use software (freeware) and other cultural values. At the time, the internet was a niche subject dominated by a passionate few. Today, the term is being adapted for AI, but AI is already a household term and is being used by all varieties of companies in the Fortune 500.
A core piece of the definition of open-source software was the decision to avoid any sort of usage-based restriction in the governing document called a license. If a piece of software said you can’t use this for a morally grey topic, it was not open-source. The biggest difference we see today is the obvious cultural change in the politics and social status of AI in the 2020s versus software in the 1990s. If open-source hadn’t been defined until today, it’s likely it would look very different.
In chatting with a few people in the room when the accepted definition of open-source was written, such as Brian Behlendorf of the Linux Foundation and Mitchell Baker of the Mozilla Foundation, there were clear ideological battles behind so much as a single word in the definition. The open-source software definition includes the following (with more explanation of each):
Free Redistribution,
Source Code,
Derived Works,
Integrity of The Author’s Source Code,
No Discrimination Against Persons or Groups,
No Discrimination Against Fields of Endeavor,
Distribution of License,
License Must Not Be Specific to a Product,
License Must Not Restrict Other Software,
License Must Be Technology-Neutral.
Each of these principles was debated over, and likely even bigger fights unfolded on the terms that were omitted. Such clarity may not exist for modern machine learning systems. For much of this post, I will focus on the definitions and landscape of what we currently bucket under “Open LLMs”. There’s plenty more work to be done on what constitutes open-source machine learning or AI generally, but the discussion will be increasingly messy.
With large language models (LLMs), we are starting to uncover clear patterns in the release and downstream use of them. The current definition or colloquial term for an LLM with weights available is an Open LLM, short for open-weights LLM, or in the executive order, a widely accessible large model, or something similar.
This definition, shortened from open-source LLM, which everyone (myself included) misused badly up and through Llama 2. Meta released Llama 2 as a “open-source LLM,” without consultation of the open-source governing bodies. I copied their messaging, got correctly called out by Eleuther’s Stella Biderman, and the usage in the technical community has been improving. Unfortunately, the difference between an open-source LLM and an open (weight) LLM to policymakers is not nearly clear enough. An open-source LLM was assumed to be one with every part of the stack accessible and without any usage restrictions. The subtle difference in naming did not remotely reflect the extreme difference in the methodology of the release.
After trying to make the term “Open LLM” work for the last 6 months or so, it’s pretty clear that we need a change if we want to be able to mobilize broader support for open AI.
A new naming scheme for open LLMs
There’s an emerging trend in the space of open LLM providers. First is the likes of Meta with Llama base models (and Mistral). Second is the likes of Eleuther and AI2 with Pythia and OLMo. The Llama-like models are base models with limited information that are primarily intended for application and building a broader economy. OLMo and Pythia represent attempts be radically open to foster a long-term scientific understanding and inclusive ecosystem around a powerful technology. Currently, all of the models like Llama, Mistral, Pythia, and OLMo all fall under the same term of Open LLMs, which is quickly becoming a big, messy category.
My proposal is moving from one binary of open vs. closed LLM to two binaries (which really results in three options). By differentiating models that open just the base model weights versus the entire training stack, we can have the following model types:
Openly Trained Models (OLMo, Pythia, etc.) — those with data, training code, and weights available without usage restrictions.
Permissible Usage Models (Llama**, Mistral, Gemma, etc.) — those with base model weights and inference code available for easy fine-tuning and distribution.
Closed LLMs — everything from GPT4 to a random set of fine-tuned weights without much information.
The first category obviously includes the second, but it’s striking how clearly the framework fits what we have been seeing to date. I don’t expect this to be a final answer, but it provides clarity on the type of answer that would be helpful for informing policy on the landscape of open language models. A central issue would be balancing the fact that the industry leaders of the Open LLM movement would almost certainly be bereaved to get a down-ranking from the top of the openness leaderboard.
The Open Source Initiative (OSI), a non-profit tasked with credentialing various licenses as open-source software, or not, is working on a working definition of open-source AI broadly. I’m not sure most practitioners in the field of AI will agree with it (as I have seen an early version of updated v0.6), but it’s an important perspective reflecting the broader history of the internet relative to the narrow framing of LLMs. [Update, thanks to Stefano at OSI] ** The license used by Llama, particularly around economic redistribution, is squarely at odds with original open source principles. Hence, more variations are needed. Some model releases should be announced squarely as not open source, but that would remove the PR bump associated from it.
At the end of the day, defining an open-source AI license is a multi-institutional and cultural process. With my above definition, there are tons of corner cases on the license beyond the Llama 700million users clause. What happens if the training dataset is gated, is it still an Openly Trained Model? To solve this, I see a possibility like that of Creative Commons Licenses, where modifications are denoted in the following acronyms. These licenses include things like CC BY NC ND:
BY: credit must be given to the creator.
NC: Only noncommercial uses of the work are permitted.
ND: No derivatives or adaptations of the work are permitted.
For AI, we will definitely have things like that. Many companies are creating custom licenses for things like no training on outputs (N.O.) or no use for Waifus (N.W.). If we can agree on base terms to then append restriction to, it’ll be much easier than each organization starting from an entirely new license. The difference between the Llama 2 license and Apache 2 that Mistral uses now is astronomical (more on this below), but with the above proposal we can at least start on common ground.
The impossibility of this, creating a koan, is the cultural weight leading teams have in making the definition. If the definition of open-source AI doesn’t fit what Meta wants it to be, adoption will be slowed or stopped. Meta did eventually stop saying its Llama models are open-source, just calling them open models, but that delay will become even longer if we collectively assign them a terminology they don’t agree with. However, without the support of rich organizations like Meta participating in the open LLM game, the ecosystem may be financially impossible to support in its current form.
Pivot points and politics
In the 1990s, the government waged a so-called “Crypto War” on encryption standards in consumer technologies. By classifying state-of-the-art encryption standards as a military technology, consumers were limited to weaker encryption and therefore online security. When compared to AI today, it shows that regulation may not be our primary point of concern around the future of the open LLM ecosystem — it may be other federal actors. While regulation is slow, there are plenty of other levers that can be pulled within the government against a variety of open LLMs if they are presented with misleading facts or stories.
The support of open-source software will always be exposed to threats of this kind. Open-source movements are intentionally messy and multi-stakeholder. This complexity is the strength and the weakness of open ML today — it cannot be extricated from the discussion. Open LLMs are not a topic solely defined by Meta, no matter how much that would serve them.
Recently, there have been many important events governing the future viability of open-source AI, wherever the definition lands.
Claude 3, arms race, commoditization, and national security
On Monday, March 4th, Anthropic released their set of Claude 3 models and the internet decided generally that they’re the new standard, with a vibes-based dethroning of GPT4. In practice, this means they’re probably roughly equal because everyone loves a new thing. We’ve reached the point when those most concerned about existential risk are definitively contributing to an arms race behavior across the industry. The pace of progress is extremely high, which means any arguments for saturation look increasingly nonsensical. Saturation in technology looks like the incremental iPhone releases, not a new clear winner in the space from a different company each month.
With a dramatic pace of progress comes an increased risk of non-regulation-based political interventions on the ecosystem, as discussed above with encryption. I considered it a reasonably possible, but not likely, outcome that the current leaders are entrenched by federal action responding to what they created that inevitably hurts open ML.
The largest factor that openness may have on its side is the difficulty of using API-based software in critical national security infrastructure. While technically it is more nuanced, many secure government processes rely on something close to an air-gapped network where accessing external software is extremely challenging. In this world, you can put the weights from HuggingFace on a thumb drive and cross the gap, though.
The secondary benefit openness can take from this news is the potential for commoditization. Given that all of Anthropic, Google, and OpenAI have made models of a similar caliber, it is increasingly likely that someone in the open will figure that out eventually too.
Doomers debunking bio risks of LLMs themselves
Over the last few months, the wind in the sails of the bio risks arguments against openly available LLMs has totally shut off due to two studies from institutions predominantly supporting this theory saying that there’s “practically no risk.” The two pieces are:
From RAND: Current Artificial Intelligence Does Not Meaningfully Increase Risk of a Biological Weapons Attack doesn’t need much more description.
From OpenAI: Building an early warning system for LLM-aided biological threat creation, hides a clear message: “In an evaluation involving both biology experts and students, we found that GPT-4 provides at most a mild uplift in biological threat creation accuracy.”
From what I’m hearing, these articles have meaningfully shifted the conversation in DC away from the area, which was a huge risk to the perception of open language models. There is still a lot of work to be done to change the core axis of the conversation away from the assumption that “Open AI is dangerous AI.” Along most axes, being open normally provides opportunities for safety and risk — the direction of the potential lands depends on the evolution of the technology.
Mistral’s perceived reversal and the EU
Not all the politics for open ML companies are good these days. The largest piece of AI regulation passed to date, the EU AI Act, was meaningfully influenced by the rise of Mistral AI and its potential as a counterweight to a largely American technology ecosystem. In short, Mistral pushed the law towards a carveout that protected companies from releasing weights openly, which had been their key “strategy” until the recent news with Microsoft.
In a flurry of news last week, Mistral AI announced a $15 million investment (likely for compute credits) from Microsoft, along with a new Large model via an API and a chat product. While making this change, the website made a substantial overhaul that muddled a lot of its phrases around open-sourcing and open-weights, immediately drawing backlash from Twitter users. This backlash, for the AI peanuts of a few million dollars, was nothing compared to the emerging attention from European policymakers.
While critics are pushing for an antitrust investigation, the unreported discussions and vibes are far worse. In short, some in the EU feel betrayed by this action. Differentiating between the portions of this caused by the Microsoft relationship and the portions caused by walking back on openness principles is not easy. Mistral is essentially positioning itself as a mirror to OpenAI, with a play on initial open models to gain momentum. It’s hard to see how Mistral could continue to exist in this form without specific support from the EU to try and broaden the technology landscape. When the first report comes out on the corporate usage of the Mistral API relative to the likes of OpenAI, I’m sure this picture will be much clearer. Economies of scale have and will continue to dominate in the growth era of AI we are in.
The coalition of open ML these days is small, so any changes to the perception and incentives of one or two players will reconfigure things very rapidly.
Messy points: Transparency, safety, and copyright
This section is mostly about checking in on long-standing and slow-burning questions facing the AI field, and how openness is affecting it. They’re the discussion items that’ll play out in the political battle field.
The muddling of transparency
The Foundation Model Transparency Index (FMTI) is a project released by the Stanford Center for Research on Foundation models last year that showcases how hard it is to progress a narrative on open ML these days. To me and others at Eleuther AI (and surely many silent community members), this work deeply muddles the narrative of what transparency is for language models. Transparency is a central value I believe in for the benefits of open ML models, so seeing it distorts results in more work I have to do to make any policy goals. The critique we issued had the central point that we expect this to be gamed, in addition to many philosophical issues.
Today, we’ve now seen this gamification happen with the release of Croissant LLM in February (a bilingual French-English model) that touted its transparency score. This is the start of the signs of pain from the FMTI, with more to come out slowly.
I see the FMTI mostly as the result of a scientific process that designed methods, used them to get results, and then failed the scientific step of updating your methods when the results were not correlated with the methods you deployed. This re-running of results happens a lot in other fields like medicine where you can only test certain hypotheses based on your experiment design, and running more tests reduces the statistical power of subsequent hypotheses (I think). The field of AI is very different. Transparency is a central topic, so it was obvious that doing anything in this space when at a well-regarded institution like Stanford would result in perceived success from outsiders. For this reason, I understand the mistake, as I could see myself having something happen in a few years, I just wish more work was done to ground the Index to what most in the open ML community actually mean by openness and transparency.
I don’t like the be in the dunking business, we simply need people in AI being self critical and reflective of how their work will shift the narratives we care about. I’ve had continued discussions with some of the authors of the FMTI, and they’re doing plenty of other work that is crucial in the beneficial rollout of AI into society. This one piece of work doesn’t represent everything they’ve done.
The muddling of “safety”
Since the regulatory capture push by players like Sam Altman in Washington around the existential risks of powerful AI systems, under the name of AI Safety, the term safety has been overworked. The same term is now being forced into discussions on things like Gemini’s bias problems, rather than meaningful and specific harms caused to individuals. This sort of distraction, resulting in lukewarm corporate statements, and cases like the Sam Altman Saga at OpenAI have substantially shifted the political momentum away from closed systems. While not entirely about open versus closed AI, this meme captures the sentiment that the notion of safety is garnering across the AI world.
As long as this is the case, there’s a large opportunity window for open LLMs, which for a long time were consistently bashed as “unsafe.” The notion of unsafe now applies at every point on the open versus closed distribution, rather than just hypothetical risks on open models. Google seemingly got the backlash the worst because they didn’t fix the issues when they saw DALLE 2 or Llama 2 chat go through similar problems.
The muddling of licenses and copyright
The messy point that I still think open models may have the most exposure to is the joint discussion of licenses, copyrights, and terms of use. We are having the most basic discussion on the matter if licenses or copyright even will apply, which is very hard to do when the field is moving so fast. The license question is being handled by the open-source software community primarily, as discussed above, and the issue of copyright is slowly propagating through the American judicial system.
Licenses in open-weight language models have become extremely complicated. For this section, I’m just going to focus on the “license” given to the weights. Mistral plays an underrated role by using the Apache 2.0 license which is free of almost any restriction. The other models carry terms that are normally not followed, which could bite users later.
One of the most important drivers of innovation in model fine-tuning these days is that of synthetic data — using one LLM to generate outputs to then use to train another LLM. The problem is, that most models with custom licenses actually restrict this. For example, even Llama 2 has a term similar to OpenAI or Anthropic’s terms of service:
You will not use the Llama Materials or any output or results of the Llama Materials to improve any other large language model (excluding Llama 2 or derivative works thereof).
There’s a reason almost every company has included a provision that you cannot train foundation models on the outputs from a model — synthetic data from powerful LLMs is likely the strongest flywheel to creating more powerful LLMs than we have right now. For most companies, this makes sense, as there is a moat advantage to having the best model. For Meta, the models are open to enable more content and usage of their services, so having better AI faster would support this. In the future, I expect Meta to remove this provision, if their open AI strategy is actually that stated by Zuckerberg in a recent earnings report.
[Update, thanks Kent K.] It’s important to remember that a license or terms of service only applies to a user of a model, so if I were to generate data with Llama 2 and upload it to the Hub, you’re not technically beholden by it. This is why Mistral’s Apache 2.0 license can be seen as a loophole around OpenAI’s terms of service if they trained on OpenAI outputs (rumor, not confirmed). I refer to this as a “license loophole” that is far from being addressed in our current system.
Gemma’s license from Google DeepMind is generally accepted to be better, with weird terms around waifus and updating the model only, but it's not good enough to use for many of the synthetic data tasks enabled by Llama 2 70 billion or Mixtral. If a ruling were to come down in this space, the open fine-tuning community could have to make a left turn on no notice.
Close in my mind, probably because I rarely go deep on legal issues these days, is the notion of copyright. We’ve seen this system be put to work against OpenAI, but the same could happen for many open models. More analysis could potentially even be done given access to the raw weights.
There are many emerging discussions here around the copyright of outputs or the models themselves. A growing argument is that weights and outputs are just information, so they would NOT get the protections that copyright confers on creative work.
Last year there was a prominent case from the U.S. copyright office where they rejected copyright on a graphic novel made substantially with Midjourney via the discord interface. The general idea is once you grant copyright to AI-generated media, copyright is destroyed. If AI material had access to copyright, the entire copyright system would be numerically impossible.
The last resort in situations where copyright does not apply is commercial terms of service. Terms of service are essentially enterprise agreements, and will never affect individuals. These documents, which are highly interpretable by the party selling (e.g. OpenAI), could pan out to be extremely important in the open ecosystem if they are to set back Meta or Mistral.
Vibes points and next steps
An important principle of the early days of open-source software was the somewhat intangible idea of integrating freedom into our technologies. The benefits listed for open LLMs are much more muddled than the narratives of risks presented by Doomers and their quiet allies. Open systems will always have a narrative disadvantage due to the compelling storytelling nature of fear, but concentrating on fewer narratives and stories of the potential of open language models of all types and sizes.
These stories are how we concentrate the vibes. A lot of smart and calibrated people agree that E/Acc is not the position that will mobilize the requisite support among policymakers. It’s easy to see why some of the messaging from E/Acc may be received with skepticism about the tools they represent.
At the same time, the value built by core open-source software stacks has been forgotten and underappreciated for years. One of the most common software stacks called the LAMP stack after Linux, Apache, MySQL, and PHP, is entirely built on different open-source projects. Many modern technology companies would look entirely different without this stack, yet it is often questioned whether open-source tools actually deliver meaningful business value. While the same may still happen for LLMs, it’s important to know that we may be protecting a quiet, yet central, asset, rather than something flashy.
This sort of battle is attracting the most vibes-centric of competitors, for better or worse. Last week, Elon Musk filed a lawsuit against OpenAI saying they violated corporate law by not being Open Enough. This lawsuit is almost surely just smoke and mirrors, but it goes to show how far the issue has permeated corporate America. My favorite part is that in order to make a ruling a jury would have to say whether or not OpenAI has AGI, which previously was reserved for their boardroom. At the end of the day, this action from Elon can be summarized by this meme. Open source intentionally makes it so one group cannot rule all of the resources we are building.
With all of this in mind, it’s time to move the discussion from the marginal risks of openness to the marginal benefits. Recently, there was a paper from the Stanford Center for Research on Foundation Models that downplayed the most documented risks of open foundation models (except nonconsensual deepfakes and CSAM) and proposed a framework for understanding the marginal risks of releasing open foundation models.
This paper is an important contribution, but it so obviously calls for a paper on the benefits of open-source models. These example benefits and the narratives that carry them will be the bedrock of political momentum throughout 2024.
At this point, pretty much every major tech company interested in AI has released an “open” LLM at the 7 billion parameter or small scale. For those still reading this, it’s a good time to step up and participate in the discussion. Yann said he feels alone in advocating for this position, so help a guy out.
It feels like we are at the beginning of conversations that’ll be remembered as “when we defined open-source AI,” so let’s keep going. For more on the event, see this blog post from Mozilla on the Columbia Convening on Openness and AI.
Thanks to many old friends and new friends in the space that I discussed these ideas with at the Columbia Convening on Open Source AI this last week, including Irene Solaiman, Yacine Jernite, Aviya Skowron, Stella Biderman, Kevin Klyman, Sayash Kapoor (of AI Snake Oil), and others. I’ve been recommended this blog that covers the issues of open ML: Open(ish) Machine Learning News.
Audio of this post will be available later today on podcast players, for when you’re on the go, and YouTube, which I think is a better experience with the normal use of figures.
Looking for more content? Check out my podcast with Tom Gilbert, The Retort.
Newsletter stuff
Elsewhere from me
We released OLMo Instruct! A long way to go, but hey, it’s open.
Models, datasets, and other tools
StarCoder 2 is here! Yay, more open coding models.
StableLM 2 report — one of the better base model reports recentyl.
Gemma fine tunes finally starting to come, thanks H4. See discussion on it being “unfine-tuneable.”
Merlinite 7b from IBM takes an interesting approach on synthetic data for fine-tuning.
Moondream2 small/local VLM.
Links
A cool new project emerged on getting start with building and understanding foundation models — the Foundation Model Cheatsheet.
The author of MMLU, Dan Hendrycks, wrote a cool blog post on “how to design evaluations.”
Promising signs for KTO in the OrcaMath paper.
A teammate at AI2 did an interview about many details in OLMo on TWIML.
Housekeeping
An invite to the paid subscriber-only Discord server is in email footers.
Interconnects referrals: You’ll accumulate a free paid sub if you use a referral link from the Interconnects Leaderboard.
Student discounts: Want a large paid student discount, go to the About page.