How the Foundation Model Transparency Index Distorts Transparency
A proper critique of the Foundation Model Transparency Index (FMTI). Plus some thoughts on the ecosystem implications.
Stanford's Center for Research on Foundation Models (CRFM) has never been uncontroversial. Seared into my mind is Jitendra Malik's criticism of their first piece, which coined the term foundation model, as something along the lines of "floating castles in the sky." It still feels warranted.
Fast forward to today, the same organization (with different authors) released what they call the Foundation Model Transparency Index (FMTI) that posits to scores "flagship models" from leading players in the AI space on how transparent their models are. At first glance, this is obviously something that is needed. It’s crucial to progress of the AI community.
Quickly, I realized they were comparing open models (those with weights you can download) directly to those you cannot, with little acknowledgement of the fact that with closed models you cannot verify most of the claims made around transparency.
Finally, I dug into it a little and realized that so much of what I think transparency is actually about was misconstrued in the work. This combined with the huge media distribution push they did made me certain I needed to publish a critique. If nothing else, as an exercise to figure out what transparency actually should mean.
To me, transparency is openness in good faith. The first proposal of the FMTI sidesteps this and defines transparency in a way that suits technology companies. With some friends from EluetherAI and the broader open ML community, we wrote an extremely detailed analysis of the misunderstandings in the FMTI. This really constitutes the main blog post for this week, so I really recommend you give it a read.
Here’s the critique in full. It is also cross-posted on the EleutherAI blog.
How the Foundation Model Transparency Index Distorts Transparency
Evaluating transparency requires precision.
October 26, 2023 · Nathan Lambert, SE Gyges, Stella Biderman, Aviya Skowron
Stanford’s Center for Research on Foundation Models (CRFM) recently released a new piece of work called the Foundation Model Transparency Index (FMTI), which sets out to score all large language models (LLMs) on the different aspects of transparency around building and deploying a model. Work on understanding the transparency of LLMs is crucial to building trust and creating realistic evaluation standards for this extremely powerful technology. However, the FMTI makes many claims that are misleading concerning both the spirit and facts around transparency of LLMs, and is detrimental to recent progress in transparency.
Our core issues are:
The FMTI misleadingly presents itself as measuring transparency for foundation models when it really measures how well-documented a commercial product is.
The FMTI’s scorecard approach minimizes the nuance of their analysis and encourages people to view it as a score to optimize. This is exacerbated by the way the authors weigh questions, overemphasizing things that are easy for companies to address on paper without making real change.
The FMTI makes critical factual errors in its analysis, misleading readers. It is also systematically biased against openly released models. The fact that openly released models score higher than closed APIs despite this emphasizes just how large the transparency gulf is.
The FMTI doesn’t engage with reasonable criticism of its approach, even when that criticism is openly acknowledged by the authors.
Throughout this document, we point out errors and assumptions that impact how BigScience’s BLOOM-Z1 was evaluated. This spreadsheet shows a corrected scoring of BLOOM-Z, ultimately giving it an 87%. However, even the selection of this model for the Index is an odd choice: the “flagship” model from the project (following FMTI's terminology) is BLOOM, which was specifically developed to operationalize transparency in AI development and serves as a real demonstration of how an AI model can be transparently developed. This analysis erases the distinction between BLOOM and BLOOM-Z, ignores much of the documentation surrounding the BLOOM research project, and ultimately judges that BLOOM-Z lacks transparency in spite of adhering to a high standard for research transparency.2
BLOOM-Z is centered in this response because its creation was grounded in transparency, which the Index ostensibly is interested in improving. Because of this, BLOOM-Z's treatment in the FMTI seems to most clearly illustrate the Index's problems. We, the authors of this post, were involved in the project for BLOOM-Z or are affiliated with institutions involved in the project. To the best of our ability, we have tried to focus on BLOOM-Z only where its treatment in the FMTI is useful for illustrating the FMTI’s fundamental problems.
Transparency is a Tool, not a Goal
One of the core issues with the framework is that it quantifies transparency without seeming to consider why transparency would be good. Transparency is an extrinsic value, that is, a mechanism for achieving other ethical values, such as accountability. This is further explained in the BigScience Ethical Charter that gave rise to BLOOM-Z. By analyzing transparency as an aggregate, mostly without reference to how it can mitigate specific harms or serve specific purposes, the Index makes transparency hollow. We see a clear picture of these transparency metrics as an example of Goodhart’s law: "When a measure becomes a target, it ceases to be a good measure."
The way the authors define transparency changes throughout the questionnaire, which gives them leeway to ask just about any question they like: they ask about the existence of terms of service, harm mitigations, and evaluations of trustworthiness under the guise of transparency. While these are useful things to evaluate, they should not be conflated with transparency.
Much of what it measures is whether a project surrounding one or more LLMs happened to document a number of very specific things in a format that the authors of the FMTI do not wholly clarify (their in-progress paper assures readers it is based on their "judgment"), but that apparently excludes the most widely used method for dispersing transparent information: open-access papers. For example, BLOOM-Z is marked as not reporting ownership of the compute for their project. It does. The information is available in the BLOOM-Z paper, clearly linked to at the start of its documentation. Yet somehow this way of acquiring the relevant information – through an open-access paper, the consensus approach in science and academia – is not in scope, and consequently the model loses points. Similarly, BLOOM-Z is marked as not sharing details of its data, even though extensive data documentation, including the fine-tuning details required by the Index, are directly linked to in the BLOOM-Z model card. We find many examples of this, both for BLOOM-Z and for other models, casting doubt on the reliability of the reported numbers on top of the already problematic skews in the criteria. This type of error can be fixed, but there is not yet a clear mechanism to do so. This also serves as a warning to anyone who uses limited keyword searches as their primary avenue for doing document discovery.
The core manner in which transparency is usually understood within research is crucially barely represented in the FMTI. Machine learning research is generally considered "more transparent" if the data, model, and evaluation procedures are openly released. These things are so undervalued by the FMTI that a model that prioritizes them can score as low as 30% on the FMTI, meaning over two thirds of the questions don’t reflect things machine learning researchers view as fundamental to transparency.
Transparency or Corporate Responsibility?
Another fundamental error is that the FMTI conflates models and hosted model services. Models are research artifacts, created once. Their creation can be documented in research papers so that they can be reproduced, and they can be released for download and governed by licenses. In contrast, hosted model services are full products, may contain many models and additional processing, and are usually maintained by one corporation and governed by Terms of Service. Hosted services for models are characterized quite heavily by what model(s) they host, which determines what they are useful for, but the service and the model(s) are distinct things.
The FMTI is designed in a manner that better suits judging these hosted model services, but it is applied to standalone project artifacts regardless. The FMTI seems to constitute a list of guarantees the authors wish were provided by vendors like OpenAI. Although it does criticize the practices of vendors, it implicitly accepts their agenda. The Index gives a relatively low weight to the production of actual research papers or artifacts that would allow other people to reproduce or build on their work, structures many of its questions around the moderation of hosted services, and assigns significant weight to releasing extensive written material that serves little research purpose (which is generally useful as marketing). This is exemplified by the fact that OpenAI's GPT-4 received 50% of the points for “Methods” and “Model Basics” despite disclosing no non-trivial details about the methodology used to create the model. A transparency index from the perspective of the research community would assign GPT-4 a score of 0.
This bias towards companies offering models as a service is shown by the number of items in the FMTI that only apply to such companies. These include questions about permitting or prohibiting users, providing justification and appeals for enforcement actions, and disclosing to users that the systems with which they are interacting are powered by AI. These are totally reasonable questions for a software-as-a-service AI business. They are nonsensical for a research project, which should be evaluated according to different criteria and along different axes.
In a typical research project, artifacts are made available with a license that specifies what users can and cannot do. This is a widely adopted practice in machine learning research and covers models including EleutherAI’s GPT-NeoX-20B, Google’s T5, and Meta’s LLaMA series in addition to many less well-known research artifacts. This manner of governance is also used broadly for open source projects such as Android, Linux, PyTorch and Jax. It is generally established law that software licenses are enforceable contracts, but such enforcement is extremely rare in research settings because most licenses are permissive and most use of software falls clearly within the terms of widely-used open source licenses. If a user is concerned that they could possibly violate such licenses, the terms of those licenses are generally quite clear. On the other hand, terms of service (ToS), which the FMTI requires, are often held to much less strict standards and are often not legally grounded, but rather are loose terms for commercial partnerships.
The questions drafted in the FMTI and rationales provided by the authors for Usage Policy and Model Behavior Policy determine that permissive open source licenses are not transparent but corporate terms of service concerning paid API use are. We see this explicitly in how the BLOOM-Z model is treated: BLOOM-Z loses points for lacking "Permitted, Restricted, and Prohibited Model Behaviors," or "Permitted, Restricted, and Prohibited Users" despite the fact that this is covered by the RAIL license. By our count, there are at least seven points that are wrongly taken away from BLOOM-Z here.
These issues around licensing do not constitute a very large part of the Index, but they show its bias the most clearly. The same issue plagues the "Downstream Impact" section: because BLOOM-Z is not actually a commercial product offered by Hugging Face they have no ability to know things like "the fraction of applications corresponding to each market sector" or "the number of individuals affected by the foundation model." Additionally, it would be a massive breach of user trust and violation of open source principles for Hugging Face to surreptitiously monitor model use or record usage and make that data available to third parties. It would also be technologically quite difficult, if not impossible. Nevertheless, BLOOM-Z loses points for not disclosing policies about doing these things. In total, we identify over 10 questions that are inapplicable to the BLOOM-Z model or any model released for research purposes with a permissive license.
Many other items in the Index favor large established corporations in more subtle ways. Scores around "impact", "risks", and "mitigation" all seem to redefine writing about possible future harms as a form of "transparency." AI safety research is an extremely valuable field of study in and of itself, but it is not the same thing as transparency.
The implicit assumption that writing about hypothetical foreseeable harm in some documentation related to a model is contributing to transparency is wrong. You aren't "more transparent" because you wrote fifty pages about the harm your model might do. Say you are selling firearms: are you a "more transparent" gun company if you also hand out pamphlets about firearm injuries? It may or may not be a good practice, but it does not increase the company's transparency.
All of this shows that the Foundation Model Transparency Index does not, primarily, evaluate transparency. It does not even evaluate models, nor projects for creating models. What the FMTI seems focused on evaluating is corporations, and especially whether the evaluated corporations are good service providers that provide strong guarantees and extensive documentation concerning their practices for people or companies using their products.
This is an extremely useful thing for consumers to have if they are considering using the services of the evaluated companies. Research transparency is one aspect of this consideration, but the Index in total is not about transparency, and seems to only measure research transparency where it overlaps with being a good service provider. If evaluating services is the intention of the Foundation Model Transparency Index, it should say so. The current framing of the Index has the potential to mislead people concerning both transparency and foundation models.
The Paper Raises and then Ignores Valid Criticism
Under the heading “Limitations of transparency,” the authors include several common objections to work like theirs. This section is in extreme conflict with the rest of the paper. Instead of grappling with or changing their design to avoid those potential issues, the authors instead fall victim to every single one of them. Given the substantial attention that their marketing push for the Index appears designed to create, we hope the authors use this feedback to improve understanding and motivation for transparency concerning research projects and corporations alike. In its current form, the index itself is in conflict with the positioning that transparency enables a healthy ML ecosystem. We will quote some sections of the paper to showcase the tensions we raise. We believe that many of these issues can be addressed through updates to the text and framing of the FMTI.
To begin, the authors comment on the relationship between transparency and improving ecosystem norms:
Transparency does not equate to responsibility. Without broad based grassroots movements to exert public pressure or concerted government scrutiny, organizations often do not change bad practices (Boyd, 2016; Ananny and Crawford, 2018)
Their justification throughout the paper is that they expect improving metrics that mention transparency to bring about better outcomes, completely ignoring this critique. CRFM has not been involved with any grassroots advocacy as far as we are aware, nor have they supported those who have been.
Next, on transparency-washing, the authors state:
Transparency-washing provides the illusion of progress. Some organizations may misappropriate transparency as a means for subverting further scrutiny. For instance, major technology companies that vocally support transparency have been accused of transparency-washing, whereby "a focus on transparency acts as an obfuscation and redirection from more substantive and fundamental questions about the concentration of power, substantial policies and actions of technology behemoths" (Zalnieriute, 2021).
Given the institutional support behind the FMTI, this paper provides easy fodder for corporate misuse. It gives companies a rubric for engaging in transparency-washing, even pointing out that one can achieve an 82% score on this Index by copying the practices other companies engage in. This is further reinforced by the next issue:
Transparency can be gamified. Digital platforms have been accused of performative transparency, offering less insightful information in the place of useful and actionable visibility (Ghosh and Faxon, 2023; Mittelstadt, 2019). As with other metrics, improving transparency can be turned into a game, the object of which is not necessarily to share valuable information.
By turning transparency into a score that can be optimized, organizations are explicitly incentivized to meet the criteria without engaging in meaningful change. Through its questions, the FMTI encourages an already serious problem of releasing “propaganda papers” that do not contain any meaningful detail but exist so that companies can extol the virtues of their work.
We argue that this Index as constructed and applied makes "transparency" around AI into a mostly pay-to-win game. You win the game by hiring a lot of employees whose new full-time jobs will be generating public documentation that discusses, as explicitly as possible, each of the things that this Index provides them a checklist for. They do not have to discuss any of these things seriously, and this documentation will almost certainly provide little value to anyone. This will serve no purpose besides boosting their score in the Index, but it should be able to do that quite successfully.
Conversely, those who will lose if this Index specifically is treated as or becomes a standard will be those who cannot hire numerous people to generate documentation discussing each bullet point in the Index. This benefits incumbents in the market immensely, because they already have plenty of employees. Independent research of any kind would become substantially more difficult, and new market entrants would struggle because it would make competing with them much more expensive and difficult.
Next, on how transparency relates to privacy, the authors state:
Transparency can inhibit privacy and promote surveillance. Transparency is not an apolitical concept and is often instrumentalized to increase surveillance and diminish privacy (Han, 2015; Mohamed et al., 2020; Birchall, 2021). For foundation models, this critique underscores a potential tension between adequate transparency with respect to the data used to build foundation models and robust data privacy.
The questions "Is geographic information regarding the people involved in data labor disclosed for each phase of the data pipeline?" and "Are the wages for people who perform data labor disclosed?" assume that the data labor is not being carried out by the authors but rather by a large group of contractors. For many models (including the BLOOM-Z model) this work was done by a small group of named individuals. Requiring them to disclose information about their salaries or geographical location would be wrong. Instead of engaging with this fact by exempting BLOOM-Z, the authors dock BLOOM-Z points for failing to answer.
The authors also feature over 10 questions that presume model providers are storing and selling user data in privacy-violating ways. While it is certainly reasonable to ask these things of commercial LLM products that collect user data, the authors’ choice to center these considerations shows that their conception of what it means to produce large language models is fundamentally rooted in offering such products. They also seem unaware that this is inapplicable to BLOOM-Z, and dock it for failing to meet 7 of the 10 related questions.
Next, on the competitive landscape:
Transparency may compromise competitive advantage or intellectual property rights. Protections of competitive advantage plays a central role in providing companies to the incentives to innovate, thereby yielding competition in the marketplace that benefits consumers. Consequently, work in economics and management studies have studied the interplay and potential trade-off between competitive advantage and transparency (Bloomfield and O'Hara, 1999; Granados and Gupta, 2013; Liu et al., 2023), especially in the discourse on corporate social responsibility.
Indeed, many corporations have claimed that competitive advantage is why they’ve stopped releasing key training details. While we certainly oppose this practice, we find it amusing that the authors in no way make an effort to acknowledge this fact. By tacitly accepting that transparency is all but gone in AI research from most major labs for commercial reasons, and then generating a long scorecard that makes actually releasing research only a minor factor among many, the authors give up on actual transparency in research before they even begin.
If this Index were adopted as regulation, it would be a textbook example of regulatory capture. Realistically, it is unlikely to specifically be encoded in regulation, but it is highly likely to inform the thinking of those involved in regulating AI. The Index accepts the premises that the industry would like to base regulations around: specifically, that they should not be pressurized to demonstrate transparency in any way that is likely to harm them commercially, but they should instead be given a scorecard which they can fill out simply by devoting labor to it, regardless of their actual practices.
On the role of transparency to improve the health of the ML ecosystem, the authors state:
Transparency is not a panacea. In isolation, more information about foundation models will not necessarily produce a more just or equitable digital world. But if transparency is implemented through engagement with third-party experts, independent auditors, and communities who are directly affected by digital technologies, it can help ensure that foundation models benefit society.
The authors are not third-party experts, independent auditors, or communities directly affected by this technology. They also explicitly overemphasize tracking metrics for transparency at the expense of advocating for good behavior. If instead of writing massive tomes on transparency they worked with open source leaders like EleutherAI, Hugging Face, Stability AI, countless academics, and numerous developers in the space to foster and implement transparency policies they would have a substantially increased chance of making a positive change in the world.
Other issues
Our primary technique has been critiquing the overall approach of this paper. In addition to those problems, the authors make a series of more specific errors that misrepresent the collected information.
Failure to distinguish available weights vs. not
Central to the transparency of LLMs is whether or not the weights of the model in question are available. Without the weights, many of the claims made in a model report are unverifiable beyond the word of the organization. Lacking this distinction represents the clearest failure in understanding how open-source and academic communities engage within the LLM space, where transparency is a primary goal, but one that is not rewarded by the FMTI.
Bad questions
Some of the questions in the FMTI are bad things to ask of people regardless of your goals. We’ve already mentioned that the FMTI requires that people disclose where they live and how much money they make, but there are several other unreasonable questions.
"Is a summary of government inquiries related to the model received by the developer disclosed?" While this isn’t inherently bad, all models that do not have an affirmative disclosure (including ones that have not had any inquiries) get zero points which is unfair. It's also easy to conjecture examples where disclosing government inquiries might be morally wrong or illegal. While there are contexts where disclosing government inquiries is beneficial, we see no reason to mandate it.
Questions like "Is the amount of energy expended in building the model disclosed?" and "Is the amount of carbon emitted (associated with the energy used) in building the model disclosed?" are not always things that model developers have the ability to answer, as people who rent cloud computing platforms may not have this information disclosed to them. Asking this question implicitly penalizes smaller enterprises.
The authors ask "[a]re any mechanisms for detecting content generated by this model disclosed?" This is deeply problematic because no such mechanisms are known to exist for text models. It is widely agreed upon in the academic community that "AI text detector" algorithms are bogus and OpenAI recently took down its tool for this purpose after admitting it didn’t work. This is not to say that watermarking is bad (see here for an interesting positive case for it), but at the present time it doesn't make sense as a requirement and we worry that its inclusion will incentivize model creators to present ineffective algorithms as more reliable than they actually are.
Finally, the authors ask two questions about "trustworthiness" evaluations. There do not exist widely agreed upon trustworthiness evaluation standards for large language models. They cite two papers in regard to trustworthiness evaluations, one of which is so new it post-dates every evaluated model and the other of which is a policy paper from 2020 about the general topic.
Numerical methods
The process of designing the features included in the final score implicitly weights the priorities of transparency in a way that may not be unanimous across the organizations included. Given the presentation of the report with 100 factors and 10 organizations for an even one thousand data points that, when averaged, fit nicely in a 10x10 matrix, it is clear that some weight of the work went into creating a concise presentation rather than focusing entirely on the accurate representation of transparency. Within these factors, common elements of the machine learning research process are considered anywhere from 1 to about 6 or 7 times. This directly maps to how the scores reflect the bias of the questions the authors chose to ask more than any underlying facts (without further analysis of the sociotechnical meaning of transparency or disclosure of these biases in the work itself or the press releases).
These issues are exacerbated by the fact that on several occasions the authors ask multiple questions about the same topic, seemingly as a way to introduce virtual levels into their analysis in spite of each question being graded as a simple yes or no. Examples of pairs of questions that are actually multi-level answers to a single question include: Queryable external data access and Direct external data access; Limitations description, Limitations demonstration, and third-party evaluation of limitations; and Unintentional harm evaluation and External reproducibility of unintentional harm evaluation. By creating multiple questions for each of these, the authors have made a choice to up-weigh their relative importance and to allow themselves more freedom in issuing scores. This choice and its impacts on the analysis (which is perhaps unintentional) are never discussed in the paper.
The scores throughout the process were assigned by groups of researchers answering the 100 transparency questions. It is stated that when disagreement happened in the process, the authors sought agreement before finalizing the score. The disagreement in labels is a signal that should be incorporated in the benchmark through error bars or a lower score on that feature. Calibrating scores across models is central to making this a functioning benchmark.
For example, the organizations in the study were given an opportunity to comment on the original FMTI scores for their model. Not included in the report was how much the scores changed through that negotiation. Including these, with comments in the PR regarding arguments and who approved the score change, would be a great feature in the GitHub repository that now houses raw data.
Finally, given all the disparities mentioned, reporting the index as an average across all features compresses a complex picture into a single leaderboard, encouraging gamification and other downstream issues.
Conclusion
Transparency is an important value to the machine learning community for many reasons: earning trust, acknowledging the power of the tools we are building, promoting inclusion, enabling robust safety research, and more. The Foundation Model Transparency Index falls well short of supporting these interwoven goals. We welcome more work in the space of transparency, but broad assessments of the AI ecosystem need to be methodologically rigorous in order to make them accurate. In its current form, the FMTI does more to confuse and commercialize LLMs than it does to make them fundamentally open. We hope this critique encourages stronger practice of the principles of open source and transparent machine learning in the pursuit of responsible AI development.
Citations
BigScience. "BigScience Ethical Charter." BigScience Blog (2022).
Black, et al. "GPT-NeoX-20B: An Open-Source Autoregressive Language Model." Proceedings of BigScience Episode# 5--Workshop on Challenges & Perspectives in Creating Large Language Models (2022).
Bommasani, et al. "The Foundation Model Transparency Index." arXiv preprint arXiv:2310.12941 (2023).
Brundage et al. "Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims." arXiv:2004.07213 (2023).
Ferrandis, et al. "The BigScience RAIL License." BigScience Blog (2023).
Goldman. "Invisible AI watermarks won’t stop bad actors. But they are a ‘really big deal’ for good ones." VentureBeat (2023).
Kirchner, et al. "New AI classifier for indicating AI-written text". OpenAI Blog (2023).
Muennighoff, et al. "Crosslingual Generalization through Multitask Finetuning." Annual Meeting of the Association for Computational Linguistics (2023).
Phang, et al. "EleutherAI: Going Beyond "Open Science" to "Science in the Open"." NeurIPS Workshop on Broadening Research Collaborations (2022).
Raffel, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer." The Journal of Machine Learning Research 21.1 (2020).
Scao, et al. "Bloom: A 176b-parameter open-access multilingual language model." arXiv preprint arXiv:2211.05100 (2022).
Wang, et al. DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. arXiv:2306.11698 (2023).
To cite this blog post, please use:
@misc{fmti-critique,
title = {How the Foundation Model Transparency Index Distorts Transparency},
author = {Lambert, Nathan and Gyges, SE and Biderman, Stella and Skowron, Aviya},
howpublished = \url{https://blog.eleuther.ai/fmti-critique/},
year = {2023}
}
Back to my thoughts and writing. If you're looking for a more casual analysis, Tom and I will be discussing transparency on The Retort AI Podcast this week; subscribe here.
Who should make the rules: How did we get here?
Before I get into the broader context of this critique I wanted to put on the record that I respect the work of the authors of this index. Though, with respect and being close colleagues of them, I feel like it is my responsibility to do my best in providing an honest critique of where the motivations of their work went wrong. I hope people will do the same for me in these years when the boundaries of fair play are being determined in AI. Most of the issues I present almost certainly did not come from the graduate students, but somewhere higher up. In that spirit, fair credit should be given to Shayne Longpre, who acknowledged in private that their are flaws with the method and they erred on the side of doing something to address transparency even given the potential issues.