Alignment-as-a-service: Scale AI vs. the new guys
Scale’s making over $750 million per year selling data for RLHF, who’s coming to take it?
Scale AI, the startup once known for various self-driving and military data labeling contracts, is closing in on a billion of annualized revenue on its data services for techniques like reinforcement learning from human feedback (RLHF). I’d heard plenty of rumors about this, and things like them working with every major lab from Meta to OpenAI, but seeing some of it in public reporting feels different.
Quoting from Forbes ($):
Scale's defense arm still pales in comparison to its commercial business, which pushed the company's annualized revenue run rate in late 2023 to $750 million, according to two people familiar with the matter — up from about $250 million in early 2022. That growth was spurred by skyrocketing demand from AI companies for reinforcement learning with human feedback (RLHF), a data labeling technique for large language models.
We don’t have good ways to get expert data for the domains we care about or added data to get robustness we need, and Scale has been the place curating most of that, for a moderate fee.
Scale’s revenue mirrors the general industry sentiment towards RLHF over the last few years. I’m curious what the $250 million in early 2022 was, but I can assure lots of data budgets across the ML industry got reallocated towards RLHF. Doing RLHF well and making a truly sticky model takes a big investment.
These revenue numbers aren’t far off from what OpenAI is seeing and they’re from a different direction (and a much, much lower valuation) — paid services for building and maintaining models rather than serving models. While the short arguments on foundation model providers have been studied at length — they focus on the commoditization of the services and races to the bottom — it’s useful to study Scale, predict where they’re making their money, and understand where the upstarts will try and jump in.
The question is: Does any moat apply to supplying companies that are hungry for data for RLHF? The most public example of what people are paying for is in the Llama 2 technical report (which could change a lot in Llama 3 because traction on Llama 2 chat was extremely low). Below is a table showing that Meta bought over a million comparisons for their models, where similar open-sourced datasets barely crack 100k samples:
On the Latent Space emergency pod for Llama 2, we estimated that the price of this reported data alone was on the order of $6-10 million (by extrapolating from known public contracts on a price/comparison basis). This is one of many experiments that companies like Meta would be doing. They’d also have experiments for their other flagship models in the GenAI org like text-to-image and text-to-video, where these latter experiments arguably can benefit further from RLHF.
Returning to the leaked $750 million per year of revenue, that would come primarily from a few large-scale contracts like what Meta is doing. Scale is the provider managing tons of humans creating this data. While there are ethical challenges to doing this well, which Scale doesn’t have the best record on, the question I’m trying to answer is: How good is the underlying business here?
As long as the business is more about managing large swaths of humans and curating datasets over and over again, the business doesn’t look like a tech company with increasing profit on fixed capital costs. How much can Scale AI increase its margins by employing more and more humans? If I was in the room buying data, one of my first assertions would be “We don’t want any prompts sold to other customers.”
Second, we don’t know how long the data services will be needed. If the base models get better, will we still need RLHF and various fine-tuning techniques? From my experience, I expect the need for expert data to remain, but much of the “generic alignment data” will eventually be open-source and reusable. Or, at an even more obvious level, how long will all big tech companies still want to train LLMs? Eventually, a few companies will win, consolidation will hit, and then Scale could lose some lucrative contracts.
While it’s easy to paint a grim picture here, all AI startups are in this boat. What happens to OpenAI when the open models well and truly catch up on the use cases most folks care about?
Interconnects is a reader-supported publication. Consider becoming a subscriber.
The competition with humans-in-the-loop
Last year, the market for RLHF at some point looked like a limited market for safety-conscious training. Now, it’s clear that the RLHF point of view is central to delivering consistent value across various capabilities (e.g. math or domain-specific tasks). It’s funny to phrase Scale AI as an incumbent to the competition when it’s still very much a startup, but most companies founded in the pre-ChatGPT space feel extremely old to the new crop founded in the last year or two.
While most companies are documented as being customers with Scale, it’s important to know that multiple research papers like InstructGPT and Anthropic’s RLHF work mention them using their own in-house annotators. Some of the early adopters for RLHF have already moved elsewhere or to their own stacks. The smaller alternatives to Scale AI for human data include companies like Surge AI, Invisible, Prolific, Toloka AI (all of which I’ve crossed paths with professionally or personally in the last 15 months), and the many others I surely missed.
Surge AI has at least re-captured part of the business from Anthropic, with a few flashy editorials highlighting researchers’ opinions such as Ethan Perez:
With Surge, the workflow for collecting human data now looks closer to “launching a job on a cluster” which is wild to me.
It’s not clear what performance gains come from changing providers based on public data. It’s a messy process deciding on which data labeling service to go with, one heavily influenced by normal sales tactics. It’ll be obvious in a few years who’s doing the best. Hint: I bet a lot of them are using big LLM + human-in-the-loop workflows.
Regardless, having touch points with model performance is the most important things for these business. Moving from chucking data over the wall to being meaningfully involved in evaluation and training discussions. Models are where people make decisions still.
Scaling Alignment-as-a-Service via AI feedback
Alignment-as-a-service (AaaS, rhymes with Haas), is the idea of paying for a recurring service fee to either enable or monitor LLM endpoints. This could be everything from discovering use cases that users aren’t happy with to providing new training data to fix problems.
I’ve thought for a long time how AaaS could make a viable startup. To deliver outsized returns any technology company needs to have large returns over their fixed costs. Providing RLHF as a service would work best when fixed costs like data or compute can offer long-term value. There are so many open questions to figuring this out, with various levels of risk. For now, I see model-management tools (e.g. continual training) being a higher margin project than data services, but they’re less viable / stable with current technology.
The most basic approach is to simply try and scoop someone like Scale AI at their own game. If I go to scale and buy $20 million in data and then repurpose that data over and over to enable the training needs of more customers, I could have a viable long-term business. The primary challenge to this argument is about the need for on-policy data, i.e. the data is generated from the models you are training, which this fixed dataset would definitionally not have. Even though you can regenerate completions to a prompt, that doesn’t materialize in the chosen-rejected pairs needed for RLHF.
The biggest slice of the pie this sort of company can take is by solving the qualitative pitfalls John Schulman describes in his talk on Proxy Objectives at ICML. Essentially, John describes how ChatGPT is essentially playing whack-a-mole with capabilities via RLHF1. They use RLHF to improve one part of the user distribution, but that causes another problem to emerge. Technically, this probably looks like continually training a fine-tuned model on slightly more data reflecting new use cases (which seems hard due to stability concerns with out-of-distribution data in RL) or redoing the alignment phase of model training with new instruction and preference data that likely encompasses this new behavior.
However you slice it, the need to consistently monitor the behavior of a deployed model and update the capabilities with a fast turnaround represents an extremely high-value business. Companies offering alignment as a service can start out by offering fine-tuning help to companies wanting to play with the hottest new open-source model. Then, they can end with proceeding down the stack into compute access, deployment, and other hot real estate for ML startups.
This all seems well and good, but a question looms over this startup design, Scale AI, and their peers alike: What happens when synthetic data gets even better? The primary advantage I see with human-in-the-loop datasets now is sufficient diversity. When using LLMs to generate prompts it is very tricky to get a truly diverse dataset. Can the ideas in synthetic data go far enough that we can order a dataset with a click of a button that covers all of our needs? I suspect that would be at least 100 times cheaper than what Scale AI et al offer right now.
If I were trying to do this, I’d try and get the best internal critic and reward models. A solid critic model lets one do things like constitutional AI and build datasets of revisions and a good reward model lets one supercharge their model with basic data selection methods. I suspect these types of models enable easier next-generation evaluation tools.
Regardless of the specifics of who wins in this space, and I expect there to be a couple of winners, it’s great to see a couple of viable business models in the AI landscape. It’s good for the field.
The potential downfall of all of these alignment companies is base models getting so good that we no longer need RLHF. The longer RLHF goes on as a crucial leg of the process, the less likely this gets (due to basic sunk cost logical fallacies among decision-makers driving billion-dollar investments). Regardless, it’s important to consider what would make your business well and truly go to zero. I suspect that even if safety and capabilities are no longer needed in RLHF, the market for personalization would drive substantial value, but maybe not $1 billion in revenue kind of value.
One of my friends is founding a startup in this alignment-as-a-service space and he’s planning on coming on for an interview soon, so stay subscribed to get that update!
Looking for more content? Check out my podcast with Tom Gilbert, The Retort.
Elsewhere from me
On episode 18 of The Retort, we discussed the OLMo models from last week and the underbelly of the internet driving AI (hint: AI Waifus).
WANDB logs for OLMo are out — see all the details in the LLM training runs!
Models, datasets, and other tools
A first attempt at open Constitutional AI was done by my former team at HuggingFace!
A small very strong model from OpenBMB (the creators of the UltraFeedback dataset I talk about a lot).
A friend of the pod,is hiring folks similar to my self to do China-US AI research.
I enjoyed this accessible talk from Sasha Rush (another friend of the pod) on LLMs in 5 formulas. It’s good to learn basics of perplexity and scaling, for example.
An invite to the paid subscriber-only Discord server is in email footers.
Interconnects referrals: You’ll accumulate a free paid sub if you use a referral link from the Interconnects Leaderboard.
Student discounts: Want a large paid student discount, go to the About page.
Highlighted by recent laziness drama?