How to cultivate a high-signal AI feed
Basic tips on how to assess inbound ML content and cultivate your news feed.
This post is in large part a result of a recent appearance of mine on the Practically Intelligent podcast with Akshay Bhushan and Sinan Ozdemir, check out the episode here.
It’s a basic fact of life these days, you need to be intentional about your media diet. It’s only been exacerbated by ML becoming culture-war-y with the Gemini woke debate, where I’m taking it as an opportunity to mute people spewing nonsense. These people were probably spewing nonsense before this Gemini debacle, but it wasn’t obvious enough to remove it from my diet. The information curation process is worth the effort. Interconnects is largely a reflection of the steps I’ve taken to focus my energy, with the benefit of my writing as a filter.
Ever since the emergence of ChatGPT, all of us trying to keep up with the happenings on ML have been on the back foot. The core assumption when consuming ML content should be that you are behind: you are not likely to know everything, you are likely to make mistakes, the information you’re reading is likely to have mistakes, and all of the like.
In this post, I’ll outline some of my simple rules of thumb for deciding to go or not go with ML content. With so much happening, saving time by avoiding the things that turn out to be misleading is just as important as spending time on reading the most important developments.
Here, roughly in order, are some things to make sure you check in on when evaluating ML content:
1. Model access and demos are king of credibility
If someone is making it free for you to evaluate and benefit from their work, you know it’s way more likely to be the real deal. Talking to models is also the easiest way to know if the claims can generalize. When I wrote about the Gemma models last week, a reader shared their use cases of the model and some of them conflicted with my reading (even from reputed sources) and my own experience. Some were surprising to me, but it was a classic example that everyone betting on and building around these models needs to use them regularly to constantly update their worldview.
It’s common practice for people to now “launch” a model just by providing examples of its usage and scores, rather than enabling you to do it yourself. This is the opposite of Mistral’s vibes-based releases until just this week with their Large model, who are so confident in their work that they say nothing except giving everyone access with no forward notice. For the latest Mistral news about the Large model, I largely am ignoring it because they’re not releasing it openly. If the weights were open, it would probably be yet another emergency blog post.
2. Focus your feed on depth or breadth
Too many people today are trying to know everything about every model. You need to know what your information diet that yields leverage is and focus on that. For me, I focus on depth and coverage of everything in the open fine-tuning space with a focus on models that are shipped and some extra coverage of closed model fine-tuning. For my day job, I largely ignore all things pretraining, GPU optimization, inference, etc. It’s okay to still find them interesting, but don’t waste time. Someone like a venture capitalist will likely have much more breadth than me and the open versus closed debate is not as important to them.
3. Examples of using the model normally show its usable, shockingly
You’d be surprised how many ML projects these days don’t work out of the box to replicate their claims. Examples of model use look like people including inference code, support in related libraries like VLLM or other open-source projects, or even specific examples of this working. Essentially, they’re encouraging you to use the artifact, but aren’t footing the bill to make it generally accessible (which not everyone can afford). Extra points if they describe what scale of compute is needed to run the various aspects of the project.
For projects with open code, especially on the research side, the quality of the code is normally extremely easy to assess. Do they have a normal package structure, some tests, and an active GitHub repo, great! Do they just have one commit, “release,” and lots of comments and scripts, that’ll be harder to build on? Many research projects release the code so they can say they did, which normally means the results could be harder to build on.
4. Leaderboards as the single leading claim is often anti-signal
Most people know already that both leaderboards like the Open LLM Leaderboard and popular evaluation tools like MMLU and Hellaswag are often highlighted in a marketing-centric way. The more nuanced point to remember here is the difference between having these claims with other materials as opposed to on their own. A new research project shares a new method and releases its models atop some leaderboard, great. A random model shows up and all they say is high scores on the leaderboard, you can probably ignore it unless they’re using a relatively unproven method like model merging these days. In that case, it’s still not likely to be game changing until tons of models are released with the method, so waiting to react is a good strategy.
5. Basic deep learning conceptual checks will often save you
While we’re here, there are also a few principles of deep learning that have yet to be violated when assessing work. All of them are based on the idea of no free lunch.
Smaller cannot be better at everything. Scaling still matters. I got a lot of joy out of Francois Chollet's outing of Phi 2 as not being a savior model. The same goes in my field for techniques like Less is More for Alignment (LIMA). When you go small, you’re normally overfitting. Small models have a use, but if you hear generality and scaling down, you’re probably being misled.
Simple ideas tend to win out. Be wary of complexity. Simple ideas more often come with code and are long-lasting. The Direct Preference Optimization paper has more usage than all the other somewhat complex RLHF algorithms proposed in the last year combined. This also goes for things that are less popular, the simple things people sleep on may be an opportunity.
Putting eval context in perspective. if an eval dataset is multiple choice out of 4 and therefore has a ~25% correctness by guessing, is a reported 30% score actually good? If evaluation scores are close to this threshold it’s normally not worth caring at all.
6. If it’s not even remotely reproducible or verifiable, it’s not science
Especially in 2023, there were so many papers released just studying and/or slightly improving the behavior of closed models like GPT4 via prompting or other analyses. These papers, at best, will provide exactly the model version they used, but these model versions still change under the hood when OpenAI realizes it needs to update a safety or other liability filter. Any paper that relies on such a black box does not follow the same scientific standards the rest of the field uses.
It’s not to say these results are always worthless, they can provide useful intuitions, but the measurement of their claims is often oversold. Directionally, their work is probably accurate, but the only people who can actually measure the results fully are the closed model providers themselves. Scientific and reproducible results are still a foundation of much progress in AI, so it’s worth paying more attention to that.
7. Don’t over-index on Twitter & know incentives
Twitter is great — it’s where I first see news of most of the new things happening in ML. That does not mean that Twitter is where you should formulate all of your beliefs on what is happening. ML Twitter these days is a substantially slimmed-down community compared to the old days. Many smart people have just left. This leaves us with fewer perspectives and an algorithm that more strongly rewards groupthink. Most of the time, the group is right, but sometimes the group will be wrong and very set in their ways. This is what you avoid by not over-indexing on Twitter.
The most recent example of this is the blowback to the Gemini happenings here (paid section), where there’s obviously a major issue with the model, but everyone is overgeneralizing Gemini’s demise. Other companies have gone through this bias challenge before in different ways.
A fact that a lot of people still don’t know: paper Tweeters like AK very regularly share work when asked for it — it’s not always as organic of a process as people may think. Yes, these folks play an important editorial role by reading every title on Arxiv, but it is worth doing sanity checks to make sure the information flow you expect matches reality.
For one source of diversification — all the Substacks I recommend in the ML space are kept here. It’s focused on smaller outlets that I either consume myself or know to be a clear compliment to my work.
The incentives of Twitter are very different than those of Substack. This is why a lot of people read this blog. I’m a largely neutral entity. I benefit from the entirety of the AI ecosystem progressing and improving more than I do from one model or product. On Twitter, many creators are literally paid for clicks (I make small beans, like $15-20 per month), but generally the incentive is views. Know the incentives of who you are reading.1
8. Data sharing, licenses, communication clarity, and small things add up
When looking at a release, there’s a long tail of little things that can sway me into realizing this project is probably legit. They all take the form of small things that a vocal few people will care about rather than the majority of users. An organization taking the time to support the long-tail audience of their work is much more likely to be invested in the work long-term.
9. Research papers, technical reports, blog posts, and Tweets all serve different purposes
From the paper downward, a project that can do one of these can do the rest. Generally, the more information one discloses the more excited you should be about this. Someone drops a model with just leaderboard scores — I usually pass. Someone drops a model with a technical report showing how they built it — that’s promising. Someone drops a model with a research paper comparing their work with other existing training methods and models, that’s the one to invest your time in.
Similar to the previous point, basic paper quality still shows you a lot. If a paper has a consistent figure style, clean organization, and high-quality writing when you jump to a random section, it’s more likely to be useful and compelling. Of course, there are people that try to flag post an idea and don’t have quality artifacts around it, but it's one of those averages games.
When parsing a paper, results that are overly specific will probably be lost in the noise of progress over time. This is an advanced point, but with time you can learn to absorb only that which matters in the big picture from a paper. This reduces most papers to being confidence adjustments on a worldview (as many results repeat) than true breakthroughs like DPO. A thing I’m seeing these days is how RLHF can improve code and math reasoning in many ways, but the claims by any one paper about their specific method are likely to be lost.
10. Socialize your information and build relationships
This maybe should be higher up, but all information extraction should not be a one-way street in your brain. You need ways to test your worldview and your feed. Talk with folks about what you’re seeing, share your takes, and update your priors. This is honestly the best way to make keeping up more fun than exhausting. Scientific progress comes from the collective more than the individual.
The same goes for who you’re receiving content from. It’s much easier to process information coming from consistent sources. Figure out the modalities that work for you. This helps you normalize by publisher intent too — the process of knowing what you don’t know. For every new person you start following, you need to go into it understanding that you don’t know their worldview. Snoop around and figure out what someone may gain by spinning the news a certain way.
Recent work: data selection for RLHF survey
I recently helped with a short-ish section on preference fine-tuning data selection for a larger survey paper. As I’ve been saying, we have a lot to learn about RLHF data. Right now, we use:
Manual filtering (e.g. LIMA): Very specific rules for better performance. 2.
Model-based evaluation (ultrafeedback/nectar): Ask GPT4 to filter out bad datapoints. Lots of RLAIF dataset have this step. 3.
Reward model re-weighting (tons of papers you haven’t heard of): the many ways the field has evolved rejection-sampling like ideas (lots of work in code, reasoning, and other stuff here I wasn't expecting).
If you compare this to other sections like pretraining, it's obvious how much headroom we have. Check out the paper here.
Audio of this post will be available later today on podcast players, for when you’re on the go, and YouTube, which I think is a better experience with the normal use of figures.
Looking for more content? Check out my podcast with Tom Gilbert, The Retort.
Newsletter stuff
Elsewhere from me
On The Retort ep 21, Tom and I vented about a crazy week of releases from OpenAI, Google, and the lot.
Models, datasets, and other tools
Cosmo dataset from HF rephrasing wiki pages is likely to be the best we can do in substitute to Phi’s dataset.
Phind model on code looks interesting, but is closed source. I’m excited by more model providers focusing on one task.
In model merging land, there was a modified Mistral 7B model that shows promise.
This video/image to game model from DeepMind is awesome.
Argilla released another dataset of preferences on Hermes, but I don’t think the RM used to select chosen and rejected will be good enough to make this a stick dataset.
Links
Mozilla wrote a whitepaper on openness in AI and policy, which is opinionated and entertaining.
This Mamba blog post looks like a good compliment to my writing — it’ll show you more of the low level architecture structures.
The one year of latent space pod reflection is a great read on how to build a content brand on the internet in AI. This sort of thing is what makes a content producer great:
Swyx and I cancelled multiple episodes after recording them because we didn't think the quality was high enough.
Housekeeping
An invite to the paid subscriber-only Discord server is in email footers.
Interconnects referrals: You’ll accumulate a free paid sub if you use a referral link from the Interconnects Leaderboard.
Student discounts: Want a large paid student discount, go to the About page.
Credit to Jaan for nudging me to add this back in https://twitter.com/thejaan.