Artifacts Log 2: Gemma 2, more Chinese LLMs, high quality datasets, and domain-specific training

A roundup of interesting open ML models in June (and July) 2024.

Jul 10, 2024

Previous Issues: #1 - May 2024

There are no signs of open models slowing down. Tons of models. Tons of topics. The biggest stories are Nemotron 340B from Nvidia, which I discussed at length in my recent post on synthetic data, and Gemma 2 from Google, which I haven’t covered directly until now. Gemma 2 is a very serious model that beats Llama 3 Instruct on ChatBotArena. The technical report has a lot of pointers to novel techniques but not a lot of answers for how others could do this too.

The open model ecosystem is clearly healthy.

Given the amount of models, I’ve broken them down by category. Models at the top of the lists are those that are most interesting and some models are filtered out for length of the issue.

General-use text models

gemma-2-27b by google: This is a serious model. I could write a speculative post about each of the sections in the report. In summary, it evaluated on ChatBotArena well, is trained on LMSYS data, is distilled similarly to Gemini (probably, as discussed in my recent post), uses model merging during fine-tuning, uses an order of magnitude larger reward model for RLHF (>100B parameters), uses synthetic and human data, and is a reasonable size for inference on one 80GB memory GPU. Read more in the technical report here.

Otherwise, I seriously expect future Gemma models to replace a lot of Llama models in workflows. Google shows every intention of putting a lot of weight behind these, which is fantastic to see. Hopefully it can continue.
For more on Gemma 2, see this post from HuggingFace.
Qwen2-72B-Instruct by Qwen: Another very strong and recent open model. The instruct version came in around the same level of Command R Plus, but is the top open-weight Chinese model on LMSYS. Two API models, Yi-Large and GLM-4-0520 are still ahead of it (but we don’t know what they are).
DeepSeek-V2-Lite by deepseek-ai: Another great chat model from Chinese open model contributors. Consistently, the 01-ai, DeepSeek, and Qwen teams are shipping great models This DeepSeek model has “16B total params, 2.4B active params” and is trained on 5.7 trillion tokens. This is a great size for many people to play with.
K2 by LLM360: A 65B “fully open-source” model. This model reaches similar performance to Llama 2 70B and uses less compute (only 1.4 trillion tokens).

I’ve added these models and some of their recent peers to the MMLU model. Models are continuing to climb the compute efficiency frontier (especially when you compare to models like Llama 2 and Falcon 180B that are recent memories).

neo_7b by m-a-p: Another open-source model (at least they include data, I haven’t looked at the code). It’s great to have more competition and peers to learn from for OLMo.
Mistral-7B-Instruct-v0.3 by mistralai: Mistral is still improving their small models while we’re waiting to see what their strategy update is with the likes of Llama 3 and Gemma 2 out there.
openchat-3.6-8b-20240522 by openchat: These openchat models are really popular with researchers doing RLHF. They are strong base models to do continued RLHF or reward modeling on, and here’s the latest version!
Phi-3-medium-4k-instruct, Phi-3-small-8k-instruct, and the rest of the Phi family by microsoft: We knew these models were coming, but they’re solid for trying tasks like data filtering, local fine-tuning, and more on.
Interconnects is a reader-supported publication. Consider becoming a subscriber.

Reward models

llemma-7b-prm-metamath-level-1to3-hf by ScalableMath: While process reward models (PRMs) — reward models that score each step in a reasoning chain) are documented by OpenAI as being really helpful for model reasoning capabilities (more in my Q* post), there really are almost none on HuggingFace. I was scraping for them, and found this one organization has a couple!
GRM-llama3-8B-distill by Ray2333: This model comes from a new paper that adds some language model loss functions (DPO loss, reference free DPO, and SFT - like InstructGPT) to reward model training for RLHF. It show strong results on RewardBench and downstream RLHF performance. This is close to what I've heard from some industry labs regarding RM training, so I’m happy to see this.

Datasets

HelpSteer2 by nvidia: It’s rare that we get access to a dataset created by one of the big data labelling labs (they push pretty hard against open-sourcing in my experience, in order to protect their business model). This dataset, and particularly the accompanying paper, is a dense resource filled with insights on how state-of-the-art fine-tuning may actually work in industry labs.
fineweb-edu by HuggingFaceFW: This is the “high-quality” split of the recent well-received pretraining corpus from HuggingFace. The split was created by training a classifier on Llama 3 70B to identify educational style content. This type of filtering is on a fast track to being used everywhere (along with distillation from a bigger model in training).

aya-23-35B by CohereForAI: Cohere updated their original Aya model with fewer languages and using their own base model (Command R, while the original model was trained on top of T5).
TowerBase-7B-v0.1 by Unbabel: A multilingual continue training of Llama 2 7B, importantly it “maintains the performance” on English tasks. This is a domain I expect things to expand on.
Llama3-8B-Chinese-Chat by shenzhi-wang: A Chinese focused Llama 3.
Swallow-70b-instruct-v0.1 by tokyotech-llm: A Japanese focused Llama 2 model.
internlm2-math-plus-mixtral8x22b by internlm: Next model in the popular series of math models.
scitulu-70b by allenai: A Llama 2 fine-tune designed to specialized on scientific information extraction and processing tasks. Built on top of our Tulu 2 work!
DeepSeek-Coder-V2-Instruct by deepseek-ai: A super popular new coding model. Evals on coding specific models like this are tending to match or pass the API-based general models. I haven’t given them a shot yet.

Visual language models (VLMs)

Llama-3-8B-Dragonfly-v1 by togethercomputer and MiniCPM-Llama3-V-2_5 by openbmb: Two new late-fusion VLMs built on the Llama 3 8B backbone. Please reach out if you have experience with these
Phi-3-vision-128k-instruct by microsoft: Reminder that Phi had a vision version!

Other models I flagged

CommonCanvas-XL-C by common-canvas: A text-to-image model with better data traceability. From the model card: “The goal is to produce a model that is competitive with Stable Diffusion 2, but to do so using an easily accessible dataset of known provenance. Doing so makes replicating the model significantly easier, and provides a clearer mechanism for applying training-data attribution techniques.”
Skywork-MoE-Base by Skywork: Another MoE model.
mamba2-2.7b by state-spaces: Mamba v2!
Zamba-7B-v1 by Zyphra: A hybrid model (like StripedHyena) with Mamba and Transformer blocks.
Yuan2-M32-hf by IEITYuan: Another MoE model.
glm-4-9b-chat by THUDM: A really popular Chinese chat model I couldn’t parse much from r/LocalLLaMA on.
esm3-sm-open-v1 by EvolutionaryScale: A giant model for protein prediction from a new high valuation startup.
Hermes-2-Theta-Llama-3-70B by NousResearch: A general chat model from one of the normal fine-tuning groups!

Links

This commencement speech from Grant Sanderson of 3Blue1Brown fame was one of the best I’ve ever watched. Nails a lot on how to navigate a career and early life.

I enjoyed this article on “The importance to stupidity in scientific research.” Too much of modern ML is about grinding.

Elsewhere from me

I was on a couple podcasts recently.

In late May I was on the ChinaTalk podcast to discuss the GPT-4o launch, DC vibes on open source, China and opensource, etc.

In June I was on SuperDataScience to cover recent happenings in the space of RLHF. Covering normal topics: DPO, personalization, robotic foundation models, etc.

References: (2024 artifacts, 2023 artifacts, MMLU vs training compute model)

Keep sending me models (and datasets)!