Reflection 70B's performance questioned, accused of 'fraud'

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now

It took just one weekend for the new, self-proclaimed king of open source AI models to have its crown tarnished.

Reflection 70B, a variant of Meta’s Llama 3.1 open source large language model (LLM) — or wait, was it a variant of the older Llama 3? — that had been trained and released by small New York startup HyperWrite (formerly OthersideAI) and boasted impressive, leading benchmarks on third-party tests, has now been aggressively questioned as other third-party evaluators have failed to reproduce some of said performance measures.

The model was triumphantly announced in a post on the social network X by HyperWrite AI co-founder and CEO Matt Shumer on Friday, September 6, 2024 as “the world’s top open-source model.”

I'm excited to announce Reflection 70B, the world’s top open-source model.

Trained using Reflection-Tuning, a technique developed to enable LLMs to fix their own mistakes.

405B coming next week – we expect it to be the best model in the world.

Built w/ @GlaiveAI.

Read on ⬇️: pic.twitter.com/kZPW1plJuo
— Matt Shumer (@mattshumer_) September 5, 2024

In a series of public X posts documenting some of Reflection 70B’s training process and subsequent interview over X Direct Messages with VentureBeat, Shumer explained more about how the new LLM used “Reflection Tuning,” a previously documented technique developed by other researchers outside the company that sees LLMs check the correctness of or “reflect” on their own generated responses before outputting them to users, improving accuracy on a number of tasks in writing, math, and other domains.

AI Scaling Hits Its Limits

Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are:

Turning energy into a strategic advantage
Architecting efficient inference for real throughput gains
Unlocking competitive ROI with sustainable AI systems

Secure your spot to stay ahead: https://bit.ly/4mwGngO

However, on Saturday September 7, a day after the initial HyperWrite announcement and VentureBeat article were published, Artificial Analysis, an organization dedicated to “Independent analysis of AI models and hosting providers” posted its own analysis on X stating that “our evaluation of Reflection Llama 3.170B’s MMLU score” — referencing the commonly used Massive Multitask Language Understanding (MMLU) benchmark — “resulted in the same score as Llama 3 70B and significantly lower than Meta’s Llama 3.1 70B,” showing a major discrepancy with HyperWrite/Shumer’s originally posted results.

Our evaluation of Reflection Llama 3.1 70B's MMLU score resulted in the same score as Llama 3 70B and significantly lower than Meta's Llama 3.1 70B.

A LocalLLaMA post (link below) also compared the diff of Llama 3.1 & Llama 3 weights to Reflection Llama 3.1 70B and concluded the… pic.twitter.com/hqvFp2TyCC
— Artificial Analysis (@ArtificialAnlys) September 7, 2024

On X that same day, Shumer stated that Reflection 70B’s weights — or settings of the open source model — had been “fucked up during the upload process” to Hugging Face, the third-party AI code hosting repository and company, and that this issue could have resulted in worse quality performance compared to HyperWrite’s “internal API” version.

We’ve figured out the issue. The reflection weights on Hugging Face are actually a mix of a few different models — something got fucked up during the upload process.

Will fix today. https://t.co/rKuOlTApRK
— Matt Shumer (@mattshumer_) September 7, 2024

On Sunday, September 8, 2024 at around 10 pm ET, Artificial Analysis posted on X that it had been “given access to a private API which we tested and saw impressive performance but not to the level of the initial claims. As this testing was performed on a private API, we were not able to independently verify exactly what we were testing.”

Reflection 70B update: Quick note on timeline and outstanding questions from our perspective

Timeline:
– We tested the initial Reflection 70B release and saw worse performance than Llama 3.1 70B.

– We were given access to a private API which we tested and saw impressive…
— Artificial Analysis (@ArtificialAnlys) September 9, 2024

The organization detailed two key questions that seriously call into question HyperWrite and Shumer’s initial performance claims, namely:

“We are not clear on why a version would be published which is not the version we tested via Reflection’s private API.
We are not clear why the model weights of the version we tested would not be released yet.

As soon as the weights are released on Hugging Face, we plan to re-test and compare to our evaluation of the private endpoint.”

All the while, users on various machine learning and AI Reddit communities or subreddits, have also called into question Reflection 70B’s stated performance and origins. Some have pointed out that based on a model comparison posted on Github by a third party, Reflection 70B appears to be a Llama 3 variant rather than a Llama-3.1 variant, casting further doubt on Shumer and HyperWrite’s initial claims.

This has led to at least one X user, Shin Megami Boson, to openly accuse Shumer of “fraud in the AI research community” as of 8:07 pm ET on Sunday, September 8, posting a long list of screenshots and other evidence.

A story about fraud in the AI research community:

On September 5th, Matt Shumer, CEO of OthersideAI, announces to the world that they've made a breakthrough, allowing them to train a mid-size model to top-tier levels of performance. This is huge. If it's real.

It isn't. pic.twitter.com/S0jWT8rDVb
— ? Shin Megami Boson ? (@shinboson) September 9, 2024

Others accuse the model of actually being a “wrapper” or application built atop of propertiary/closed-source rival Anthropic’s Claude 3.

"Reflection API" is a sonnet 3.5 wrapper with prompt. And they are currently disguising it by filtering out the string 'claude'.https://t.co/c4Oj8Y3Ol1 https://t.co/k0ECeo9a4i pic.twitter.com/jTm2Q85Q7b
— Joseph (@RealJosephus) September 8, 2024

However, other X users have spoken up in defense of Shumer and Reflection 70B, and some have posted about the model’s impressive performance on their end.

I know @mattshumer_ and this does not mesh with my understanding of him. He knows his stuff and is super pragmatic and works around problems in impressive ways that most people get bogged down on for months. I would say maybe give the guy a little more time before you say stuff…
— Sasha krecinic (@SashaKrecinic) September 9, 2024

Regardless, the model’s rollout, lofty claims, and now criticism show how rapidly the AI hype cycle can come crashing down.

For 48 hours, the AI research community waited with bated breath for Shumer’s response and updated model weights on Hugging Face.

The CEO finally broke his silence about the debacle on the evening of Tuesday, September 10 around 6 pm ET — without providing corrected model weights — writing in a post on X:

I got ahead of myself when I announced this project, and I am sorry. That was not my intention. I made a decision to ship this new approach based on the information that we had at the moment.

I know that many of you are excited about the potential for this and are now skeptical.…
— Matt Shumer (@mattshumer_) September 10, 2024

“I got ahead of myself when I announced this project, and I am sorry. That was not my intention. I made a decision to ship this new approach based on the information that we had at the moment.

I know that many of you are excited about the potential for this and are now skeptical. Nobody is more excited about the potential for this approach than I am. For the moment, we have a team working tirelessly to understand what happened and will determine how to proceed once we get to the bottom of it. Once we have all of the facts, we will continue to be transparent with the community about what happened and next steps.”

Shumer also linked to another X post by Sahil Chaudhary, founder of Glaive AI, the platform Shumer previously claimed was used to generate synthetic data to train Reflection 70B.

Intriguingly, Chaudhary’s post states that some of the responses from Reflection 70B saying it was a variant of Anthropic’s Claude are also still a mystery to him. He also admits that “the benchmark scores I shared with Matt haven’t been reproducible so far.” Read his full post below:

I want to address the confusion and valid criticisms that this has caused in the community. I am currently investigating what happened that led to this and will share a transparent summary as soon as possible. There are two areas I’d like to address, which I am investigating:

-… https://t.co/NSjx6oqPRo
— Sahil Chaudhary (@csahil28) September 10, 2024

“I want to address the confusion and valid criticisms that this has caused in the community. I am currently investigating what happened that led to this and will share a transparent summary as soon as possible. There are two areas I’d like to address, which I am investigating:

– First, I want to be clear that at no point was I running any models from other providers as the API that was being served on my compute — I’m working on providing evidence of this and understanding why people saw model behaviour such as using a different tokenizer, or completely skipping words like “Claude”.

– Second, the benchmark scores I shared with Matt haven’t been reproducible so far. I am working to understand why this is and if the original scores I reported were accurate or a result of contamination / misconfiguration. I have a lot of work to do on both of these and am working on a full postmortem that I will share with the community. I’m sorry for the confusion this has caused and know that I’ve let the community down and lost trust. I still believe in the potential of the approach. My focus is on rebuilding trust through increased transparency. I’ll have more to share soon.“

For now, the mystery — and skepticism of the open source AI community and this publication — remains.

Hi Matt, we spent a lot of time, energy, and GPUs on hosting your model and it's sad to see you stopped replying to me in the past 30+ hours, I think you can be more transparent about what happened (especially why your private API has a much better perf)https://t.co/srTMGruXEZ
— Yuchen Jin (@Yuchenj_UW) September 10, 2024

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.

The AI insights you need to lead