Skip to main content

New open source AI leader Reflection 70B’s performance questioned, accused of ‘fraud’

AI comic book style image of angry crowd pointing accusingly at man with dark hair sweating and holding small robot
Credit: VentureBeat made with OpenAI ChatGPT

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now


It took just one weekend for the new, self-proclaimed king of open source AI models to have its crown tarnished.

Reflection 70B, a variant of Meta’s Llama 3.1 open source large language model (LLM) — or wait, was it a variant of the older Llama 3? — that had been trained and released by small New York startup HyperWrite (formerly OthersideAI) and boasted impressive, leading benchmarks on third-party tests, has now been aggressively questioned as other third-party evaluators have failed to reproduce some of said performance measures.

The model was triumphantly announced in a post on the social network X by HyperWrite AI co-founder and CEO Matt Shumer on Friday, September 6, 2024 as “the world’s top open-source model.”

In a series of public X posts documenting some of Reflection 70B’s training process and subsequent interview over X Direct Messages with VentureBeat, Shumer explained more about how the new LLM used “Reflection Tuning,” a previously documented technique developed by other researchers outside the company that sees LLMs check the correctness of or “reflect” on their own generated responses before outputting them to users, improving accuracy on a number of tasks in writing, math, and other domains.


AI Scaling Hits Its Limits

Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are:

  • Turning energy into a strategic advantage
  • Architecting efficient inference for real throughput gains
  • Unlocking competitive ROI with sustainable AI systems

Secure your spot to stay ahead: https://bit.ly/4mwGngO


However, on Saturday September 7, a day after the initial HyperWrite announcement and VentureBeat article were published, Artificial Analysis, an organization dedicated to “Independent analysis of AI models and hosting providers” posted its own analysis on X stating that “our evaluation of Reflection Llama 3.170B’s MMLU score” — referencing the commonly used Massive Multitask Language Understanding (MMLU) benchmark — “resulted in the same score as Llama 3 70B and significantly lower than Meta’s Llama 3.1 70B,” showing a major discrepancy with HyperWrite/Shumer’s originally posted results.

On X that same day, Shumer stated that Reflection 70B’s weights — or settings of the open source model — had been “fucked up during the upload process” to Hugging Face, the third-party AI code hosting repository and company, and that this issue could have resulted in worse quality performance compared to HyperWrite’s “internal API” version.

On Sunday, September 8, 2024 at around 10 pm ET, Artificial Analysis posted on X that it had been “given access to a private API which we tested and saw impressive performance but not to the level of the initial claims. As this testing was performed on a private API, we were not able to independently verify exactly what we were testing.”

The organization detailed two key questions that seriously call into question HyperWrite and Shumer’s initial performance claims, namely:

  • We are not clear on why a version would be published which is not the version we tested via Reflection’s private API.
  • We are not clear why the model weights of the version we tested would not be released yet.

As soon as the weights are released on Hugging Face, we plan to re-test and compare to our evaluation of the private endpoint.”

All the while, users on various machine learning and AI Reddit communities or subreddits, have also called into question Reflection 70B’s stated performance and origins. Some have pointed out that based on a model comparison posted on Github by a third party, Reflection 70B appears to be a Llama 3 variant rather than a Llama-3.1 variant, casting further doubt on Shumer and HyperWrite’s initial claims.

This has led to at least one X user, Shin Megami Boson, to openly accuse Shumer of “fraud in the AI research community” as of 8:07 pm ET on Sunday, September 8, posting a long list of screenshots and other evidence.

Others accuse the model of actually being a “wrapper” or application built atop of propertiary/closed-source rival Anthropic’s Claude 3.

However, other X users have spoken up in defense of Shumer and Reflection 70B, and some have posted about the model’s impressive performance on their end.

Regardless, the model’s rollout, lofty claims, and now criticism show how rapidly the AI hype cycle can come crashing down.

For 48 hours, the AI research community waited with bated breath for Shumer’s response and updated model weights on Hugging Face.

The CEO finally broke his silence about the debacle on the evening of Tuesday, September 10 around 6 pm ET — without providing corrected model weights — writing in a post on X:

“I got ahead of myself when I announced this project, and I am sorry. That was not my intention. I made a decision to ship this new approach based on the information that we had at the moment.

I know that many of you are excited about the potential for this and are now skeptical. Nobody is more excited about the potential for this approach than I am. For the moment, we have a team working tirelessly to understand what happened and will determine how to proceed once we get to the bottom of it. Once we have all of the facts, we will continue to be transparent with the community about what happened and next steps.”

Shumer also linked to another X post by Sahil Chaudhary, founder of Glaive AI, the platform Shumer previously claimed was used to generate synthetic data to train Reflection 70B.

Intriguingly, Chaudhary’s post states that some of the responses from Reflection 70B saying it was a variant of Anthropic’s Claude are also still a mystery to him. He also admits that “the benchmark scores I shared with Matt haven’t been reproducible so far.” Read his full post below:

I want to address the confusion and valid criticisms that this has caused in the community. I am currently investigating what happened that led to this and will share a transparent summary as soon as possible. There are two areas I’d like to address, which I am investigating:

– First, I want to be clear that at no point was I running any models from other providers as the API that was being served on my compute — I’m working on providing evidence of this and understanding why people saw model behaviour such as using a different tokenizer, or completely skipping words like “Claude”.

– Second, the benchmark scores I shared with Matt haven’t been reproducible so far. I am working to understand why this is and if the original scores I reported were accurate or a result of contamination / misconfiguration. I have a lot of work to do on both of these and am working on a full postmortem that I will share with the community. I’m sorry for the confusion this has caused and know that I’ve let the community down and lost trust. I still believe in the potential of the approach. My focus is on rebuilding trust through increased transparency. I’ll have more to share soon.

For now, the mystery — and skepticism of the open source AI community and this publication — remains.