Reflection 70B’s efficiency questioned, accused of ‘fraud’

September 9, 2024

6

Be a part of our every day and weekly newsletters for the newest updates and unique content material on industry-leading AI protection. Study Extra

It took only one weekend for the new, self-proclaimed king of open supply AI fashions to have its crown tarnished.

Reflection 70B, a variant of Meta’s Llama 3.1 open supply massive language mannequin (LLM) — or wait, was it a variant of the older Llama 3? — that had been skilled and launched by small New York startup HyperWrite (previously OthersideAI) and boasted spectacular, main benchmarks on third-party assessments, has now been aggressively questioned as different third-party evaluators have failed to breed a few of mentioned efficiency measures.

The mannequin was triumphantly introduced in a put up on the social community X by HyperWrite AI co-founder and CEO Matt Shumer on Friday, September 6, 2024 as “the world’s prime open-source mannequin.”

I am excited to announce Reflection 70B, the world’s prime open-source mannequin.
Educated utilizing Reflection-Tuning, a way developed to allow LLMs to repair their very own errors.
405B coming subsequent week – we anticipate it to be the very best mannequin on the earth.
Constructed w/ @GlaiveAI.
Learn on ⬇️: pic.twitter.com/kZPW1plJuo
— Matt Shumer (@mattshumer_) September 5, 2024

In a sequence of public X posts documenting a few of Reflection 70B’s coaching course of and subsequent interview over X Direct Messages with VentureBeat, Shumer defined extra about how the brand new LLM used “Reflection Tuning,” a beforehand documented approach developed by different researchers exterior the corporate that sees LLMs verify the correctness of or “mirror” on their very own generated responses earlier than outputting them to customers, bettering accuracy on a lot of duties in writing, math, and different domains.

Nonetheless, on Saturday September 7, a day after the preliminary HyperWrite announcement and VentureBeat article have been revealed, Synthetic Evaluation, a company devoted to “Impartial evaluation of AI fashions and internet hosting suppliers” posted its personal evaluation on X stating that “our analysis of Reflection Llama 3.170B’s MMLU rating” — referencing the generally used Large Multitask Language Understanding (MMLU) benchmark — “resulted in the identical rating as Llama 3 70B and considerably decrease than Meta’s Llama 3.1 70B,” displaying a serious discrepancy with HyperWrite/Shumer’s initially posted outcomes.

Our analysis of Reflection Llama 3.1 70B’s MMLU rating resulted in the identical rating as Llama 3 70B and considerably decrease than Meta’s Llama 3.1 70B.
A LocalLLaMA put up (hyperlink beneath) additionally in contrast the diff of Llama 3.1 & Llama 3 weights to Reflection Llama 3.1 70B and concluded the… pic.twitter.com/hqvFp2TyCC
— Synthetic Evaluation (@ArtificialAnlys) September 7, 2024

On X that very same day, Shumer said that Reflection 70B’s weights — or settings of the open supply mannequin — had been “fucked up throughout the add course of” to Hugging Face, the third-party AI code internet hosting repository and firm, and that this concern might have resulted in worse high quality efficiency in comparison with HyperWrite’s “inside API” model.

We’ve found out the problem. The reflection weights on Hugging Face are literally a combination of some completely different fashions — one thing bought fucked up throughout the add course of.
Will repair at present. https://t.co/rKuOlTApRK
— Matt Shumer (@mattshumer_) September 7, 2024

On Sunday, September 8, 2024 at round 10 pm ET, Synthetic Evaluation posted on X that it had been “given entry to a personal API which we examined and noticed spectacular efficiency however to not the extent of the preliminary claims. As this testing was carried out on a personal API, we weren’t capable of independently confirm precisely what we have been testing.”

Reflection 70B replace: Fast word on timeline and excellent questions from our perspective
Timeline:
– We examined the preliminary Reflection 70B launch and noticed worse efficiency than Llama 3.1 70B.
– We got entry to a personal API which we examined and noticed spectacular…
— Synthetic Evaluation (@ArtificialAnlys) September 9, 2024

The group detailed two key questions that severely name into query HyperWrite and Shumer’s preliminary efficiency claims, particularly:

“We aren’t clear on why a model can be revealed which isn’t the model we examined through Reflection’s personal API.
We aren’t clear why the mannequin weights of the model we examined wouldn’t be launched but.

As quickly because the weights are launched on Hugging Face, we plan to re-test and evaluate to our analysis of the personal endpoint.”

All of the whereas, customers on varied machine studying and AI Reddit communities or subreddits, have additionally referred to as into query Reflection 70B’s said efficiency and origins. Some have identified that primarily based on a mannequin comparability posted on Github by a 3rd occasion, Reflection 70B seems to be a Llama 3 variant relatively than a Llama-3.1 variant, casting additional doubt on Shumer and HyperWrite’s preliminary claims.

This has led to a minimum of one X person, Shin Megami Boson, to overtly accuse Shumer of “fraud within the AI analysis group” as of 8:07 pm ET on Sunday, September 8, posting an extended checklist of screenshots and different proof.

A narrative about fraud within the AI analysis group:
On September fifth, Matt Shumer, CEO of OthersideAI, publicizes to the world that they’ve made a breakthrough, permitting them to coach a mid-size mannequin to top-tier ranges of efficiency. That is large. If it is actual.
It is not. pic.twitter.com/S0jWT8rDVb
— ? Shin Megami Boson ? (@shinboson) September 9, 2024

Others accuse the mannequin of truly being a “wrapper” or software constructed atop of propertiary/closed-source rival Anthropic’s Claude 3.

Nonetheless, different X customers have spoken up in protection of Shumer and Reflection 70B, and a few have posted concerning the mannequin’s spectacular efficiency on their finish.

I do know @mattshumer_ and this doesn’t mesh with my understanding of him. He is aware of his stuff and is tremendous pragmatic and works round issues in spectacular ways in which most individuals get slowed down on for months. I’d say perhaps give the man somewhat extra time earlier than you say stuff…
— Sasha krecinic (@SashaKrecinic) September 9, 2024

Regardless, the mannequin’s rollout, lofty claims, and now criticism present how quickly the AI hype cycle can come crashing down.

As for now, the AI analysis group waits with breath baited for Shumer’s response and up to date mannequin weights on Hugging Face. VentureBeat has additionally reached out to Shumer for a direct response to those allegations of fraud and can replace after we hear again.

VB Each day

Keep within the know! Get the newest information in your inbox every day

By subscribing, you comply with VentureBeat’s Phrases of Service.

Thanks for subscribing. Try extra VB newsletters right here.

An error occured.

Reflection 70B’s efficiency questioned, accused of ‘fraud’

Related Articles

Sumit Nagal requested Rs 45 lakh pay rise, and AITA agreed earlier than Sweden Davis Cup tie

Media Push Deceptive Crime Stats To Shield Democrat Narrative

5 Should-Go to Northeast Cideries for Your Subsequent Fall-Foliage Highway Journey

LEAVE A REPLY Cancel reply

Latest Articles

Sumit Nagal requested Rs 45 lakh pay rise, and AITA agreed earlier than Sweden Davis Cup tie

Media Push Deceptive Crime Stats To Shield Democrat Narrative

5 Should-Go to Northeast Cideries for Your Subsequent Fall-Foliage Highway Journey

Diddy’s ex Cassie seen for the primary time with telling facial features after music mogul was arrested for ‘freak off’ orgies

Looking out For The Greatest Viewpoint In Petra in Jordan – Hand Baggage Solely