Hallucinations in AI: How GSK is addressing a critical problem in drug development

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now

Pharmaceutical giant GSK is pushing the boundaries of what generative AI can achieve in healthcare areas like scientific literature review, genomic analysis and drug discovery. But it faces a persistent problem with hallucinations, or when AI models generate incorrect or fabricated information. Errors in healthcare are not merely inconvenient; they can have life-altering consequences. Here’s how GSK is tackling it.

The hallucination problem in generative health care

A lot of focus around reducing hallucinations has been applied during the training of a large language model (LLM), or when it is learning from data. But to mitigate hallucinations, GSK instead employs strategies at inference-time, or at the time when a model is actually being used in a real application. Strategies here include things like self-reflection mechanisms, multi-model sampling and iterative output evaluation. According to Kim Branson, SVP of AI and machine learning (ML) at GSK, these techniques help ensure that agents are “robust and reliable,” while enabling scientists to generate actionable insights more quickly. “We’re all about increasing the iteration cycles at GSK — how we think faster,” he said.

Leveraging test-time compute scaling

Improving an generative AI application’s performance at inference-time, also referred to as test-time, is mostly done by increasing computational resources when a model trying to figure out the answer to a problem. This includes more complex operations such as iterative output refinement or multi-model aggregation, which are critical for reducing hallucinations and improving model performance.

Branson emphasized the transformative role of scaling this phase of test-time compute in GSK’s AI efforts, noting that by using strategies like self-reflection and ensemble modeling, GSK can leverage these additional compute cycles to produce results that are not only quicker, but more accurate and reliable.

AI Scaling Hits Its Limits

Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are:

Turning energy into a strategic advantage
Architecting efficient inference for real throughput gains
Unlocking competitive ROI with sustainable AI systems

Secure your spot to stay ahead: https://bit.ly/4mwGngO

In fact, this is a broader industry trend, not only across healthcare, but other verticals too. “You’re seeing this war happening with how much I can serve, my cost per token and time per token,” said Branson. “That allows people to bring these different algorithmic strategies which were before not technically feasible, and that also will drive the kind of deployment and adoption of agents.”

Strategies for reducing hallucinations

To tackle hallucinations in healthcare gen AI apps, GSK employs two main strategies that require additional computational resources during inference.

Self-reflection and iterative output review

One core technique is self-reflection, where LLMs critique or edit their own responses to improve quality. The model “thinks step by step,” analyzing its initial output, pinpointing weaknesses and revising answers as needed. GSK’s literature search tool exemplifies this: It collects data from internal repositories and an LLM’s memory, then re-evaluates its findings through self-criticism to uncover inconsistencies.

This iterative process results in clearer, more detailed final answers. Branson underscored the value of self-criticism, saying: “If you can only afford to do one thing, do that.” Refining its own logic before delivering results allows the system to produce insights that align with healthcare’s strict standards.

Multi-model sampling

GSK’s second strategy relies on multiple LLMs or different configurations of a single model to cross-verify outputs. In practice, the system might run the same query at various temperature settings to generate diverse answers, employ fine-tuned versions of the same model specializing in particular domains or call on entirely separate models trained on distinct datasets.

Comparing and contrasting these outputs helps confirm the most consistent or convergent conclusions. “You can get that effect of having different orthogonal ways to come to the same conclusion,” said Branson. Although this approach requires more computational power, it reduces hallucinations and boosts confidence in the final answer — an essential benefit in high-stakes healthcare environments.

The inference wars

GSK’s strategies depend on infrastructure that can handle significantly heavier computational loads. In what Branson calls “inference wars,” AI infrastructure companies — such as Cerebras, Groq and SambaNova — compete to deliver hardware breakthroughs that enhance token throughput, lower latency and reduce costs per token.

Specialized chips and architectures enable complex inferencing routines, including multi-model sampling and iterative self-reflection, at scale. Cerebras’ technology, for example, processes thousands of tokens per second, allowing advanced techniques to work in real-world scenarios. “You’re seeing the results of these innovations directly impacting how we can deploy generative models effectively in healthcare,” Branson noted.

This week, in a partnership with Mayo Clinic and Microsoft, Cerebras announced a genomic foundation model that predicts the best medical treatments for people with rheumatoid arthritis using the efficiencies found in its custom silicon.

When hardware keeps pace with software demands, solutions emerge to maintain accuracy and efficiency.

Challenges remain

Even with these advancements, scaling compute resources presents obstacles. Longer inference times can slow workflows, especially if clinicians or researchers need prompt results. This is where the advanced silicon comes in. Higher compute usage also drives up costs, requiring careful resource management. Nonetheless, GSK considers these trade-offs necessary for stronger reliability and richer functionality.

“As we enable more tools in the agent ecosystem, the system becomes more useful for people, and you end up with increased compute usage,” Branson noted. Balancing performance, costs and system capabilities allows GSK to maintain a practical yet forward-looking strategy.

What’s next?

GSK plans to keep refining its AI-driven healthcare solutions with test-time compute scaling as a top priority. The combination of self-reflection, multi-model sampling and robust infrastructure helps to ensure that generative models meet the rigorous demands of clinical environments.

This approach also serves as a road map for other organizations, illustrating how to reconcile accuracy, efficiency and scalability. Maintaining a leading edge in compute innovations and sophisticated inference techniques not only addresses current challenges, but also lays the groundwork for breakthroughs in drug discovery, patient care and beyond.

This is part of our Healthcare and Gen AI feature series.

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.