The AI Drug Discovery Audit: What Recursion's Platform Bet Means for Every Pharma R&D Leader Committing Budget to an Algorithm's Hypothesis
Recursion Pharmaceuticals, Insilico Medicine, and Schrödinger are operating AI drug discovery platforms under nine-figure agreements with Roche, Bayer, Sanofi, and AstraZeneca. The first AI-designed drug candidate completed Phase II in 2024. FDA reviewers are asking about AI methodology in pre-IND meetings without established documentation standards. The governance questions pharma R&D leaders have not resolved: what validation your organization requires before an AI confidence score earns a Phase II budget commitment, what your IND needs to document about how the candidate was identified, and what happens to your model access and data rights if the platform partner is acquired — as Exscientia was by Recursion in 2024.
Key Numbers
Background
The AI drug discovery market has crossed from speculative to operational. Recursion Pharmaceuticals — which acquired Exscientia in 2024, absorbing Sanofi's $1.2B agreement and Exscientia's candidate pipeline in the process — operates one of the largest AI drug discovery platforms in production, with commitments from Roche ($150M) and Bayer ($100M+) to access the platform for novel target identification and compound optimization. Insilico Medicine, operating out of Hong Kong and the United States, has put an AI-designed drug candidate through Phase II — a milestone the industry has been working toward since deep learning was first applied to molecular biology in the early 2010s. Schrödinger applies physics-based machine learning to computational chemistry for clients across pharma and biotech. BenevolentAI, in partnership with AstraZeneca, identified baricitinib as a viable COVID-19 treatment candidate through AI biological pathway analysis before clinical validation confirmed it — a prediction that proved correct and was subsequently supported by FDA approval. The programs are running, the checks have been written, and the candidates are in the clinic.
What these platforms actually do is narrower — and more honest — than the commercial narrative suggests. Recursion's core approach is high-content phenotypic screening at scale: the platform treats cells with compounds and photographs the results at high resolution, then uses convolutional neural networks to identify cellular phenotype patterns associated with biological effects of interest. The system processes millions of experimental readouts per week, which genuinely exceeds what any human team could analyze through manual inspection. Insilico Medicine combines generative chemistry — AI systems that design novel molecular structures satisfying target binding requirements — with biological pathway analysis that draws on published literature and proprietary experimental datasets. Schrödinger's free energy perturbation calculations use ML to predict binding affinity with a level of accuracy that reduces the number of synthesis cycles required to optimize a lead compound. In each case, the common mechanism is pattern recognition against training data. The AI predicts which compounds will bind which targets with which biological effects, based on statistical patterns in data it was trained on. None of these platforms run the biology. They predict; wet lab biology confirms or refutes.
The training data architecture is where the governance gap begins. AI drug discovery models are trained primarily on public databases — ChEMBL (biological activity data for drug-like molecules), PubChem, the Protein Data Bank, published literature ingested through PubMed — combined with each platform's proprietary experimental datasets. Those public databases carry systematic biases that the AI inherits without warning its users. Publication bias is the most significant: negative results are underrepresented in published literature by a factor of three to four relative to positive results. A model trained on published literature learns that compounds of certain structural classes tend to show activity against certain target classes — because the published literature over-represents the experiments that worked. The compounds that didn't work, the target hypotheses that failed in internal screening without reaching publication, and the assay conditions that produced false positives are systematically absent from the training signal. The AI's confidence score reflects pattern matching against this biased dataset. Its estimate of how likely a given target-compound interaction is to show activity in a wet lab assay is calibrated against a literature record that systematically overstates historical success rates.
Rare disease targets compound the limitation. Training data availability scales with historical research attention — common disease targets with decades of published literature have rich training signals; rare disease targets with sparse published data have weak ones. An AI platform generating confidence scores for rare disease targets is working from a thinner training signal than for common oncology or cardiovascular targets, but the confidence score presentation may not visibly reflect that difference. R&D leaders deploying AI drug discovery for rare disease programs — which represent a meaningful portion of current pharma pipeline investment, given orphan drug economics — are making Phase II commitments based on confidence scores calibrated against datasets that may have a tenth the depth of the confidence scores generated for well-studied targets.
The FDA documentation question is arriving whether or not pharma R&D is ready for it. FDA's May 2023 draft guidance "Using Artificial Intelligence and Machine Learning in the Development of Drug and Biological Products" established that AI/ML used during drug development is within FDA's regulatory interest and that sponsors should be prepared to describe AI methodology in submissions. The guidance was more specific about post-approval model changes and manufacturing process AI than about pre-IND candidate generation — but the direction was unambiguous, and FDA reviewers at pre-IND meetings are already asking about it. The questions are consistent across programs: What data was the model trained on? What validation was performed on the model's predictions before the candidate was advanced? How does the sponsor distinguish AI-generated hypotheses from mechanistic biology in the IND rationale? These questions do not have standardized answer formats because there is not yet a standardized requirement. The first sponsors to submit comprehensive, credible answers will set the informal industry standard for what "sufficient" AI methodology documentation looks like. The sponsors without prepared answers will generate clinical hold queries on timelines they did not plan for.
Decision Required
Your AI drug discovery partner has delivered a candidate list with confidence scores. Your R&D committee is allocating Phase II budget. The decision before you has three governance questions embedded in it that the Phase II allocation process is not designed to surface.
What validation does your organization require between an AI confidence score and a Phase II budget authorization? If the answer is "standard biology" — the same validation protocol applied to traditionally identified candidates — the question is whether that biology was designed to challenge the AI's hypothesis or to confirm it. A validation program designed to confirm a hypothesis generates confirmatory data. A validation program designed to challenge a hypothesis — to identify the conditions under which the AI's prediction is wrong — is structurally different in design, duration, and cost. Most R&D validation protocols were not designed for the challenge-oriented posture that AI-generated hypotheses warrant, because traditionally identified candidates have a mechanistic rationale that pharmacologists can evaluate before the validation biology starts. AI confidence scores do not come with that rationale attached.
What does your IND say about how the candidate was identified? FDA reviewers are asking. Your regulatory team needs a documented answer before the pre-IND meeting, not after the IND submission generates a clinical hold query. The answer needs to describe the training data, the model architecture (at the level of mechanism, not proprietary detail), the validation experiments that tested the model's predictions before the candidate was selected, and the scientific reasoning that links the AI output to the biological hypothesis the clinical program is testing. This documentation does not exist in a standardized format anywhere in the industry. Your regulatory team is building it from scratch unless another team in your organization built it first — and that team's protocol needs to exist before the IND is submitted, not after the question arrives.
What are your rights to the model and your data if the platform partnership changes? Exscientia was acquired by Recursion in 2024. Sanofi's $1.2B agreement — the largest single AI drug discovery contract signed at that point — is now governed under Recursion's terms, Recursion's platform architecture, and Recursion's business priorities. Pharma companies that built their AI drug discovery strategy around Exscientia's specific platform capabilities found those capabilities absorbed into a different company without a formal notice period or renegotiation right. If your AI partnership agreement does not specify what happens to your data access, your candidate documentation, your model export rights, and your ongoing support obligations in the event of a platform acquisition, restructuring, or exit from the market, you are exposed to the same scenario with a different platform partner.
Options
This is the current practice at most pharma organizations with AI drug discovery partnerships. The AI identifies candidates; standard biology validates them; candidates meeting existing Phase II criteria advance. The governance gap: existing Phase II entry criteria were designed to evaluate mechanistically-understood candidates whose biological rationale was established before validation began. AI-generated candidates may clear those criteria on pharmacological metrics while leaving the underlying biological hypothesis — the AI's specific claim about mechanism — unvalidated. Phase II failure then generates a retrospective question: was the biology wrong, or was the AI's reasoning wrong? Without a protocol that specifically tested the AI's hypothesis, the answer is unavailable. The FDA documentation gap is also unaddressed: if a reviewer asks what validation was performed on the AI's predictions, the answer is that standard biology was applied without a protocol designed to interrogate the model's reasoning.
Before any AI-generated candidate enters Phase II evaluation, require the platform partner to provide: the specific biological pathway hypotheses the confidence score represents, the training data categories that generated the confidence estimate, the validation experiments the platform's internal team performed on the prediction, and a documented assessment of the training data coverage for this target class. Internally, require your pharmacologists to produce an independent mechanistic hypothesis before the validation biology begins — a hypothesis that the AI's prediction can be tested against. This approach adds overhead to the Phase II evaluation timeline; it also produces the FDA documentation your regulatory team will need and the challenge biology that distinguishes a validation program from a confirmation program. Right for organizations with enough pipeline volume that Phase II allocation decisions have material financial consequences and enough regulatory exposure that pre-IND preparation is worth the investment.
Merck, Pfizer, Novartis, and AstraZeneca have all invested in internal AI drug discovery teams alongside their external platform partnerships. Building internal capability solves the partner acquisition risk, the training data transparency problem, and the FDA documentation challenge — if your team built the model, you can document it. The investment is substantial: computational biology teams with ML capability, proprietary assay data at scale, infrastructure for model training and deployment, and the management overhead to integrate AI output into the R&D process. For organizations with large enough R&D pipelines to amortize the fixed cost and the in-house biological data assets that make proprietary training viable, this is the governance-complete path. For most pharma organizations outside the largest five or six global players, the build-versus-buy economics do not favor internal platform development over platform partnership.
Maintain platform partnerships for the screening throughput advantage while building a small internal team — three to five computational biologists with ML literacy — whose function is to review AI-generated candidates before they enter the Phase II allocation process. This team reviews the platform's confidence score rationale, designs the challenge validation biology, maintains the FDA documentation protocol, and monitors training data coverage for your pipeline's target areas. The team does not need to replicate the platform's screening capability; it needs to evaluate the platform's outputs with enough depth to distinguish well-supported predictions from extrapolations beyond the training data. This is the governance architecture that balances external platform economics with internal oversight — and it is the structure most likely to produce credible FDA documentation without requiring full internal capability build.
Recommendation
Develop and document an AI hypothesis validation protocol before the next Phase II allocation meeting that involves an AI-generated candidate. This is not about slowing down the AI drug discovery program. It is about building the governance architecture around a program that is already running without one — before a Phase II failure or a pre-IND clinical hold generates the retrospective question about what the process was.
The protocol has three components. The first is a mechanistic hypothesis requirement: before any AI-generated candidate enters Phase II evaluation, your pharmacology team must produce an independent biological hypothesis that the AI's prediction can be tested against. Not a restatement of the confidence score — an independent mechanistic claim about the target biology, the patient population, and the expected therapeutic mechanism that your pharmacologists are willing to defend regardless of what the AI says. This forces the validation biology to test something beyond the AI's pattern match and produces the scientific rationale your regulatory team needs for the IND. The exercise also identifies candidates where the AI confidence score has no supportable mechanistic hypothesis — which is information your R&D committee should have before, not after, Phase II commitment.
The second component is a training data coverage assessment for each candidate's target class. Before Phase II allocation, require your platform partner to document: what public databases and proprietary datasets the confidence score draws on for this target class, what the publication density is for this target in the training data, and whether there are known gaps in assay coverage for this target (common in rare disease targets and novel biology). This documentation should be part of the platform partnership SLA — the partner already holds this information; requiring them to surface it for candidates advancing to Phase II is a contract term, not a technical impossibility. If your current platform agreement does not include this provision, it should be the first amendment negotiated at the next renewal.
The third component is FDA documentation readiness. Assign a regulatory lead to each AI-generated candidate entering Phase II evaluation whose specific responsibility is preparing the AI methodology section for the pre-IND package. The section needs to cover: the platform's mechanism and training data (at the level of mechanism, without requiring proprietary model disclosure), the validation experiments performed on the prediction before candidate selection, the mechanistic hypothesis your team has independently developed, and the challenge biology design that tested that hypothesis. This is not a new regulatory submission requirement — it is the documentation that allows your regulatory team to answer the questions FDA reviewers are already asking at pre-IND meetings, before those questions generate a clinical hold on the IND.
On partner risk: review your platform partnership agreements before the next renewal and identify whether each agreement specifies data access rights, model documentation entitlements, and support continuity obligations in the event of platform acquisition, restructuring, or wind-down. Exscientia's acquisition by Recursion was not announced with warning. The pharma companies with the most favorable post-acquisition positions were those whose agreements included explicit data portability clauses and ongoing support obligations that survived a change of control. If your current agreements do not contain these provisions, the renewal negotiation is the time to add them — not after your platform partner has been acquired and the counterparty across the table is a different company with different priorities.
Enjoying this brief? The next one ships Tuesday.
One enterprise AI deployment, dissected weekly. Free during beta · No credit card · Unsubscribe anytime
Risks
The public databases that anchor most AI drug discovery training sets — ChEMBL, PubMed, PubChem — over-represent positive results by design: journals publish successful experiments at three to four times the rate of negative results. The AI model trained on this data learns that compounds of successful structural classes work against well-studied targets, because the training data systematically omits the experiments that failed and never reached publication. The confidence score for a given compound-target pair reflects pattern matching against this biased record. For target classes with decades of published research, the bias may average out across the training signal. For novel targets, rare disease biology, or molecular mechanisms that have not been heavily published, the AI's confidence estimate is calibrated against a thinner and more biased training signal than the interface communicates. R&D leaders who evaluate AI confidence scores without understanding the training data coverage for their specific target area are trusting a number that does not mean what it appears to mean.
FDA's May 2023 draft guidance addressed AI/ML in drug development primarily through the lens of post-approval model changes and manufacturing process AI. Pre-IND documentation requirements for AI-generated drug candidate identification are not yet established in formal guidance. The absence of formal requirements does not mean reviewers aren't asking — they are, consistently, at pre-IND meetings across multiple therapeutic areas. Every sponsor submitting an IND for an AI-designed candidate is currently developing their documentation framework without a validated template or regulatory precedent to reference. If the documentation is inadequate or absent, the likely outcome is an information request or clinical hold at the IND stage — adding months to the program timeline at the highest-cost point in drug development. The organizations building their AI documentation framework now, before it is a regulatory requirement, will have a competitive advantage in IND review timelines when the formal guidance arrives.
The AI drug discovery platform market is early-stage: Recursion acquired Exscientia, BenevolentAI has gone through significant restructuring, Insilico Medicine is private and geopolitically complex given its China operations. Pharma organizations that have built their AI drug discovery strategy around a specific platform's capabilities, data access, and model architecture are exposed to the scenario where that platform is acquired, restructured, or exits the market under different terms. The contractual protections that matter at this event: data portability clauses that specify format and timeline for receiving your experimental data and candidate documentation at contract termination; model documentation requirements that give your computational team enough detail to reconstruct the reasoning behind key predictions; and change-of-control provisions that trigger renegotiation rights or termination options if the platform partner is acquired. Standard software licensing terms do not cover any of these; AI drug discovery partnership agreements need bespoke language.
The validation biology run on AI-generated candidates is often designed by teams whose professional incentive is to advance a promising candidate, not to falsify the AI's hypothesis. The biology confirms that the compound has the expected activity in the expected assay system under the expected conditions — which is what the AI predicted. It does not specifically test the conditions under which the AI's prediction would be wrong: different patient biology, different disease contexts, compound activity in the specific tissue compartment where the therapeutic mechanism must operate. This is not a failure of individual scientists; it is a structural design problem in how AI-generated hypotheses are validated. The confirmation bias is systematic. Phase II failures in AI-generated programs often involve not that the biology was wrong in the assay system, but that the assay system did not challenge the AI's assumptions about which biology matters in patients.
AI drug discovery platforms are optimized to generate high-confidence predictions for compound-target interactions. They are not optimized to predict clinical outcomes in heterogeneous patient populations with real-world comorbidities, polypharmacy, and treatment histories that differ from clinical trial design assumptions. Phase II failure in AI-discovered programs frequently occurs not because the AI was wrong about the target-compound biology, but because the clinical trial design assumptions about patient population, biomarker selection, or dosing regimen did not match the real-world conditions in which the drug must work. The AI predicted target engagement correctly; the trial failed because target engagement alone was insufficient for clinical efficacy in the enrolled population. This is the same failure mode as traditional drug development — but the AI confidence score tends to generate higher internal conviction about the biological hypothesis, which makes it more likely that disconfirming clinical signals are rationalized rather than recognized as early stopping signals.
Questions Your Team Should Be Answering
These are the questions that distinguish organizations that get this right from those that do not. If your team cannot answer them, that is your first deliverable.
- 1.
Does your organization have a documented protocol specifying what validation is required between an AI confidence score and a Phase II budget authorization — and does that protocol require an independent mechanistic hypothesis from your pharmacology team before the validation biology begins?
- 2.
What does your current IND template include about AI methodology when a candidate was identified through an AI drug discovery platform — and has your regulatory team prepared that documentation before a pre-IND meeting rather than after an IND clinical hold query?
- 3.
Do your AI drug discovery partnership agreements include data portability clauses, model documentation entitlements, and change-of-control provisions — and when were those agreements last reviewed against the risk that the platform partner is acquired or restructured?
- 4.
What is the training data coverage depth for the specific target classes in your current AI-assisted pipeline — and has your platform partner documented known gaps in assay coverage or publication density that would affect confidence score calibration for your therapeutic areas?
- 5.
How does your validation biology protocol for AI-generated candidates differ from your validation protocol for traditionally-identified candidates — specifically, does it include challenge experiments designed to falsify the AI's specific hypothesis, or does it apply the same confirmation-oriented assay battery?
- 6.
If your AI drug discovery platform partner was acquired tomorrow, what would your organization receive in terms of data export, model documentation, candidate pipeline documentation, and ongoing support — and is any of that specified in writing in your current partnership agreement?
If this memo belongs in your next executive meeting or board pack, send it along. One click opens a pre-drafted email — edit or send as-is.
The ATO Bottleneck: What Federal Agencies Discover When AI Procurement Meets the Authorization Process
Federal agencies are deploying AI tools across procurement, benefits processing, and workforce operations — but the ATO process was written for static systems. FedRAMP authorizes cloud infrastructure, not AI behavior. Most frontier AI APIs lack FedRAMP authorization, and most federal ATOs are stale by the time the model updates.
Read memo →The Algorithmic Underwriting Audit: What NAIC AI Requirements Mean for Every Insurer Using AI in Pricing and Claims
State insurance regulators have moved. The NAIC Model Bulletin on AI has been adopted in 38+ states. Colorado mandates external algorithmic audits for life insurance AI. California CDI has challenged AI-generated property risk scores. Most carriers have deployed AI in claims and underwriting without building the governance documentation regulators are now requiring.
Read memo →The SR 11-7 Blind Spot: What Banks Discover When AI Hits Model Risk Management
Banks are deploying AI in credit underwriting, fraud detection, compliance monitoring, and customer service — but SR 11-7, the OCC/Fed model risk framework, was written in 2011 for statistical models. The validation gap for third-party LLM APIs, the model version change management problem, and what bank examiners are beginning to ask.
Read memo →