LLMs and Models Specialized for Science and Research

Node diagram showing AI science models by domain: biology, chemistry, physics, materials science — Science AI is fragmenting by domain — each field now has specialized systems built on domain-specific training data

Why general LLMs aren't enough for science

General-purpose language models are trained on text. Most of what makes science hard isn't in text — it's in experimental data, molecular structures, genomic sequences, simulation outputs, and physical laws that are expressed as equations rather than sentences. A model that reasons fluently about published papers is not the same as a model trained on PDB structures or crystallographic databases. The specialized science AI systems emerging in 2026 combine language model reasoning with domain-specific training data and, in many cases, with formal verification or physical simulation as a grounding mechanism. The results are qualitatively different from asking a general LLM to summarize a paper.

Biology and drug discovery

FutureHouse's Robin is one of the most striking examples of autonomous scientific AI in operation. Robin is a multi-agent system built for experimental biology: it autonomously generates hypotheses, designs experiments, analyzes data, and produces research outputs. In a documented case, Robin identified ripasudil — an existing glaucoma drug — as a promising therapy for dry age-related macular degeneration. The entire process from concept to publishable output took 2.5 months. FutureHouse estimates the cognitive labor for that discovery cycle, if done by humans, would have required 872–937 person-hours; Robin completed the equivalent work in under two hours of researcher time.

Google DeepMind's Co-Scientist is a general-purpose multi-agent research system built on Gemini 2.0, capable of assisting across hypothesis generation, experimental design, and data analysis across scientific domains. Where Robin specializes in biology and drug discovery, Co-Scientist is designed as a broadly applicable research collaborator. Both represent a shift from AI as a literature search tool toward AI as an active participant in the scientific process.

Structural biology: AlphaFold 3

AlphaFold 3, released by Google DeepMind and Isomorphic Labs, extended the protein-structure prediction breakthrough of its predecessors to nearly all of life's molecules: proteins, DNA, RNA, ligands, ions, and modified residues — and critically, the interactions between them. About 40% of new structures deposited into the Protein Data Bank between 2024 and 2025 were obtained using AlphaFold. Antibody-antigen modeling has seen particular impact: AlphaFold 3 captures the geometry of immune recognition with enough accuracy to meaningfully accelerate vaccine and therapeutic antibody development.

Its main weakness is RNA. RNA's conformational flexibility and context-dependent folding remain harder to predict than protein structure, and evaluations through 2026 show mixed performance on RNA-specific tasks. For proteins and protein-ligand interactions, it is the clear standard; for RNA structure prediction, specialized systems remain more reliable.

Materials science: 2.2 million new structures

Google DeepMind's GNoME system has autonomously discovered 2.2 million new crystalline structures and identified 380,000 stable materials likely to be synthesizable. Of those, 736 have already been created and confirmed in laboratory experiments by independent researchers — a 736-fold expansion of verified novel materials in a domain where human discovery had been measured in dozens per year.

Chemistry and simulation

A class of systems now autonomously generates scientific simulation code, runs computational chemistry experiments, optimizes models, and verifies results. ERA and kUPS are among the systems in this category, operating across fluid dynamics, chemistry, and biological simulation. These aren't language models writing chemistry papers — they're AI systems running actual computational experiments, checking results against physical constraints, and iterating. The interface between AI and simulation is where some of the most practically impactful science AI work is happening in 2026, even if it's less visible than the headline protein-folding breakthroughs.

Astronomy: the first foundation model for the sky

AION-1 is the first astronomy-specific foundation model, trained on over 200 million celestial objects drawn from five major sky surveys. It applies the foundation model approach — pretraining on massive domain-specific data, then fine-tuning for downstream tasks — to astronomy for the first time. Applications include anomaly detection in survey data, classification of transient events, and rapid hypothesis generation about newly observed phenomena. Astronomy generates data faster than humans can analyze it; a foundation model trained on the full sky survey corpus can identify patterns that would take years to surface through human review.

The state of AI as a research partner

Natural sciences saw approximately 80,150 AI-assisted publications in 2025, up 26% from 2024. AI now accounts for 5.8–8.8% of scientific research output depending on the field. The pattern across biology, chemistry, materials science, and astronomy is consistent: AI systems are most transformative not when they replace human judgment about what to study, but when they compress the time between hypothesis and experimental result. The bottleneck in science is rarely a shortage of ideas — it's the cost and time of testing them. Systems like Robin, Co-Scientist, and GNoME are attacking that bottleneck directly.

Limitations worth naming

Hallucination remains a risk in science AI applications, and the consequences of a confident wrong answer are higher than in most language model contexts. The standard mitigation is grounding: pairing language model reasoning with verifiable physical simulation, formal proof checking, or experimental validation. Systems that operate without this grounding — generating chemistry insights purely from training data without a simulation backend — have produced errors that look authoritative and are hard to detect without domain expertise. The systems doing the most credible work in 2026 are the ones that don't trust their own outputs without checking them.