Research

NAVER Cloud Digital Healthcare Lab

Drug–target interaction reasoning with LLMs

AI Research Intern, June–August 2025. A biomedical foundation-model landscape survey and an LLM fine-tuning project with a partner pharmaceutical company.

Gemma-3-27b-itLoRATRL + PEFTBindingDBPLIP

Foundation Model Landscape

The first three weeks were a survey of biomedical foundation models across four modality groups: text (biomedical LLMs), vision (radiology and pathology), science (genomics, transcriptomics, proteomics, cheminformatics, systems biology), and multimodal. Each entry was characterized along architecture, pretraining objective, training data, representative benchmarks, and licensing. The survey framed the landscape around the Generalist Medical AI (GMAI) vision and its four application directions: personalized treatment, digital twin, synthetic biology, and drug discovery.

The landscape-level read: current biomedical foundation models are strong on single-modality pattern recognition, information extraction, and summarization, and weak on cross-modal integration, long-context reasoning, clinical uncertainty, and generalization across sites. Most of these weaknesses trace to limits in training-data availability and benchmark design.

Drug–Target Interaction

The partner team wanted a model that, given a small-molecule ligand and a target protein, could describe the non-covalent contacts driving binding — the structural rationale a medicinal chemist would write next to an assay readout. The original framing asked for free-text “binding strengths and weaknesses”; that was dropped because the output was under-specified, labeled data was scarce, and evaluation was not objective.

An intermediate framing asked the model to produce a full chain-of-thought: identify functional groups, identify binding-site residues, list interactions, and predict IC50. Medicinal-chemistry review of those outputs reported that interaction types and functional-group assignments were mostly correct and IC50 values were systematically over-estimated. Scope was cut down to the structured interaction list only. IC50 regression stayed off the table.

The final task: given SMILES, an amino-acid sequence, and a potential binding-site residue list, produce an ordered list of tuples (interaction_type, ligand_functional_group, protein_residue). Interaction types are drawn from a fixed set of seven (hydrogen bond with ligand as donor or as acceptor, salt bridge, hydrophobic, pi–pi stacking, pi–cation, halogen bond). Functional groups come from a 21-element vocabulary that took work to settle on: RDKit emits a much larger, overlapping set of descriptors, an initial attempt at a hierarchical sub-ontology from ChEBI was too granular to train on, and the final 21-class mapping was worked out with a chemistry PhD on the partner team. Ground-truth interactions come from PLIP run over protein–ligand complexes in BindingDB.

DTI task: three text-serialized inputs (SMILES, protein amino-acid sequence, potential binding-site residues) pass through Gemma-3-27b-it with a LoRA adapter, producing an ordered list of interaction tuples.
The task formulation. Three text-serialized inputs, one structured list output. Every element of every tuple is a token the model has to produce exactly for the interaction to count against the ground truth.

Probing the base model

Before fine-tuning, several lightweight probes mapped what off-the-shelf LLMs could already do on the task’s sub-skills. The SMILES side was the more encouraging half. A/B probes — given two molecules, pick the one containing a specified functional group — and yes/no existence probes both showed TxGemma-27B and Gemma-3-27b-it reliably identifying most of the 21 functional groups, with weaker performance on the more overlapping categories (hydrophobic, hydrogen-bond acceptor). The models recognize local SMILES substructures, though A/B order sensitivity on some groups suggests the recognition is closer to pattern matching than to structural parsing.

IC50 prediction exposed the harder half. With TxGemma’s official regression prompt, the model’s outputs on held-out protein–ligand pairs clustered around a small number of integer values across the full pIC50 range. Reformulating the same task as five-bucket classification, both with raw and with normalized boundaries, produced accuracy only marginally above random, with predictions still collapsing onto a narrow band of classes. The model is reproducing its training distribution rather than resolving the specific ligand–target pair.

Protein-sequence comprehension was the weakest link. In chat-style probes, both models struggled to associate residue identities with their chemical properties when asked to reason over amino-acid sequences. The structured interaction task assumes the model can resolve a residue position, say SER46, back to a hydrogen-bond donor on a hydroxyl side chain. That assumption is not free.

Model selection and prompting

Two decisions followed from the probes: which base model to fine-tune, and what prompt format to fine-tune against.

TxGemma-27b-chat was the first candidate because it was specifically instruction-tuned on therapeutic-domain data. Two observations pushed toward Gemma-3-27b-it instead. On the A/B functional-group probes, TxGemma’s accuracy shifted noticeably when the two options were swapped — that order sensitivity suggested the model was not robustly parsing the SMILES. And on a direct 1000-pair comparison at the target task, asking both models to generate the interaction list with max_new_tokens=512, Gemma-3 produced more consistent tuple structures and better-formed JSON. Since neither model was clearly stronger on the underlying chemistry and Gemma-3 was the more stable generator, it became the fine-tuning base.

The CoT prompt that produced the IC50 scope cut above was the first of three iterations. The second added the potential binding-site residue list to the input and made the reasoning steps explicit: compound analysis → binding-site analysis → interaction list. Format stabilized, but the interaction step fell back on heuristic rules like “hydroxyl → SER/THR/ASN/GLN” — pattern matching on residue identities rather than structural reasoning about the pocket. The third iteration pinned the output to a 2-shot JSON schema specifying the seven interaction types and 21 functional-group classes; that schema became the fine-tuning target.

Fine-tuning

Gemma-3-27b-it was chosen over TxGemma as the base. It was more stable across A/B orderings on the functional-group probes and handled the 2-shot JSON-output prompt more reliably than TxGemma-27b-chat. LoRA (TRL + PEFT) was applied to the attention (q/k/v/o) and MLP (gate/up/down) projections of every transformer block, with rank swept across 16, 64, and 96. Training used SFT with assistant-only loss, bf16, batch size 1 with gradient accumulation 16, cosine schedule at learning rate 5e-5, two epochs, and a 9:1 train/eval split over PLIP-derived interaction tuples.

config.yml — LoRA targets and staged fine-tuning setup

lora:
  r: 96
  alpha: 96
  dropout: 0.05
  target_modules:
    - q_proj
    - k_proj
    - v_proj
    - o_proj
    - gate_proj
    - up_proj
    - down_proj
 
stages:
  stage_1:
    data_path: stage1_functional_groups.jsonl
  stage_2:
    data_path: stage2_binding_sites.jsonl
    init_adapter_from: checkpoints/stage_1/final_model
  stage_3:
    data_path: stage3_merged.jsonl

The training script was designed to run a three-stage curriculum: Stage 1 teaches SMILES→functional-groups, Stage 2 teaches protein→binding-site-residues, and Stage 3 folds both into the joint interaction-list task, with each stage initializing its LoRA adapter from the previous stage’s checkpoint. The intent was to decompose the structural-reasoning load across sub-skills the probes had already shown the base model could partly do. In practice, evaluating a Stage-1‑2-initialized Stage-3 run against the same Stage-3 run launched from the base adapter showed the curriculum did not help; the base-initialized run was slightly stronger on held-out interaction tuples. Final training ran directly on the Stage-3 data.

Evaluation graded predictions against PLIP ground truth with four F1 scores: exact-match on the full tuple, plus component-level scores treating predicted interaction types, functional groups, and residues independently. Exact match is the strict metric; the component scores show which parts of the tuple the model is learning — often the interaction type and the functional group land correctly and the residue does not.

Takeaway

Framing a structural-chemistry problem as next-token prediction on text-serialized inputs gets the output format and local chemistry mostly right. The fine-tuned model produces the schema correctly, uses the interaction and functional-group vocabularies, and recovers hydrophobic contacts and hydrogen bonds when the binding pocket is small and the interactions are local. Within this setup, it does not learn to reason about the geometry that selects one residue over another: tuples that require ranking across many candidate residues (pi-stacking onto an aromatic rim, salt bridges on acidic side chains) are the brittle ones, and the output frequently reproduces the same interaction across two ligands that happen to share a functional group.

The end-of-internship read was that the task likely needs dedicated encoders for the protein and the ligand — sequence or 3D-structure encoders whose embeddings already carry chemistry — aligned into an LLM’s text space, rather than asking the LLM to learn chemistry from characters. GPU and time budgets meant that direction was not prototyped during the internship.

Reference

“Junho conducted comprehensive research on large language models and transformer-based foundation models in biomedical informatics, reviewing recent findings in the field. He collaborated closely with a partnering pharmaceutical company, reproducing relevant literature and successfully defining AI modeling problems tailored to their specific research challenges.”

“While limited time and GPU resources prevented novel research breakthroughs, Junho’s exceptional learning curve and meticulous work style demonstrate his potential for remarkable achievements given adequate experience and resources.”

— Haanju Yoo, Director, Applied AI Group, NAVER

Full recommendation →