Scientific Informatics: Live Integrations and Code
This demo uses Kymera Therapeutics, a Watertown, Massachusetts biotechnology company, as the informatics case study. The page doubles as a slice of data engineering (live SQL, REST ingestion, caching, and operational concerns), AI engineering (prompts and toolbars built from the payloads the APIs return), and full stack development (this Next.js surface, server-side fetches, and interactive panels). Architecturally: batch ETL (illustrated in code samples) builds curated ground truth; deterministic SQL and REST on this page give exact, repeatable answers; agentic chat at /api/chat composes the same allowlisted AACT presets and registry tools when questions are messy or multi-step. All data is fetched live from ChEMBL, UniProt, AACT PostgreSQL, and PubMed. Code samples cover Python, R/Shiny, SQL, Airflow, Docker, and AWS.
Agentic Chat Pipeline
Same three-layer story as this page: ETL builds trust, SQL and REST return exact slices, /api/chat reasons across them with allowlisted AACT tools plus specialists (not ad-hoc SQL)
Agentic chat request path
Next.js route handler, Vercel AI SDK streamText on the lead and generateText inside specialists. Deterministic first: the lead can call queryAactPreset (same allowlisted SQL as the AACT panel below), then consultSpecialist or consultSpecialistsParallel for registry and literature work. Optional intent (Facts and SQL vs Explore), answer trace in the UI, versioned prompts in metadata and audit logs.
Specialists use scoped tools only (search, detail, PubMed, compare). The lead never runs ad-hoc SQL; AACT access matches fixed presets. Structured tool output feeds the streamed reply; users can expand how this answer was built for tool names and prompt version.
Reference: docs/architecture-agentic.md in the repository describes scope (portfolio demo vs regulated posture), guardrails, and roadmap.
Skills Coverage
Scientific informatics mapped to data engineering, AI engineering, and full stack delivery, tied to the Kymera case study (Watertown, MA)
Languages
Pipelines, APIs, automation, ETL
Interactive scientific dashboards
AACT live DB, DMTA schemas, views
CLI automation, cron, scripting
Scientific Platforms
Data model + REST API integration
Assay data pipeline patterns
DMTA design data flows
Live bioactivity integration (this page)
Live protein data (this page)
REST + AACT PostgreSQL
Infrastructure
CDK IaC, Fargate, lifecycle rules
Containerized R + Python stack
DMTA workflow orchestration DAG
Automated test + deploy
Data Engineering
DMTA compound/assay/results schema
Validated, paginated, tested
Outlier detection, curation flags
UniProt integration, protein data
Scoped at RCH + Vertex + Pfizer
Full Stack & AI Engineering
This page: layouts, interactive panels, route handlers
Typed UI and server integration
Toolbar prompts built from ChEMBL, UniProt, and AACT payloads
RSC + caching: fresh scientific data without exposing keys in the browser
Target Proteins
STAT6, IRAK4, and IRF5 from UniProt REST API
Signal transducer and activator of transcription 6
Carries out a dual function: signal transduction and activation of transcription. Involved in IL4/interleukin-4- and IL3/interleukin-3-mediated signaling
Interleukin-1 receptor-associated kinase 4
Serine/threonine-protein kinase that plays a critical role in initiating innate immune response against foreign pathogens. Involved in Toll-like receptor (TLR) and IL-1R signaling pathways (PubMed:17878374). Is rapidly recruited by MYD88 to the receptor-signaling complex upon TLR activation to form the Myddosome together with IRAK2. Phosphorylates initially IRAK1, thus stimulating the kinase activ
Interferon regulatory factor 5
Transcription factor that plays a critical role in innate immunity by activating expression of type I interferon (IFN) IFNA and INFB and inflammatory cytokines downstream of endolysosomal toll-like receptors TLR7, TLR8 and TLR9 (PubMed:11303025, PubMed:15695821, PubMed:22412986, PubMed:25326418, PubMed:32433612). Regulates the transcription of type I IFN genes (IFN-alpha and IFN-beta) and IFN-stim
https://rest.uniprot.org/uniprotkb/{P42226,Q9NWZ3,Q13568}.jsonNo key required, 24h cacheChEMBL Bioactivity Data
31,876 total measurements across STAT6, IRAK4, IRF5
| Molecule | Type | Value (nM) | pChEMBL | Assay | Year |
|---|---|---|---|---|---|
| CHEMBL487451 | IC50 | 0.7 | 9.15 | Inhibition of STAT6 activation in FW4 reporter cel… | 2008 |
| CHEMBL593171 | IC50 | 0.7 | 9.15 | Inhibition of STAT6 in IL4-stimulated human FW4 re… | 2009 |
| CHEMBL470963 | IC50 | 1.4 | 8.85 | Inhibition of STAT6 activation in FW4 reporter cel… | 2008 |
| CHEMBL488814 | IC50 | 1.6 | 8.80 | Inhibition of STAT6 activation in FW4 reporter cel… | 2008 |
| CHEMBL519249 | IC50 | 1.8 | 8.74 | Inhibition of STAT6 activation in FW4 reporter cel… | 2008 |
| CHEMBL512874 | IC50 | 1.8 | 8.74 | Inhibition of STAT6 activation in FW4 reporter cel… | 2008 |
| Molecule | Type | Value (nM) | pChEMBL | Assay | Year |
|---|---|---|---|---|---|
| CHEMBL4443947 | IC50 | 0.022 | 10.66 | Binding affinity to human IRK4 using myelin basic … | 2019 |
| CHEMBL4556091 | IC50 | 0.026 | 10.59 | Inhibition of IRAK4 in human whole blood assessed … | 2020 |
| CHEMBL4566431 | IC50 | 0.078 | 10.11 | Inhibition of IRAK4 in human whole blood assessed … | 2020 |
| CHEMBL4545898 | IC50 | 0.081 | 10.09 | Inhibition of IRAK4 in human whole blood assessed … | 2020 |
| CHEMBL6030179 | IC50 | 0.1 | 10.00 | IRAK4 Enzymatic DELFIA Assay, Protocol A: This is … | 2023 |
| CHEMBL4066705 | IC50 | 0.1 | 10.00 | Inhibition of full-length IRAK4 (unknown origin) i… | 2024 |
| Molecule | Type | Value (nM) | pChEMBL | Assay | Year |
|---|---|---|---|---|---|
| CHEMBL463914 | IC50 | 3,400 | 5.47 | Inhibition of human DNA polymerase kappa (19 to 52… | 2016 |
| CHEMBL3960997 | IC50 | 5,600 | 5.25 | Inhibition of human DNA polymerase kappa (19 to 52… | 2016 |
| CHEMBL463914 | IC50 | 5,600 | 5.25 | Inhibition of human DNA polymerase kappa (19 to 52… | 2016 |
| CHEMBL1014 | IC50 | 5,600 | 5.25 | Inhibition of human DNA polymerase kappa (19 to 52… | 2016 |
| CHEMBL1917196 | IC50 | 6,800 | 5.17 | Inhibition of C-terminal His6-tagged human DNA pol… | 2011 |
| CHEMBL1917198 | IC50 | 8,100 | 5.09 | Inhibition of C-terminal His6-tagged human DNA pol… | 2011 |
https://www.ebi.ac.uk/chembl/api/data/activity.json?target_chembl_id=CHEMBL5401&assay_type=B&pchembl_value__isnull=false&order_by=-pchembl_valueAACT PostgreSQL
Deterministic slice: presets (degrader, glue, kinase, combined) power this panel and queryAactPreset in /api/chat. Flexible queries (queryAactFlexible) add target, sponsor, drug, condition, and phase filters.
Bifunctional molecules (PROTACs) that recruit E3 ligases to ubiquitinate and degrade target proteins
SELECT DISTINCT ON (s.nct_id)
s.nct_id, s.brief_title, s.overall_status, s.phase,
sp.name AS sponsor, s.enrollment, s.start_date
FROM studies s
JOIN interventions i ON i.nct_id = s.nct_id
JOIN sponsors sp ON sp.nct_id = s.nct_id AND sp.lead_or_collaborator = 'lead'
WHERE (
LOWER(i.name) LIKE '%protac%'
OR LOWER(i.name) LIKE '%degrader%'
OR LOWER(s.brief_title) LIKE '%degrader%'
)
AND s.overall_status IN ('RECRUITING', 'ACTIVE_NOT_RECRUITING', 'NOT_YET_RECRUITING')
ORDER BY s.nct_id, s.start_date DESC NULLS LAST
LIMIT 10;Active trials · PROTACs / Degraders
By trial phase
Top sponsors
DMTA Cycle: Design, Make, Test, Analyze
Tools, data inputs, and outputs at each stage of drug discovery
Tools
Data in
Target structures, docking scores, ADMET predictions
Output
Virtual compound library (SMILES + predicted properties)
Tools
Data in
Synthesis routes, reaction yields, batch purity (HPLC/NMR)
Output
Registered compound batches with analytical certificates
Tools
Data in
IC50, DC50, Dmax, selectivity panels, DMPK assays
Output
Validated bioactivity dataset loaded to research DB
Tools
Data in
SAR trends, multiparameter optimization, statistical models
Output
Go/no-go decisions, next design hypotheses
Integration Architecture
Sources to ingestion to storage to apps; agentic chat sits beside dashboards as a separate path for messy, multi-step questions
Data Sources
Ingestion & Processing
Storage & Platform
Applications
Code Samples
Python ETL, R Shiny app, PostgreSQL DMTA schema, Airflow DAG, Docker and AWS CDK
ETL pipeline that fetches compound bioactivity data from ChEMBL, validates records, flags outliers and duplicates, and writes a curated Parquet file. Mirrors the CDD Vault / D360 ingestion pattern.
#!/usr/bin/env python3
"""
ChEMBL → Scientific Database ETL Pipeline
Fetches bioactivity data for Kymera target proteins,
validates and loads into research data store.
Pattern mirrors CDD Vault / D360 integration workflows.
"""
import numpy as np
import pandas as pd
import requests
import logging
from dataclasses import asdict, dataclass, field
from typing import Iterator
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__)
CHEMBL_BASE = "https://www.ebi.ac.uk/chembl/api/data"
# Kymera priority targets
TARGETS = {
"STAT6": "CHEMBL5401",
"IRAK4": "CHEMBL3778",
"IRF5": "CHEMBL5365",
}
@dataclass
class ActivityRecord:
target: str
molecule_id: str
smiles: str | None
assay_type: str
standard_type: str # IC50, Ki, Kd, etc.
value_nm: float | None
pchembl: float | None
assay_description: str
journal: str | None
year: int | None
flags: list[str] = field(default_factory=list)
def fetch_activities(target_name: str, chembl_id: str) -> Iterator[dict]:
"""Paginate ChEMBL activity endpoint for a given target."""
offset, limit = 0, 100
while True:
url = (
f"{CHEMBL_BASE}/activity.json"
f"?target_chembl_id={chembl_id}"
f"&assay_type=B&pchembl_value__isnull=false"
f"&limit={limit}&offset={offset}"
)
r = requests.get(url, timeout=30)
r.raise_for_status()
data = r.json()
rows = data.get("activities", [])
log.info(f"{target_name}: fetched {len(rows)} records (offset={offset})")
yield from rows
if not data["page_meta"]["next"]:
break
offset += limit
def parse_record(target: str, raw: dict) -> ActivityRecord:
"""Parse and validate a raw ChEMBL activity dict."""
try:
value = float(raw["standard_value"]) if raw.get("standard_value") else None
except (ValueError, TypeError):
value = None
try:
pchembl = float(raw["pchembl_value"]) if raw.get("pchembl_value") else None
except (ValueError, TypeError):
pchembl = None
flags = []
if raw.get("potential_duplicate"):
flags.append("POTENTIAL_DUPLICATE")
if value is not None and value < 0:
flags.append("NEGATIVE_VALUE")
if raw.get("data_validity_comment"):
flags.append(f"DVC:{raw['data_validity_comment']}")
return ActivityRecord(
target=target,
molecule_id=raw.get("molecule_chembl_id", ""),
smiles=raw.get("canonical_smiles"),
assay_type=raw.get("assay_type", ""),
standard_type=raw.get("standard_type", ""),
value_nm=value,
pchembl=pchembl,
assay_description=raw.get("assay_description", ""),
journal=raw.get("document_journal"),
year=raw.get("document_year"),
flags=flags,
)
def validate_dataframe(df: pd.DataFrame) -> pd.DataFrame:
"""
Curate and validate bioactivity data:
- Remove duplicates and flagged records
- Filter to accepted activity types
- Enforce value range (1 pM – 100 µM = 0.001–100000 nM)
"""
initial = len(df)
df = df[~df["flags"].apply(lambda f: "POTENTIAL_DUPLICATE" in f)]
df = df[df["standard_type"].isin(["IC50", "Ki", "Kd", "EC50", "GI50"])]
df = df[df["value_nm"].between(0.001, 100_000)]
df = df.dropna(subset=["smiles", "value_nm"])
log.info(f"Validation: {initial} → {len(df)} records ({initial - len(df)} removed)")
return df
def run_pipeline(output_path: str = "kymera_bioactivity.parquet") -> pd.DataFrame:
"""
Full ETL: Extract → Transform → Validate → Load
Output: curated Parquet file ready for downstream analysis
or CDD Vault / D360 import.
"""
all_records = []
for target_name, chembl_id in TARGETS.items():
for raw in fetch_activities(target_name, chembl_id):
all_records.append(asdict(parse_record(target_name, raw)))
df = pd.DataFrame(all_records)
df = validate_dataframe(df)
df["pIC50"] = df["pchembl"].fillna(9 - np.log10(df["value_nm"]))
df.to_parquet(output_path, index=False)
log.info(f"Loaded {len(df)} curated records → {output_path}")
return df
if __name__ == "__main__":
df = run_pipeline()
print(df.groupby("target")[["value_nm", "pchembl"]].describe().round(2))On this page
What these numbers mean
Bioactivity measurements
IC50, Ki, and Kd values for STAT6 (224), IRAK4 (31,578), and IRF5 (74) from published assays and patents. A higher count means more published evidence for a target, which affects how tractable it is.
Active degrader / PROTAC trials
Queried in real time from the AACT PostgreSQL replica of ClinicalTrials.gov. Filters on PROTAC and degrader in intervention names and titles, RECRUITING or ACTIVE status only. Shows how much the field has moved from preclinical into the clinic.
Swiss-Prot curated entries for STAT6, IRAK4, IRF5 from UniProt. Includes sequence, function, and subcellular location.
Python, R/Shiny, SQL, Docker, AWS CDK, plus this Next.js UI. Covers data engineering, AI-assisted cards from live payloads, and full stack delivery for the Kymera informatics case study.