Portfolio by Johnny RiceInformatics demo
ClariTrial
ClariTrial/Scientific Informatics Demo
Informatics DemoKymera case study · Watertown, MAAll data live

Scientific Informatics: Live Integrations and Code

This demo uses Kymera Therapeutics, a Watertown, Massachusetts biotechnology company, as the informatics case study. The page doubles as a slice of data engineering (live SQL, REST ingestion, caching, and operational concerns), AI engineering (prompts and toolbars built from the payloads the APIs return), and full stack development (this Next.js surface, server-side fetches, and interactive panels). Architecturally: batch ETL (illustrated in code samples) builds curated ground truth; deterministic SQL and REST on this page give exact, repeatable answers; agentic chat at /api/chat composes the same allowlisted AACT presets and registry tools when questions are messy or multi-step. All data is fetched live from ChEMBL, UniProt, AACT PostgreSQL, and PubMed. Code samples cover Python, R/Shiny, SQL, Airflow, Docker, and AWS.

Ask AI about this demo
ChEMBL activities
31,876
STAT6, IRAK4, IRF5
UniProt proteins
3
Swiss-Prot reviewed
TPD trials (live SQL)
10+
AACT PostgreSQL
Code samples
5
Python, R, SQL, Docker, AWS + TS/Next

Agentic Chat Pipeline

Same three-layer story as this page: ETL builds trust, SQL and REST return exact slices, /api/chat reasons across them with allowlisted AACT tools plus specialists (not ad-hoc SQL)

Agentic chat request path

Next.js route handler, Vercel AI SDK streamText on the lead and generateText inside specialists. Deterministic first: the lead can call queryAactPreset (same allowlisted SQL as the AACT panel below), then consultSpecialist or consultSpecialistsParallel for registry and literature work. Optional intent (Facts and SQL vs Explore), answer trace in the UI, versioned prompts in metadata and audit logs.

POST/api/chat
Lead:Claude Sonnet 4.6orGPT-4o
Lead tools:queryAactPreset·consultSpecialist·consultSpecialistsParallel
Subagent:Claude Haiku 4.5orGPT-4o-mini
Trial discoveryTrial deep diveEvidence and literatureTrial comparison

Specialists use scoped tools only (search, detail, PubMed, compare). The lead never runs ad-hoc SQL; AACT access matches fixed presets. Structured tool output feeds the streamed reply; users can expand how this answer was built for tool names and prompt version.

Reference: docs/architecture-agentic.md in the repository describes scope (portfolio demo vs regulated posture), guardrails, and roadmap.

Open live chat

Skills Coverage

Scientific informatics mapped to data engineering, AI engineering, and full stack delivery, tied to the Kymera case study (Watertown, MA)

Languages

Python

Pipelines, APIs, automation, ETL

R / R Shiny

Interactive scientific dashboards

SQL (PostgreSQL)

AACT live DB, DMTA schemas, views

Linux / Bash

CLI automation, cron, scripting

Ask AI

Scientific Platforms

CDD Vault

Data model + REST API integration

D360 / Dotmatics

Assay data pipeline patterns

LiveDesign

DMTA design data flows

ChEMBL API

Live bioactivity integration (this page)

UniProt API

Live protein data (this page)

ClinicalTrials.gov

REST + AACT PostgreSQL

Ask AI

Infrastructure

AWS (EC2/ECS/S3/RDS)

CDK IaC, Fargate, lifecycle rules

Docker / Compose

Containerized R + Python stack

Airflow / Prefect

DMTA workflow orchestration DAG

CI/CD (GitHub Actions)

Automated test + deploy

Ask AI

Data Engineering

Data Modeling

DMTA compound/assay/results schema

ETL / ELT Pipelines

Validated, paginated, tested

Data Validation / QC

Outlier detection, curation flags

Proteomics workflows

UniProt integration, protein data

Vendor management

Scoped at RCH + Vertex + Pfizer

Ask AI

Full Stack & AI Engineering

Next.js / React

This page: layouts, interactive panels, route handlers

TypeScript

Typed UI and server integration

Live-backed prompts

Toolbar prompts built from ChEMBL, UniProt, and AACT payloads

Server data fetching

RSC + caching: fresh scientific data without exposing keys in the browser

Ask AI

Target Proteins

STAT6, IRAK4, and IRF5 from UniProt REST API

Live data
STAT6P42226

Signal transducer and activator of transcription 6

Carries out a dual function: signal transduction and activation of transcription. Involved in IL4/interleukin-4- and IL3/interleukin-3-mediated signaling

847 aaSwiss-Prot ✓CytoplasmNucleus
Ask AI
IRAK4Q9NWZ3

Interleukin-1 receptor-associated kinase 4

Serine/threonine-protein kinase that plays a critical role in initiating innate immune response against foreign pathogens. Involved in Toll-like receptor (TLR) and IL-1R signaling pathways (PubMed:17878374). Is rapidly recruited by MYD88 to the receptor-signaling complex upon TLR activation to form the Myddosome together with IRAK2. Phosphorylates initially IRAK1, thus stimulating the kinase activ

460 aaSwiss-Prot ✓Cytoplasm
Ask AI
IRF5Q13568

Interferon regulatory factor 5

Transcription factor that plays a critical role in innate immunity by activating expression of type I interferon (IFN) IFNA and INFB and inflammatory cytokines downstream of endolysosomal toll-like receptors TLR7, TLR8 and TLR9 (PubMed:11303025, PubMed:15695821, PubMed:22412986, PubMed:25326418, PubMed:32433612). Regulates the transcription of type I IFN genes (IFN-alpha and IFN-beta) and IFN-stim

498 aaSwiss-Prot ✓CytoplasmNucleus
Ask AI
GEThttps://rest.uniprot.org/uniprotkb/{P42226,Q9NWZ3,Q13568}.jsonNo key required, 24h cache

ChEMBL Bioactivity Data

31,876 total measurements across STAT6, IRAK4, IRF5

Live data
STAT6top hits by pChEMBL
224 total
MoleculeTypeValue (nM)pChEMBLAssayYear
CHEMBL487451IC500.79.15Inhibition of STAT6 activation in FW4 reporter cel2008
CHEMBL593171IC500.79.15Inhibition of STAT6 in IL4-stimulated human FW4 re2009
CHEMBL470963IC501.48.85Inhibition of STAT6 activation in FW4 reporter cel2008
CHEMBL488814IC501.68.80Inhibition of STAT6 activation in FW4 reporter cel2008
CHEMBL519249IC501.88.74Inhibition of STAT6 activation in FW4 reporter cel2008
CHEMBL512874IC501.88.74Inhibition of STAT6 activation in FW4 reporter cel2008
Ask AI
IRAK4top hits by pChEMBL
31,578 total
MoleculeTypeValue (nM)pChEMBLAssayYear
CHEMBL4443947IC500.02210.66Binding affinity to human IRK4 using myelin basic 2019
CHEMBL4556091IC500.02610.59Inhibition of IRAK4 in human whole blood assessed 2020
CHEMBL4566431IC500.07810.11Inhibition of IRAK4 in human whole blood assessed 2020
CHEMBL4545898IC500.08110.09Inhibition of IRAK4 in human whole blood assessed 2020
CHEMBL6030179IC500.110.00IRAK4 Enzymatic DELFIA Assay, Protocol A: This is 2023
CHEMBL4066705IC500.110.00Inhibition of full-length IRAK4 (unknown origin) i2024
Ask AI
IRF5top hits by pChEMBL
74 total
MoleculeTypeValue (nM)pChEMBLAssayYear
CHEMBL463914IC503,4005.47Inhibition of human DNA polymerase kappa (19 to 522016
CHEMBL3960997IC505,6005.25Inhibition of human DNA polymerase kappa (19 to 522016
CHEMBL463914IC505,6005.25Inhibition of human DNA polymerase kappa (19 to 522016
CHEMBL1014IC505,6005.25Inhibition of human DNA polymerase kappa (19 to 522016
CHEMBL1917196IC506,8005.17Inhibition of C-terminal His6-tagged human DNA pol2011
CHEMBL1917198IC508,1005.09Inhibition of C-terminal His6-tagged human DNA pol2011
Ask AI
GEThttps://www.ebi.ac.uk/chembl/api/data/activity.json?target_chembl_id=CHEMBL5401&assay_type=B&pchembl_value__isnull=false&order_by=-pchembl_value

AACT PostgreSQL

Deterministic slice: presets (degrader, glue, kinase, combined) power this panel and queryAactPreset in /api/chat. Flexible queries (queryAactFlexible) add target, sponsor, drug, condition, and phase filters.

Live data

Bifunctional molecules (PROTACs) that recruit E3 ligases to ubiquitinate and degrade target proteins

aact-db.ctti-clinicaltrials.org · 5432 / aactLive query executed server-side
SELECT DISTINCT ON (s.nct_id)
  s.nct_id, s.brief_title, s.overall_status, s.phase,
  sp.name AS sponsor, s.enrollment, s.start_date
FROM studies s
JOIN interventions i  ON i.nct_id = s.nct_id
JOIN sponsors sp      ON sp.nct_id = s.nct_id AND sp.lead_or_collaborator = 'lead'
WHERE (
  LOWER(i.name) LIKE '%protac%'
  OR LOWER(i.name) LIKE '%degrader%'
  OR LOWER(s.brief_title) LIKE '%degrader%'
)
  AND s.overall_status IN ('RECRUITING', 'ACTIVE_NOT_RECRUITING', 'NOT_YET_RECRUITING')
ORDER BY s.nct_id, s.start_date DESC NULLS LAST
LIMIT 10;

Active trials · PROTACs / Degraders

Recruiting
Planned
Clinical Trial to Evaluate the Tolerance of TQB3201 Tablets
NCT07172126Phase 1/Phase 2Chia Tai Tianqing Pharmaceutical Group Co., Ltd.291 pts
Search all PROTACs / Degraders trials on ClariTrial →

By trial phase

Phase 1
5
NA
4
Phase 1/Phase 2
1
Phase 4
1
Not Applicable
1

Top sponsors

BeiGene1
Boundless Bio, Inc.1
Horsens Hospital1
Manchester University NHS Foundation Trust1
Medtronic - MITG1
Nantes University Hospital1

DMTA Cycle: Design, Make, Test, Analyze

Tools, data inputs, and outputs at each stage of drug discovery

Step 1Design

Tools

SchrödingerLiveDesignRDKitPython

Data in

Target structures, docking scores, ADMET predictions

Output

Virtual compound library (SMILES + predicted properties)

Ask AI
Step 2Make

Tools

ELN (Signals)CDD VaultLIMSBarcode scanning

Data in

Synthesis routes, reaction yields, batch purity (HPLC/NMR)

Output

Registered compound batches with analytical certificates

Ask AI
Step 3Test

Tools

D360 / Dotmaticsplate readersAirflow DAG

Data in

IC50, DC50, Dmax, selectivity panels, DMPK assays

Output

Validated bioactivity dataset loaded to research DB

Ask AI
Step 4Analyze

Tools

R / ShinyPython (pandas, scipy)SpotfireSQL

Data in

SAR trends, multiparameter optimization, statistical models

Output

Go/no-go decisions, next design hypotheses

Ask AI

Integration Architecture

Sources to ingestion to storage to apps; agentic chat sits beside dashboards as a separate path for messy, multi-step questions

Data Sources

ChEMBLBioactivity
UniProtProteins
ClinicalTrials.govTrials
PubMedLiterature
Instrument exportsAssays
Ask AI

Ingestion & Processing

Python ETLAirflow DAG
SQL CurationPostgreSQL
R scriptsAnalysis
QC PipelineValidation
Ask AI

Storage & Platform

Amazon RDSPostgreSQL
Amazon S3Raw data
CDD Vault / D360Scientific
Ask AI

Applications

Next.js / web UIFull stack
R ShinyScientists
REST APIsIntegrations
DashboardsLeadership
Ask AI

Code Samples

Python ETL, R Shiny app, PostgreSQL DMTA schema, Airflow DAG, Docker and AWS CDK

PythonR / ShinySQLApache AirflowDockerAWS CDK

ETL pipeline that fetches compound bioactivity data from ChEMBL, validates records, flags outliers and duplicates, and writes a curated Parquet file. Mirrors the CDD Vault / D360 ingestion pattern.

chembl_etl.py
#!/usr/bin/env python3
"""
ChEMBL → Scientific Database ETL Pipeline
Fetches bioactivity data for Kymera target proteins,
validates and loads into research data store.

Pattern mirrors CDD Vault / D360 integration workflows.
"""

import numpy as np
import pandas as pd
import requests
import logging
from dataclasses import asdict, dataclass, field
from typing import Iterator

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__)

CHEMBL_BASE = "https://www.ebi.ac.uk/chembl/api/data"

# Kymera priority targets
TARGETS = {
    "STAT6":  "CHEMBL5401",
    "IRAK4":  "CHEMBL3778",
    "IRF5":   "CHEMBL5365",
}

@dataclass
class ActivityRecord:
    target: str
    molecule_id: str
    smiles: str | None
    assay_type: str
    standard_type: str           # IC50, Ki, Kd, etc.
    value_nm: float | None
    pchembl: float | None
    assay_description: str
    journal: str | None
    year: int | None
    flags: list[str] = field(default_factory=list)

def fetch_activities(target_name: str, chembl_id: str) -> Iterator[dict]:
    """Paginate ChEMBL activity endpoint for a given target."""
    offset, limit = 0, 100
    while True:
        url = (
            f"{CHEMBL_BASE}/activity.json"
            f"?target_chembl_id={chembl_id}"
            f"&assay_type=B&pchembl_value__isnull=false"
            f"&limit={limit}&offset={offset}"
        )
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        data = r.json()
        rows = data.get("activities", [])
        log.info(f"{target_name}: fetched {len(rows)} records (offset={offset})")
        yield from rows
        if not data["page_meta"]["next"]:
            break
        offset += limit

def parse_record(target: str, raw: dict) -> ActivityRecord:
    """Parse and validate a raw ChEMBL activity dict."""
    try:
        value = float(raw["standard_value"]) if raw.get("standard_value") else None
    except (ValueError, TypeError):
        value = None

    try:
        pchembl = float(raw["pchembl_value"]) if raw.get("pchembl_value") else None
    except (ValueError, TypeError):
        pchembl = None

    flags = []
    if raw.get("potential_duplicate"):
        flags.append("POTENTIAL_DUPLICATE")
    if value is not None and value < 0:
        flags.append("NEGATIVE_VALUE")
    if raw.get("data_validity_comment"):
        flags.append(f"DVC:{raw['data_validity_comment']}")

    return ActivityRecord(
        target=target,
        molecule_id=raw.get("molecule_chembl_id", ""),
        smiles=raw.get("canonical_smiles"),
        assay_type=raw.get("assay_type", ""),
        standard_type=raw.get("standard_type", ""),
        value_nm=value,
        pchembl=pchembl,
        assay_description=raw.get("assay_description", ""),
        journal=raw.get("document_journal"),
        year=raw.get("document_year"),
        flags=flags,
    )

def validate_dataframe(df: pd.DataFrame) -> pd.DataFrame:
    """
    Curate and validate bioactivity data:
    - Remove duplicates and flagged records
    - Filter to accepted activity types
    - Enforce value range (1 pM – 100 µM = 0.001–100000 nM)
    """
    initial = len(df)
    df = df[~df["flags"].apply(lambda f: "POTENTIAL_DUPLICATE" in f)]
    df = df[df["standard_type"].isin(["IC50", "Ki", "Kd", "EC50", "GI50"])]
    df = df[df["value_nm"].between(0.001, 100_000)]
    df = df.dropna(subset=["smiles", "value_nm"])
    log.info(f"Validation: {initial} → {len(df)} records ({initial - len(df)} removed)")
    return df

def run_pipeline(output_path: str = "kymera_bioactivity.parquet") -> pd.DataFrame:
    """
    Full ETL: Extract → Transform → Validate → Load
    Output: curated Parquet file ready for downstream analysis
    or CDD Vault / D360 import.
    """
    all_records = []
    for target_name, chembl_id in TARGETS.items():
        for raw in fetch_activities(target_name, chembl_id):
            all_records.append(asdict(parse_record(target_name, raw)))

    df = pd.DataFrame(all_records)
    df = validate_dataframe(df)
    df["pIC50"] = df["pchembl"].fillna(9 - np.log10(df["value_nm"]))
    df.to_parquet(output_path, index=False)
    log.info(f"Loaded {len(df)} curated records → {output_path}")
    return df

if __name__ == "__main__":
    df = run_pipeline()
    print(df.groupby("target")[["value_nm", "pchembl"]].describe().round(2))

On this page

What these numbers mean

31,876ChEMBL, EBI REST API

Bioactivity measurements

IC50, Ki, and Kd values for STAT6 (224), IRAK4 (31,578), and IRF5 (74) from published assays and patents. A higher count means more published evidence for a target, which affects how tractable it is.

10+AACT PostgreSQL, live SQL

Active degrader / PROTAC trials

Queried in real time from the AACT PostgreSQL replica of ClinicalTrials.gov. Filters on PROTAC and degrader in intervention names and titles, RECRUITING or ACTIVE status only. Shows how much the field has moved from preclinical into the clinic.

3proteins

Swiss-Prot curated entries for STAT6, IRAK4, IRF5 from UniProt. Includes sequence, function, and subcellular location.

5code samples

Python, R/Shiny, SQL, Docker, AWS CDK, plus this Next.js UI. Covers data engineering, AI-assisted cards from live payloads, and full stack delivery for the Kymera informatics case study.

31,876
10+