IBD-Quantum-ESM Project Report

Project Overview

This report details the status of the ibd-quantum-esm project, a proof-of-concept pipeline designed to integrate quantum machine learning (QML) for classifying Inflammatory Bowel Disease (IBD) multi-omics data. This page interactively summarizes the project's goals, current outcomes, and performance metrics.

Executive Summary

The project repository now delivers a full, CLI-driven workflow for embedding protein sequences, ingesting multi-omics tables, training both classical and quantum classifiers, and generating summary artifacts. A key limitation is the current reliance on a synthetic dataset due to restricted access to real IBD multi-omics data. Therefore, all metrics reported here serve as pipeline validation rather than conclusive biomedical evidence.

Original Objective

Deliver a proof-of-concept quantum-enhanced classifier on a real IBD multi-omics cohort (e.g., IBDMDB), compare it against classical baselines, and surface interpretable molecular signatures for IBD flare vs. remission.

Current Implementation

Implements the entire workflow on a synthetic, high-separation multi-omics dataset (240 samples, 13 features). No real patient data or IBM hardware runs have been executed yet.

Interactive Workflow

The pipeline is orchestrated through a series of CLI commands. Click on any stage below to see its purpose and the associated command. This demonstrates the end-to-end process from data ingestion to model training.

🧬

1. Embed Sequences

ESM2 Embedding

→

📊

2. Ingest Omics

Load & Process Data

→

💻

3. Train Classical

Baseline Model

→

⚛️

4. Train Quantum

QSVC Kernel Model

Stage 1: Embed Protein Sequences

This command uses the ESM2 model to generate numerical embeddings from protein FASTA files, storing them in a DuckDB database.

python -m src embed data/demo.fa --backend esm2_t6_8M_UR50D --outdb results/duckdb/embeddings.duckdb

Stage 2: Ingest Multi-omics Data

This step reads the synthetic multi-omics CSV, identifies ID and label columns, and saves the processed data into a DuckDB database.

python -m src ingest-omics data/ibd_multiomics_synthetic.csv --id-col sample_id --label-col clinical_status --outdb results/duckdb/ibd.duckdb --overwrite

Stage 3: Train Classical Classifier

Trains a classical Logistic Regression model on the ingested omics data and saves metrics, the model, and predictions.

python -m src train-omics --db results/duckdb/ibd.duckdb --out results/metrics/omics_classifier.json --model-out results/models/omics_classifier.joblib --pred-out results/predictions_omics.csv

Stage 4: Train Quantum Classifier

Trains a Quantum Support Vector Classifier (QSVC) using a quantum kernel. This example uses 3 PCA components and a statevector simulator.

python -m src train-qsvc-quantum --db results/duckdb/ibd.duckdb --table embeddings --labels-csv data/ibd_multiomics_synthetic.csv --label-col clinical_status --pca-components 3 --reps 1 --backend statevector --C 100 --out results/metrics/qsvc_quantum.json

Results Dashboard

This section visualizes the performance comparison between the classical and quantum models. All results are based on the synthetic dataset, which was found to be easily separable, leading to perfect scores for the classical model.

Data Limitation

The current data is synthetic and not representative of real-world biological complexity. The 1.00 F1 score for the classical model indicates high separability in the demo data, not a perfect model.

Classical Model (Logistic Regression)

1.00

Macro F1 Score

Quantum Model (QSVC Simulator)

0.83

Macro F1 Score

Findings & Next Steps

The primary finding is that the end-to-end pipeline is functional. The performance gap is likely due to the synthetic data's simplicity and un-tuned parameters for the quantum model. The focus must now shift to applying this pipeline to real data.

Key Findings

The CLI-driven workflow is fully implemented and functional.
Classical model (LogReg) achieves a perfect 1.00 Macro F1 on the synthetic data, indicating high separability.
Quantum model (QSVC) achieves ~0.83 Macro F1, showing viability but also a performance gap on this simple data.
The synthetic data is insufficient for meaningful biomedical interpretation or model comparison.

Recommended Next Steps

Secure access to IBDMDB or an equivalent real patient cohort.
Update ingestion scripts to merge different data modalities and labels from the real dataset.
Re-run training, tune PCA/qubit settings, and explore IBM managed simulators or hardware backends.
Extend reporting notebooks to interpret dominant biomarkers (microbes, metabolites, host genes).
Package the workflow (e.g., Makefile or tox) and enable CI tests for reproducibility.

Command Reference

Below is a reference of the key commands used in the project's workflow. Click each item to expand and view the full command.

python -m src runtime-test --limit 5

python -m src embed data/demo.fa --backend esm2_t6_8M_UR50D --outdb results/duckdb/embeddings.duckdb

python -m src ingest-omics data/ibd_multiomics_synthetic.csv --id-col sample_id --label-col clinical_status --outdb results/duckdb/ibd.duckdb --overwrite

python -m src train-omics --db results/duckdb/ibd.duckdb --out results/metrics/omics_classifier.json --model-out results/models/omics_classifier.joblib --pred-out results/predictions_omics.csv

python -m src train-qsvc-quantum --db results/duckdb/ibd.duckdb --table embeddings --labels-csv data/ibd_multiomics_synthetic.csv --label-col clinical_status --pca-components 3 --reps 1 --shots 1024 --backend statevector --per-class-limit 120 --C 100 --test-size 0.2 --random-state 123 --out results/metrics/qsvc_quantum.json --model-out results/models/qsvc_quantum.joblib