Skip to main content

Methods

Comprehensive documentation of the image analysis pipeline and statistical methods.

Image Analysis Pipeline

1. Overview

HistoAtlas is a pan-cancer atlas of computational histopathology features derived from The Cancer Genome Atlas (TCGA). It integrates 40 quantitative histomics features extracted from hematoxylin-and-eosin (H&E) whole-slide images with clinical, genomic, and molecular annotations. The analysis pipeline has three stages: (1) cell and tissue segmentation from H&E whole-slide images (§3–§4), (2) spatial region definition and histomics feature extraction (§5), and (3) statistical analysis of feature–outcome associations (§6–§17). The platform provides a suite of statistical analyses: survival modelling, correlation analysis, categorical association testing, gene set enrichment, and unsupervised clustering, all designed to characterise how tissue morphology relates to patient outcomes and tumour biology across 24 cancer types. All analyses are executed through a reproducible Snakemake pipeline with multiple testing correction, evidence strength classification, and confounding adjustment applied systematically.

2. Data Sources and Cohort Definition

Tissue slides and clinical annotations are sourced from TCGA via the Genomic Data Commons (GDC). The atlas includes 9,028 diagnostic whole-slide images spanning 33 TCGA cancer types. One representative formalin-fixed, paraffin-embedded (FFPE) diagnostic slide is selected per patient case. Slides are excluded if the viable tissue area (i.e., the area of all non-whitespace, non-artefact compartments after tissue segmentation, §4) falls below 1 mm², or if the slide exhibits severe processing artefacts (pen marks covering >20% of tissue area, out-of-focus regions), or if essential clinical metadata (vital status, follow-up time) is missing. In addition, nine TCGA cancer types are excluded entirely because their dominant cell morphologies fall outside the training domain of the HistoPLUS cell detection model (§3), which was developed primarily on squamous and epithelial tissue: KIRC, KIRP, and KICH (renal clear-cell, papillary, and chromophobe carcinomas), DLBC (diffuse large B-cell lymphoma), LAML (acute myeloid leukaemia), LGG and GBM (lower-grade and high-grade gliomas), SKCM (cutaneous melanoma), and PCPG (pheochromocytoma and paraganglioma). After all exclusions, the final cohort comprises 24 cancer types.

Clinical endpoints include overall survival (OS), progression-free survival (PFS), disease-specific survival (DSS), and disease-free survival (DFS), with time-to-event and censoring indicators sourced from the TCGA Pan-Cancer Clinical Data Resource (TCGA-CDR; Liu et al., 2018).

Molecular annotations are derived from published TCGA companion studies. Somatic mutation calls originate from the MC3 multi-caller ensemble (Ellrott et al., 2018). Immune cell fraction estimates are obtained from CIBERSORT (Newman et al., 2015) and xCell (Aran et al., 2017). Tumour purity is estimated by ABSOLUTE (Carter et al., 2012). Immune subtypes (C1–C6) are taken from the pan-cancer immune landscape analysis of Thorsson et al. (2018), which classified TCGA samples into six categories: wound healing (C1), IFN-γ dominant (C2), inflammatory (C3), lymphocyte depleted (C4), immunologically quiet (C5), and TGF-β dominant (C6), based on integrated immune gene expression, leukocyte fraction, and neoantigen data.

3. Cell Segmentation

Cell-level instance segmentation is performed using Owkin’s HistoPLUS model (Adjadj et al., 2025), a cell detection, segmentation and classification model for computational pathology. HistoPLUS detects and classifies individual cells into nine morphological types:

Cell typeDescription
Tumor cellsNeoplastic epithelial cells
LymphocytesT cells, B cells, NK cells
FibroblastsStromal spindle cells
PlasmocytesAntibody-secreting B cell derivatives
NeutrophilsPolymorphonuclear granulocytes
EosinophilsBilobed eosinophilic granulocytes
Red blood cellsErythrocytes (excluded from density computations)
Apoptotic bodiesCell death fragments
Mitotic figuresCells undergoing division

Inference is performed tile-by-tile on 224 × 224 px tiles. When 40× magnification is available (0.25 μm/px), it is used directly; otherwise, the model falls back to 20× (0.50 μm/px). Tiles are extracted with a 64-pixel (16 μm) overlap margin between adjacent tiles. Cells detected in overlap regions are deduplicated via a union-find algorithm that merges instances whose centroids fall within 10 μm of each other, preventing border-effect artefacts at tile boundaries.

The output for each cell is an instance segmentation mask, a centroid coordinate (), and a cell-type label. HistoPLUS achieves a mean panoptic quality (PQ) of 0.509 across cell types on its evaluation benchmark; see Adjadj et al. (2025) for the full evaluation protocol. These per-cell annotations serve as inputs for downstream density, ratio, and spatial features (§5).

4. Tissue Segmentation

Tissue-level semantic segmentation classifies each pixel of the whole-slide image into one of nine tissue compartments. The model uses a CellViT-inspired architecture (Hörst et al., 2024) with a Phikon self-supervised backbone (Filiot et al., 2023), trained on the PanopTILs crowdsourced annotation dataset (Amgad et al., 2019). Inference is performed at 0.5 μm/px on 224 × 224 px tiles. Adjacent tiles overlap by 32 pixels and the final segmentation mask is obtained by majority voting in overlap regions, followed by downsampling to the analysis resolution of 8 μm/px (§5). On the PanopTILs held-out test set, the model achieves a mean intersection-over-union (mIoU) of 0.72 across tissue classes; see supplementary materials for the per-class breakdown.

Tissue classDescription
Cancerous epitheliumTumour cell regions
StromaConnective tissue and fibroblast-rich areas
NecrosisDead tissue regions
Normal epitheliumNon-neoplastic epithelial tissue
TILsTumour-infiltrating lymphocyte-dense areas
Junk / DebrisProcessing artefacts and tissue folds
BloodVascular lumens and haemorrhagic regions
OtherUnclassified tissue
WhitespaceBackground / empty glass

TIL region reclassification

Before any downstream computation, regions classified as “TILs” by the tissue segmentation model are reclassified as Stroma. TIL-dense zones are biologically embedded within the stromal compartment (Saltz et al., 2018), and treating them as a separate compartment would shrink the effective stromal area, inflating cell density estimates in immune-rich regions and creating a systematic bias in density-based features. This convention follows the International Immuno-Oncology Biomarker Working Group recommendation to assess stromal TILs as a fraction of the total stromal area (Salgado et al., 2015). After reclassification, the effective compartments used for spatial analysis are: cancerous epithelium, stroma, necrosis, normal epithelium, and blood.

5. Spatial Region Definition

Histomics features are computed within spatially defined regions (bands) that capture the tumour–stroma interface at biologically meaningful scales. Let denote the full tissue domain of a whole-slide image, discretised on a regular pixel lattice at resolution . All spatial computations are performed at this resolution, ensuring invariance across different TCGA scanners. The tissue segmentation model (§4) partitions into disjoint compartments; after TIL reclassification (§4), the effective partition is:

where is cancerous epithelium, is stroma (including reclassified TIL regions), is necrosis, is normal epithelium, is blood, and is the union of non-tissue classes (whitespace, junk/debris, other) excluded from all downstream computation. The compartments are pairwise disjoint and their union covers .

Resolution standardisation

Let denote the native resolution of a given scanner (in μm/px). Before any spatial computation, compartment masks are resampled to the common lattice at by nearest-neighbour interpolation. All distance thresholds, area thresholds, and kernel sizes defined below are in physical units (μm, μm²) and are converted to pixel counts via .

Signed distance transform

Let denote the inner boundary of the tumour compartment under 8-connectivity, where is the set of 8-connected neighbours of pixel on the lattice. The signed Euclidean distance transform assigns to each pixel a value :

where is the Euclidean norm in . In practice, is computed using the exact Euclidean distance transform on the discrete pixel lattice (Maurer et al., 2003), which yields the exact L2 distance from each pixel to the nearest boundary pixel. By convention, inside the tumour and outside; pixels exactly on the boundary satisfy . This transform is the basis for defining the four tumour–stroma bands below.

Analogously, for necrosis, let denote the inner boundary of and define the unsigned distance to the necrosis boundary:

Band definitions

Five spatial bands are defined as subsets of via the distance transforms and . Each band is a set of pixels satisfying a distance predicate:

The first four bands are pairwise disjoint and satisfy and . Note: the stromal bands () include all non-tumour tissue pixels within the distance range, not only pixels classified as stroma. A necrotic or normal-epithelium pixel at is included in . This convention captures the full peritumoral microenvironment regardless of tissue class. Cell-type-specific density features are then computed per band, so the tissue composition within each band is resolved at the feature level rather than the band definition level. The necrosis ring may overlap with any of the four tumour–stroma bands.

BandPredicatePurpose
Tumor front Invasive margin (tumour side)
Tumor core Deep tumour interior
Stroma near Peritumoral stroma
Stroma far Extended stromal microenvironment
Necrosis ring Perinecrotic zone

Island removal

Let denote the connected components of a compartment (where "comp" is Tum, Str, or Nec), computed under 8-connectivity on the pixel lattice. Let denote the physical area of component in μm². Components below the area threshold are removed:

The filtered compartments replace the originals before computing the distance transforms and . This prevents noisy segmentation fragments from producing spurious boundary pixels that would distort the distance transform and introduce artefactual band regions.

Macro-tumour mask

Let denote a disk structuring element of radius . The macro-tumour mask is obtained by morphological closing of the filtered tumour compartment :

where denotes Minkowski dilation and Minkowski erosion. The closing bridges gaps smaller than in infiltrative tumours where thin stromal septa fragment the tumour region. The macro-tumour mask is used exclusively for the quality-control metric micro_interface_ratio, not for primary feature computation.

Growth pattern classification

Slides are classified into growth pattern regimes based on the front fraction , defined as the ratio of tumour-front area to total tumour area:

where denotes pixel count (proportional to physical area at fixed ). Growth pattern regimes:

  • Mass-forming (): compact tumour with a small invasive margin relative to total area
  • Intermediate (): mixed growth pattern
  • Infiltrative (): highly dispersed tumour with most tumour area near the stromal interface

The same features are computed for all slides regardless of growth pattern; the regime label aids interpretation of spatial features and is not used as a filter.

Design rationale

  • 50 μm front band: approximately 5 cell diameters (typical epithelial cell diameter ≈ 10 μm), capturing the invasive margin zone where tumour–stroma interactions are most active
  • 200 μm stroma far cutoff: beyond this distance, tumour–stroma interaction effects attenuate, as shown by spatial analyses of immune cell infiltration gradients (Saltz et al., 2018; Keren et al., 2018)
  • resolution: balances spatial precision with computational tractability and is invariant across TCGA scanner platforms
  • island threshold: corresponds to pixels at , removing objects smaller than a single cell cluster
  • Identical parameters for all 24 cancer types: ensures cross-cancer comparability of all histomics features
Statistical Methods

6. Feature Preprocessing

Let denote the raw feature matrix, where is the number of samples and is the number of morphological features. Each sample belongs to exactly one cancer type ; we write for the submatrix of samples of type , with . Log-transformation and winsorisation are applied to the full dataset; z-score standardisation is applied differently depending on the downstream consumer (see Standardisation below).

This preprocessing is applied exclusively for statistical modelling and embedding computation (UMAP, K-means clustering, Cox regression, RMST, Spearman correlations, and categorical association tests). Feature values displayed in tables and tooltips throughout the platform are the raw entries of ; winsorisation, log-transformation, and standardisation are never applied to displayed data.

Log transformation

Let with denote the index set of right-skewed features (densities, ratios, heterogeneity measures). For each feature and sample , we apply the transform to stabilise variance and attenuate the influence of extreme values:

The remaining features, which are approximately symmetric, are left unchanged. The full list of log-transformed features is defined in the preprocessing configuration.

Winsorisation

To limit the influence of outliers, all features are winsorised at the 0.5th and 99.5th percentiles across the full dataset. Let denote the -quantile of feature computed over all samples (after the log-transformation step). The winsorised values are:

where . The two global preprocessing steps are applied in order: (1) log-transformation on the right-skewed features, (2) winsorisation at the 0.5th and 99.5th percentiles. Z-score standardisation, when applied, occurs downstream and its scope depends on the analysis (see below).

Standardisation

Z-score standardisation is applied differently depending on the downstream analysis. Let and denote the sample mean and standard deviation of the winsorised feature computed over a reference set of samples (specified below):

The scope of standardisation varies by consumer:

  • Embeddings and clustering (§7, §8): global standardisation across all samples, yielding the matrix with column-wise zero mean and unit variance over the full cohort. This ensures that embedding distances and cluster assignments are comparable across cancer types.
  • Cox regression (§9): per-cancer-type standardisation at analysis time. For each cancer type , features are standardised to zero mean and unit variance within the samples of that cancer type. This ensures that hazard ratios are interpretable as per-standard-deviation effects within each cancer type.
  • Spearman correlations (§10): no z-score standardisation is applied. Spearman's rank correlation operates on ranks of the preprocessed (log-transformed and winsorised) values; the rank transformation is the natural standardisation for rank-based methods.

7. Dimensionality Reduction

Two-dimensional embeddings are computed for interactive visualisation. The input is the preprocessed feature matrix from §6 (log-transformed and winsorised). Because some entries may be missing, we first impute each missing value with the column median of the corresponding feature across all samples, then apply global z-score standardisation (§6), yielding the complete matrix (column-wise zero mean, unit variance over the full cohort).

UMAP

Uniform Manifold Approximation and Projection (UMAP; McInnes et al., 2018) computes a map that approximately preserves local neighbourhood structure. Applied row-wise to , it produces the embedding matrix:

where contains the two-dimensional coordinates used for scatter-plot visualisation. Hyperparameters are set to n_neighbors=15, min_dist=0.1, with Euclidean distance in the ambient space . These balance local neighbourhood preservation with global structure visibility. A fixed random seed (random_state=42) ensures reproducibility.

8. Clustering and Stability Assessment

K-means clustering

Samples are clustered using K-means on the imputed, re-standardised matrix from §7. K-means seeks a partition of the samples and a set of centroids that minimise the within-cluster sum of squares (inertia):

where is the -th row of . The algorithm is run with 10 random initialisations (n_init=10), retaining the partition that achieves the smallest .

Hierarchical cluster levels

Two levels of clustering granularity are provided. L1 (coarse) is computed at fixed values (5, 10, 15, 20) with selected for downstream analysis based on silhouette analysis as a balance between granularity and interpretability. L2 (fine, cancer-specific) uses the elbow method to select the optimal within the range . Both and are min-max normalised to , and the optimal is the value that maximises the perpendicular distance from the point to the line segment connecting the endpoints and in the normalised plane. This hierarchy allows users to explore broad morphological groupings or finer-grained subtypes.

Bootstrap stability assessment

Cluster stability is assessed via bootstrap resampling (50 iterations, 80% subsample without replacement). For each bootstrap replicate , K-means is re-run to obtain a bootstrap partition , which is compared to the full-data partition using two metrics:

  • Adjusted Rand Index (ARI): the chance-corrected Rand index measuring overall partition agreement. ARI = 1 indicates perfect agreement; ARI = 0 corresponds to the expected value under random assignment.
  • Jaccard similarity: computed per cluster as the overlap with its best-matching bootstrap cluster :
    capturing per-cluster recovery on a scale.

Validation metrics

Internal cluster quality is quantified using three complementary metrics.

Silhouette score. For each sample , let be the mean distance to all other samples in the same cluster, and the mean distance to samples in the nearest neighbouring cluster. The silhouette coefficient is:

The overall silhouette score is .

Calinski-Harabasz (CH) index. Let be the between-cluster dispersion, where is the global centroid. Then:

Higher CH indicates more compact, well-separated clusters.

Davies-Bouldin (DB) index. Let be the mean intra-cluster distance in cluster and the inter-centroid distance. Then:

Lower DB indicates better cluster separation. These three metrics are reported alongside stability results to provide a comprehensive assessment of clustering quality.

9. Survival Analysis

Cox proportional hazards regression

The association between each morphological feature and time-to-event outcomes is estimated using Cox proportional hazards regression. For sample , define the covariate vector , where is the feature value standardised to zero mean and unit variance within cancer type at analysis time (per-cancer-type standardisation, §6) and are the confounding covariates defined in §16 (the number of covariates depends on the adjustment model: for unadjusted, for minimal, for full). The Cox model specifies the hazard function as:

where is the unspecified baseline hazard, is the coefficient for the morphological feature, and is the vector of covariate coefficients. Models are fit by partial-likelihood maximisation using the lifelines library with no L2 penalisation (penalizer=0.0). The hazard ratio for feature is:

with 95% Wald confidence interval:

where is the standard normal quantile function () and is the standard error derived from the observed Fisher information matrix. Two-sided Wald p-values test . Since features are standardised within cancer type, hazard ratios are interpretable as the multiplicative change in hazard per one standard deviation increase in the feature.

Proportional hazards assumption

The proportional hazards assumption is tested for each model using the Schoenfeld residual test with the Kaplan-Meier time transform. The resulting p-value determines a three-level flag:

FlagConditionAction
PassNo evidence of PH violation; Cox HR reported without caveat
WarnBorderline violation; HR reported with a warning annotation
FailStrong evidence of non-proportionality; RMST difference becomes the primary reported effect

When the PH assumption is violated (fail), the restricted mean survival time (RMST) difference is reported as the primary effect measure instead of the hazard ratio.

Restricted mean survival time (RMST)

RMST provides a clinically interpretable, assumption-free measure of survival benefit. Let be the ordered distinct event times, the number of events at , and the number at risk just before . The Kaplan-Meier estimator of the survival function is:

The RMST at restriction time is defined as the area under the Kaplan-Meier curve:

The variance is estimated using the Irwin (1949) formula:

The restriction time is set to 1,095 days (3 years) by default, with cancer-specific overrides for aggressive tumours (730 days for GBM, PAAD, MESO) and indolent tumours (1,825 days for LGG, THCA, PRAD). These horizons are chosen to ensure adequate follow-up and event counts within each cancer type.

Samples are split into two groups by the median of feature : group (above median) and (at-or-below median). The RMST difference is:

95% confidence intervals for are obtained by bootstrap percentile method (1,000 resamples). Statistical significance is assessed via two-sided permutation test (5,000 permutations), where the p-value is the proportion of null differences with absolute value at least as large as the observed difference.

Kaplan-Meier curves

For visualisation, continuous features are dichotomised at the median (or split into quartiles) to produce Kaplan-Meier survival curves. These serve as visual summaries and are not used for formal inference.

Sample size requirements

Cox models require a minimum of subjects with at least 10 observed events. RMST analyses require subjects with at least 5 per group. Analyses not meeting these thresholds are excluded and flagged as having insufficient data.

Cluster-level survival comparisons

For cluster-level survival comparisons, a two-sided log-rank test is computed between cluster members and non-members, with Benjamini-Hochberg correction applied within each cluster level, analysis type, cancer type, and endpoint.

10. Continuous Association Analysis

Spearman rank correlation

Associations between morphological features and continuous molecular variables are quantified using Spearman's rank correlation coefficient. Let be the vector of preprocessed feature values (log-transformed and winsorised per §6) and the vector of molecular target values for the samples in a given cancer type. Define the rank operator that replaces each value by its rank among the entries, with tied values receiving the average of their ranks. Spearman's rank correlation is:

Spearman's is robust to non-linear monotonic relationships and is not sensitive to the distributional assumptions that limit Pearson's .

Partial Spearman correlation

To control for confounders, partial Spearman correlations are computed using Freedman-Lane residualisation. Let denote the covariate matrix (§16). All variables are first rank-transformed: , , and each column of is likewise replaced by its ranks. OLS residuals are then computed:

The partial Spearman correlation is the Pearson correlation of these residuals:

For inference under the null , the Freedman-Lane procedure permutes the outcome residuals while preserving covariate structure. Let be the fitted values and a permutation matrix. The permuted outcome is:

This breaks the feature-outcome association while preserving the covariate structure under the null.

Inference

P-values are computed by permutation test (5,000 two-sided permutations). The permutation p-value is defined as the proportion of null correlations with absolute value at least as large as the observed value. 95% confidence intervals for are obtained by bootstrap percentile method (1,000 resamples).

Fisher z-transform for MDES

For power calculations (§14), the Fisher z-transform variance-stabilises the correlation coefficient:

Under , the standard error of is:

where is the sample size and is the number of covariates in the partial model. The inverse transform maps back to the correlation scale (used in §14 for MDES computation).

11. Categorical Association Analysis

Two-group comparisons

For binary categorical variables (e.g., mutation status), the Mann-Whitney U test is used with a two-sided alternative. Let and be the preprocessed feature values in group 0 and group 1 respectively, with group sizes and . The effect size is quantified by Cliff's delta:

where is the sign function. Cliff's , with 0 indicating no stochastic dominance. 95% confidence intervals are obtained by bootstrap percentile method (1,000 resamples).

Multi-group comparisons

For categorical variables with levels, the Kruskal-Wallis H test (omnibus, non-directional) is used. Let be the total sample size and the rank sum for group , where is the rank of sample among all observations. The Kruskal-Wallis statistic is:

The effect size is reported as the bias-corrected epsilon-squared (labelled in the platform for brevity), with 95% bootstrap percentile confidence intervals (1,000 resamples):

For the unadjusted model, inference uses the exact or asymptotic distribution of the test statistic (Mann-Whitney U or Kruskal-Wallis H) as implemented in scipy.stats. Freedman-Lane permutation testing is used only for the covariate-adjusted models described below.

Covariate-adjusted tests (rank-ANCOVA)

For adjusted models, a rank-based ANCOVA is performed using Freedman-Lane permutation (1,000 permutations). Let be the vector of rank-transformed feature values, the group indicator matrix, and the rank-transformed covariate matrix. Define the full and reduced OLS models:

Let and be the residual sums of squares from the full and reduced models respectively. The group sum of squares and F-statistic are:

where is the number of covariates. The partial eta-squared is:

The Freedman-Lane procedure residualises on the covariates, then permutes these residuals to generate the null distribution of . The permutation p-value is the proportion of null -statistics at least as large as the observed value.

12. Gene Set Enrichment Analysis

Gene Set Enrichment Analysis (GSEA) is performed to identify molecular pathways enriched in morphologically defined clusters, following the method of Subramanian et al. (2005). Enrichment is computed against MSigDB Hallmark gene sets (Liberzon et al., 2015) intersected with the 133-gene curated panel, plus a set of hand-curated pathway signatures (immune, cell cycle, EMT, etc.). After intersection with the panel, gene set sizes are substantially reduced; sets with fewer than 5 genes after intersection are excluded from analysis.

Ranking metric

Genes are ranked by Welch's t-statistic (unequal variance) comparing expression in cluster members versus non-members. Welch's t is used rather than the standard t-test to account for potentially unequal group variances that arise from unbalanced cluster sizes.

Enrichment score and normalisation

Let the genes be ordered by decreasing Welch's -statistic: . For a gene set of size , define the running sum at position using the weighted scheme () from the original GSEA method:

The enrichment score is the value of the running sum with maximum absolute deviation from zero:

To account for gene set size, the normalised enrichment score (NES) is computed by dividing the observed ES by the mean absolute null ES, separately for positive and negative enrichments:

where and denote the positive and negative enrichment scores from the permutation null distribution, respectively.

Significance

Null distributions are generated by permuting phenotype labels (2,000 permutations). False discovery rates are estimated using a pooled null NES distribution across all gene sets. Gene sets with FDR are reported as significant, following the conventional GSEA threshold.

13. Multiple Testing Correction

All p-values are corrected for multiple testing using the Benjamini-Hochberg (BH) procedure to control the false discovery rate (FDR) at α = 0.05. Correction is applied within defined families of tests to balance statistical power against false positive control:

Analysis typeGrouping key (correction family)
Cox survival (features)cancer_type × endpoint × model
Cluster survival (Cox)cluster_level × analysis_type × cancer_type × endpoint × model
Cluster survival (log-rank)cluster_level × analysis_type × cancer_type × endpoint
RMSTcancer_type × endpoint × model
Spearman correlationscancer_type × target_set_id × corr_method × model
Categorical associationscancer_type × categorical_var × test_type × model
GSEAPooled null NES per cluster (canonical GSEA FDR)

Each result stores its correction_family_id and n_tests_in_family for full transparency. The family definitions ensure that biologically related comparisons (e.g., all features tested against the same endpoint within one cancer type and model) share a correction burden, while unrelated comparisons across cancer types or analysis modalities do not inflate each other's thresholds.

Permutation p-values throughout (RMST, Spearman, rank-ANCOVA) use the add-one correction of Phipson & Smyth (2010): , where is the count of null statistics at least as extreme as the observed value and is the number of valid permutations. This prevents p-values of exactly zero and provides a conservative estimate.

GSEA uses its own canonical FDR procedure based on pooled null NES distributions, as described in the original method (Subramanian et al., 2005), rather than BH correction.

14. Statistical Power and Minimum Detectable Effect Sizes

For each analysis, HistoAtlas computes the minimum detectable effect size (MDES) at 80% power and (two-sided). The MDES contextualises observed results: a non-significant finding is more informative when the study was well-powered to detect small effects.

Cox regression (Schoenfeld-Freedman)

Let denote the standard normal quantile function, with the two-sided significance quantile and the power quantile. At the defaults , (80% power):

For Cox models, the variance of the log-hazard-ratio estimator under the Schoenfeld (1982) approximation with worst-case binary allocation () is:

where is the number of observed events. The MDES on the hazard ratio scale is then:

Correlation (Fisher z-transform)

For Spearman correlations, the MDES is computed via the Fisher z-transform (§10). Using the standard error from §10:

where is the number of covariates in the partial model and inverts the Fisher z-transform, mapping back from the variance-stabilised scale to the correlation scale.

Mann-Whitney and Kruskal-Wallis (simulation-based)

For nonparametric categorical tests, closed-form MDES formulae are not available. Instead, MDES is computed via simulation-based binary search (2,000 simulations per candidate effect size). The algorithm searches for the smallest effect size that achieves 80% rejection rate at .

15. Evidence Strength Classification

Each analysis result is classified into one of four evidence tiers based on statistical significance, effect magnitude, confidence interval precision, and sample size. The classification integrates multiple dimensions of evidence quality rather than relying on a single p-value threshold:

TierP-valueEffect sizeCI widthSample size
StrongAbove threshold (see below)Narrow
ModerateAbove threshold (see below)Narrow or moderate
Suggestive or CI excludes null (HR = 1, = 0, = 0, or = 0)
Insufficient

Confidence interval width categories

The precision of point estimates is categorised using scale-appropriate thresholds:

ScaleNarrowModerateWide
Ratio (HR)CI upper/lower < 22 ≤ ratio < 4ratio ≥ 4
Additive (, )CI width < 0.30.3 ≤ width < 0.6width ≥ 0.6

For ratio-scale quantities (hazard ratios), the CI width is measured as the ratio of upper to lower bounds. For additive quantities (correlations, Cliff's delta), the CI width is the difference between bounds. These categories are calibrated to distinguish clinically informative from uninformative estimates.

Effect size thresholds

Each analysis type has preset minimum effect size thresholds for the “Strong” and “Moderate” tiers. The specific thresholds are:

AnalysisEffect metricStrong / Moderate threshold
SurvivalHR≥ 1.5 or ≤ 0.667
Correlation≥ 0.3
Categorical (2-group)≥ 0.3
Categorical (k-group) or ≥ 0.06

16. Confounding Adjustment Framework

All association analyses (survival, correlation, categorical) are run under three nested covariate adjustment models:

ModelCovariatesPurpose
UnadjustedNone (feature only)Marginal association
MinimalAge, sexDemographic confounding
FullAge, sex, AJCC pathological stage, tissue source site, tumor purityComprehensive confounding adjustment

By reporting results across all three models, users can assess the robustness of associations to confounding. An effect that attenuates substantially from the unadjusted to the full model suggests confounding, while a stable effect provides stronger evidence of a direct morphology-outcome relationship.

Covariate encoding

Continuous covariates (age, purity) enter models as-is. Categorical covariates (sex, AJCC pathological stage, tissue source site) are one-hot encoded with drop_first=True to avoid multicollinearity.

Missing data handling

Complete case analysis is used: observations with missing values in any covariate are excluded from the adjusted models. The sample size for each model is reported to allow assessment of the impact of missing data on statistical power.

Analysis typeMinimum nAdditional requirement
Cox proportional hazards30Events ≥ 10
RMST20≥ 5 per group
Spearman correlation30
Categorical (2-group)30≥ 5 per group
Categorical (k-group)30≥ 5 per group

17. Software and Reproducibility

All analyses are implemented in Python and orchestrated by a Snakemake workflow that defines a directed acyclic graph (DAG) of computational dependencies. Key libraries include:

  • lifelines: Cox regression and Kaplan-Meier estimation
  • scipy: Mann-Whitney U, Kruskal-Wallis, and permutation tests
  • scikit-learn: K-means clustering, silhouette scores
  • umap-learn: UMAP embedding
  • statsmodels: OLS regression for Freedman-Lane residualisation
  • numpy / pandas: numerical computation and data handling

Exact library versions are pinned in the project lockfile for reproducibility.

All random processes (permutation tests, bootstrap resampling, K-means initialisation) use explicit random seeds for reproducibility. The pipeline supports a dry-run mode that subsets to 3 cancer types, 10 features, and 500 samples for rapid validation of statistical outputs before full production runs.

18. References

  1. Cox, D.R. (1972). Regression models and life-tables. Journal of the Royal Statistical Society: Series B, 34(2), 187–220.
  2. Kaplan, E.L. & Meier, P. (1958). Nonparametric estimation from incomplete observations. Journal of the American Statistical Association, 53(282), 457–481.
  3. Irwin, J.O. (1949). The standard error of an estimate of expectation of life, with special reference to expectation of tumourless life in experiments with mice. Journal of Hygiene, 47(2), 188.
  4. Royston, P. & Parmar, M.K. (2013). Restricted mean survival time: an alternative to the hazard ratio for the design and analysis of randomized trials with a time-to-event outcome. BMC Medical Research Methodology, 13, 152.
  5. Schoenfeld, D. (1982). Partial residuals for the proportional hazards regression model. Biometrika, 69(1), 239–241.
  6. Freedman, D. & Lane, D. (1983). A nonstochastic interpretation of reported significance levels. Journal of Business & Economic Statistics, 1(4), 292–298.
  7. Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57(1), 289–300.
  8. Subramanian, A. et al. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences, 102(43), 15545–15550.
  9. McInnes, L., Healy, J. & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for dimension reduction. arXiv preprint, arXiv:1802.03426.
  10. Cliff, N. (1993). Dominance statistics: ordinal analyses to answer ordinal questions. Psychological Bulletin, 114(3), 494–509.
  11. Spearman, C. (1904). The proof and measurement of association between two things. The American Journal of Psychology, 15(1), 72–101.
  12. Fisher, R.A. (1921). On the “probable error” of a coefficient of correlation deduced from a small sample. Metron, 1, 3–32.
  13. Kruskal, W.H. & Wallis, W.A. (1952). Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47(260), 583–621.
  14. Mann, H.B. & Whitney, D.R. (1947). On a test of whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics, 18(1), 50–60.
  15. Phipson, B. & Smyth, G.K. (2010). Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn. Statistical Applications in Genetics and Molecular Biology, 9(1), Article 39.
  16. Mölder, F. et al. (2021). Sustainable data analysis with Snakemake. F1000Research, 10, 33.
  17. Davidson-Pilon, C. (2019). lifelines: survival analysis in Python. Journal of Open Source Software, 4(40), 1317.
  18. Liberzon, A. et al. (2015). The Molecular Signatures Database Hallmark Gene Set Collection. Cell Systems, 1(6), 417–425.
  19. Thorsson, V. et al. (2018). The Immune Landscape of Cancer. Immunity, 48(4), 812–830.e14.
  20. Adjadj, B. et al. (2025). Towards comprehensive cellular characterisation of H&E slides. arXiv preprint, arXiv:2508.09926.
  21. Amgad, M. et al. (2019). Structured crowdsourcing enables convolutional segmentation of histology images. Bioinformatics, 35(18), 3461–3467.
  22. Filiot, A. et al. (2023). Scaling Self-Supervised Learning for Histopathology with Masked Image Modeling. medRxiv preprint, doi:10.1101/2023.07.21.23292757.
  23. Liu, J. et al. (2018). An Integrated TCGA Pan-Cancer Clinical Data Resource to drive high-quality survival outcome analytics. Cell, 173(2), 400–416.e11.
  24. Ellrott, K. et al. (2018). Scalable open science approach for mutation calling of tumor exomes using multiple genomic pipelines. Cell Systems, 6(3), 271–281.e7.
  25. Newman, A.M. et al. (2015). Robust enumeration of cell subsets from tissue expression profiles. Nature Methods, 12(5), 453–457.
  26. Aran, D., Hu, Z. & Butte, A.J. (2017). xCell: digitally portraying the tissue cellular heterogeneity landscape. Genome Biology, 18, 220.
  27. Carter, S.L. et al. (2012). Absolute quantification of somatic DNA alterations in human cancer. Nature Biotechnology, 30(5), 413–421.
  28. Hörst, F. et al. (2024). CellViT: Vision Transformers for precise cell segmentation and classification. Medical Image Analysis, 94, 103143.
  29. Saltz, J. et al. (2018). Spatial organization and molecular correlation of tumor-infiltrating lymphocytes using deep learning on pathology images. Cell Reports, 23(1), 181–193.e7.
  30. Salgado, R. et al. (2015). The evaluation of tumor-infiltrating lymphocytes (TILs) in breast cancer: recommendations by an International TILs Working Group 2014. Annals of Oncology, 26(2), 259–271.
  31. Keren, L. et al. (2018). A structured tumor-immune microenvironment in triple negative breast cancer revealed by multiplexed ion beam imaging. Cell, 174(6), 1373–1387.e19.
  32. Maurer, C.R., Qi, R. & Raghavan, V. (2003). A linear time algorithm for computing exact Euclidean distance transforms of binary images in arbitrary dimensions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(2), 265–270.