An Analytic Pipeline to Obtain Reliable Genetic Ancestry Estimates from Tumor-Derived RNA Sequencing Data

Published in Cancer Epidemiology, Biomarkers & Prevention, 2025

This study presents a validated computational pipeline for estimating global genetic ancestry directly from tumor-derived RNA sequencing (RNASeq) data, enabling ancestry inference when germline DNA is unavailable. Using archival formalin-fixed paraffin-embedded (FFPE) tumor tissues from women with epithelial ovarian cancer (EOC), the pipeline integrates SeqKit, HISAT2, SAMtools, BCFtools, plink, and ADMIXTURE, with 1000 Genomes Project samples as reference. Comparisons against germline DNA demonstrated strong correlations for African and European ancestry estimates (0.76–0.94), confirming the method’s accuracy. This approach substantially expands analytic opportunities for cancer studies reliant on archival tumor tissue, particularly in admixed populations, by supporting ancestry-informed analyses of tumor biology and survival outcomes even in the absence of germline samples.

Recommended citation: Johnson C.E., **Ran X.**, Wrobel J., Davidson N.R., Greene C.S., Epstein M.P., Marks J.R., Peres L.C., Doherty J.A., & Schildkraut J.M. (2025). "An Analytic Pipeline to Obtain Reliable Genetic Ancestry Estimates from Tumor-Derived RNA Sequencing Data." Cancer Epidemiology, Biomarkers & Prevention. PMID:40622249
Download Paper