High-dimensional regression analysis links magnetic resonance imaging features and protein expression and signaling pathway alterations in breast invasive carcinoma
1 Department of Bioinformatics and Computational Biology, University of Texas MD Anderson Cancer Center, Houston, TX, USA
2 Department of Statistics, Purdue University, West Lafayette, IN, USA
3 Department of Biostatistics, Emory University, Atlanta, GA, USA
4 Department of Radiology, UT Southwestern, Dallas, TX, USA
5 Department of Diagnostic Radiology, Roswell Park Cancer Institute, Buffalo, NY, USA
6 Department of Radiology, University of Wisconsin—Madison, Madison, WI, USA
7 Department of Radiology, Memorial Sloan Kettering Cancer Center, New York, NY, USA
8 Department of Radiology, MD Anderson Cancer Center, Houston, TX, USA
9 Department of Radiology, University of Miami Health System, Miami, FL, USA
10 Department of Radiology, Mayo Clinic, Rochester, MN, USA
11 Department of Radiology, University of Pittsburgh, Pittsburgh, PA, USA
Correspondence to: Arvind Rao, email:firstname.lastname@example.org
Keywords: breast invasive carcinoma; MRI; protein expression; signaling pathway analysis; TCGA
Received: October 16, 2017
Accepted: December 15, 2017
Published: February 26, 2018
Copyright: Lehrer et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License 3.0 (CC BY 3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Background: Imaging features derived from MRI scans can be used for not only breast cancer detection and measuring disease extent, but can also determine gene expression and patient outcomes. The relationships between imaging features, gene/protein expression, and response to therapy hold potential to guide personalized medicine. We aim to characterize the relationship between radiologist-annotated tumor phenotypic features (based on MRI) and the underlying biological processes (based on proteomic profiling) in the tumor.
Methods: Multiple-response regression of the image-derived, radiologist-scored features with reverse-phase protein array expression levels generated association coefficients for each combination of image-feature and protein in the RPPA dataset. Significantly-associated proteins for features were analyzed with Ingenuity Pathway Analysis software. Hierarchical clustering of the results of the pathway analysis determined which features were most strongly correlated with pathway activity and cellular functions.
Results: Each of the twenty-nine imaging features was found to have a set of significantly correlated molecules, associated biological functions, and pathways.
Conclusions: We interrogated the pathway alterations represented by the protein expression associated with each imaging feature. Our study demonstrates the relationships between biological processes (via proteomic measurements) and MRI features within breast tumors.
Breast cancer is the most common cancer in women , with incidence rates rising since the 1990s . Molecular expression profiling of tumors has been effective in allowing for individualized therapy plans in certain types of breast cancer . Expression of three receptors-- estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor (HER2)—are routinely used to determine optimal treatment plans for breast cancer patients . PR and ER expression are associated with luminal A and B subtypes of breast cancer, with a lower proliferation index and pathological grade . Disease-free and overall survival is lower in HER2 over-expression and triple negative breast cancers when compared to luminal A and B subtypes .
Despite obtaining multiple specimens from percutaneous biopsies as well as analysis of surgical specimens, the temporal and spatial heterogeneity of tumor gene and protein expression cannot be adequately determined [6, 7, 8]. Readily available imaging databases such as The Cancer Imaging Archive (TCIA) are leveraged in order to address the problem of tumor heterogeneity and to predict gene expression and patient responses to therapy based on imaging data . MRI, as well as other modalities, is now used by researchers for extraction of features which correlate with patient responses and gene expression [10-12]. Breast cancer radiomic signatures can potentially predict recurrence when compared with multi-gene assays . Deciphering the associations between imaging features, breast tumor gene/protein expression levels, and patient outcomes holds the potential to guide personalized medicine [12, 14].
High-dimensional variable selection (Supplementary Table S1) is commonly used to analyze relationships between multiple modalities (copy number, expression, etc.) in genomic data. To avoid generating spurious correlations, a number of Bayesian and frequentist approaches have been devised. Bayesian approaches use a sparsity-inducing prior, such as spike-and-slab [15, 16], double-exponential , horseshoe , horseshoe+, or generalized double Pareto prior . Frequentist approaches use penalized regression models: l1, horseshoe+ , generalized double Pareto prior , L1-norm penalty of the LASSO , combined l1/l2 penalty of elastic net , or combined L1 and L2- norm penalty of elastic net . These regression models allow us to ignore the loss of statistical efficiency that occurs through correlation structures because they treat all variables as independent . Several approaches to high-dimensional variable selection in highly-correlated datasets have been taken [24-26]. In this study, we used a Bayesian approach to model the correlation structure as previously described .
Analyzing a cohort of 82 breast cancer patients included in the TCGA database, we built a model correlating MRI-derived imaging features with proteomics data using a high-dimensional regression approach. Though a previous study of 353 breast cancer patients assessed correlations between 21 imaging traits and mRNA transcript levels , to our knowledge our approach has not yet been applied to proteomics data for breast cancer.
Molecules were found to be significantly correlated to each imaging feature (Supplementary Table S2) with the exception of clumped non-mass internal enhancement. These molecules were obtained through high dimensional regression of the RPPA protein expression data on the imaging features set. For example, the axillary lymphadenopathy feature was found to be directly correlated with expression of EIF4EBP1 and PRDX1, and inversely correlated with RAB25, SHC1, XRCC1, and PARK7. Cell surface receptors associated with imaging features are EGFR, KDR, and PDK1.
IPA analysis was implemented to determine the functional implications of the molecules. The IPA software generated p-values and Z-scores for the IPA Canonical Pathways of each feature (Figure 2), as well as scores for the IPA Diseases and Biological Functions of each feature (Figure 3). The Canonical Signaling Pathways most strongly associated with each imaging feature are summarized in Table 2 . The results show that the same proteins are found to be correlated with a specific feature, irrespective of whether the data sets were separated into global or primary features.
In order to determine which features were most strongly associated with functional alterations to signaling pathways, agglomerative unsupervised hierarchical clustering was performed on the p-values and Z-scores (Figures 2 and 3). This analysis separated the features into groups based on the strength of their correlations with altered pathway activity and disease functions. The most strongly deregulated IPA Diseases and Biological functions featured activation Z-scores between -3.5 and +3.5 (Supplementary Table S3).
The strength of the associations between imaging features and protein expression, signaling pathways, and biological functions was computed using a sequential analysis of the protein expression data found through RPPA analysis of MRI scans of the TCGA patients. Correlation coefficients for each possible combination of imaging feature and protein were computed using a high-dimensional regression with a Bayesian selection of covariates. Corrected p-values were computed for each correlation coefficient in order to minimize the false discovery rate (FDR). Only the strongest ten percent of significantly-correlated molecules were analyzed using the standardized Core Analysis workflow in IPA, using correlation coefficients in lieu of gene expression values. The IPA analysis provides associations with pathway activity and pathobiology, allowing for hypotheses regarding the relationship between pathway activity at the cellular level and the manifestations of the alterations at the macroscopic, imaging levels. The activation Z-scores computed from the correlation coefficients indicate whether each pathway (or function) is up- or down-regulated by upstream transcription factor activity. A similar approach integrated breast cancer transcriptomics data with imaging features and extended the interpretation with gene set enrichment analysis to identify metagene signatures such as wound response and hypoxia . Our study extends this approach by leveraging the IPA Knowledge Base to interpret the patterns of protein expression associated with each imaging feature.
In our study, we used a stringent two-step method to select the correlations least likely to result from chance association, overcoming a common issue with high dimensional regression analysis. Despite this, the approach we have described is essentially a hypothesis-generation pipeline, and should be interpreted carefully, following in-vivo perturbation experiments in appropriate model systems.
We found that enhancing rim fraction score, a quantitative MRI feature, was shown to be significantly associated with the expression of the long, non-coding RNA HOTAIR . This expression is known to be associated with breast cancer progression and metastasis . The results of the high dimensional regression method used hints at the molecular underpinnings of macroscopic imaging phenotypes. It is known that MRI features correlate with pathologic stage and lymph node involvement . The results found in this study point to multiple significant associations between molecular expression patterns in the tumor cells and how these manifest as MRI phenotypes .
TCGA patient datasets
Eighty-two patients from multiple institutions with de-identified MRIs and reverse-phase protein array (RPPA) expression data were included in this study. All subject data was de-identified prior to the study through inclusion in The Cancer Genome Atlas (TCGA), and was thus exempt from requiring institutional review board approval, following the terms of the TCGA data use agreement. Imaging data was obtained through The Cancer Imaging Archive (TCIA) database. RPPA protein expression data was obtained from the TCGA through Firehose (https://gdac.broadinstitute.org/).
Scores of twenty-nine MRI semantic features were defined by the TCGA Breast Phenotype Research Group . We used the imaging features as defined by the TCGA group to include mass- and non-mass associated features as shown in Table 3. These feature groups include background features, tumor related features, tumor dimensional features, features associated with the morphology of the non-mass enhancing lesion, and T2-weighted MR acquisition features.
In order to ensure that the effects of each individual feature were appropriately described, the feature set was split into three subsets: one set with only the 8 mass-associated features, one with only the 21 global features, and an aggregate set with all 29 features. The features were isolated in order to determine if there were any significant proteins, associated pathways, or biological functions that appeared in purely global or mass-associated-only feature sets.
High dimensional regression
High dimensional regression was done in Matlab using the joint Bayesian selection of covariates developed by Bhadra and Mallick . In this analysis, the independent variables were the imaging features, and the molecules (proteins and phospho-proteins) were the response variables. This arrangement allowed the expression of each protein to be correlated with the expression of many other proteins. Multiple-testing correction
Multiple testing correction was employed to control the false-discovery rate (FDR) by sequentially designating p-value thresholds . First, the posterior probabilities of the covariates were thresholded at an FDR of 0.25, giving a sparse set of predictors (imaging variables). Second, t-tests were performed using “no-association” as the null hypothesis and “non-zero association” as the alternative hypothesis. The t-tests were computed between each combination of imaging features and molecules in the RPPA dataset. Correlation coefficients with p-values in the 10th percentile and that were less than 0.05 after adjustment for multiple comparisons were considered statistically significant. This approach is similar to that used to discern the relative impact of copy number alterations on messenger RNAs and microRNAs in glioblastoma .Pathway analysis
Pathway analysis was performed on each of the three data sets (based on the image feature subsets) using the “Core Analysis” feature in the IPA software . For the purposes of this analysis, regression correlation coefficients served as expression values. P-values and activation Z-scores were computed internally in IPA as previously described.Hierarchical clustering
Agglomerative unsupervised hierarchical clustering of p-values and activation Z-scores was carried out the using the “Stats” package in R. Euclidean distance matrices were computed and Ward’s method was minimized within-cluster variance .
Conception and design: A.R, A.B Development of methodology: A.R, A.B, M.L Acquisition of data: M.L, A.B, S.A, V.R, Y.Z, B.D, E.B, E.S.B, E.M, E.S, G.J.W, J.N, K.B, M.G, M.Z, A.R
Analysis and interpretation of data: M.L, S.A, A.B, A.R,Writing, review, and/or revision of manuscript: M.L, A.B, S.A, A.R
Administrative, technical, or material support (i.e., reporting or organizing data, constructing database): A.R, E.S, E.M, E.S.B
CONFLICT OF INTEREST
The authors have no conflict of interest to declare.
ACKNOWLEDGMENTS AND FUNDING
A.R. was supported by CCSG Bioinformatics Shared Resource P30 CA016672, an Institutional Research Grant from The University of Texas MD Anderson Cancer Center (MD Anderson), CPRIT RP170719, CPRIT RP150578, NCI R01CA214955-01A1, a Career Development Award from the MD Anderson Brain Tumor SPORE, and a Research Scholar Grant from the American Cancer Society (RSG-16-005-01).
A.B. was supported by grant no. DMS-1613063 from the National Science Foundation. This project has been funded in whole or in part with federal funds from the National Cancer Institute, National Institutes of Health, under Contract No. HHSN261200800001E. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the U.S. Government.
E.S and E.M were funded in part through the NIH/NCI Cancer Center Support Grant P30 CA008748. The sponsor had no involvement in the study design; the collection, analysis and interpretation of data; the writing of the report; and the decision to submit the article for publication.
Last Modified: 2018-03-07 14:44:23 EST