Pathway analysis of bladder cancer genome-wide association study identifies novel pathways involved in bladder cancer development

Genome-wide association studies (GWAS) are designed to identify individual regions associated with cancer risk, but only explain a small fraction of the inherited variability. Alternative approach analyzing genetic variants within biological pathways has been proposed to discover networks of susceptibility genes with additional effects. The gene set enrichment analysis (GSEA) may complement and expand traditional GWAS analysis to identify novel genes and pathways associated with bladder cancer risk. We selected three GSEA methods: Gen-Gen, Aligator, and the SNP Ratio Test to evaluate cellular signaling pathways involved in bladder cancer susceptibility in a Texas GWAS population. The candidate genetic polymorphisms from the significant pathway selected by GSEA were validated in an independent NCI GWAS. We identified 18 novel pathways (P < 0.05) significantly associated with bladder cancer risk. Five of the most promising pathways (P ≤ 0.001 in any of the three GSEA methods) among the 18 pathways included two cell cycle pathways and neural cell adhesion molecule (NCAM), platelet-derived growth factor (PDGF), and unfolded protein response pathways. We validated the candidate polymorphisms in the NCI GWAS and found variants of RAPGEF1, SKP1, HERPUD1, CACNB2, CACNA1C, CACNA1S, COL4A2, SRC, and CACNA1C were associated with bladder cancer risk. Two CCNE1 variants, rs8102137 and rs997669, from cell cycle pathways showed the strongest associations; the CCNE1 signal at 19q12 has already been reported in previous GWAS. These findings offer additional etiologic insights highlighting the specific genes and pathways associated with bladder cancer development. GSEA may be a complementary tool to GWAS to identify additional loci of cancer susceptibility.


INTRODUCTION
Urinary bladder cancer is the fourth most common cancer in men in U.S. Estimates in 2015 indicate urinary bladder cancer affects 56,320 males and 17,680 females, and will lead to 16,000 deaths in the U.S. [1] . Bladder cancer is a heterogeneous disease attributed to many risk factors. The number one risk factor is tobacco smoking, which explains 30-50% of bladder cancer risk [2]. Occupational exposure to chemicals [3,4], genetic factors, and other environmental factors such as dietary factors, lifestyle factors, medical factors, fluid intake, also contribute to bladder cancer carcinogenesis [5], although some of the risk factors are inconclusive and vary in different studies. There is substantial evidence that there is an important genetic contribution to susceptibility to bladder cancer, initially based on familial clustering of bladder cancer. Epidemiologic studies have demonstrated a two-fold elevation in bladder cancer risk among firstdegree relatives of bladder cancer patients [6,7]. A segregation analysis in 1,193 families suggests a paucity of high penetrance gene for sporadic bladder cancer but instead many low penetrance genes with modest effects [8], indicative of a complex, polygenic model [9]. A large population based twin study evaluated the contribution of hereditary factors to the causation of various sporadic cancers and estimated the genetic contribution of bladder cancer to be roughly 30% [10].
Recent GWAS have provided valuable insights into the genetic basis of human disease but GWAS do not fully explain heritability in sporadic cancers [21]. So far, the low estimated effect sizes of the individual SNPs account for a small portion of the heritability of bladder cancer. In addition, majority of disease associated SNPs found in GWAS are tagging SNPs at non-genic regions without clear functional implication (http://www.genome. gov/gwastudies/). They may be highly correlated with the variants directly associated with bladder cancer susceptibility [22].

DISCUSSION
In this study, we pursued a pathway approach in the Texas bladder cancer GWAS data using three GSEA methods including Gen-Gen, Aligator, and SNP Ratio Test. We identified 18 promising pathways out of 781 predefined gene-sets, which were associated with bladder cancer risk according to our screening criteria. The top five significant pathways involved cell cycle control at G1 and S phase, NCAM and PDGF induced intracellular signaling, and unfolded protein response. From these top five pathways, 17 SNPs from CCNE1, RAPGEF1, SKP1, HERPUD1, CACNB2, CACNA1C, CACNA1S, COL4A2, SRC, CACNA1C, appear to be associated with bladder cancer risk and were subsequently observed in the NCI bladder cancer GWAS.
The pathways highlighted were two cell cycle related pathways "BIOCARTA_RACCYCD_PATHWAY", and "BIOCARTA_SKP2E2F_PATHWAY" at G1 and S phase (Supplementary Figure 2A). The G1/S phase transition is the rate-limiting step in cell cycle. This process is sequentially and coordinately regulated by the formation of several Cyclin-Cyclin Dependent Kinase (CDK) complexes, for example, Cyclin-D -CDK4/6 complex for G1 progression, Cyclin-E -CDK2 complex for the G1-S transition, and Cyclin-A -CDK2 complex for S-phase progression [35,36]. Disruption of these complexes leads to either cell cycle arrest or uncontrolled cell cycle proliferation. Somatic and germline alterations of this pathway had been found in bladder cancer and other tumors [37][38][39][40]. In particular, the expression of Cyclin E1 has been correlated with more advanced and invasive bladder cancer, as well as poor clinical outcomes [41]. In our GSEA, the most significant SNP after validation in NCI population is CCNE1 rs8102137, which maps to the *The overlapping genes in "REACTOME_SIGNALING_BY_PDGF" and "REACTOME_NCAM_SIGNALING_FOR_NEURITE_OUT_GROWTH", and "BIOCARTA_RACCYCD_PATHWAY" and "BIOCARTA_SKP2E2F_PATHWAY" were indicated by asterisk. chromosome 19q12 at the 5' flanking region of CYCLIN E1 gene with P value of 1×10 -6 ( Table 3). This SNP has been reported in previous GWAS, which is reassuring for the analytical approach [15]. CCNE1 rs8102137 is also linked with the top signal rs60560217 in our imputation results (Figure 2A). There were no functional implications for rs60560217. Rs997669, located at the intron 4 of CCNE1, was significantly associated with bladder cancer risk (pooled P value of 3.6×10 -5 ) independent of the above two SNPs.
In the "REACTOME_NCAM_SIGNALING_FOR_ NEURITE_OUT_GROWTH" pathway (Supplementary Figure 2B), the neural cell adhesion molecule, NCAM, belongs to the immunoglobulin superfamily. The NCAM induced intracellular signaling not only functions in neuronal differentiation, synaptic plasticity, and * "REACTOME_SIGNALING_BY_PDGF" and "REACTOME_NCAM_SIGNALING_FOR_NEURITE_OUT_GROWTH" had overlapping SNPs. The shared SNPs and Genes were indicated by asterisk. "BIOCARTA_RACCYCD_PATHWAY" and "BIOCARTA_SKP2E2F_PATHWAY" had overlapping SNPs. The shared SNPs and Genes were indicated by asterisk. MAF-minor allele frequency regeneration, but is also involved in the regulation of growth factor signaling, and cytoskeletion, etc. In the "REACTOME_SIGNALING_BY_PDGF" pathway (Supplementary Figure 2C), the binding of Plateletderived growth factors (PDGF) to its two tyrosine kinase receptor induces the receptor dimerization and autophosphorylation, and enables the activation of many downstream molecules, such as SRC, PI3K, CRK, STAT, SHP-2, NCK, GAP, SHC, GRB2, GRB7, and PLC-γ1. Therefore PDGF can elicit the crosstalk of many downstream pathways, for example RAS-RAF-MEK-MAPK and PI3K-AKT pathways, to influence diverse functions, such as cell growth and motility [42].
In growth factor mediated intracellular pathways (Supplementary Figure 2B and 2C), RAPGEF1 rs7040470 was significantly associated with bladder cancer risk and the significance level reached 1.2×10 -5 in the pooled analysis (Table 3). RAPGEF1 maps to 9q34.13 and encodes the RAP guanine nucleotide exchange factor 1. RAPGEF1 regulates the RAS-CRK-RAP1 cellular signal transduction system which has shown abnormality in lung carcinogenesis [43,44]. Rs7040470 is at the downstream near gene region of RAPGEF1.
UPR contributes to a critical decision point between homeostasis or apoptosis of cell (Supplementary Figure  2D). During the ER stress, UPR initially decreases protein translation and enhances the unfolded protein degradation response to enforce the cell to maintain a homeostastic status [45]. If unable to maintain homeostasis within a certain time, the cell will commit apoptosis. In the UPR pathway (Supplementary Figure 2D), rs2518054 of HERPUD1was the only significant SNP validated with pooled P value 0.002 (Table 3). HERPUD1, located at 16q13, is an endoplasmic reticulum (ER) resident protein which is up-regulated in response to ER stress [46]. Interestingly, variants in HERPUD1 have been associated with the metabolic syndrome in GWAS [47][48][49] .
The GSEA method is an attractive approach for identifying additional susceptibility signals but it does require both larger sample sizes and independent replication sets to conclusively establish novel loci. GSEA can detect evidence for subtle effects of multiple SNPs in the same gene set, though it does not dissect pleiotropy in a given region. In our GSEA, we not only validated one SNP at CCNE1 from previous GWAS, but also highlighted several novel SNPs, genes, and pathways potentially involved in bladder cancer tumorigenesis (Tables 1-3). Since GSEA grouped millions of SNPs into hundreds of gene sets, the burden of multiple comparisons have been greatly reduced. In addition, incorporation of the biological knowledge into statistical analysis renders our finding more relevant to biological interpretation. The biggest challenge for GSEA is how to define a gene set as misclassification leads directly to a loss of power. Due to the complexity of cell biology, some of the gene sets will be inevitably redundant or overlapping. Thus, associations could be driven by significant genes that are overlapped in different pathways. Another major limitation of GSEA is that it can only assess SNPs in or near gene regions so non-genic variants are not considered. Furthermore, the analysis assumes SNPs having only local cis-effects, an assumption that may be limiting. Finally, we only validated the genes and SNPs but not the pathways associated with bladder cancer risk since the NCI data was derived from multiple GWAS genotyping panels (HumanHap 1M, HumanHap610-Quad, HumanHap610, and HumanHap550 equivalents) compared to the MD Anderson data of using a single beadchip (HumanHap610). Differences in gene coverage and the selection of tagSNPs for each gene region in the various Illumina GWAS panels precluded us from confirming the results at the pathway level.
In summary, we implemented three different GSEA methods as internal validation to identify the biological pathways consistently associated with bladder cancer risk and also validated our results in an independent NCI population, which may reduce false discovery in our findings. GSEA is a complementary tool to identify additional genetic contributions to the heritability of bladder cancer, and may also be applicable to clinical outcome studies [50] by incorporating the biological pathway information into GWAS analysis. Our findings may pinpoint potential pathway targets for cancer prevention and treatment and to improve the risk prediction model of bladder cancer. However, to pursue these strategies, further research are needed to validate, fine-map and conduct functional characterization to pinpoint the variants directly associated with bladder cancer risk.

Study population for primary GWAS
Study population was derived from our previous published GWAS (14), which included a total of 969 Caucasian cases and 957 Caucasian controls. Cases were recruited from MD Anderson Cancer Center and Baylor College of Medicine between 1999 and 2007. They were newly diagnosed bladder cancer patients, histologically confirmed, and previously untreated (ICD codes 188.1-188.9). There were no restrictions on age, sex, ethnicity, and cancer stage in case recruitment. Control subjects were recruited from Kelsey Seybold clinics and were frequency-matched to cases by age (±5 years), sex and ethnicity. We restricted our analysis to Caucasians, due to the small number of minority participants in our population. All epidemiology data were collected by trained interviewers after signing of the consent form by study participants. The study was approved by the institutional review boards of MD Anderson Cancer Center, Baylor College of Medicine, and Kelsey-Seybold Clinic. No inflation was found in the study population structure [14]. All leukocyte DNAs were genotyped by Illumina HumanHap610 chip. Quality control for genotyping has been described previously (14). Briefly, cases and controls were excluded from analysis if they had genotyping call rates less than 95%; were found on review not to be of European ancestry; or were found to be duplicated samples, not matched according to established criteria, or to have reported a sex that did not match with X chromosome heterozygosity. We also excluded samples that deviated by more than 4 standard deviation from other study subjects using similarity in genotypes implemented in PLINK [51]. We randomly selected 2% of the samples for duplicate genotyping. The concordance of SNP genotype calls was > 99% for duplicated samples. Among the 620,901 markers on HumanHap610 chip, we excluded those that were copy number variation markers, did not yield genotypes, variants with minor allele frequency (MAF) less than 0.01 or with call rate < 95%. We further removed SNPs that deviated from Hardy-Weinberg equilibrium in the controls at P < 0.0001. These procedures left 556,429 SNPs for the final analysis.
The validation population consists of the primary scan of the NCI bladder cancer GWAS (available on dbGaP), which includes five studies with 3,532 cases and 5,120 controls of European ancestry [15,34]. These five studies are Spanish Bladder Cancer Study (SBCS), New England, Maine and Vermont Bladder Cancer Study (NEBCS-ME/VT), Alpha-Tocopherol, Beta-Carotene Cancer Prevention Study (ATBC), the American Cancer Society Cancer Prevention Study II Nutrition Cohort (CPS-II), and the Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial (PLCO). The same ICD codes as Texas GWAS were used for patient selection.

Pathway definition and annotation
The molecular signature database (http://www. broadinstitute.org/gsea/msigdb/) from Broad Institute was used to define gene sets/pathways, which were composed of positional gene sets, curated gene sets, motif gene sets, computational gene sets and GO gene sets. We downloaded the 880 canonical pathways for GSEA. To avoid the overly narrow and broad definition of a biological pathway, we confined the input pathway to contain 10-100 genes per pathway, resulting in 781 pathways in our GSEA analysis (Supplementary Figure.  1). Among these, 151 pathways were selected from KEGG (http://www.genome.jp/kegg/), 214 were from Biocarta (http://www.biocarta.com/), 377 were from Reactome (http://www.reactome.org/), and 39 were from other resources. Biocarta generally has the smallest pathway size in terms of the number of gene in each pathway, with a median gene number of 18 per pathway. In contrast, KEGG pathway has the largest size with a median gene number of 44 per pathway. The significant pathways selected by GSEA were input into the Ingenuity Pathway Analysis tools (http://www.ingenuity.com/index.html) for functional annotation.

SNP-gene map
Gene information was downloaded from NCBI dbSNP build 36.3. SNP information was from the Illumina HumanHap610 chip and validated by USCS genome browser (http://genome.ucsc.edu/). SNPs were mapped to gene region and ±20KB upstream and downstream of gene boundaries to cover the gene coding region and most of the regulatory components.

Data preparation
To assess the association between each SNP and disease status, we built a 2x2 contingency table by counting the number of times each possible allele appears in a case or control and allelic 1 degree of freedom (d.f.) test implemented in PLINK was performed similar to the primary GWAS analysis [14]. We conducted quantilequantile plot analysis to assess the distribution of chi2 test statistics of all GWAS SNPs using the R installed package snpMatrix (http://www.bioconductor.org/packages/2.3/ bioc/html/snpMatrix.html) and hexbin (http://cran.rproject.org/web/packages/hexbin/index.html). Deviation of observed data from expected results might indicate the possibility of population stratification, inadequacy of casecontrol matching, or differential genotyping in cases and controls. We randomly permuted the case-control status 1000 times in order to test the presence of differential genotyping. In each permutated data the same number of cases and controls was generated and an allelic 1 d.f. test statistic and P value was re-calculated for each SNP using the permuted case-control status. We applied three GSEA methods for comparison to determine which pathway(s) associated with bladder cancer were likely true findings, not derived by chance. For all methods, pathways with P values < 0.05 were considered significant.

Pathway analysis GenGen [28]
The statistic value of each gene was represented by the highest statistic value among all SNPs mapped to the gene and then sorted from largest to smallest (r (1) ,…, r (N) ) for all N genes in the GWAS dataset. For any given gene set S, composed of N H genes, a weighted Kolmogorov-Smirnov-like running-sum statistic was calculated that reflects the overrepresentation of genes within the set S www.impactjournals.com/Genes&Cancer at the top of entire ranked list of genes in the genome: , where N R = ∑ Gj*∈s |r (j*) | p , p was the weight to genes with extreme statistic values. The enrichment score, ES(S), indicated the maximum deviation of the sum of the statistic values in gene set S from a set of randomly picked genes in the genome. Normalized enrichment score, was used which enabled the comparisons among different gene sets [27].

Aligator [30]
Aligator utilizes a predefined threshold P-value of 0.01 wherein significant SNPs were defined on the basis of less than the predefined threshold. If a gene had one or more than one significant SNP, the gene was considered significant. Assume the total number of significant gene was K in the overall data. The number of significant genes was counted for each gene set. To determine the statistical significance of the gene set, 5000 replicate gene lists were generated by randomly selecting SNPs from all available SNPs and adding the genes that encompass the SNP to the gene list until the size of the gene list reached K. The P-value for each gene set was evaluated by comparing the number of significant genes from the observed data to the number of significant genes from the 5000 replicate gene lists. To correct for multiple testing, the program randomly selected one replicate gene list as the observed data and sample 5000 gene lists with replacement from the 5000 replicate gene lists. P-values for each gene set were calculated as before, using permutation. This procedure was repeated 1000 times to determine whether there was a significant excess of significant gene sets.

SNP ratio test [29]
For a given pathway W, the SNP ratio r w = the number of significant SNP in W / the number of SNPs in W. The empirical P value for a particular pathway, P = (s+1)/(N+1), where s is the number of simulated datasets that produce a ratio greater than or equal to the original ratio, and N is the total number of simulated datasets.

Gene and SNP analysis
We selected SNPs identified in the pathway analyses above for validation in a second dataset from the already published NCI GWAS [15]. The P values of the distribution of genotypes of these SNPs between case and controls were assessed by allelic 1 degree of freedom (d.f.) test in both Texas and NCI populations. For SNPs to be considered "validated", the differential distribution of the genotypes in cases vs. controls in the validation group is consistent with the discovery population (same direction of change), and also both their associations with bladder cancer risk are significant at P < 0.05. A meta-analysis of the Texas and NCI populations was also used to further support the findings. Imputation of the SNPs at gene region of interest for Texas bladder cancer GWAS data was conducted by Impute 2 software (The University of Oxford http://mathgen.stats.ox.ac.uk/ impute/impute_v2.html) using 1000 genomes data (http:// www.1000genomes.org/) as reference panel.