Integrative analysis of DNA methylation and gene expression to identify clear cell renal cell carcinoma diagnostic biomarkers via a machine learning approach
Integrative analysis of DNA methylation and gene expression to identify clear cell renal cell carcinoma diagnostic biomarkers via a machine learning approach
Maryam Sadat Hosseini,1Parvaneh Nikpour,2,*
1. Department of Genetics and Molecular Biology, Isfahan University of Medical Sciences, Isfahan, Iran 2. Department of Genetics and Molecular Biology, Faculty of Medicine, Isfahan University of Medical Sciences, Isfahan, Iran
Introduction: Kidney tumors are responsible for 2.2% of all cancer diagnoses around the world each year. Approximately 70% of malignant kidney tumors are clear cell renal cell carcinoma (ccRCC).
Because of ccRCC’s obscure initial symptoms, most patients are diagnosed in a late stage and this clears the need for biomarkers that can help for its early detection.
In the current study, our purpose is to integrate methylation and gene expression data from The Cancer Genome Atlas (TCGA) to find potential diagnostic biomarkers for ccRCC via bioinformatics and a machine learning approach.
Methods: Study population:
DNA Methylation and gene expression data for KIRC project (Kidney renal clear cell carcinoma) were downloaded from TCGA using TCGABiolinks package in R. For DNA methylation, we obtained 160 normal and 324 tumor samples based on Illumina Infinium Human Methylation 450 platform. lluminaHiSeq RNASeqV2 gene expression data for 72 normal and 533 tumor samples was used for expression analysis.
Differential methylation and expression analyses:
At first, probes locating at sex chromosomes, containing SNPs or missing values were removed. We used ChAMP package for finding differentially methylated CpGs (DMCs). CpGs with adjusted p-values<0.05 and |delta β| > 0.15 were considered DMCs. IlluminaHumanMethylation450kanno.ilmn12. hg19 package was used for annotating methylation probes.
Gene expression dataset was normalized via DESeq2 package. Genes were considered differentially expressed (DEGs) if they satisfied the threshold of |log2-based fold change| > 1 and adjusted p-values<0.05.
Protein–protein interaction (PPI) network:
Search Tool for the Retrieval of Interacting Genes (STRING) was used to create a PPI network with interaction score of 0.4. MCODE was used to pull out modules of the PPI network. Top two modules were analyzed using CytoHubba app in Cytoscape.
Screening of diagnostic biomarkers via machine learning:
Recursive feature elimination (RFE) method was used to choose top three genes from those which were selected by CytoHubba. These three genes where then used for constructing a logistic regression model.
Results: Since the promoter methylation has a great impact on gene expression, we only looked for DMCs which were mapped to promoter of the genes, so 1400 CpGs were selected. Then we mapped these DMCs to genes and identified 874 non-duplicated genes which their promoter methylation level is significantly different in tumor and normal samples.
For expression data, after normalization, a total of 5826 DEGs were found.
Next step was intersecting between those 874 differentially methylated genes and 5826 DEGs. After intersection, we ended up with 303 genes that we used for construction of a PPI network. MCODE was applied for module analysis of network and top two modules were selected for downstream analysis with cytoHubba. On each module, we chose top 15 genes based on 3 criteria (degree, betweenness and closeness). Then to determine reliable ccRCC biomarkers, we intersected these 3 lists to find the common top genes which where 6 ones on each module. We imported expression data of these 12 genes to scikit-learn library in python. After feature selection with RFE method, 3 genes including CENPM, GAPDH and LAPTM5 were selected. We split the data to 30% test and 70% training and built a diagnostic logistic regression model with 3 selected genes with training samples. Accuracy of model performance on the test data was 96% indicating that the three markers could achieve excellent performance in distinguishing KIRC tumor and normal samples.
Conclusion: In this study, we analyzed DNA methylation and gene expression profiles in ccRCC samples from TCGA and came up with 3 genes that can discriminate tumor and normal samples with 96% accuracy.
Several studies have shown LAPTM5 plays a role in development of different types of cancers. For example, it has great impact on proliferation of bladder cancer cells via G0/G1 phase.
GAPDH can impact metastasis in renal cancer. It also has impact on the pathogenesis of cancer via regulation of autophagy and apoptosis.
CENPM facilitates tumor metastasis in pancreatic cancer via mTOR/p70S6K signaling pathway. It also impacts invasion of cancer cells in hepatocellular carcinoma.
In conclusion, since these markers are passed through multiple filters and because machine learning approaches are very effective for building diagnostic models, these markers could be helpful for early detection of ccRCC.