1-1.什麼是maftools

maftools : efficient and comprehensive analysis of somatic variants in cancer

With advances in Cancer Genomics, Mutation Annotation Format (MAF) is being widely accepted and used to store somatic variants detected. The Cancer Genome Atlas Project has sequenced over 30 different cancers with sample size of each cancer type being over 200. Resulting data consisting of somatic variants are stored in the form of Mutation Annotation Format. This package attempts to summarize, analyze, annotate and visualize MAF files in an efficient manner from either TCGA sources or any in-house studies as long as the data is in MAF format.

簡單來說就是癌症基因體計畫所產生的突變資料都存在一種叫MAF格式的檔案裡。maftools就是透過視覺化輔助分析的方法來讓人看得懂MAF的內容。

Mayakonda A, Lin DC, Assenov Y, Plass C, Koeffler HP. 2018. Maftools: efficient and comprehensive analysis of somatic variants in cancer. Genome Resarch PMID: 30341162

原始連結: https://www.bioconductor.org/packages/release/bioc/vignettes/maftools/inst/doc/maftools.html

1-2.Overview of the package

1-3.MAF格式

GDC MAF Format: Mutation Annotation Format (MAF) is a tab-delimited text file with aggregated mutation information from VCF Files and are generated on a project-level.

https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/

Row

重要欄位

  • Hugo_Symbol
  • Chromosome
  • Start_Position
  • End_Position
  • Reference_Allele
  • Tumor_Seq_Allele2
  • Variant_Classification
  • Variant_Type
  • Tumor_Sample_Barcode

對照表

2-1 建立MAF object

column

read.maf

read.maf function reads MAF files, summarizes it in various ways and stores it as an MAF object.

  1. 載入maftools套件 library(maftools)
  2. read.maf() 函式
  3. MAF file https://gdac.broadinstitute.org
  1. clinical information containing survival information and histology (optional)

maftools內建範例

Acute Myeloid Leukemia (LAML)

# 1.載入maftools套件
library(maftools)

# path to TCGA LAML MAF file
laml.maf = system.file('extdata', 'tcga_laml.maf.gz', package = 'maftools') 

# clinical information containing survival information and histology. This is optional
laml.clin = system.file('extdata', 'tcga_laml_annot.tsv', package = 'maftools') 

# 利用read.maf() 建立 maf object: laml
laml = read.maf(maf = laml.maf, clinicalData = laml.clin)

# 檢視maf object, 直接輸入laml
laml


# 2.Shows sample summary
getSampleSummary(laml)

# 3.Shows gene summary
getGeneSummary(laml)

# 4.Shows clinical data associated with samples
getClinicalData(laml)

# 5.Shows all fields in MAF
getFields(laml)

# 6.Writes maf summary to an output file with basename laml.
write.mafSummary(maf = laml, basename = 'laml')

# 7.plotmafSummary
We can use plotmafSummary to plot the summary of the maf file, which displays number of variants in each sample as a stacked barplot and variant types as a boxplot summarized by Variant_Classification.

plotmafSummary(maf = laml, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

# 8,9. Oncoplots

oncoplot(maf = laml, top = 10)
oncoplot(maf = laml, top = 10, fontSize = 10,showTumorSampleBarcodes = TRUE, SampleNamefontSize = 2)

# 10.Transition and Transversions

laml.titv = titv(maf = laml, plot = FALSE, useSyn = TRUE)
plotTiTv(res = laml.titv)

# 11, 12. Lollipop plots for amino acid changes

#lollipop plot for DNMT3A, which is one of the most frequent mutated gene in Leukemia.

lollipopPlot(maf = laml, gene = 'DNMT3A', AACol = 'Protein_Change', showMutationRate = TRUE)
lollipopPlot(maf = laml, gene = 'DNMT3A', AACol = 'Protein_Change', showMutationRate = TRUE, showDomainLabel=FALSE,labelPos=882)

# 13. Cross project comparison
tcgaCompare(maf = laml, cohortName = 'Example-LAML')

# 14. Somatic Interactions
Mutually exclusive or co-occurring set of genes can be detected using somaticInteractions function, which performs pair-wise Fisher’s Exact test to detect such significant pair of genes.

somaticInteractions(maf = laml, top = 25, pvalue = c(0.05, 0.1))

# 15. Oncogenic Signaling Pathways
OncogenicPathways function checks for enrichment of known Oncogenic Signaling Pathways in TCGA cohorts

OncogenicPathways(maf = laml)

column

1

-Reading
-Validating
-Silent variants: 475 
-Summarizing
-Processing clinical data
-Finished in 0.293s elapsed (2.224s cpu) 
An object of class  MAF 
                   ID          summary  Mean Median
 1:        NCBI_Build               37    NA     NA
 2:            Center genome.wustl.edu    NA     NA
 3:           Samples              193    NA     NA
 4:            nGenes             1241    NA     NA
 5:   Frame_Shift_Del               52 0.271      0
 6:   Frame_Shift_Ins               91 0.474      0
 7:      In_Frame_Del               10 0.052      0
 8:      In_Frame_Ins               42 0.219      0
 9: Missense_Mutation             1342 6.990      7
10: Nonsense_Mutation              103 0.536      0
11:       Splice_Site               92 0.479      0
12:             total             1732 9.021      9

2

3

4

5

6

7

8

9

10

11

     HGNC refseq.ID protein.ID aa.length
1: DNMT3A NM_175629  NP_783328       912
2: DNMT3A NM_022552  NP_072046       912
3: DNMT3A NM_153759  NP_715640       723

12

     HGNC refseq.ID protein.ID aa.length
1: DNMT3A NM_175629  NP_783328       912
2: DNMT3A NM_022552  NP_072046       912
3: DNMT3A NM_153759  NP_715640       723

13

14

     gene1  gene2       pValue oddsRatio  00 11 01 10              Event
  1: ASXL1  RUNX1 0.0001541586 55.215541 176  4 12  1       Co_Occurence
  2:  IDH2  RUNX1 0.0002809928  9.590877 164  7  9 13       Co_Occurence
  3:  IDH2  ASXL1 0.0004030636 41.077327 172  4  1 16       Co_Occurence
  4:  FLT3   NPM1 0.0009929836  3.763161 125 17 16 35       Co_Occurence
  5:  SMC3 DNMT3A 0.0010451985 20.177713 144  6 42  1       Co_Occurence
 ---                                                                    
296: PLCE1  ASXL1 1.0000000000  0.000000 184  0  5  4 Mutually_Exclusive
297: RAD21  FAM5C 1.0000000000  0.000000 183  0  5  5 Mutually_Exclusive
298: PLCE1  FAM5C 1.0000000000  0.000000 184  0  5  4 Mutually_Exclusive
299: PLCE1  RAD21 1.0000000000  0.000000 184  0  5  4 Mutually_Exclusive
300:  EZH2  PLCE1 1.0000000000  0.000000 186  0  4  3 Mutually_Exclusive
             pair event_ratio
  1: ASXL1, RUNX1        4/13
  2:  IDH2, RUNX1        7/22
  3:  ASXL1, IDH2        4/17
  4:   FLT3, NPM1       17/51
  5: DNMT3A, SMC3        6/43
 ---                         
296: ASXL1, PLCE1         0/9
297: FAM5C, RAD21        0/10
298: FAM5C, PLCE1         0/9
299: PLCE1, RAD21         0/9
300:  EZH2, PLCE1         0/7

15

       Pathway  N n_affected_genes fraction_affected
 1:    RTK-RAS 85               18        0.21176471
 2:      Hippo 38                7        0.18421053
 3:      NOTCH 71                6        0.08450704
 4:        MYC 13                3        0.23076923
 5:        WNT 68                3        0.04411765
 6:       TP53  6                2        0.33333333
 7:       NRF2  3                1        0.33333333
 8:       PI3K 29                1        0.03448276
 9: Cell_Cycle 15                0        0.00000000
10:   TGF-Beta  7                0        0.00000000

2-2 Pre-compiled TCGA MAF objects

column

指令

# 載入TCGAmutations套件
library(TCGAmutations)

# 1.查看TCGA cancer projects
tcga_available()

# 2.利用Study_Abbreviation載入maf object
coad = tcga_load(study = 'COAD')

# 3.plotmafSummary()
plotmafSummary(coad)

# 4.oncoplots
oncoplot(maf = coad, top = 10, fontSize = 1,showTumorSampleBarcodes = TRUE, SampleNamefontSize = 1)

# 5.getFields
getFields(coad)

# 6.Lollipop plots for amino acid changes
lollipopPlot(maf = coad, gene = 'APC', AACol = 'HGVSp_Short', showMutationRate = TRUE)

column

1

2

3

4

5

 [1] "Hugo_Symbol"               "Chromosome"               
 [3] "Start_Position"            "End_Position"             
 [5] "Reference_Allele"          "Tumor_Seq_Allele2"        
 [7] "Variant_Classification"    "Variant_Type"             
 [9] "Tumor_Sample_Barcode"      "HGVSp_Short"              
[11] "n_depth"                   "n_ref_count"              
[13] "n_alt_count"               "t_depth"                  
[15] "t_ref_count"               "t_alt_count"              
[17] "Tumor_Sample_Barcode_full" "sample_type_description"  
[19] "ExAC_AF"                   "GMAF"                     
[21] "Cohort"                    "IMPACT"                   

6

   HGNC    refseq.ID   protein.ID aa.length
1:  APC NM_001127511 NP_001120983      2825
2:  APC NM_001127510 NP_001120982      2843
3:  APC    NM_000038    NP_000029      2843

2-3 ICGC to MAF objects

column

ICGC Data Portal

Download

Acute Lymphoblastic Leukemia - TARGET, US

# clinical information
download.file("https://dcc.icgc.org/api/v1/download?fn=/current/Projects/ALL-US/donor.ALL-US.tsv.gz",destfile = "donor.ALL-US.tsv.gz")

# simple somatic mutation
download.file("https://dcc.icgc.org/api/v1/download?fn=/current/Projects/ALL-US/simple_somatic_mutation.open.ALL-US.tsv.gz",destfile="simple_somatic_mutation.open.ALL-US.tsv.gz")

# icgc TSV to MAF format
all=icgcSimpleMutationToMAF(icgc = "simple_somatic_mutation.open.ALL-US.tsv.gz", addHugoSymbol = TRUE)

# convert maf object
icgc = read.maf(all)

# plotmafSummary
plotmafSummary(icgc)

column

--Removed 5200 duplicated variants
--Found  7  variants with no Gene Symbols
--Annotating them as 'UnknownGene' for convenience
-Validating
-Silent variants: 493 
-Summarizing
-Processing clinical data
--Missing clinical data
-Finished in 0.070s elapsed (1.139s cpu) 

作業

請將 TCGA 的 Colon_adenocarcinoma (COAD) &Rectum_adenocarcinoma (READ)兩個 projects的 KRAS gene mutations 畫在同一張 lollipop 上。包含程式碼&圖。

column

提示:

??lollipopPlot2

   HGNC refseq.ID protein.ID aa.length
1: KRAS NM_004985  NP_004976       188
2: KRAS NM_033360  NP_203524       189
   HGNC refseq.ID protein.ID aa.length
1: KRAS NM_004985  NP_004976       188
2: KRAS NM_033360  NP_203524       189