6.1 Import from Tapestri Insights

Here is how we used to do it and what I wrote last time: Our first step is to extract the data from output from Tapestri Insights. To do so, I’ve written a few functions that will be helpful to do this on a long list of samples. As this project grew it becamse clear that this work was better suited for a package. Before that though, looking forward we’ll plan to analyze data using the tapestri R package and integrate formatting from standard single cell rna-seq data in R. For now, we’ll just source a series of functions that can be found in the appendix. I’ll build in explanations as best I can over time.

Below I’ve copied directly the README.txt file output from Tapestri Insights, as it most accurately describes the output from each of the files.

This folder contains data exported from Tapestri Insights 2.1 application. The following files are included:
DP.csv - read depth. The Read Depth per Variant metric is the filtered depth, at the cell level. This gives the number of filtered reads that support each of the reported alleles.
GQ.csv - quality scores. The genotype quality score represents the Phred-scaled confidence that the genotype assignment (GT) is correct, derived from the genotype normalized likelihoods of possible genotypes (PL). Specifically, the quality score is the difference between the PL of the second most likely genotype, and the PL of the most likely genotype. The values of the PLs are normalized so that the most likely PL is always 0, so the quality score ends up being equal to the second smallest PL, unless that PL is greater than 99. In GATK, the value of quality score is capped at 99 because larger values are not more informative, but they take more space in the file. So if the second most likely PL is greater than 99, we still assign a quality score of 99. Basically the GQ gives you the difference between the likelihoods of the two most likely genotypes. If it is low, you can tell there is not much confidence in the genotype, i.e. there was not enough evidence to confidently choose one genotype over another.
NGT.csv - genotype call converted to categorical (0-reference, 1-heterozygous mutation, 2-homozygous mutation, 3-unknown). Calls are made by GATK/Haplotypecaller.
AF.csv - variant allele frequency (# of reads with evidence for mutation / total # of reads * 100)
Variants.csv - variant metadata

In the weeds a little here to explain a detour
It is worth mentioning here on why we chose to proceed in the following fashion, and why we didn’t do all the parsing in R directly from the loom file. When we first started running these samples, there were some incompatibilities with the loom v3 files produced from Bluebee, and the available packages in R that were only working with loom v2. This has now of course been corrected. We considered using python, but the reality was my proficiency in R was significantly greater than python and this was not a tennable solutuon. Going through Tapestri Insights also had two added benefits: 1) efficient SNV to protein identification, and 2) data access for others in the lab who were not quickly pouring into the data in R. Future projects in our lab are likely to bypass Tapestri Insights, and bring loom files straight into R as you’ll see in the DNA+Protein section. Furthermore, as this project progressed, new versions of Tapestri Insights became available, and this led to minor modifications in column names and file names. All of this was parsed in R, and the version of Tapestri Insight did not change the analysis, we just did not rerun >100 samples through Tapestri Insights to make sure it was all formatted directly. This is all to explain why you will see some patchwork code in some of the extraction functions upcoming in the next section.

all_samples <- list.files("/Volumes/LevineLab/Levine Lab/MissionBio_Tapestri/Insights_Output/ASH_Cohort")
blacklist <- read.xls("/Volumes/LevineLab/Levine Lab/Bobby/Collaborations/MissionBio/From_LAM/Blacklist2.xlsx")
whitelist <- read.xls("/Volumes/LevineLab/Levine Lab/Bobby/Collaborations/MissionBio/From_LAM/Whitelist.xlsx")
pheno <- read.xls("/Volumes/LevineLab/Levine Lab/Bobby/Collaborations/MissionBio/From_LAM/pheno_MBio.xlsx")
sample_set<- intersect(as.character(all_samples),as.character(pheno$Sample.ID)) 

setwd("/Volumes/LevineLab/Levine Lab/MissionBio_Tapestri/Insights_Output/ASH_Cohort")
sample_SNPS<-lapply(as.list(sample_set),function(x){
  y<-extract_SNP_info(x)
  z <- y[!(rownames(y)%in%blacklist[,"Blacklisted.coordinate"]|y[,"protein"]%in%blacklist[,"Protein.Change"]),]
  z[,"SNP_variant"] <- rownames(z)
  z[,"Sample"] <- x
  return(z)
})

names(sample_SNPS) <- sample_set
saveRDS(sample_SNPS,file="/Volumes/LevineLab/Levine Lab/Bobby/Collaborations/MissionBio/Analysis/2020/January/sample_SNPS.rds")

sample_NGTs<-lapply(as.list(names(sample_SNPS)),function(x){
  extract_NGT_files(x,sample_SNPS[[x]]$SNP_variant)
})
names(sample_NGTs) <- names(sample_SNPS)

named_sample_NGTs<-lapply(sample_NGTs,function(x){
  rownames(x) <- paste("Cell",1:nrow(x),sep="_")
  return(x)
})
saveRDS(named_sample_NGTs,file="/Volumes/LevineLab/Levine Lab/Bobby/Collaborations/MissionBio/Analysis/2020/January/sample_NGTs.rds")


pheno <- read.xls("/Volumes/LevineLab/Levine Lab/Bobby/Collaborations/MissionBio/From_LAM/pheno_MBio.xlsx")

# Now we filter out any synonymous mutations as well
sample_SNPS_filter<-lapply(sample_SNPS,function(x){ x%>%filter(!(grepl("[=>]",protein)))%>%filter(!grepl("ASXL1:p.[DNGPIL]81",protein))%>%filter(!grepl("FLT3_INS_chr13:28602226:",protein)) }) 
colnames_to_grab <- c("Variant","protein","Gene","gene","allele_freq_loom","allele_freq","SNP_variant","Sample")
sample_SNPS_filter_cols<-lapply(sample_SNPS_filter,function(x){x[,colnames(x)%in%colnames_to_grab]}) 

sample_NGTs_filter<-lapply(names(sample_SNPS),function(x){setNames(data.frame(sample_NGTs[[x]][,sample_SNPS_filter[[x]][,"SNP_variant"]]),sample_SNPS_filter[[x]][,"SNP_variant"])}); names(sample_NGTs_filter) <- names(sample_SNPS)
final_variants_of_interest<-lapply(sample_NGTs_filter,function(x){colnames(x)[colSums(x)>0]})
names(final_variants_of_interest) <-names(sample_SNPS)

test_NGT <- setNames(lapply(names(final_variants_of_interest),function(x){setNames(data.frame(sample_NGTs_filter[[x]][,final_variants_of_interest[[x]]]),final_variants_of_interest[[x]] )}),names(sample_SNPS))

final_NGT<-setNames(lapply(names(test_NGT),function(x){
  if(ncol(test_NGT[[x]])>1){
    j<-test_NGT[[x]][,apply(test_NGT[[x]],2,function(z){sum(z%in%c(1,2))>=2})]
    z<-j[!apply(j,1,function(y){any(y==3)}),]
    q<-z[,apply(z,2,function(h){sum(h%in%c(1,2))>=2})]
  } else{
    j<-data.frame(test_NGT[[x]][test_NGT[[x]]==1|test_NGT[[x]]==2])
    q<-data.frame(j[!j==3])      
  }
  return(q)
}),names(test_NGT))

final_variants_of_interest <- lapply(final_NGT,colnames)

final_SNP <-setNames(lapply(names(final_variants_of_interest),function(x){
  sample_SNPS_filter_cols[[x]]%>%filter(SNP_variant%in%final_variants_of_interest[[x]])}),names(sample_SNPS))

final_SNP_mat<-do.call(bind_rows,final_SNP)
final_SNP_mat$Gene <- ifelse(final_SNP_mat$Gene=="","FLT3",final_SNP_mat$Gene)
final_SNP_mat$gene <- ifelse(final_SNP_mat$gene=="","FLT3",final_SNP_mat$gene)
final_SNP_mat$Gene <- ifelse(is.na(final_SNP_mat$Gene),final_SNP_mat$gene,final_SNP_mat$Gene) 
colnames(final_SNP_mat)[5] <- "Sample.ID"

pheno_mut_melted<-inner_join(pheno,final_SNP_mat[,2:5])
pheno_mut_melted$Gene<- factor(pheno_mut_melted$Gene,levels=names(sort(table(pheno_mut_melted$Gene), decreasing=TRUE)))

for(x in 1:length(names(final_NGT))){
  colnames(final_NGT[x][[1]])<-pheno_mut_melted[pheno_mut_melted$SNP_variant%in%colnames(final_NGT[[x]])&
                                                  pheno_mut_melted$Sample.ID==names(final_NGT)[x],"protein"]
  rownames(final_NGT[x][[1]])<-paste("Cell",1:dim(final_NGT[x][[1]])[[1]],sep="_")
}


#These lines were used to regenerated a new blacklist
protein_variant_counts<-data.frame(sort(table(pheno_mut_melted[,"protein"]),decreasing=TRUE))
SNP_variant_counts<-data.frame(sort(table(pheno_mut_melted[,"SNP_variant"]),decreasing=TRUE))
Muts_per_patientx<-data.frame(sort(table(pheno_mut_melted[,"Sample.ID"]),decreasing=TRUE))

processing_NGT <- final_NGT
names(processing_NGT) <- names(final_NGT)

cell_number_per_sample_cutoff<-40
final_sample_set<-names(which(do.call(rbind,lapply(processing_NGT,dim))[,1]>cell_number_per_sample_cutoff))

samples_of_interest<-setdiff(names(processing_NGT),final_sample_set)
lapply(processing_NGT[samples_of_interest],nrow)
lapply(sample_NGTs[samples_of_interest],nrow)

cells_of_interest<-setNames(lapply(processing_NGT[final_sample_set],function(x){
  rownames(x)
}),final_sample_set)

saveRDS(processing_NGT[final_sample_set],file="/Volumes/LevineLab/Levine Lab/Bobby/Collaborations/MissionBio/Analysis/2020/January/final_NGTs.rds")
saveRDS(pheno_mut_melted,file="/Volumes/LevineLab/Levine Lab/Bobby/Collaborations/MissionBio/Analysis/2020/January/pheno_mut_melted.rds")
saveRDS(final_sample_set,file="/Volumes/LevineLab/Levine Lab/Bobby/Collaborations/MissionBio/Analysis/2020/January/final_sample_set.rds")
saveRDS(cells_of_interest,file="/Volumes/LevineLab/Levine Lab/Bobby/Collaborations/MissionBio/Analysis/2020/January/cells_of_interest.rds")