#This file contains a brief description of data extracts and files generated to support analyses on reference bias for STAR-WASP benchmarking Initial data preparation was done with "Data_Prep_Ref_Bias.R". The overall objective of this file is to create a dataframe with variables needed for reference bias analyses downstream. Not that the vG coordinates in the STAR SAM files are 0-based, while VCF coordinates are 1-based. We therefore addede 1 to the vG coordinates in the SAM files to create a value "var_vG_Pos_Plusone" used to map/find these variants in the cvf files and extract genotype info Input file (to Data_Prep_Ref_Bias.R): /home/asiimwe/projects/run_env/alpha_star_wasp_comparison/STAR_WASP_Runs/HG00512/32threads/STAR_vW_Tagged_Reads_vA_vG (extracted for each sample) Example of line in input file: ERR1050076.10000027 chr14 105855536 vA:B:c,1 vG:B:i,105855557 vW:i:1 (Subset of file with vW_Tagged reads (extracted columns of interest)) Output files: i. samples_subset1.txt (Contains Samples: [“HG00512" , "HG00513", "HG00731", "HG00732", "HG00733", "NA19238", "NA19239", "NA19240”]) ii. samples_subset2.txt (Contains Samples: ["NA12878_Nucleus_nonPolyA", "NA12878_Nucleus_nonPolyA_Rep", "NA12878_Nucleus_PolyA_Rep", "NA12878_Nucleus_PolyA"]) iii. samples_subset3.txt (Contains Samples: ["NA12878_PolyA", "NA12878_PolyA_Rep", "NA12878_RAMPAGE", "NA12878_RAMPAGE_Rep", "NA12878_Total", "NA12878_Total_Rep"]) These files contain: "Read_ID","Chr","Pos","vW_Tag","Sample","VCF_Path","vA_Base", "vA", "vG_Base", var_vG_Pos(variant's genomic coordinate) var_vG_Pos_Plusone(variant's genomic coordinate +1 for mapping with vcf file coordinates) Samplese were subset to create smaller and manageable sets that could fit into memory ** Note that Data_Prep_Ref_Bias.R also removes inconsistent haplotypes to maintain only definitive haplotypes, e.g removes vA:B:c,1,2 1 but maintains vA:B:c,1,1 1 Once all files are extracted, all quotes leading or trailing each value should be removed for easy processing downstream - replaced all at the terminal with “” Run "overlapped_variants_genotype_mapper_subset1.py" and "overlapped_variants_genotype_mapper_subset2_and_3.py" to extract a mappability key from the vcf files (extracts python dictionary) - the samples_subset files have a similar key generated by concatenating chr, position and sample These python files will create "vcf_mapper_subset1.txt" and "vcf_mapper_subset2_and_3_NA12878.txt" for subsets 1, “2 and 3” respectively ** The python files also filter to remove multi-allele REF and ALT occurances. We also filter to maintain heteroz An inner join is then done on the vcf_mapper files and samples_subset files after sorting both files like so (sorting on mapper variable): sort -k12,12 samples_subset1.txt > sorted_samples_subset1.txt sort -k12,12 samples_subset2.txt > sorted_samples_subset2.txt Vcf extracts were also sorted on mapper variable like so: sort -k7,7 vcf_mapper_subset1.txt > sorted_vcf_mapper_subset1.txt Join: join -1 7 -2 12 sorted_vcf_mapper_subset2_and_3_NA12878.txt sorted_samples_subset2.txt > mapped_subset2.txt mapped_subset2.txt files are then fed into Ref_Bias_Analysis_STAR_WASP.Rmd for analysis. Cleaned datasets for analysis are in "Analysis_files"