#This file contains a brief description of data extracts and files generated to support analyses on reference bias for STAR-WASP benchmarking

Initial data preparation was done with "Data_Prep_Ref_Bias.R". The overall objective of this file is to create a dataframe with variables needed for reference bias analyses downstream. 
Not that the vG coordinates in the STAR SAM files are 0-based, while VCF coordinates are 1-based. We therefore addede 1 to the vG coordinates in the SAM files to create a value "var_vG_Pos_Plusone" used to map/find these variants in the cvf files and extract genotype info

Input file (to Data_Prep_Ref_Bias.R): /home/asiimwe/projects/run_env/alpha_star_wasp_comparison/STAR_WASP_Runs/HG00512/32threads/STAR_vW_Tagged_Reads_vA_vG (extracted for each sample)
Example of line in input file: ERR1050076.10000027 chr14 105855536 vA:B:c,1 vG:B:i,105855557 vW:i:1 (Subset of file with vW_Tagged reads (extracted columns of interest))

Output files:
	i.   samples_subset1.txt  (Contains Samples: [“HG00512" ,  "HG00513",  "HG00731",  "HG00732",  "HG00733",  "NA19238",  "NA19239",  "NA19240”])
	ii.  samples_subset2.txt (Contains Samples: ["NA12878_Nucleus_nonPolyA",  "NA12878_Nucleus_nonPolyA_Rep", "NA12878_Nucleus_PolyA_Rep", "NA12878_Nucleus_PolyA"])
	iii. samples_subset3.txt (Contains Samples: ["NA12878_PolyA", "NA12878_PolyA_Rep",  "NA12878_RAMPAGE",  "NA12878_RAMPAGE_Rep",  "NA12878_Total", "NA12878_Total_Rep"])

These files contain: "Read_ID","Chr","Pos","vW_Tag","Sample","VCF_Path","vA_Base", "vA", "vG_Base", var_vG_Pos(variant's genomic coordinate) var_vG_Pos_Plusone(variant's genomic coordinate +1 for mapping with vcf file coordinates)

Samplese were subset to create smaller and manageable sets that could fit into memory

** Note that Data_Prep_Ref_Bias.R also removes inconsistent haplotypes to maintain only definitive haplotypes, e.g removes vA:B:c,1,2 1  but maintains vA:B:c,1,1 1

Once all files are extracted, all quotes leading or trailing each value should be removed for easy processing downstream - replaced all at the terminal with “”

Run "overlapped_variants_genotype_mapper_subset1.py" and "overlapped_variants_genotype_mapper_subset2_and_3.py" to extract a mappability key from the vcf files (extracts python dictionary) - the samples_subset files have a similar key generated by concatenating chr, position and sample
These python files will create "vcf_mapper_subset1.txt" and "vcf_mapper_subset2_and_3_NA12878.txt" for subsets 1, “2 and 3” respectively

** The python files also filter to remove multi-allele REF and ALT occurances. We also filter to maintain heteroz

An inner join is then done on the vcf_mapper files and samples_subset files after sorting both files like so (sorting on mapper variable):
sort -k12,12   samples_subset1.txt > sorted_samples_subset1.txt
sort -k12,12   samples_subset2.txt > sorted_samples_subset2.txt

Vcf extracts were also sorted on mapper variable like so:
sort -k7,7 vcf_mapper_subset1.txt > sorted_vcf_mapper_subset1.txt

Join:
join -1 7 -2 12 sorted_vcf_mapper_subset2_and_3_NA12878.txt sorted_samples_subset2.txt > mapped_subset2.txt

mapped_subset2.txt files are then fed into Ref_Bias_Analysis_STAR_WASP.Rmd for analysis. Cleaned datasets for analysis are in "Analysis_files"