The SN section contains a series of counts, percentages, and averages, in a similar style to The checksums are computed per alignment recordĪnd summed, meaning the checksum does not change if the input file has The CHK row contains distinct CRC32 checksums of read names, sequencesĪnd quality values. Information on the meaning of the flags is given in the SAM specification Neither is set are not counted in either category. Reads where PAIRED is set and either both READ1 and READ2 are set or.Reads where PAIRED and READ2 are set, and READ1 is not set are “last”.Reads where PAIRED and READ1 are set, and READ2 is not set are “first”.PAIRED is not set) are all “first” fragments.įor these records, the READ1 and READ2 flags are ignored. Records are put into these categories using the PAIRED (0x1), READ1 (0x40) Some of the statistics are collected for “first” or “last” Not all sections will be reported as some depend on the data beingĬoordinate sorted while others are only present when specific barcode The output can be visualized graphically using plot-bamstats.Ī summary of output sections is listed below, followed by moreĪCGT content per cycle for first fragments onlyĪCGT content per cycle for last fragments only Samtools stats collects statistics from BAM files and outputs in a text format. Written by Andrew Whitwham from the Sanger Institute.Samtools stats – produces comprehensive statistics from alignment file Thus no additional sort steps are normally needed. Sequence alignment and the markdup step after sorting by chromosome and Typically the fixmate step would be applied immediately after Samtools markdup positionsort.bam markdup.bam Samtools sort -o positionsort.bam fixmate.bam Samtools fixmate -m namecollate.bam fixmate.bam Samtools collate -o namecollate.bam example.bamĪdd ms and MC tags for markdup to use later: This first collate command can be omitted if the file is already TheyĬan optionally be marked as duplicates if they have a primary that is also a Excluded reads are not used for calculating duplicates. By default QC failed reads are also excluded but can be includedĪs an option. Might be obtained by further sequencing of the library.Įxcluded reads are those marked as secondary, supplementary or In particular it can be used to determine how much more data However itĬan provide a useful guide into how many unique read pairs are likely to beĪvailable. the libraryĬonsists of unique fragments that are randomly selected (with replacement) SINGLE: reads that are not part of a pair.ĭUPLICATE PAIR: reads in a duplicate pair.ĭUPLICATE SINGLE: single read duplicates.ĭUPLICATE PAIR OPTICAL: optical duplicate paired reads.ĭUPLICATE SINGLE OPTICAL: optical duplicate single reads.ĭUPLICATE NON PRIMARY: supplementary/secondary duplicate reads.ĭUPLICATE NON PRIMARY OPTICAL: supplementary/secondary opticalĭUPLICATE PRIMARY TOTAL: number of primary duplicate reads.ĭUPLICATE TOTAL: total number of duplicate reads.ĮSTIMATED LIBRARY SIZE: estimate of the number of unique fragments inĮstimated library size makes various assumptions e.g. threads INT Number of input/output compression threads to use in addition to mainĮXAMINED: reads examined for duplication. no-PG Do not add a PG line to the output file. When there are a great many duplicates for each original read. Using this option can speed up duplicate marking Tagging where reads may tagged with a better quality read but not Marking reads as duplicates further checks to make sure all opticalĭuplicates are found are not carried out.
no-multi-dup Stop checking duplicates of duplicates for correctness. include-fails Include quality checked failed reads. The two methods identify mostly the same reads as duplicates, mode Mode s measures positions based on sequence start. Mode t measures positions based on template start/end TYPE Duplicate decision method for paired reads. t Mark duplicates with the name of the original in a do tag. c Clear previous duplicate settings and tags. Optical duplicateĭetection will not work on non standard names. Names produced by the Illumina sequencing machines. When set, duplicate reads are tagged withĭt:Z:SQ for optical duplicates and dt:Z:LB otherwise.Ĭalculation of distance depends on coordinate data embedded in the read Platforms or about 2500 for NovaSeq ones. Suggested settings of 100 for HiSeq style d distance The optical duplicate distance. tmp -S Mark supplementary reads of duplicates as duplicates. OPTIONS ¶ -l INT Expected maximum read length of INT bases. Program relies on the MC and ms tags that fixmate provides. Mark duplicate alignments from a coordinate sorted file that hasīeen run through samtools fixmate with the -m option. Samtools-markdup - mark duplicate alignments in a coordinate