A Sourceful of Secrets

Andrew E. Bruno

Counting the number of reads in a BAM file

with 22 comments

The output from short read aligners like Bowtie and BWA is commonly stored in SAM/BAM format. When presented with one of these files a common first task is to calculate the total number of alignments (reads) captured in the file. In this post I show some examples for finding the total number of reads using samtools and directly from Java code. For the examples below, I use the HG00173.chrom11 BAM file from the 1000 genomes project which can be downloaded here.

First, we look at using the samtools command directly. One way to get the total number of alignments is to simply dump the entire SAM file and tell samtools to count instead of print (-c option):

$ samtools view -c HG00173.chrom11.ILLUMINA.bwa.FIN.low_coverage.20111114.bam
5218322

If we’re only interested in counting the total number of mapped reads we can add the -F 4 flag. Alternativley, we can count only the unmapped reads with -f 4:

# Mapped reads only
$ samtools view -c -F 4 HG00173.chrom11.ILLUMINA.bwa.FIN.low_coverage.20111114.bam
5068340

# Unmapped reads only
$ samtools view -c -f 4 HG00173.chrom11.ILLUMINA.bwa.FIN.low_coverage.20111114.bam
149982

To understand how this works we first need to inspect the SAM format. The SAM format includes a bitwise FLAG field described here. The -f/-F options to the samtools command allow us to query based on the presense/absence of bits in the FLAG field. So -f 4 only output alignments that are unmapped (flag 0x0004 is set) and -F 4 only output alignments that are not unmapped (i.e. flag 0x0004 is not set), hence these would only include mapped alignments.

An example for paired end reads you could do the following. To count the number of reads having both itself and it’s mate mapped:

$ samtools view -c -f 1 -F 12 HG00173.chrom11.ILLUMINA.bwa.FIN.low_coverage.20111114.bam
4906035

The -f 1 switch only includes reads that are paired in sequencing and -F 12 only includes reads that are not unmapped (flag 0x0004 is not set) and where the mate is not unmapped (flag 0x0008 is not set). Here we add 0x0004 + 0x0008 = 12 and use the -F (bits not set), meaning you want to include all reads where neither flag 0x0004 or 0x0008 is set. For help understanding the values for the SAM FLAG field there’s a handy web tool here.

There’s also a nice command included in samtools called flagstat which computes various summary statistics. However, I wasn’t able to find much documentation describing the output and it’s not mentioned anywhere in the man page. This post examines the C code for the flagstat command which provides some insight into the output.

$ samtools flagstat HG00173.chrom11.ILLUMINA.bwa.FIN.low_coverage.20111114.bam
5218322 + 0 in total (QC-passed reads + QC-failed reads)
273531 + 0 duplicates
5068340 + 0 mapped (97.13%:-nan%)
5205999 + 0 paired in sequencing
2603248 + 0 read1
2602751 + 0 read2
4881994 + 0 properly paired (93.78%:-nan%)
4906035 + 0 with itself and mate mapped
149982 + 0 singletons (2.88%:-nan%)
19869 + 0 with mate mapped to a different chr
15271 + 0 with mate mapped to a different chr (mapQ>=5)

The above shows a few simple examples using the samtools command but what if you wanted to count the total number of reads in code? I’ve been using the excellent Picard Java library as of late and haven’t found a simple way to do this via the API. I was looking for a fast way to compute this without having to scan the entire BAM file each time. Would love to see this added as a public function to the BAMIndexMetaData object or similar. Here’s a function I wrote to calcuate the total mapped reads from a BAM file. This makes use of the BAM index for speed and obviously requires you to first index your BAM file:

public int getTotalReadCount(SAMFileReader sam) {
    int count = 0;

    AbstractBAMFileIndex index = (AbstractBAMFileIndex) sam.getIndex();
    int nRefs = index.getNumberOfReferences();
    for (int i = 0; i < nRefs; i++) {
        BAMIndexMetaData meta = index.getMetaData(i);
        count += meta.getAlignedRecordCount();
    }

    return count;
}

This uses the BAMIndex to loop through each reference and sum the total mapped reads. A complete working example is included below:

import java.io.File;

import net.sf.samtools.AbstractBAMFileIndex;
import net.sf.samtools.BAMIndexMetaData;
import net.sf.samtools.SAMFileReader;

public class CountMapped {

    public static void main(String[] args) {
        File bamFile = new File(args[0]);

        SAMFileReader sam = new SAMFileReader(bamFile, 
                                 new File(bamFile.getAbsolutePath() + ".bai"));

        AbstractBAMFileIndex index = (AbstractBAMFileIndex) sam.getIndex();

        int count = 0;
        for (int i = 0; i < index.getNumberOfReferences(); i++) {
            BAMIndexMetaData meta = index.getMetaData(i);
            count += meta.getAlignedRecordCount();
        }

        System.out.println("Total mapped reads: " + count);
    }

}

Requires the Picard Java library. To compile/run:

$ javac -cp samtools.jar CountMapped.java
$ java -cp samtools.jar:. CountMapped HG00173.chrom11.ILLUMINA.bwa.FIN.low_coverage.20111114.bam
Total mapped reads: 5068340
About these ads

Written by Andrew

2012/04/13 at 22:31

Posted in Bioinformatics, Java

22 Responses

Subscribe to comments with RSS.

  1. Hi Andrew,

    Thanks for such an informative post. There are lot of posts related to SAM/BAM format but your explanation about the bitwise FLAG is the best. I understand it much better now.

    Thanks

    Ashutosh

    2012/04/23 at 13:24

  2. Reads can be nicely counted from bam index by using samtools idxstats or picard BamIndexStats:

    samtools idxstats in.bam | awk ‘{s+=$3+$4} END {print s}’ // total reads

    samtools idxstats in.bam | awk ‘{s+=$3} END {print s}’ // mapped reads

    Ig

    2012/06/04 at 20:07

  3. Hi lg, The article explains programmatic retrieval/availability of read counts with which you can calculate some thing else (e.g., RPKM).

    Parthiban

    2012/10/07 at 09:26

  4. Parthiban,
    I understand that. My comment’s goal was to show a trick how to get a fast total mapped/unmapped read count. It takes a split second from .bai file, not everybody appreciates that.

    It’s crucial for large bam files, since samtools flagstat or samtools view can take up to 30 min or more on them.

    Ig

    2012/10/07 at 18:50

  5. Hi,
    Reads in specific target regions can be counted by extracting bam file in the specific target region using samtools:
    samtools view in.bam chr2:20,100,000-20,200,000 > out.bam

    How can we count the reads falling in multiple target regions?

    Anonymous

    2012/10/10 at 09:00

  6. Best explanation ever of the bitwise Flag ! Thanks for sharing

    Antoine

    2012/11/14 at 04:47

  7. Hi, how could we extract a reads from specific locus/gene from bam file? Especially from accepted_hits.bam which I generated from Tophat. Thank you.

    Wan Fahmi

    2012/12/14 at 05:52

  8. Dear all,
    could you help me with this task?
    I think I have an inconsistency with some bam files.

    The problem is that samtools give me more reads than the original size of the library.
    I will paste the commands and the results:

    Library size (using FASTQC, file fastqc_data.txt)
    Total Sequences 9462380

    After, I aligned with tophat (tophat.log)
    left reads: min. length=40, max. length=101, 9462231 kept reads (149 discarded)

    But if I inspect accepted.bam with samtools, I have:

    samtools view -c accepted.bam 9534699
    samtools view -c -F 4 accepted.bam 9534699
    samtools view accepted.bam | cut -f 1 | sort | uniq | wc -l 9034090

    Also, if I do, as Ig did suggested:

    samtools idxstats accepted.bam | awk ‘{s+=$3+$4} END {print s}’ // total reads 9534699
    samtools idxstats accepted.bam | awk ‘{s+=$3} END {print s}’ // mapped reads 9534699

    Do you think it is a mistake?

    Thank you very much in advance

    Estefania

    Estefania

    2013/01/04 at 16:14

  9. The samtools idxstats looks like a useful tool. However, the awk commands listed above by Ig did not work on our RHEL system. Since I don’t know awk that well, I wasn’t able to debug it. The error was: awk:

    samtools idxstats EAP119_R2.fastq.nodup.srt.bam | awk ‘{s+=$3+$4} END {print s}’
    ‘{s+=+}
    awk: ^ invalid char ‘�’ in expression

    which means that awk was unable to pick up $3 and $4.

    Any idea why this didn’t work?

    JWhite

    Joe White

    2013/03/26 at 11:37

  10. @Joe If you copy and pasted the awk command I’m guessing it’s using the “smart” quote character (for example: U+2018). Just delete the single qutoes and replace them with an ascii single quote: '

    Andrew

    2013/03/26 at 11:52

  11. Andrew,

    Thanks, it was the smart quote characters. They should be banned! I have never seen a good use of smart quotes. What was MS thinking?

    Joe

    Joe White

    2013/03/26 at 11:58

  12. Hi,

    Aren’t you guys reporting “the number of alignments” instead of “the number of reads”? They are different.

    WOody

    Woody Lin

    2013/06/23 at 22:24

  13. @Woody Yes, we’re referring to sequence reads that have been aligned (using one of the many short read aligner programs). In particular, this post is about reporting the number of mapped vs unmapped reads.

    Andrew

    2013/06/24 at 10:40

  14. Unfortunately, all the scripts above are actually giving you “the number of alignment results”. It is wrong to use those scripts to get “the number of mapped reads”.

    Woody Lin

    2013/06/24 at 11:35

  15. @Woody Can you elaborate? Raw sequencing reads are aligned to a reference genome via an alignment tool which outputs results in SAM format (Sequence Alignment/Map format). SAM format has a bitwise field indicating whether or not a read was unmapped. See the FAQ. Here, we’re referring to “unmapped reads” as reads which could not be aligned to the reference genome. Is your issue with the use of the word “read”?

    Andrew

    2013/06/24 at 12:59

  16. The title of the article is for counting the number of reads, which include mapped reads. The comments are mainly interested in the number of mapped reads. I think that there is a confusion. It is also the case for Estefania’s inconsistency problem. I think that it is better to clarify that Samtools doesn’t provide the function for counting the number of reads but the number of alignments. If you use “-f 4″, what you actually got is the number of alignments. Because unmapped reads have only one alignment result, you will probably get the numbers right. For certain circumstance, you can get the number wrong if the sam or bam file doesn’t contain unmapped alignment results.
    Bowtie will report the numbers for mapped and unmmaped reads in the end of the run. Comparing the numbers with the one from “samtools view” or “samtools idxstats”, you will find difference. For 100% sure, You will need a script to count the number of reads properly.

    Woody Lin

    2013/06/25 at 06:14

  17. In the case when the BAM file includes non-primary alignments, you can get inconsistencies with idxstats/awk. In this case the “read” will be mapped multiple times and have multiple “alignments”.
    BWA by default outputs only primary alignment per read (so this is not an issue there).
    But that’s most likely Estefania’s issue.
    You can use the -F with 0x100 (secondary alignment) flag to deal with that in samtools view.
    [It's not a mapped/unmapped issue.]

    Ig

    2013/06/25 at 11:08

  18. @Ig @Wood Thanks for the feedback. I was not aware of the non-primary alignments. I suppose it would be more accurate here to say “counting the number of mapped/unmapped alignments” and not use the term “read” and “alignment” interchangeably as they are not a one-to-one mapping.

    Andrew

    2013/06/25 at 16:44

  19. @Ig I have also tried “-f 260″ (exclude unmapped and not primary alignment). The number is not correct.

    Woody Lin

    2013/06/25 at 18:40

  20. @Ig I have also tried “-F 260″ (exclude unmapped and not primary alignment). The number is not correct.

    Woody Lin

    2013/06/25 at 18:40

  21. This was a great discussion, I agree with Andrew that is better to say “counting the number of mapped/unmapped alignments” and not use the term “read” and “alignment” interchangeably as they are not a one-to-one mapping”. Thanks

    Estefania

    2013/06/28 at 09:22

  22. Hi Andrew,
    Thanks for your tips. I am interested in extracting number of mapped reads from my bam file and i would like to d othis for e series of windows. To be more precise i divided the genome into 100bp windows and want to get the number of mapped reads for each window. I used the command you mentioned above
    samtools view -c -F 4 mybam 1:100-200 1:201-300…

    I tried to write a for loop for this so that it calls the same command for each window. But this is extremely time consuming. Do you if this is possible, either by samtools or any other tool?
    thanks

    salih

    2013/07/07 at 05:57


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: