Extracting Mapped and Unmapped Reads from FASTQ

Bioinformatics
Command Line
Sequencing
A practical command-line workflow to separate mapped and unmapped reads and recover them into FASTQ files.
Author

Bhargava Reddy Morampalli

Published

4 March 2019

Modified

5 February 2026

Originally published on my legacy blog in 2019. Updated for technical accuracy and command clarity on 5 February 2026.

When only a fraction of reads maps to a reference genome, a common next step is to split mapped and unmapped reads for separate inspection.

This post shows a straightforward workflow using samtools and seqtk.

Workflow overview

  1. Align reads to a reference and produce a SAM/BAM file.
  2. Split mapped vs unmapped alignments using SAM FLAG filters.
  3. Extract read IDs from each group.
  4. Pull those reads from the original FASTQ.

Step 1: Start from an alignment file

Assume you already have an alignment file such as sample.bam.

Step 2: Split mapped and unmapped alignments

Use SAM FLAG-based filters:

  • -F 4 excludes reads marked unmapped (keeps mapped).
  • -f 4 keeps only reads marked unmapped.
samtools view -b -F 4 sample.bam > sample.mapped.bam
samtools view -b -f 4 sample.bam > sample.unmapped.bam

Step 3: Extract unique read IDs

Get the first column (QNAME), sort, and deduplicate:

samtools view sample.mapped.bam | cut -f1 | sort -u > mapped_ids.lst
samtools view sample.unmapped.bam | cut -f1 | sort -u > unmapped_ids.lst

Using sort -u avoids duplicate IDs from secondary/supplementary alignments.

Step 4: Recover reads from the original FASTQ

Use the read ID lists with seqtk subseq:

seqtk subseq original.fastq mapped_ids.lst > mapped.fastq
seqtk subseq original.fastq unmapped_ids.lst > unmapped.fastq

Now you have:

  • mapped.fastq: reads that mapped to the reference.
  • unmapped.fastq: reads that did not map.

Optional: convert unmapped FASTQ to FASTA

seqtk seq -a unmapped.fastq > unmapped.fa

This can be useful for quick downstream checks such as BLAST searches.

Practical notes

  • For paired-end data, keep mate synchronization in mind when extracting reads.
  • Always confirm whether your aligner emits secondary/supplementary records and adjust filtering if needed.
  • Keep a record of software versions to improve reproducibility.
Back to top