Checking bioinformatics files before you trust them

Bioinformatics
CLI
Bash
A reference for the command-line checks you run before trusting a file, launching a pipeline, or debugging a result.
Author

Bhargava Reddy Morampalli

Published

1 June 2026

A lot of mistakes in bioinformatic analyses happen before the pipeline runs. A truncated download, a problem with the reference, an annotation built against a different assembly and a lot of these subtle mistakes do not throw an error. The tools run to completion and hand you a result that looks fine but it is not. By the time a strange number shows up three steps later, you’ve spent a day of compute and you’re debugging the wrong thing. And the worst part is you might even publish that result, and the error only comes to light when someone tries to reproduce it.

This is my attempt to write about some of the checks you can perform that can catch some of the common problems. This is by no means comprehensive and it’s not even close. This is just a subset of things I have used in my work. I hope to write more of these similar posts in the future, and I hope this one is useful to someone.

Each check asks one question of the file, and the useful part is knowing what a bad answer looks like. These are the checks I run most on the nanopore and bacterial-genomics data I work with: cheap, and worth the time. The commands are mostly grep, awk, file, and gzip, plus the sequencing tools samtools and seqkit. The sections below start with general checks, then go through the file types I see most often. For each one the question is the same: is the file intact, is it the format I think it is, and does its content make sense?

NoteEnvironment assumed

These one-liners assume the operating system as Ubuntu or a similar Linux shell, as that’s where you would be working most of the time with large-scale genomics data.

Before anything else: file sizes and disk space

Start with sizes and disk usage:

# File sizes in human-readable format
ls -lh *.fastq.gz

# How much space a data directory is using (useful as a quick check after transferring the data to a server)
du -sh data/

Output

This might look like something as shown below

-rw-r--r--  1 user  user   1.9G  Jun 1 09:10  sample_A_R1.fastq.gz
-rw-r--r--  1 user  user   1.9G  Jun 1 09:10  sample_A_R2.fastq.gz
-rw-r--r--  1 user  user   2.1G  Jun 1 09:12  sample_B_R1.fastq.gz
-rw-r--r--  1 user  user     0B  Jun 1 09:12  sample_B_R2.fastq.gz   # <- stop here
4.0G  data/

A zero-byte or suspiciously small file usually means something failed: an interrupted transfer, a job that ran out of disk, a symlink pointing at a path that no longer exists. In a set of files that should be about the same size, the odd one out is usually the problem.

Is the file intact?

A file can be the right size and still be truncated or corrupt. Compression makes this worse: zcat file.gz | head shows you the first few reads even if the last 200 MB never finished downloading. To catch that, you have to read the file all the way to the end.

# Is this actually a gzip file, or just named like one?
file reads.fastq.gz

# Test the entire compressed stream for integrity (reads to EOF)
gzip -t reads.fastq.gz && echo "OK"

# If checksums came with the download, verify those too (this is a great way to check for data you received from a collaborator. You can ask them to run this check for the data and send you the output so you can match it with your own check)
sha256sum -c SHA256SUMS

Output

reads.fastq.gz: gzip compressed data, was "reads.fastq", from Unix
OK
reads.fastq.gz: OK

file reads the magic bytes, so it catches the classic case of a download that returned an HTML error page saved under a .fastq.gz name. gzip -t decompresses the whole thing without writing anything out; if it returns an error, the archive is truncated or corrupt. If a collaborator gives you checksums, verify them too, because a readable gzip file isn’t always the exact file you meant to download. For BAMs, the matching first check is samtools quickcheck (below), which looks for a readable header and the expected end-of-file marker.

A truncated gzip often decompresses correctly right up to the break, so a head or a preview in a viewer looks okay. The corruption only surfaces when a tool tries to read past the truncation point, usually deep into your run. gzip -t catches it by reading the stream to the end; samtools quickcheck does the matching BAM check by looking for the expected header and end marker.

Do the files agree with each other?

This check catches some of the most expensive mistakes, and it’s very easy to miss. Do your reference, alignment, annotation, and variants all use the same names for the same sequences (or contig names for example)? The chromosome that one file calls 1, another calls chr1, a third calls Chr1, and a fourth calls NC_000001.11. No tool errors on this. They find no overlap and return empty output, “no reads in region,” or zero variants: results that look like biology but are really a string mismatch.

Pull the sequence names from each file and compare the sets directly:

# Reference FASTA
grep "^>" ref.fa | sed 's/>//; s/ .*//' | sort > fa.names

# BAM header (@SQ lines)
samtools view -H aln.bam \
  | awk -F '\t' '$1=="@SQ"{for (i=1; i<=NF; i++) if ($i ~ /^SN:/) {sub(/^SN:/, "", $i); print $i}}' \
  | sort > bam.names

# Annotation
grep -v "^#" ann.gff | cut -f1 | sort -u > gff.names

# Variants
grep -v "^#" var.vcf | cut -f1 | sort -u > vcf.names

# Compare any two sets
diff fa.names bam.names
comm -3 fa.names gff.names      # lines unique to each side
$ comm -3 fa.names gff.names
1
    chr1

That output is the whole problem in two lines: the FASTA says 1, the GFF says chr1. After confirming the lengths/assembly match, it may be a one-line naming fix with sed; caught downstream, it’s a day of wondering why a chromosome has no annotated genes.

Matching names aren’t enough on their own. The lengths have to agree too. Compare the @SQ lengths in the BAM header against the reference:

samtools faidx ref.fa
cut -f1,2 ref.fa.fai | sort > fa.lengths

samtools view -H aln.bam \
  | awk -F '\t' '$1=="@SQ"{name=len=""; for (i=1; i<=NF; i++) {if ($i ~ /^SN:/) name=substr($i,4); if ($i ~ /^LN:/) len=substr($i,4)}; if (name!="") print name "\t" len}' \
  | sort > bam.lengths

diff fa.lengths bam.lengths

If a contig has the same name in two files but a different length, you’re mixing assembly versions. Coordinates won’t line up, and every position-based lookup is off by however much the assemblies differ. That’s harder to spot than a name mismatch, because nothing comes back empty. It’s just wrong.

This subsection above is probably a bit more complex code but you don’t need to remember these lengthy piped one liners. These are something you usually pick up while working with the data and you can refer back to your “useful commands list”, which I think everyone should maintain for later use.

FASTA files

FASTA files hold genome assemblies, reference sequences, and transcript sets. The > lines are headers; everything else is sequence.

# Count sequences (contigs, chromosomes, transcripts)
grep -c "^>" assembly.fa

# Skim the first few headers
grep "^>" assembly.fa | head

# Full summary: counts, length distribution, N50, GC
seqkit stats -a assembly.fa

I use the pattern as ^> so only a > at the start of a line counts. On a clean FASTA that’s the same number as grep -c ">", because grep -c counts matching lines rather than matches; the anchor just stops a stray > inside a sequence line from being counted. The command I actually reach for, regularly, is seqkit stats -a, which shows the whole shape of an assembly at once:

file         format  type  num_seqs    sum_len  min_len    avg_len    max_len        N50  GC(%)
assembly.fa  FASTA   DNA          3  5,012,345    4,521  1,670,781  4,612,310  4,612,310  50.79

For a finished bacterial genome this is about what I expect: a handful of sequences, a total length in the right ballpark for the organism, and an N50 equal to the chromosome, because most of the assembly is one contiguous piece. The failure modes are just as easy to read. Hundreds of sequences with an N50 in the kilobases means a fragmented assembly, and a sum_len that’s half or double what the organism should be points at a mis-assembly or contamination before you ever map a read to it. Again, as you can see, the seqkit stats output is a lot more informative than the simple sequence count, and it’s worth the extra time to run it on every assembly you get your hands on.

FASTQ files

FASTQ files store raw reads. In the standard form, each read is four lines: header, sequence, +, and quality string.

# Count reads in an uncompressed FASTQ
wc -l file.fastq | awk '{print $1/4}'

# Count reads in a gzipped FASTQ
zcat file.fastq.gz | wc -l | awk '{print $1/4}'

# Peek at the first two reads
head -8 file.fastq

# Full stats, straight off the gzipped file
seqkit stats -a reads.fastq.gz

seqkit stats -a is where the sequencing platform becomes obvious:

file            format  type  num_seqs        sum_len  min_len  avg_len  max_len    N50  Q20(%)  Q30(%)  GC(%)
reads.fastq.gz  FASTQ   DNA  1,287,443  1,799,512,043       57    1,398   42,310  1,642    71.0    38.4  50.6

What “normal” looks like depends on the platform. For Illumina I expect a tight, fixed read length and a high Q30 (often above 85%). For the nanopore data I worked with, the read-length distribution is wide: a low min_len, a max_len in the tens of kilobases or more for DNA (less for RNA), and a real gap between mean and N50, with Q20 and Q30 lower by design. A nanopore run reporting a uniform 150 bp length, or Illumina data with a wide read-length spread, means something is mislabeled or you have a wrong file.

BAM files

BAM files store aligned reads. Most checks use samtools, and several need an index (.bai). Running them before you trust a BAM for variant calling or visualization keeps a mapping problem from being carried downstream.

# Quick structural check: readable header and expected EOF marker
samtools quickcheck file.bam

# Index (required for region queries and idxstats)
samtools index file.bam

# The overall alignment summary
samtools flagstat file.bam

samtools quickcheck is a quick, limited check: a silent exit means the file passes a structural check, not that every alignment record makes biological sense. After that, samtools flagstat is the first thing I run on any BAM:

24,563,210 + 0 in total (QC-passed reads + QC-failed reads)
24,102,145 + 0 primary
   312,044 + 0 secondary
   149,021 + 0 supplementary
23,426,589 + 0 primary mapped (97.20% : N/A)
24,102,145 + 0 paired in sequencing
22,984,510 + 0 properly paired (95.36% : N/A)
   127,813 + 0 singletons (0.53% : N/A)

The lines I actually read are primary mapped % and, for paired data, properly paired %. A mapping rate far below what I expect for the experiment points at the wrong reference, contamination, or untrimmed adapters. A rate suspiciously close to 100% when the biology says it shouldn’t be can mean the reference was effectively built from these very reads. A low properly-paired fraction next to a high mapped fraction usually means insert-size or orientation trouble. (For long-read data I lean on the overall mapped fraction and watch the supplementary count, since long reads can split across an alignment.)

Per-reference and header checks:

# Reads per reference sequence
samtools idxstats file.bam

# View the header: references, lengths, and the aligner used
samtools view -H file.bam | head
1   1000   2   0
2    500   2   0
*      0   0   1

idxstats (columns: name, length, mapped, unmapped) is a fast way to spot a contig that should have reads but doesn’t, or many records in the unmapped * row.

A natural way to count mapped reads is:

samtools view -c -F 4 file.bam     # excludes unmapped

But -F 4 counts alignment records, and one read can produce several: its primary alignment plus any secondary (0x100) and supplementary (0x800) records. On a BAM with multi-mappers or split long reads, this number is inflated. To count primary mapped reads, exclude all three:

samtools view -c -F 0x904 file.bam   # unmapped + secondary + supplementary

On a test BAM where a single read has a primary, a secondary, and a supplementary record, samtools view -c -F 4 counts all three records, while -F 0x904 counts just the one primary read. That gap is the whole point.

A common one-liner is:

samtools depth file.bam | awk '{sum+=$3; n++} END {print sum/n}'

The catch: samtools depth skips zero-coverage positions by default, so this is the mean over covered bases, not over the genome. It ignores everything your reads never reached and overstates coverage. For a genome-wide mean plus the breadth of coverage (the fraction of each reference with at least one read), use samtools coverage:

samtools coverage file.bam
#rname  startpos  endpos    numreads  covbases  coverage  meandepth  meanmapq
1       1         4641652   23104567  4641652    100.00      312.4      58.9
2       1           98472      41205     97800     99.32      298.1      57.2

coverage here is breadth (the percentage of the contig covered); meandepth is the average depth. A high mean depth with low breadth means reads are piling onto a few regions and leaving the rest bare, which is the thing a covered-positions-only average hides.

Viewing wide files without wrapping

I thought I would finish this post with a small quality of life improvement as you deal with a lot of ‘wide’ files with many columns and lot of rows. When you try to view them on a command line, they become unreadable.

# Scroll a wide file without line wrapping (q to quit, arrows to pan)
less -S file.tsv

# Same, for a gzipped file
zcat variants.vcf.gz | less -S

# Line up the columns first (split on tabs only)
column -t -s $'\t' file.tsv | less -S

less -S feels minor until you try it. Turning off line-wrap makes the structure of a file much easier to see.

A practical order

When a file lands or a step finishes, I work through it in the same order every time:

  1. Is it intact? Size looks right, and gzip -t or samtools quickcheck pass.
  2. Is it the format I think it is? The first few lines look right, and file confirms the container (gzip, not an HTML error page).
  3. Does the content make sense? Sequence counts and mapping rate all land where I’d expect.
  4. Do the files agree with each other? Sequence names (and lengths) are consistent across reference, alignment, annotation, and variants.

The time taken for running these checks is minimal compared to not catching these issues in the beginning. Together these steps catch a lot of common problems while they’re still cheap to fix, before they turn into a result you trusted and a day you can’t get back.