Quick Checks: CLI One-Liners for Common Bioinformatics File Types

Bioinformatics

CLI

Bash

A reference for the command-line checks you run before trusting a file, launching a pipeline, or debugging a result.

Author

Bhargava Reddy Morampalli

Published

28 May 2026

Modified

29 May 2026

Most bioinformatics mistakes happen before the pipeline runs. A truncated download, a reference that doesn’t match the reads, an annotation built against a different assembly — these rarely throw an error. The tools run to completion and hand you a result that is quietly, confidently wrong. By the time a strange number shows up three steps downstream, you’ve spent a day of compute and you’re debugging the wrong thing.

So before I trust a file, I spend thirty seconds checking it on the command line. The point isn’t to run commands for their own sake — it’s to ask a specific question of each file and know what a bad answer looks like. These are the checks I’ve built into muscle memory over years of running nanopore and bacterial-genomics pipelines: cheap to run, and worth far more than the time they cost. The commands are mostly grep, awk, file, and gzip, plus the standard bioinformatics tools samtools, bcftools, and seqkit.

The sections below start with general checks, then move through the file types I see most often. For each, the question is the same: is this file intact, is it the format I think it is, and does its content make sense?

Environment assumed

These one-liners are written for Ubuntu systems using GNU command-line tools. They are not written for macOS or BSD shells, where command names and flags can differ, especially for compressed-file helpers and checksum tools.

For reproducibility, I record the operating system and tool versions before saving the checks:

grep '^PRETTY_NAME=' /etc/os-release
samtools --version | head -n 1
bcftools --version | head -n 1
seqkit version
gzip --version | head -n 1

Before anything else: file sizes and disk space

These two belong in muscle memory.

# File sizes, human-readable
ls -lh *.fastq.gz

# How much space a directory is using
du -sh data/

-rw-r--r--  1 user  staff   1.9G  sample_A_R1.fastq.gz
-rw-r--r--  1 user  staff   1.9G  sample_A_R2.fastq.gz
-rw-r--r--  1 user  staff   2.1G  sample_B_R1.fastq.gz
-rw-r--r--  1 user  staff     0B  sample_B_R2.fastq.gz   # <- stop here

A zero-byte or suspiciously small file tells you something failed — an interrupted transfer, a job that ran out of disk, a symlink to a path that no longer exists — before you waste any time on it. In a set of files that should be about the same size, the file that does not match the others is usually the problem.

Is the file even intact?

A file can be the right size and still be corrupt or truncated. Compression makes this worse: zcat file.gz | head will happily show you the first few reads of a file whose last 200 MB never finished downloading. To catch that, you have to read the file all the way to the end.

# Is this actually a gzip file, or just named like one?
file reads.fastq.gz

# Test the entire compressed stream for integrity (reads to EOF)
gzip -t reads.fastq.gz && echo "OK"

# If checksums came with the download, verify those too
sha256sum -c SHA256SUMS

reads.fastq.gz: gzip compressed data, was "reads.fastq", from Unix
OK

file inspects the magic bytes, so it catches the classic case of a download that returned an HTML error page saved under a .fastq.gz name. gzip -t decompresses the whole thing without writing anything out; if it returns an error, the archive is truncated or corrupt. If a provider gives you checksums, verify them as well; a readable gzip file is not always the exact file you meant to download. For BAMs, a similar first check is samtools quickcheck (below), which checks for a readable header and the expected end-of-file marker.

Going deeper: why “it opened fine” isn’t enough

A truncated gzip often decompresses correctly right up to the break, so a head or a preview in a viewer looks perfectly healthy. The corruption only surfaces when a tool tries to read past the truncation point — usually deep into your run. gzip -t catches this by reading the stream to the end; samtools quickcheck does the matching BAM check by looking for the expected header and end marker.

Do the files agree with each other?

This is the check that catches the most expensive mistakes, and almost nobody runs it: do your reference, alignment, annotation, and variants actually use the same names for the same sequences? The chromosome that one file calls 1, another calls chr1, a third calls Chr1, and a fourth calls NC_000001.11. No tool errors on this. They simply find no overlap and return empty output, “no reads in region,” or zero variants — results that look like biology when they’re really a string mismatch.

Pull the sequence names from each file and compare the sets directly:

# Reference FASTA
grep "^>" ref.fa | sed 's/>//; s/ .*//' | sort > fa.names

# BAM header (@SQ lines)
samtools view -H aln.bam \
  | awk -F '\t' '$1=="@SQ"{for (i=1; i<=NF; i++) if ($i ~ /^SN:/) {sub(/^SN:/, "", $i); print $i}}' \
  | sort > bam.names

# Annotation
grep -v "^#" ann.gff | cut -f1 | sort -u > gff.names

# Variants
grep -v "^#" var.vcf | cut -f1 | sort -u > vcf.names

# Compare any two sets
diff fa.names bam.names
comm -3 fa.names gff.names      # lines unique to each side

$ comm -3 fa.names gff.names
        chr1
1

That output is the whole problem in two lines: the FASTA says 1, the GFF says chr1. Caught here, it’s a one-line fix with sed; caught downstream, it’s a day of wondering why a chromosome has no annotated genes.

Going deeper: same name, different sequence

Matching names aren’t quite enough — the lengths have to agree too. Compare the @SQ lengths in the BAM header against the reference:

samtools view -H aln.bam \
  | awk -F '\t' '$1=="@SQ"{name=len=""; for (i=1; i<=NF; i++) {if ($i ~ /^SN:/) name=substr($i,4); if ($i ~ /^LN:/) len=substr($i,4)}; if (name!="") print name "\t" len}'

If a contig shares a name across files but reports a different length, you’re mixing assembly versions. Coordinates won’t line up, and every position-based lookup is off by however much the assemblies diverge — a harder problem to spot than a name mismatch, because nothing looks empty; it’s just wrong.

FASTA files

FASTA files hold genome assemblies, reference sequences, and transcript sets. The > lines are headers; everything else is sequence.

# Count sequences (contigs, chromosomes, transcripts)
grep -c "^>" assembly.fa

# Skim the headers
grep "^>" assembly.fa | head

# Full summary: counts, length distribution, N50, GC
seqkit stats -a assembly.fa

I use grep -c "^>" rather than grep -c ">" so a stray > inside a header line can’t make the count too high. But the command I actually reach for is seqkit stats -a, which gives the whole shape of an assembly at once:

file         format  type  num_seqs    sum_len  min_len    avg_len    max_len        N50  GC(%)
assembly.fa  FASTA   DNA          3  5,012,345    4,521  1,670,781  4,612,310  4,612,310  50.79

For a finished bacterial genome this is exactly what I’d expect: a handful of sequences, a total length in the right ballpark for the organism, and an N50 equal to the chromosome because most of the assembly is one contiguous piece. The failure modes are just as easy to read — hundreds of sequences with an N50 in the kilobases means a fragmented assembly, and a sum_len that’s half or double what the organism should be points at a mis-assembly or contamination before you ever map a read to it.

FASTQ files

FASTQ files store raw reads. In the common unwrapped form, each read is four lines: header, sequence, +, and quality string.

# Count reads in an uncompressed FASTQ
wc -l file.fastq | awk '{print $1/4}'

# Count reads in a gzipped FASTQ
zcat file.fastq.gz | wc -l | awk '{print $1/4}'

# Peek at the first two reads
head -8 file.fastq

# Full stats, straight off the gzipped file
seqkit stats -a reads.fastq.gz

The wc -l divided by four trick is a useful quick check for conventional unwrapped FASTQ. If that count doesn’t divide evenly, the file is truncated or malformed — which is the same problem gzip -t catches from the other direction. For format-aware counting, use seqkit stats.

seqkit stats -a is where the sequencing platform becomes obvious:

file            format  type  num_seqs        sum_len  min_len  avg_len  max_len    N50  Q20(%)  Q30(%)  GC(%)
reads.fastq.gz  FASTQ   DNA  1,287,443  1,799,512,043       57    1,398   42,310  1,642    71.0    38.4  50.6

What “normal” looks like depends entirely on the platform. For Illumina I expect a tight, fixed read length and high Q30 (often >85%). For the nanopore data I work with, the read-length distribution is wide — a low min_len, a max_len in the tens of kilobases, and a meaningful gap between mean and N50 — and the Q20/Q30 figures sit lower by design. A nanopore run reporting a uniform 150 bp length, or Illumina data with a wide read-length distribution, means something upstream is mislabeled or the wrong file got handed to me.

BAM files

BAM files store aligned reads. Most checks use samtools, and several need an index (.bai). Running these before you trust a BAM for variant calling or visualization keeps a mapping problem from being carried into everything downstream.

# Quick structural check: readable header and expected EOF marker
samtools quickcheck file.bam

# Index (required for region queries and idxstats)
samtools index file.bam

# The overall alignment summary
samtools flagstat file.bam

samtools quickcheck is a quick, limited check: a silent exit means the file passes that structural check, not that every alignment record is biologically sensible. samtools flagstat is the first thing I run after that on any BAM:

24,563,210 + 0 in total (QC-passed reads + QC-failed reads)
24,102,145 + 0 primary
   312,044 + 0 secondary
   149,021 + 0 supplementary
23,426,589 + 0 primary mapped (97.20% : N/A)
24,102,145 + 0 paired in sequencing
22,984,510 + 0 properly paired (95.36% : N/A)
   127,813 + 0 singletons (0.53% : N/A)

The lines I actually read are primary mapped % and, for paired data, properly paired %. A mapping rate far below what I expect for the experiment points at the wrong reference, contamination, or untrimmed adapters. A rate suspiciously close to 100% when biology says it shouldn’t be can mean the reference was effectively built from these very reads. A low properly-paired fraction alongside a high mapped fraction usually means insert-size or orientation trouble. (For long-read data, I lean on the overall mapped fraction and watch the supplementary count, since long reads legitimately split across an alignment.)

Per-reference and depth checks:

# Reads per reference sequence
samtools idxstats file.bam

# View the header: references, lengths, and the aligner used
samtools view -H file.bam | head

1   1000   2   0
2    500   2   0
*      0   0   1

idxstats (columns: name, length, mapped, unmapped) is a fast way to spot a contig that should have reads but doesn’t, or many records in the unmapped * row.

Going deeper: you’re counting records, not reads

A natural way to count mapped reads is:

samtools view -c -F 4 file.bam     # excludes unmapped

But -F 4 counts alignment records, and one read can produce several — its primary alignment plus any secondary (0x100) and supplementary (0x800) records. On a BAM with multi-mappers or split long reads, this number is inflated. To count primary mapped reads, exclude all three:

samtools view -c -F 0x904 file.bam   # unmapped + secondary + supplementary

On a small test BAM where one read has a primary, a secondary, and a supplementary alignment, -F 4 returns 4 while -F 0x904 returns 2. That gap is the whole point.

Going deeper: “mean depth” usually isn’t what you mean

A common one-liner is:

samtools depth file.bam | awk '{sum+=$3; n++} END {print sum/n}'

The catch: samtools depth skips zero-coverage positions by default, so this is the mean over covered bases, not across the genome — it silently ignores everything your reads never reached and overstates coverage. For a genome-wide mean plus the breadth of coverage (the fraction of each reference with at least one read), use samtools coverage if your installed samtools includes it:

samtools coverage file.bam

#rname  startpos  endpos    numreads  covbases  coverage  meandepth  meanmapq
1       1         4641652   23104567  4641652    100.00      312.4      58.9
2       1           98472      41205     97800     99.32      298.1      57.2

coverage here is breadth (% of the contig covered); meandepth is the average depth. A high mean depth with low breadth means reads are piling onto a few regions and leaving the rest bare — exactly the thing a covered-positions-only average hides.

GFF and GTF files

Annotation files describe a genome’s features: genes, transcripts, exons, and so on.

# Header / metadata lines
grep "^#" annotation.gff

# Count each feature type
grep -v "^#" annotation.gff | cut -f3 | sort | uniq -c | sort -rn

# Which sequences are annotated
grep -v "^#" annotation.gff | cut -f1 | sort -u

  4218 exon
  3902 CDS
  3811 mRNA
  3799 gene
   112 tRNA
    22 rRNA

The feature-type tally is the first thing I look at with an unfamiliar annotation: it tells me immediately whether the file carries exon-level detail or only gene-level entries, and roughly whether the gene count fits the organism. It also shows which naming style the file uses — Ensembl, NCBI/RefSeq, and Prokka don’t name or nest features identically (gene vs mRNA vs transcript, the presence of CDS), and a downstream tool that expects one style will quietly do nothing with another.

VCF files

VCF files store variant calls: SNPs, indels, and structural variants.

# Count variant records (excluding the header)
grep -v "^#" variants.vcf | wc -l

# List the samples
bcftools query -l variants.vcf      # or: grep "^#CHROM" variants.vcf

# Rough SNP vs indel split
grep -v "^#" variants.vcf \
  | awk '{print $4, $5}' \
  | awk '{if (length($1)==length($2)) print "SNP"; else print "INDEL"}' \
  | sort | uniq -c

These are quick checks: total count, who’s in the file, the rough balance of substitutions to indels. They’re fine for a sanity check and nothing more.

Going deeper: where the quick VCF split breaks

That awk classifier compares the length of REF to ALT as plain strings, so it gets multiallelic and symbolic sites wrong. A record like C → G,T has a three-character ALT field, so it’s tallied as an indel even though it’s two SNPs; symbolic alleles like <DEL> are misread the same way. On a tiny test VCF with one SNP, one true indel, and one multiallelic SNP site, the one-liner reports two indels and one SNP — one of those “indels” is the multiallelic row.

For anything you’ll quote or act on, let a format-aware tool do the counting:

bcftools stats variants.vcf

It splits SNPs, MNPs, indels, and multiallelic sites correctly, and reports ts/tv and per-sample breakdowns. While you’re at it, confirm the file is sorted and indexed (bcftools index / tabix -p vcf) before any region query — an unsorted or unindexed VCF will fail or, worse, silently return nothing for a region.

Viewing wide files without wrapping

Many of these files have very long lines that turn into unreadable mush in a default terminal.

# Scroll a wide file without line wrapping (q to quit, arrows to pan)
less -S file.tsv

# Same, for a gzipped file
zcat variants.vcf.gz | less -S

# Line up the columns first
column -t file.tsv | less -S

less -S is one of those underrated flags that feels minor until you try to read idxstats or a VCF without it. Turning off line-wrap makes the structure of a file much easier to read.

A practical order

When a file lands or a step finishes, I work through it in the same order every time:

Is it intact? Size looks right, and gzip -t / samtools quickcheck pass.
Is it the format I think it is? file agrees, and the first few lines look right.
Does the content make sense? Sequence counts, mapping rate, feature types, variant counts all land where I’d expect.
Do the files agree with each other? Sequence names — and lengths — are consistent across reference, alignment, annotation, and variants.

None of this takes more than a minute, and I keep the version information with the results so the numbers mean something later. Together these steps catch many common problems while they’re still cheap — before they turn into a result you trusted and a day you can’t get back.