Creating nf-core-style Nextflow modules
I have been spending some time cleaning up local Nextflow modules in one of my pipelines. I updated the version capture to latest topic channels. I thought I would write a post about making a module which is up to nf-core standards and this will make a nice tutorial for people starting out creating modules locally. These standards make the modules easier to inspect, test, reuse, and debug.
A good module is not just a process that happens to run. It tells you what it consumes, what it emits, which tool version produced the output, how to test it in stub mode, and how to check that all of this still works after a future change.
I wrote this post in a way I think I would have liked when I was learning. This is in no way a replacement for the nf-core docs, and I have to add a caveat here at the beginning that the details will keep moving as Nextflow and nf-core evolve. It is a practical guide to the current style as of June 2026, based on the local modules I have just been modernising.
The short version is this:
- start from the nf-core module template when you can;
- keep
main.nfboring and explicit; - make
meta.ymldescribe the actual channel structure; - emit tool versions with
topic: versions; - test both real execution and stub mode;
- open the snapshot and check the version tuples yourself.
That last step matters more than it sounds.
I am assuming a local Nextflow pipeline repository with modules under modules/local/, nf-test available, and a recent Nextflow version. The examples use local modules, but the same habits apply when writing modules for a shared nf-core-style codebase.
Start from the template when possible
If you are creating a brand-new module, do not begin with a blank file unless you have a very good reason. Let nf-core/tools generate the boring parts first:
nf-core modules create fastqc --author @your-github-handle --label process_low --metaFor a tool/subcommand-style module, use the tool path:
nf-core modules create samtools/depth --author @your-github-handle --label process_low --metaThe command creates the module scaffold and prompts for the details it can infer. If it finds a Bioconda entry, it can fill in some software/container information. If it finds a bio.tools entry, it may also suggest inputs, outputs, and EDAM ontology terms.
Do not treat those guesses as truth. Treat them as a useful first draft.
After generation you should inspect the generated files. Open the three generated files in your editor and read them for accuracy and completeness:
main.nfmeta.ymltests/main.nf.test
Check the command, the declared outputs, the metadata, and the test together. Most mistakes happen when those files drift away from each other.
The shape of a small module
A simpler example from one of my pipelines is a depth module based on samtools depth. It has:
- one input tuple: sample metadata plus a BAM file;
- one file output: a depth text file;
- one version output: the
samtoolsversion; - one stub section that creates the expected output file without running the real tool.
Here is the process in individual sections before showing the whole thing.
Process header
The process name is uppercase, the tag is useful in logs, and the label maps to your pipeline resource configuration:
process SAMTOOLS_DEPTH {
tag "$meta.id"
label 'process_low'
For most local modules, tag "$meta.id" is enough. If the module takes a reference or a method name, add that only if it makes the trace easier to read.
Software environment
Pin the tool in Conda and, when you know the image, pin the container too:
conda "bioconda::samtools=1.17"
container "quay.io/biocontainers/samtools:1.17--h00cdaf9_0"
Inputs
Most nf-core-style modules pass sample information as a meta map in a tuple:
input:
tuple val(meta), path(bam)
That tuple shape is important because the modern meta.yml format mirrors the channel grouping. A tuple in main.nf becomes a list in meta.yml. A single value channel is not wrapped in the same way. This is one of those small formatting details that makes metadata very readable.
Outputs
The normal output gets an emit: name:
output:
tuple val(meta), path("*.txt"), emit: depth
Use a name that describes the channel, not the file extension. depth, bam, index, report, plot, and summary are easier to reason about later than out or result.
Version output with topic channels
Capturing the tool version is the most important step for reproducibility: it records the provenance of every output. It should be a tuple with the process name, the tool name, and the version string. Emit it with a unique name for the module, and also send it to the shared versions topic:
tuple val("${task.process}"),
val('samtools'),
eval('(samtools --version 2>/dev/null || echo unknown) | head -n 1 | cut -d " " -f 2'),
emit: versions_samtools,
topic: versions
There are two separate ideas here:
emit: versions_samtoolsgives the process output a unique name for nf-test snapshots and module metadata.topic: versionssends the same value into the shared versions topic so the pipeline can collect all tool versions centrally.
In a pipeline, this usually becomes something like:
ch_versions = Channel.topic('versions')
Topic channels are useful because many processes can send values to the same topic without wiring a long chain of mix() calls. The important caveat is that a process should not both consume from a topic and emit to that same topic, because that can create a pipeline that waits forever.
For modules with more than one tool, emit one version channel per tool:
tuple val("${task.process}"), val('python'), eval('python --version 2>&1 | sed "s/Python //" || echo unknown'), emit: versions_python, topic: versions
tuple val("${task.process}"), val('pandas'), eval('python -c "import pandas; print(pandas.__version__)" 2>/dev/null || echo unknown'), emit: versions_pandas, topic: versions
That is what we did for modules such as coverage_plot, where the process uses Python plus plotting libraries. One process can emit several version tuples to the same versions topic.
The version expression should print one stable value. Some tools print multi-line banners, write versions to stderr, or print extra dependency information. Normalise that in the expression. For samtools, the full samtools --version output is too much for a clean tuple, so the expression keeps only the first line and extracts the version number.
when:
Most modules should respect task.ext.when:
when:
task.ext.when == null || task.ext.when
This lets the pipeline turn a module on or off with process configuration without adding branching logic inside the module itself.
Script block
The script should be readable from left to right:
script:
def args = task.ext.args ?: ''
def prefix = task.ext.prefix ?: "${meta.id}"
"""
samtools depth $args -a $bam > ${prefix}.txt
"""
I like using task.ext.args even when the first version of the module does not need extra options. It gives you one obvious place for process-specific arguments later. task.ext.prefix is similarly useful for making filenames predictable while still allowing a workflow to override them.
Stub block
Stub mode should create the same output paths, just without running the real tool:
stub:
def prefix = task.ext.prefix ?: "${meta.id}"
"""
touch ${prefix}.txt
"""
}
This is essential for testing. nf-core module tests must include a stub test, and stub tests are often the difference between a module that can be checked easily in CI and a module that only gets tested when someone has all the real dependencies and input data.
The stub does not need to produce meaningful data. It needs to prove that the process wiring, filenames, channels, and version outputs still exist.
The whole main.nf
Putting the pieces together:
process SAMTOOLS_DEPTH {
tag "$meta.id"
label 'process_low'
conda "bioconda::samtools=1.17"
container "quay.io/biocontainers/samtools:1.17--h00cdaf9_0"
input:
tuple val(meta), path(bam)
output:
tuple val(meta), path("*.txt"), emit: depth
tuple val("${task.process}"), val('samtools'), eval('(samtools --version 2>/dev/null || echo unknown) | head -n 1 | cut -d " " -f 2'), emit: versions_samtools, topic: versions
when:
task.ext.when == null || task.ext.when
script:
def args = task.ext.args ?: ''
def prefix = task.ext.prefix ?: "${meta.id}"
"""
samtools depth $args -a $bam > ${prefix}.txt
"""
stub:
def prefix = task.ext.prefix ?: "${meta.id}"
"""
touch ${prefix}.txt
"""
}
Writing meta.yml
The modern meta.yml file should describe the channel structure in the module.
Start with the module identity:
name: "samtools_depth"
description: Calculate per-base depth coverage from BAM files
keywords:
- depth
- coverage
- bam
- samtoolsKeep the description factual. Keywords should be specific enough to help someone find the module later.
Tools
Write one entry per real tool. Use homepage, documentation, doi, licence, and tool_dev_url when you know them. Do not invent bio.tools IDs or ontology terms just to make the file look complete.
tools:
- samtools:
description: |
SAMtools is a suite of programs for interacting with high-throughput sequencing data.
homepage: http://www.htslib.org/
documentation: https://www.htslib.org/doc/samtools-depth.html
doi: 10.1093/bioinformatics/btp352
licence: ["MIT"]For a Python plotting module, it is fine to list Python and the libraries that matter:
tools:
- python:
description: Python programming language
homepage: https://www.python.org/
documentation: https://docs.python.org/3/
tool_dev_url: https://github.com/python/cpython
licence: ["PSF-2.0"]
- pandas:
description: Python library providing data structures and data analysis tools
homepage: https://pandas.pydata.org/
documentation: https://pandas.pydata.org/docs/
tool_dev_url: https://github.com/pandas-dev/pandas
licence: ["BSD-3-Clause"]The rule is not “list every import in the script”. The rule is “list the software whose versions explain the output”.
Inputs in grouped form
This is the modern grouped structure for one tuple channel:
input:
- - meta:
type: map
description: |
Groovy Map containing sample information
e.g. `[ id:'sample1', single_end:false ]`
- bam:
type: file
description: BAM file for depth calculation
pattern: "*.bam"
ontologies:
- edam: "http://edamontology.org/format_2572" # BAMNotice the double list:
input:
- - meta:
- bam:That means “one channel, and this channel is a tuple containing meta and bam.” It looks odd until you get used to it, but it matches the channel shape exactly.
Outputs keyed by emit:
Outputs are now keyed by the emit: names from main.nf:
output:
depth:
- - meta:
type: map
description: |
Groovy Map containing sample information
e.g. `[ id:'sample1', single_end:false ]`
- "*.txt":
type: file
description: Per-base depth coverage file
pattern: "*.txt"
ontologies:
- edam: "http://edamontology.org/format_3475" # TSVThis is much easier to review than an ungrouped list of files. You can compare the output block directly against main.nf:
tuple val(meta), path("*.txt"), emit: depth
If the emit: name is depth, the metadata key should be depth.
Version outputs in metadata
Every version output gets its own output: entry:
versions_samtools:
- - "${task.process}":
type: string
description: The name of the process
- samtools:
type: string
description: The name of the tool
- "(samtools --version 2>/dev/null || echo unknown) | head -n 1 | cut -d \" \" -f 2":
type: eval
description: The expression to obtain the version of the toolThe version-capture command must exactly match the one in main.nf. If the command in main.nf changes, the metadata should change too. This is one of the main things the metadata is there to protect. If you need to update the command in the future, it should be updated in both places.
And the same tuple appears under topics.versions:
topics:
versions:
- - "${task.process}":
type: string
description: The name of the process
- samtools:
type: string
description: The name of the tool
- "(samtools --version 2>/dev/null || echo unknown) | head -n 1 | cut -d \" \" -f 2":
type: eval
description: The expression to obtain the version of the toolThis duplication feels a little fussy, but it is useful. output: documents the named process output. topics: documents what the process contributes to the shared topic channel.
For a multi-tool process, repeat this once per tool. In coverage_plot, for example, we had:
output:
versions_python:
versions_pandas:
versions_matplotlib:
versions_seaborn:
topics:
versions:
# python tuple
# pandas tuple
# matplotlib tuple
# seaborn tupleThe snapshot should then show all four keys and all four tool names. Always open the snapshot and manually confirm that the versions are captured properly.
A full meta.yml example
Here is the compact version for the samtools_depth example:
name: "samtools_depth"
description: Calculate per-base depth coverage from BAM files
keywords:
- depth
- coverage
- bam
- samtools
tools:
- samtools:
description: |
SAMtools is a suite of programs for interacting with high-throughput sequencing data.
homepage: http://www.htslib.org/
documentation: https://www.htslib.org/doc/samtools-depth.html
doi: 10.1093/bioinformatics/btp352
licence: ["MIT"]
input:
- - meta:
type: map
description: |
Groovy Map containing sample information
e.g. `[ id:'sample1', single_end:false ]`
- bam:
type: file
description: BAM file for depth calculation
pattern: "*.bam"
ontologies:
- edam: "http://edamontology.org/format_2572" # BAM
output:
depth:
- - meta:
type: map
description: |
Groovy Map containing sample information
e.g. `[ id:'sample1', single_end:false ]`
- "*.txt":
type: file
description: Per-base depth coverage file
pattern: "*.txt"
ontologies:
- edam: "http://edamontology.org/format_3475" # TSV
versions_samtools:
- - "${task.process}":
type: string
description: The name of the process
- samtools:
type: string
description: The name of the tool
- "(samtools --version 2>/dev/null || echo unknown) | head -n 1 | cut -d \" \" -f 2":
type: eval
description: The expression to obtain the version of the tool
topics:
versions:
- - "${task.process}":
type: string
description: The name of the process
- samtools:
type: string
description: The name of the tool
- "(samtools --version 2>/dev/null || echo unknown) | head -n 1 | cut -d \" \" -f 2":
type: eval
description: The expression to obtain the version of the tool
authors:
- "@your-github-handle"
maintainers:
- "@your-github-handle"The real file can have more detail, but this is the skeleton I now look for.
Writing the nf-test
A module test should answer three questions:
- Does the process run?
- Are the expected outputs present?
- Are the version outputs visible in the snapshot?
For this module:
nextflow_process {
name "Test Process SAMTOOLS_DEPTH"
script "../main.nf"
process "SAMTOOLS_DEPTH"
tag "modules"
tag "modules_local"
tag "samtools"
tag "samtools/depth"
test("Should calculate per-base depth from BAM file") {
when {
process {
"""
input[0] = [
[ id: 'test', single_end: false ],
file("${projectDir}/modules/local/samtools_depth/tests/data/test.bam", checkIfExists: true)
]
"""
}
}
then {
assertAll(
{ assert process.success },
{ assert snapshot(process.out).match() },
{ assert process.out.versions_samtools }
)
}
}
test("Should work in stub mode") {
options "-stub"
when {
process {
"""
input[0] = [
[ id: 'test_stub', single_end: false ],
file("${projectDir}/modules/local/samtools_depth/tests/data/test.bam", checkIfExists: true)
]
"""
}
}
then {
assertAll(
{ assert process.success },
{ assert process.out.depth },
{ assert process.out.versions_samtools }
)
}
}
}I like snapshotting process.out in the real test because it catches channel names, file fingerprints, and version tuples in one place. For stub tests, I often keep the assertions lighter if the real test already snapshots the full output.
When to sanitize snapshots
Snapshots are very good at catching accidental changes, but they can become noisy with unstable outputs:
- binary files that differ between runs;
- logs with timestamps or absolute paths;
- directories containing generated files with variable names;
- plots whose metadata changes even when the visual output is effectively the same.
In those cases, snapshot only the stable part:
def report = process.out.report.get(0).get(1)
assert snapshot(
file(report).getName().toString(),
process.out.versions_python
).match()Or collect stable files and unstable filenames separately for directory outputs:
def stableFiles = []
def unstableNames = []
file(process.out.results.get(0).get(1)).eachFileRecurse { f ->
if (!f.isDirectory() && f.getName().endsWith(".tsv")) {
stableFiles.add(f)
}
if (!f.isDirectory() && f.getName().endsWith(".log")) {
unstableNames.add(f.getName().toString())
}
}
assert snapshot(
stableFiles,
unstableNames,
process.out.versions_tool
).match()The key point: sanitize unstable files if needed, but keep version outputs snapshotted normally. Version tuples are one of the main things the snapshot is there to protect.
Module-local test config
Sometimes a module needs small test-only configuration. For example, a stub-only test may not need to solve a difficult Conda environment, or a test may need ext.args.
Use a tiny tests/nextflow.config:
process {
withName: 'SAMTOOLS_DEPTH' {
ext.args = { params.module_args ?: '' }
ext.prefix = { "${meta.id}" }
}
}Then include it only in the test that needs it:
test("Should calculate per-base depth from BAM file") {
config "modules/local/samtools_depth/tests/nextflow.config"
when {
params {
module_args = "-q 10"
}
process {
"""
input[0] = [
[ id: 'test', single_end: false ],
file("${projectDir}/modules/local/samtools_depth/tests/data/test.bam", checkIfExists: true)
]
"""
}
}
}Keep this file small. It is for module test configuration, not a second pipeline config.
Run the tests
The normal rhythm is:
# running it for the first time
nf-test test modules/local/samtools_depth/tests/main.nf.test
# updating the snapshot when you update the module
nf-test test modules/local/samtools_depth/tests/main.nf.test --update-snapshotThen open the snapshot file — tests/main.nf.test.snap — and read it. A general note: after updating the snapshot, run the test again without the --update-snapshot flag to confirm that the snapshot is now correct and the test passes.
For samtools_depth, the important part looks like this:
{
"versions_samtools": [
[
"SAMTOOLS_DEPTH",
"samtools",
"1.17"
]
]
}That tells you the emitted channel is named correctly and the tool version tuple is actually reaching the test output.
For a module with four tools, I want to see four keys:
versions_python
versions_pandas
versions_matplotlib
versions_seaborn
and four tool names in the tuples:
python
pandas
matplotlib
seaborn
If the test passes but the snapshot does not contain those, it needs to be fixed before the module is really done.
Linting
The ideal final check is nf-core module linting:
nf-core modules lint samtools/depth --dir . --localor for a local module name:
nf-core modules lint samtools_depth --dir . --localIn a fully recognised nf-core pipeline or modules repository, this checks that meta.yml exists, validates against the module schema, and agrees with main.nf inputs, outputs, and topics.
The final checklist I use
For one module, I now check this in order:
main.nf
[ ] process name matches the module name
[ ] input tuples are explicit
[ ] every output has a useful emit name
[ ] every version output has emit: versions_<tool>
[ ] every version output also has topic: versions
[ ] task.ext.when is respected
[ ] script block uses task.ext.prefix, and task.ext.args when useful
[ ] stub block creates the expected output paths
meta.yml
[ ] input channels are grouped by channel shape
[ ] output map is keyed by emit names
[ ] versions_<tool> outputs are documented
[ ] topics.versions documents the same version tuples
[ ] EDAM ontology terms are included where obvious and verified
[ ] authors and maintainers are present
tests
[ ] real test succeeds, unless a real blocker is documented
[ ] stub test succeeds
[ ] snapshot contains versions_<tool> keys
[ ] snapshot contains the expected tool names and version values
[ ] unstable outputs are sanitized only as much as needed
References
These are the docs I keep open while doing this work: