Python for Biological Data Analysis, Part 1: Strings and Sequence Basics

Python

Bioinformatics

Foundations

A practical introduction to Python string operations for biological sequence analysis.

Author

Bhargava Reddy Morampalli

Published

2 April 2019

Modified

5 February 2026

Originally published on my legacy blog in 2019. Updated for clarity and Python 3 syntax on 5 February 2026.

This post introduces a few Python basics that are immediately useful for working with DNA or RNA sequences. The examples are deliberately small so the logic is easy to follow.

Why start with strings?

In many beginner bioinformatics workflows, sequences are first handled as strings. Learning how to transform and inspect strings gives you a reliable foundation before moving to larger FASTA/FASTQ files and specialized libraries.

Core terms

String: an ordered sequence of characters, wrapped in quotes.
float: a numeric value with a decimal point (for example, 2.0 or 3.5).
int: a whole number with no decimal point.

You can also use Python as a calculator, but here we focus on sequence-style string operations.

Converting uppercase and lowercase letters

Use .upper() and .lower() to normalize sequence casing.

sequence = "ATTCGTACTACTGACGT"
lower_sequence = sequence.lower()
upper_sequence = sequence.upper()

print(sequence)
print(lower_sequence)
print(upper_sequence)

ATTCGTACTACTGACGT
attcgtactactgacgt
ATTCGTACTACTGACGT

Consistent case helps avoid subtle bugs when comparing sequences.

Replacing characters in a sequence

Use .replace(old, new) to substitute nucleotides or motifs.

rna = "AUGGCUAACUGGUCAG"
cdna = rna.replace("U", "T")

print(cdna)

ATGGCTAACTGGTCAG

You can also replace longer patterns, not only single characters.

Extracting subsequences with slicing

Slicing uses [start:end], where start is included and end is excluded.

sequence = "ATTGCTAGC"

print(sequence[0:5])
print(sequence[6:])
print(sequence[3:7])

ATTGC
AGC
GCTA

Important reminders:

Python uses zero-based indexing.
Indices are written inside square brackets.
If start is omitted, slicing begins at the first character.
If end is omitted, slicing continues to the last character.

Counting and locating nucleotides

Use .count() to measure frequency and .find() to locate first occurrence.

dna = "ATGGCTTAAGCTGCAGTCGTAGCTGACGTGCA"

print("A count:", dna.count("A"))
print("T count:", dna.count("T"))
print("G count:", dna.count("G"))
print("C count:", dna.count("C"))

print("First A index:", dna.find("A"))
print("First C index:", dna.find("C"))

A count: 7
T count: 8
G count: 10
C count: 7
First A index: 0
First C index: 4

Takeaways

String methods are often your first tools for sequence inspection.
Case normalization, replacement, slicing, counting, and searching cover many early tasks.
The same concepts scale to larger workflows when you later read from files and use bioinformatics libraries.