Python for Biological Data Analysis, Part 1: Strings and Sequence Basics
Originally published on my legacy blog in 2019. Updated for clarity and Python 3 syntax on 5 February 2026.
This post introduces a few Python basics that are immediately useful for working with DNA or RNA sequences. The examples are deliberately small so the logic is easy to follow.
Why start with strings?
In many beginner bioinformatics workflows, sequences are first handled as strings. Learning how to transform and inspect strings gives you a reliable foundation before moving to larger FASTA/FASTQ files and specialized libraries.
Core terms
- String: an ordered sequence of characters, wrapped in quotes.
- float: a numeric value with a decimal point (for example,
2.0or3.5). - int: a whole number with no decimal point.
You can also use Python as a calculator, but here we focus on sequence-style string operations.
Converting uppercase and lowercase letters
Use .upper() and .lower() to normalize sequence casing.
sequence = "ATTCGTACTACTGACGT"
lower_sequence = sequence.lower()
upper_sequence = sequence.upper()
print(sequence)
print(lower_sequence)
print(upper_sequence)ATTCGTACTACTGACGT
attcgtactactgacgt
ATTCGTACTACTGACGT
Consistent case helps avoid subtle bugs when comparing sequences.
Replacing characters in a sequence
Use .replace(old, new) to substitute nucleotides or motifs.
rna = "AUGGCUAACUGGUCAG"
cdna = rna.replace("U", "T")
print(cdna)ATGGCTAACTGGTCAG
You can also replace longer patterns, not only single characters.
Extracting subsequences with slicing
Slicing uses [start:end], where start is included and end is excluded.
sequence = "ATTGCTAGC"
print(sequence[0:5])
print(sequence[6:])
print(sequence[3:7])ATTGC
AGC
GCTA
Important reminders:
- Python uses zero-based indexing.
- Indices are written inside square brackets.
- If
startis omitted, slicing begins at the first character. - If
endis omitted, slicing continues to the last character.
Counting and locating nucleotides
Use .count() to measure frequency and .find() to locate first occurrence.
dna = "ATGGCTTAAGCTGCAGTCGTAGCTGACGTGCA"
print("A count:", dna.count("A"))
print("T count:", dna.count("T"))
print("G count:", dna.count("G"))
print("C count:", dna.count("C"))
print("First A index:", dna.find("A"))
print("First C index:", dna.find("C"))A count: 7
T count: 8
G count: 10
C count: 7
First A index: 0
First C index: 4
Takeaways
- String methods are often your first tools for sequence inspection.
- Case normalization, replacement, slicing, counting, and searching cover many early tasks.
- The same concepts scale to larger workflows when you later read from files and use bioinformatics libraries.