Needleman-Wunsch algorithm in Ruby
While at university I had the pleasure of studying bioinformatics. Bioinformatics is all about using the processing power of computers to lean about and infer data from DNA and protein sequences.
For example if you want to find the common parts of a DNA sequence from 500 different sequences and each sequence is 50000 bases long it will take some time to compare them all and you are bound to make some mistakes.
This is where bioinformatics comes in and uses tried and tested computer science principles to make these large tasks easier, more accurate and faster. One example is the Needleman-Wunsch algorithm which is used to score a string of bases against another.
The course I was on covered this and it also came up in the exam. Although no code implementation of the algorithm was provided I found making a working copy very helpful in understanding it as well as getting some coding practice.
My inital version was in Java but I re-wrote it in Ruby as some of the abilities of the language regarding overriding of array methods made it slightly more terse.
I list my code below to help people in the future as well as a number of resources that helped me
Resources
- The Wikipedia description of the algorithm
- A Ruby implementation of the Wikipedia sudo code
- Ruby and Python version of the Edit distance algorithm by Brian Schroeder (the algorithm is similar to N-W)
Code
# The cell object holds the value of an allignment
# and the location of the previous cell
class Cell
@@valid_paths = [:ABOVE, :LEFT, :DIAG, nil]
attr_reader :value, :path_from
def initialize(value, path_from=nil)
raise “Invalid path: ‘#{path_from}’”
unless @@valid_paths.include?(path_from)
@value = value
@path_from = path_from
end
# This enables comparison between one cell to another.
def <=>(other)
self.value <=> other.value
end
end
# The grid object is a generic 2D array which can store any object.
# In this case it stores Cell objects and provides a numbers of
# functions to make traversing and accessing the grid cleaner
class Grid
# 2D array initalised to nil
def initialize(number_of_rows, number_of_cols)
@grid = Array.new(number_of_rows){Array.new(number_of_cols){nil}}
end
# Allows use of [x,y] over [x][y]
def [](row, column)
@grid[row][column]
end
# Allows use of [x,y]=z over [x][y]=z
def []=(row, column, value)
@grid[row][column] = value
end
# Loops through each index to negate the need for
# nested loops or blocks in other parts of the code
def each_index
@grid.each_with_index do |row, row_index|
row.each_with_index do |col, column_index|
yield row_index, column_index
end
end
end
# Returns each row array from the grid.
# Not convinced how usefull this is
def each
@grid.each do |row|
yield row
end
end
# Very basic output of the grid
def to_s
@grid.map{ |row| row.join(”\t”) }.join(”\n”)
end
end
class NeedlemanMunch
attr_reader :score, :allignment
def initialize(reference_seq, comparison_seq, similarity_matrix, gap_penalty=-1)
@ref_sq, @comp_sq = reference_seq , comparison_seq
# Size of f_matrix
@cols = @ref_sq.size+1
@rows = @comp_sq.size+1
# Allignment score params
@gap_pen = gap_penalty
@sim_matrix = similarity_matrix
# Matrix creation
@f_matrix = Grid.new(@rows, @cols)
# Results
@score = nil
@allignment = nil
# Calcultate the allignment and score
# then trace the path
calculate_f_matrix
trace_path
end
def print_f_matrix
output = ""
@f_matrix.each do |row|
line = ""
row.each do |col|
line << "#{col.value}:#{col.path_from}\t"
end
output << line << "\n"
end
output
end
def to_s
puts @allignment
end
private
def calculate_f_matrix
@f_matrix.each_index do |r, c|
@f_matrix[r,c] =
if first_cel?(r,c) then Cell.new(0)
elsif first_col?(r,c) then Cell.new(@f_matrix[r-1,c].value + @gap_pen)
elsif first_row?(r,c) then Cell.new(@f_matrix[r,c-1].value + @gap_pen)
else calc_cell(r,c)
end
end
@score = @f_matrix[@rows-1,@cols-1].value
end
# Three functions that determin the location
# of a cell in the grid based on the coords
def first_cel?(row, col) row==0 && col==0 end
def first_row?(row, col) row==0 && col!=0 end
def first_col?(row, col) row!=0 && col==0 end
# Calculate the value of the cell based on the 3 surrounding
# cells, the dynamic programming bit.
def calc_cell(row, col)
[Cell.new(@f_matrix[row-1, col ].value + @gap_pen, :ABOVE),
Cell.new(@f_matrix[row, col-1].value + @gap_pen, :LEFT),
Cell.new(@f_matrix[row-1, col-1].value + match_score(row,col), :DIAG)].max
end
# Get the score for 2 bases from the similarity matrix
def match_score(row, col)
@sim_matrix[(@ref_sq[col-1].chr + @comp_sq[row-1].chr).upcase]
end
# Function: trace_path
#
# When the f_matrix for both sequences has been
# calculated the allignment must be worked out.
#
def trace_path
ref = ''
seq = ''
i = @comp_sq.length
j = @ref_sq.length
# This is currently a horrid case statement
# that crawls back through the f_matrix looking
# at each cells path_from attribute to determin
# the allignment of the two sequences.
#
# This will determin one allignment, there may be
# others!
while (i>0 and j>0)
current = @f_matrix[i,j]
case current.path_from
when :DIAG
ref = @comp_sq[i-1].chr << ref
seq = @ref_sq[j-1].chr << seq
i -= 1
j -= 1
when :ABOVE
ref = @comp_sq[i-1].chr + ref
seq = '-' << seq
i -= 1
when :LEFT
ref = '-' << ref
seq = @ref_sq[j-1].chr + seq
j-= 1
else
# Something has gone wrong
raise UnknownPathError
end
end
while (i>0)
ref = @comp_sq[i-1].chr + ref
seq = ‘-’ << seq
i-=1
end
while (j > 0)
ref = ‘-’ + ref
seq = @ref_sq[j-1].chr << seq
j -= 1
end
@allignment = [seq , ref]
end
end
class UnknownPathError < StandardError; end
if __FILE__ == $0
example = 2
# a few similarity matrix options for you to try
if example == 1
similarity_matrix = {
'AA' => 10, 'AG' => -1, 'AC' => -3, 'AT' => -4,
'GA' => -1, 'GG' => 7, 'GC' => -5, 'GT' => -3,
'CA' => -3, 'CG' => -5, 'CC' => 9, 'CT' => 0,
'TA' => -4, 'TG' => -3, 'TC' => 0, 'TT' => 8
}
gap_penalty = -5
a = 'GAATTCAGTTA'
b = 'GGATCGA'
elsif example == 2
similarity_matrix = {
'AA' => 2, 'AG' => -1, 'AC' => -1, 'AT' => -1,
'GA' => -1, 'GG' => 2, 'GC' => -1, 'GT' => -1,
'CA' => -1, 'CG' => -1, 'CC' => 2, 'CT' => -1,
'TA' => -1, 'TG' => -1, 'TC' => -1, 'TT' => 2
}
gap_penalty = -1
a = 'ACCGGTAT'
b = 'ACCTATC'
elsif example == 3
similarity_matrix = {
'AA' => 1, 'AG' => 0, 'AC' => 0, 'AT' => 0,
'GA' => 0, 'GG' => 1, 'GC' => 0, 'GT' => 0,
'CA' => 0, 'CG' => 0, 'CC' => 1, 'CT' => 0,
'TA' => 0, 'TG' => 0, 'TC' => 0, 'TT' => 1
}
gap_penalty = 0
a = 'AGACTAGTTAC'
b = 'CGAGACGT'
end
nm = NeedlemanMunch.new(a, b, similarity_matrix, gap_penalty)
puts nm.allignment
puts nm.score
puts nm.print_f_matrix
end
Hope this helps someone even if it is just with some Ruby stuff.
No Comments Yet