Posted

17 April 2008 @ 11am

 

Needleman-Wunsch algorithm in Ruby

While at university I had the pleasure of studying bioinformatics. Bioinformatics is all about using the processing power of computers to lean about and infer data from DNA and protein sequences.

For example if you want to find the common parts of a DNA sequence from 500 different sequences and each sequence is 50000 bases long it will take some time to compare them all and you are bound to make some mistakes.

This is where bioinformatics comes in and uses tried and tested computer science principles to make these large tasks easier, more accurate and faster. One example is the Needleman-Wunsch algorithm which is used to score a string of bases against another.

The course I was on covered this and it also came up in the exam. Although no code implementation of the algorithm was provided I found making a working copy very helpful in understanding it as well as getting some coding practice.

My inital version was in Java but I re-wrote it in Ruby as some of the abilities of the language regarding overriding of array methods made it slightly more terse.

I list my code below to help people in the future as well as a number of resources that helped me

Resources

Code


# The cell object holds the value of an allignment
# and the location of the previous cell
class Cell
  @@valid_paths = [:ABOVE, :LEFT, :DIAG, nil]
  attr_reader :value, :path_from

  def initialize(value, path_from=nil)
    raise “Invalid path: ‘#{path_from}’”
      unless @@valid_paths.include?(path_from)
    @value = value
    @path_from = path_from
  end

  # This enables comparison between one cell to another.
  def <=>(other)
    self.value <=> other.value
  end
end


# The grid object is a generic 2D array which can store any object.
# In this case it stores Cell objects and provides a numbers of
# functions to make traversing and accessing the grid cleaner
class Grid
  # 2D array initalised to nil
  def initialize(number_of_rows, number_of_cols)
    @grid = Array.new(number_of_rows){Array.new(number_of_cols){nil}}
  end

  # Allows use of [x,y] over [x][y]
  def [](row, column)
    @grid[row][column]
  end

  # Allows use of [x,y]=z over [x][y]=z
  def []=(row, column, value)
    @grid[row][column] = value
  end

  # Loops through each index to negate the need for
  # nested loops or blocks in other parts of the code
  def each_index
    @grid.each_with_index do |row, row_index|
      row.each_with_index do |col, column_index|
        yield row_index, column_index
      end
    end
  end

  # Returns each row array from the grid.
  # Not convinced how usefull this is
  def each
    @grid.each do |row|
      yield row
    end
  end

  # Very basic output of the grid
  def to_s
    @grid.map{ |row| row.join(”\t”) }.join(”\n”)
  end
end


class NeedlemanMunch

  attr_reader :score, :allignment

  def initialize(reference_seq, comparison_seq, similarity_matrix, gap_penalty=-1)

    @ref_sq, @comp_sq = reference_seq , comparison_seq

    # Size of f_matrix
    @cols = @ref_sq.size+1
    @rows = @comp_sq.size+1

    # Allignment score params
    @gap_pen = gap_penalty
    @sim_matrix = similarity_matrix

    # Matrix creation
    @f_matrix = Grid.new(@rows, @cols)

    # Results
    @score = nil
    @allignment = nil

    # Calcultate the allignment and score
    # then trace the path
    calculate_f_matrix
    trace_path
  end

  def print_f_matrix
    output = ""
    @f_matrix.each do |row|
      line = ""
      row.each do |col|
        line << "#{col.value}:#{col.path_from}\t"
      end
      output << line << "\n"
    end
    output
  end

  def to_s
    puts @allignment
  end

private

  def calculate_f_matrix
    @f_matrix.each_index do |r, c|
      @f_matrix[r,c] =
        if    first_cel?(r,c) then Cell.new(0)
        elsif first_col?(r,c) then Cell.new(@f_matrix[r-1,c].value + @gap_pen)
        elsif first_row?(r,c) then Cell.new(@f_matrix[r,c-1].value + @gap_pen)
        else  calc_cell(r,c)
      end
    end
    @score = @f_matrix[@rows-1,@cols-1].value
  end

  # Three functions that determin the location
  # of a cell in the grid based on the coords
  def first_cel?(row, col) row==0 && col==0 end
  def first_row?(row, col) row==0 && col!=0 end
  def first_col?(row, col) row!=0 && col==0 end

  # Calculate the value of the cell based on the 3 surrounding
  # cells, the dynamic programming bit.
  def calc_cell(row, col)
    [Cell.new(@f_matrix[row-1, col  ].value + @gap_pen, :ABOVE),
     Cell.new(@f_matrix[row,   col-1].value + @gap_pen, :LEFT),
     Cell.new(@f_matrix[row-1, col-1].value + match_score(row,col), :DIAG)].max
  end

  # Get the score for 2 bases from the similarity matrix
  def match_score(row, col)
    @sim_matrix[(@ref_sq[col-1].chr + @comp_sq[row-1].chr).upcase]
  end

  # Function: trace_path
  #
  # When the f_matrix for both sequences has been
  # calculated the allignment must be worked out.
  #
  def trace_path
    ref = ''
    seq = ''
    i = @comp_sq.length
    j = @ref_sq.length

    # This is currently a horrid case statement
    # that crawls back through the f_matrix looking
    # at each cells path_from attribute to determin
    # the allignment of the two sequences.
    #
    # This will determin one allignment, there may be
    # others!
    while (i>0 and j>0)
      current = @f_matrix[i,j]

      case current.path_from
      when :DIAG
        ref = @comp_sq[i-1].chr << ref
        seq = @ref_sq[j-1].chr << seq
        i -= 1
        j -= 1
      when :ABOVE
        ref = @comp_sq[i-1].chr + ref
        seq = '-' << seq
        i -= 1
      when :LEFT
        ref = '-' << ref
        seq = @ref_sq[j-1].chr + seq
        j-= 1
      else
        # Something has gone wrong
        raise UnknownPathError
      end
    end

    while (i>0)
      ref = @comp_sq[i-1].chr + ref
      seq = ‘-’ << seq
      i-=1
    end

    while (j > 0)
      ref = ‘-’ + ref
      seq = @ref_sq[j-1].chr << seq
      j -= 1
    end
    @allignment = [seq , ref]
  end
end


class UnknownPathError < StandardError; end


if __FILE__ == $0

  example = 2

  # a few similarity matrix options for you to try
  if example == 1

    similarity_matrix = {
      'AA' => 10, 'AG' => -1, 'AC' => -3, 'AT' => -4,
      'GA' => -1, 'GG' =>  7, 'GC' => -5, 'GT' => -3,
      'CA' => -3, 'CG' => -5, 'CC' =>  9, 'CT' =>  0,
      'TA' => -4, 'TG' => -3, 'TC' =>  0, 'TT' =>  8
    }
    gap_penalty = -5
    a = 'GAATTCAGTTA'
    b = 'GGATCGA'

  elsif example == 2

    similarity_matrix = {
      'AA' =>  2, 'AG' => -1, 'AC' => -1, 'AT' => -1,
      'GA' => -1, 'GG' =>  2, 'GC' => -1, 'GT' => -1,
      'CA' => -1, 'CG' => -1, 'CC' =>  2, 'CT' => -1,
      'TA' => -1, 'TG' => -1, 'TC' => -1, 'TT' =>  2
    }
    gap_penalty = -1
    a = 'ACCGGTAT'
    b = 'ACCTATC'

  elsif example == 3

    similarity_matrix = {
      'AA' => 1, 'AG' => 0, 'AC' => 0, 'AT' => 0,
      'GA' => 0, 'GG' => 1, 'GC' => 0, 'GT' => 0,
      'CA' => 0, 'CG' => 0, 'CC' => 1, 'CT' => 0,
      'TA' => 0, 'TG' => 0, 'TC' => 0, 'TT' => 1
    }
    gap_penalty = 0
    a = 'AGACTAGTTAC'
    b = 'CGAGACGT'

  end

  nm = NeedlemanMunch.new(a, b, similarity_matrix, gap_penalty)
  puts nm.allignment
  puts nm.score
  puts nm.print_f_matrix
end

Hope this helps someone even if it is just with some Ruby stuff.

 

No Comments Yet

There are no comments yet. You could be the first!

Leave a Comment