Implementing the longest common subsequence in Golang

·

5 min read

Implementing algorithms can be tricky because it's possible to write an implementation that builds and runs, but the outcome may be incorrect. An understanding of the algorithm is the best tool for implementing that algorithm correctly. In addition, familiarity with testing, debugging, and basics of the language used to implement the algorithm are useful tools for turning correct code into idiomatic and efficient code, but maintainability is second to correctness.

Having a good resource is the first step towards learning about an algorithm. There are many ways to correctly write pseudo-code for lcs; however, the three main ones are recursive, dynamic but top-down, and dynamic but bottom-up. The pseudo-code for any of these versions of lcs are relatively short, but the dynamically programmed ones are more performant and easier to debug than the recursive one. Furthermore, the bottom-up version is only a few lines long, which makes it an appealing choice.

Iterative LCS:

int lcs_length(char * A, char * B)
{
  allocate storage for array L;
  for (i = m; i >= 0; i--)
      for (j = n; j >= 0; j--)
      {
      if (A[i] == '\0' || B[j] == '\0') L[i,j] = 0;
      else if (A[i] == B[j]) L[i,j] = 1 + L[i+1, j+1];
      else L[i,j] = max(L[i+1, j], L[i, j+1]);
      }
  return L[0,0];
}

Even with good pseudo-code, it is still useful to examine existing implementations in order to evaluate the fit. In the example implementation, the table of values is populated in reverse.

Table from the pseudo-code
     a    b    c
a    2    2    1    0
b    1    1    1    0
     0    0    0    0
Table from the existing implementation
          c    b    a
     0    0    0    0
b    0    1    1    1
a    0    1    2    2

The reason for this reversal is because the loops start counting from the back.

for i := 0; i < len(a); i++ {
  for j := 0; j < len(b); j++ {

In addition, this implementation returns the substring itself. After adding tests, it's clear that the implementation works, but improvements can still be made. The reversal of the table adds some friction when comparing pseudo-code from reference material to the implementation, reducing how readable the code is. With this improvement in mind, a new implementation can be written.

String matching algorithms are useful, and where it's necessary to have one, it's likely necessary to have others. When adding code to an existing library, the conventions from the library inform and constrain the implementation. For example, each test case is accompanied by a brief comment. Also, this library comes with utility functions, such as maxI, which returns the maximum between two numbers.

The final implementation traverses the table as specified in the pseudo-code. The only difference is the logic that sets the edges of the table to zero are removed because of Go's zero values. Consequently, the implementation starts the for-loops 1 from the end.

Much of the work that goes into implementing an algorithm lies in the reading of pseudo-code and actual code. The subsequent writing of the code still takes some care, but it's significantly less work. Fortunately, many resources exist to make the process of understanding, writing, debugging, and testing algorithms smoother.

Further reading: