Finding the longest common substring or common prefix or suffix of a group of strings is a common text processing problem. And it is a special case of edit distance problem.
This problem has a wide range of applications in different areas, i.e. computation biology and information retrieval from texts, and so on. I encountered this problem at work and I would like to give an extensive survey to this problem.
The key to this problem is to build a data structure called suffix tree over the text or group of strings you want to work with.
Suffix trees were first introduced by Weiner 1973. However this paper was considered too difficult to understand and had little follow-on work.
McCreight 1979 came along with another algorithm with better space efficiency. But this method is more difficult to extend for generalized suffix trees.
Ukkonen 1992 then came up with an easy to understand algorithm, which is an on-line construction of suffix trees in linear-time. And this algorithm is currently also what most suffix tree implementations are based on.
To understand Ukkonen’s paper, it takes some efforts to observe how suffix tree is grown from the text being processed from left to right. To gain some insights and detailed explanations, try to read the following links:
- statckoverflow post about Ukkonen’s algorithm
- A clear explanation including C++ implementation
- Explanation on
Okay, so suffix tree can be constructed in time linear to the text size, but how can it be used to solve Longest Common Substring Problem? First, build a generalized suffix tree for a set of strings. Then do a tree traversal (in linear time) to find the deepest internal node which has leaf nodes from all the strings in the subtree below it.
If you are interested in the implementation of suffix tree data structure, check out the following links:
Feeling enthusiastic to roll out your own implementation of suffix tree?