TY - JOUR
T1 - A branch and bound irredundant graph algorithm for large-scale MLCS problems
AU - Wang, Chunyang
AU - Wang, Yuping
AU - Cheung, Yiuming
N1 - Funding Information:
This work was supported by the National Natural Science Foundation of China ( 61872281 ).
PY - 2021/11
Y1 - 2021/11
N2 - Finding the multiple longest common subsequences (MLCS) among many long sequences (i.e., the large scale MLCS problem) has many important applications, such as gene alignment, disease diagnosis, and documents similarity check, etc. It is an NP-hard problem (Maier et al., 1978). The key bottle neck of this problem is that the existing state-of-the-art algorithms must construct a huge graph (called direct acyclic graph, briefly DAG), and the computer usually has no enough space to store and handle this graph. Thus the existing algorithms cannot solve the large scale MLCS problem. In order to quickly solve the large-scale MLCS problem within limited computer resources, this paper therefore proposes a branch and bound irredundant graph algorithm called Big-MLCS, which constructs a much smaller DAG (called Small-DAG) than the existing algorithms do by a branch and bound method, and designs a new data structure to efficiently store and handle Small-DAG. By these schemes, Big-MLCS is more efficient than the existing algorithms. Also, we compare the proposed algorithm with two state-of-the-art algorithms through the experiments, and the results show that the proposed algorithm outperforms the compared algorithms and is more suitable to large-scale MLCS problems.
AB - Finding the multiple longest common subsequences (MLCS) among many long sequences (i.e., the large scale MLCS problem) has many important applications, such as gene alignment, disease diagnosis, and documents similarity check, etc. It is an NP-hard problem (Maier et al., 1978). The key bottle neck of this problem is that the existing state-of-the-art algorithms must construct a huge graph (called direct acyclic graph, briefly DAG), and the computer usually has no enough space to store and handle this graph. Thus the existing algorithms cannot solve the large scale MLCS problem. In order to quickly solve the large-scale MLCS problem within limited computer resources, this paper therefore proposes a branch and bound irredundant graph algorithm called Big-MLCS, which constructs a much smaller DAG (called Small-DAG) than the existing algorithms do by a branch and bound method, and designs a new data structure to efficiently store and handle Small-DAG. By these schemes, Big-MLCS is more efficient than the existing algorithms. Also, we compare the proposed algorithm with two state-of-the-art algorithms through the experiments, and the results show that the proposed algorithm outperforms the compared algorithms and is more suitable to large-scale MLCS problems.
KW - Branch and bound
KW - Gene alignment
KW - Multiple longest common subsequences
KW - Small DAG
UR - http://www.scopus.com/inward/record.url?scp=85107763078&partnerID=8YFLogxK
U2 - 10.1016/j.patcog.2021.108059
DO - 10.1016/j.patcog.2021.108059
M3 - Journal article
AN - SCOPUS:85107763078
SN - 0031-3203
VL - 119
JO - Pattern Recognition
JF - Pattern Recognition
M1 - 108059
ER -