3 Solution Estimation

In this chapter, we discuss the process to solve (2.7) and explore a solution involving dynamic programming. Because the exact solution to this problem has quadratic complexity, we also discuss some alternatives that simplify the search path and provide an approximate solution.

3.1 Exact Algorithm

The first attempt to solve the equation would be to search for all the possible alternatives to find the correct minimal set of intervals that minimize the total cost of the system. It’s done iteratively by:

  • let \(D\) be the current data set we want to segment
  • for each column index \(i \in \{1, \dots, m\}\)
    • for each column index \(j \in \{i, \dots, m\}\)
      • compute \(K_{i:j}=K(D_{i:j})\)
      • store the cost \(K_{i:j}\) and the index \(j\) for the lowest value of \(K_{i:j}\) for later comparison

Finally, after computing all of the possible combinations of segment costs, the procedure to find the optimal solution is:

  • let \(j = m\)
  • while \(j > 1\)
    • search the stored values for \(i = \underset{i}{\arg\min} (K_{i:j})\)
    • store \(i\) in the set of change points \(C\) if \(i \ne 1\)
    • let \(j = i\) and repeat until \(j = 1\)

With the execution of the procedure described above, the estimated set of change points is stored in \(C\). The solution will be exact because it analyses all the possible combinations. However, that has a time complexity of \(O(m^2)\) in the Big-O notation, for the number of columns \(m\) in the data set. Because the number of columns in data segmentation problems can be very large, computation time can be very prohibitive. For those situations, approximate solutions are described.

3.2 Hierarchical Algorithm

To simplify the search path of the algorithm, (Castro et al. 2018) also proposes a technique that relies on the assumption of the data hierarchical, i.e. a segment can be sub-divided by reapplying the segmentation function recursively. The algorithm is described as:

  • let \(D\) be the current data set we want to segment
  • for each column \(i\) in the data set
    • compute the total cost \(K_i = K(D_{1:i-1})+K(D_{i:m})\) if \(i \ne 1\)
  • find \(i\) for which \(K_i\) is minimum
  • if \(K_i >K(D_{1:m})\)
    • return the empty set as the result of the current function call
  • recursively find the set of change points \(C_L\) by calling the algorithm on the left segment \(D_{1:i-1}\)
  • recursively find the set of change points \(C_R\) by calling the algorithm on the right segment \(D_{i:m}\)
  • return \(C=C_L \cup C_R \cup \{i\}\) as result of current function call

Implementing the algorithm described above will estimate the set of change points \(C\) with time complexity of \(O(m\log(m))\), for the number of columns \(m\) in the data set. The reduction in time complexity is only possible because of the strong assumption the data set has a hierarchical cost behavior. However, that is not true for all types of segmentation problems, as it will be seen in more detail in the next chapters. So, the usage of this algorithm should be considered with care.

3.3 Hybrid Algorithm

The hybrid algorithm is a modified version of the hierarchical algorithm, in which it will start searching the segments using the hierarchical algorithm, and if the segments become smaller than a certain number of columns threshold, it will switch to the exact algorithm. The procedure would be as follows:

  • let \(D\) be the current data set we want to segment
  • if the number of columns of \(D\) is \(m <k\), in which \(k\) is a predefined threshold
    • return the set of change points calculated by the exact algorithm.
  • for each column \(i\) in the data set
    • compute the total cost \(K_i = K(D_{1:i-1})+K(D_{i:m})\) if \(i \ne 1\)
  • find \(i\) for which \(K_i\) is minimum
  • if \(K_i <K(X_{1:p})\)
    • return empty set as the result of the current function call
  • recursively find the set of change points \(C_L\) by calling the algorithm on the left segment \(K_{1:i-1}\)
  • recursively find the set of change points \(C_R\) by calling the algorithm on the right segment \(K_{i:m}\)
  • return \(C=C_L \cup C_R \cup \{i\}\) as result of current function call

The only difference between this and the hierarchical procedure is the presence of the conditional step is the beginning, that tests whether it should switch over to the exact algorithm. As will be shown in later chapters, the accuracy of this algorithm is shown not to be more advantageous than using the hierarchical or the exact algorithms directly, depending on the situation.

References

Castro, Bruno M., Renan B. Lemes, Jonatas Cesar, Tábita Hünemeier, and Florencia Leonardi. 2018. “A Model Selection Approach for Multiple Sequence Segmentation and Dimensionality Reduction.” Journal of Multivariate Analysis 167: 319–30. https://doi.org/https://doi.org/10.1016/j.jmva.2018.05.006.