4 Simulations
To exemplify the utility of this package, a handful of hypothetical data are presented, together with proposals of cost functions that are expected to segment the data adequately.
4.1 Segments of Independent Variables
To show compatibility with the work presented in (Castro et al. 2018), we analyze the same problem presented in their work.
Let \(B = (B_1, \dots , B_6)\) be a sequence of independent random variables with Bernoulli distributions of probability \(0.5\) and let \(X = (X_1, \dots , X_15)\) be another sequence of random variables dependent on \(B\), with relationships described in (4.1). From the relationships, it’s possible to see the first segment \(X_1, ..., X_5\) depend on \(B_1, B_2\), the second segment \(X_6, ..., X_{10}\) depend on \(B_3, B_4\) and the third and last segment \(X_{11}, ..., X_{15}\) depend on \(X_4, X_5\). Therefore, the set of change points for \(X\) is \(C = \{6, 11\}\).
\[\begin{equation} \begin{aligned} X_1 & = B_1 \\ X_2 & = B_1 - B_2 \\ X_3 & = B_2 \\ X_4 & = B_1 + B_2 \\ X_5 & = B_1 \\ X_6 & = B_3 \\ X_7 & = B_3 - B_4 \\ X_8 & = B_4 \\ X_9 & = B_3 + B_4 \\ X_{10} & = B_3 \\ X_{11} & = B_5 \\ X_{12} & = B_5 - B_6 \\ X_{13} & = B_6 \\ X_{14} & = B_5 + B_6 \\ X_{15} & = B_5 \end{aligned} \tag{4.1} \end{equation}\]
Given \(D\), illustrated in 4.1, a sample data set of the
sequence of random variables \(X\), the multivariate likelihood function \(L\)
(Park 2017) can be used to estimate the likelihood of a given
segment in the discrete data set, as described in
(4.2). The multivariate()
is a fast native-code
implementation of that function that is available in the segmentr
package. We
then use that likelihood function to define a penalized cost function \(K\) in
(4.3) with an extra term to penalize segments that
are too large, based on the number of columns \(\text{ncol}(S_k)\) of the segment.
Notice it’s important to use a penalized cost function because the original
likelihood function tends to favor bigger segments.
\[\begin{equation} \log(L(X|S_k)) = \sum_{k=0}^t P(X_{c_k:c_{k+1}-1} = D_{c_k:c_{k+1}-1}) \tag{4.2} \end{equation}\]
\[\begin{equation} K(S_k) = - \log(L(S_k)) + \text{ncol}(S_k) \tag{4.3} \end{equation}\]
Therefore, the result of segmenting the data set \(D\) with the cost function \(K\) is shown in Table 4.2, matching the expected set of change points \(C\). For contrast, an estimate solution of the change points using an unpenalized cost is shown in Table 4.3.
(ref:sample-columns-caption) Sample values of the correlated random variables defined in (4.1).
X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | X10 | X11 | X12 | X13 | X14 | X15 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 1 | 2 | 1 | 1 | 0 | 1 | 2 | 1 | 1 | -1 | 2 | 3 | 1 |
1 | -1 | 2 | 3 | 1 | 2 | 0 | 2 | 4 | 2 | 2 | 1 | 1 | 3 | 2 |
2 | 0 | 2 | 4 | 2 | 1 | 0 | 1 | 2 | 1 | 1 | -1 | 2 | 3 | 1 |
2 | 1 | 1 | 3 | 2 | 1 | 0 | 1 | 2 | 1 | 2 | 0 | 2 | 4 | 2 |
2 | 0 | 2 | 4 | 2 | 2 | 1 | 1 | 3 | 2 | 2 | 1 | 1 | 3 | 2 |
2 | 1 | 1 | 3 | 2 | 1 | -1 | 2 | 3 | 1 | 1 | 0 | 1 | 2 | 1 |
Segment no. | First index | Last index |
---|---|---|
1 | 1 | 5 |
2 | 6 | 10 |
3 | 11 | 15 |
Segment no. | First index | Last index |
---|---|---|
1 | 1 | 15 |
4.2 Segments with Similar Averages
(Ceballos et al. 2018) describes a process on how to find windows of contiguous
homozygosity, i.e. segments in the genetic data in which the alleles are of the
same type. This is of interest to a researcher investigating diseases.
In this scenario, segmentr
can be used to segment random
variables \(X_1, ..., X_m\) that represent genetic data, encoded as zero for
homozygosity and one for heterozygosity, i.e. \(0\) when the alleles are the
same and \(1\) when different.
So, to use segmentr
to solve this problem, it’s necessary to find a cost
function that favors the homogeneity of a given segment. That problem
is approached by proposing a heterogeneity cost function, i.e. a function that
penalizes segments whose elements are far from the segment average. One
such function is defined in (4.4). Notice, however, the
cost function proposed favors single column segments, as it’s the size
for which all the elements approximate the segment average the most. To counter
this undesirable behavior, we penalized the function by adding a constant to
it, as described in (4.5). The constant factor
has the effect of adding up when too many segments are considered in the
estimation process, making it so wider segments are picked up in the
estimated solution.
In order to observe how the proposed function behaves, consider the example defined in (4.6), in which \(X_i\) for \(i \in \{1, \dots, 20\}\) represent each a column indexed by \(i\) of a data set \(D\), and \(\text{Bern(p)}\) represents the Bernoulli distribution with probability \(p\). The result of segmenting a data set sampled from (4.6) using the penalized heterogeneity cost described in (4.5) is shown in Table 4.4.
\[\begin{equation} K(S_k)=\sum_i(x_i-E[S_k])^2 \tag{4.4} \end{equation}\]
\[\begin{equation} K_p(S_k)=K(S_k) + 1 \tag{4.5} \end{equation}\]
\[\begin{equation} \begin{aligned} X_{1} & \sim \text{Bern}(0.9) \\ X_{2} & \sim \text{Bern}(0.9) \\ X_{3} & \sim \text{Bern}(0.9) \\ X_{4} & \sim \text{Bern}(0.9) \\ X_{5} & \sim \text{Bern}(0.9) \\ X_{6} & \sim \text{Bern}(0.1) \\ X_{7} & \sim \text{Bern}(0.1) \\ X_{8} & \sim \text{Bern}(0.1) \\ X_{9} & \sim \text{Bern}(0.1) \\ X_{10} & \sim \text{Bern}(0.1) \\ X_{11} & \sim \text{Bern}(0.1) \\ X_{12} & \sim \text{Bern}(0.1) \\ X_{13} & \sim \text{Bern}(0.1) \\ X_{14} & \sim \text{Bern}(0.1) \\ X_{15} & \sim \text{Bern}(0.1) \\ X_{16} & \sim \text{Bern}(0.9) \\ X_{17} & \sim \text{Bern}(0.9) \\ X_{18} & \sim \text{Bern}(0.9) \\ X_{19} & \sim \text{Bern}(0.9) \\ X_{20} & \sim \text{Bern}(0.9) \end{aligned} \tag{4.6} \end{equation}\]
Segment no. | First index | Last index |
---|---|---|
1 | 1 | 5 |
2 | 6 | 15 |
3 | 16 | 20 |
4.3 Accuracy of Algorithms Using Different Algorithms
Given different solutions obtained by the segment
function, it’s often
necessary to compare them with each other. For that, the Hausdorff distance,
defined in Chapter 2.3, can be used to measure a
distance between two sets of estimated change points.
We want to measure how well the different algorithms provided in the segmentr
package find the segments in the example correlated columns simulated data set
described in (4.1), for which we know what the exact solution
to the simulation is.
Description | Changepoints | Hausdorff Distance |
---|---|---|
Expected solution | 6, 11 | 0 |
Exact algorithm estimate | 6, 11 | 0 |
Hierarchical algorithm estimate | 6, 8, 11 | 2 |
Hybrid algorithm estimate with threshold 50 | 6, 11 | 0 |
Hybrid algorithm estimate with threshold 4 | 6, 8, 11 | 2 |
Therefore, in Table 4.5 we can see the comparison of different algorithms with different parameters. We notice the “exact” algorithm manages to properly match the expected change points solution, whereas the “hierarchical” algorithm finds an extra segment, which causes the Hausdorff distance to be bigger than zero. The two “hybrid” algorithm cases are interesting in the sense it that it shows how to algorithm works by either adopting the exact or the hierarchical algorithm depending on the threshold and the size of the segment being analyzed. When the threshold argument is large, it applies the exact algorithm to the data set, whereas when the threshold is small, it applies the hierarchical algorithm instead.
Consider now the simulated data set of segments with similar averages described in equation (4.6). We do the same algorithm comparison with this data set in Table 4.6. In contrast with the previous experiment, all the algorithms manage to find the expected solution to the problem.
Description | Changepoints | Hausdorff Distance |
---|---|---|
Expected solution | 6, 16 | 0 |
Exact algorithm estimate | 6, 16 | 0 |
Hierarchical algorithm estimate | 6, 16 | 0 |
Hybrid algorithm estimate with threshold 50 | 6, 16 | 0 |
Hybrid algorithm estimate with threshold 4 | 6, 16 | 0 |
References
Castro, Bruno M., Renan B. Lemes, Jonatas Cesar, Tábita Hünemeier, and Florencia Leonardi. 2018. “A Model Selection Approach for Multiple Sequence Segmentation and Dimensionality Reduction.” Journal of Multivariate Analysis 167: 319–30. https://doi.org/https://doi.org/10.1016/j.jmva.2018.05.006.
Ceballos, Francisco, Peter Joshi, David W. Clark, Michèle Ramsay, and James F. Wilson. 2018. “Runs of Homozygosity: Windows into Population History and Trait Architecture.” Nature Reviews Genetics 19 (January). https://doi.org/10.1038/nrg.2017.109.
Park, K. I. 2017. Fundamentals of Probability and Stochastic Processes with Applications to Communications. Springer International Publishing. https://books.google.com.br/books?id=cd2SswEACAAJ.