4 Simulations

To exemplify the utility of this package, a handful of hypothetical data are presented, together with proposals of cost functions that are expected to segment the data adequately.

4.1 Segments of Independent Variables

To show compatibility with the work presented in (Castro et al. 2018), we analyze the same problem presented in their work.

Let \(B = (B_1, \dots , B_6)\) be a sequence of independent random variables with Bernoulli distributions of probability \(0.5\) and let \(X = (X_1, \dots , X_15)\) be another sequence of random variables dependent on \(B\), with relationships described in (4.1). From the relationships, it’s possible to see the first segment \(X_1, ..., X_5\) depend on \(B_1, B_2\), the second segment \(X_6, ..., X_{10}\) depend on \(B_3, B_4\) and the third and last segment \(X_{11}, ..., X_{15}\) depend on \(X_4, X_5\). Therefore, the set of change points for \(X\) is \(C = \{6, 11\}\).

\[\begin{equation} \begin{aligned} X_1 & = B_1 \\ X_2 & = B_1 - B_2 \\ X_3 & = B_2 \\ X_4 & = B_1 + B_2 \\ X_5 & = B_1 \\ X_6 & = B_3 \\ X_7 & = B_3 - B_4 \\ X_8 & = B_4 \\ X_9 & = B_3 + B_4 \\ X_{10} & = B_3 \\ X_{11} & = B_5 \\ X_{12} & = B_5 - B_6 \\ X_{13} & = B_6 \\ X_{14} & = B_5 + B_6 \\ X_{15} & = B_5 \end{aligned} \tag{4.1} \end{equation}\]

Given \(D\), illustrated in 4.1, a sample data set of the sequence of random variables \(X\), the multivariate likelihood function \(L\) (Park 2017) can be used to estimate the likelihood of a given segment in the discrete data set, as described in (4.2). The multivariate() is a fast native-code implementation of that function that is available in the segmentr package. We then use that likelihood function to define a penalized cost function \(K\) in (4.3) with an extra term to penalize segments that are too large, based on the number of columns \(\text{ncol}(S_k)\) of the segment. Notice it’s important to use a penalized cost function because the original likelihood function tends to favor bigger segments.

\[\begin{equation} \log(L(X|S_k)) = \sum_{k=0}^t P(X_{c_k:c_{k+1}-1} = D_{c_k:c_{k+1}-1}) \tag{4.2} \end{equation}\]

\[\begin{equation} K(S_k) = - \log(L(S_k)) + \text{ncol}(S_k) \tag{4.3} \end{equation}\]

Therefore, the result of segmenting the data set \(D\) with the cost function \(K\) is shown in Table 4.2, matching the expected set of change points \(C\). For contrast, an estimate solution of the change points using an unpenalized cost is shown in Table 4.3.

(ref:sample-columns-caption) Sample values of the correlated random variables defined in (4.1).

Table 4.1: Sample values of the correlated random variables defined in (4.1).
X1	X2	X3	X4	X5	X6	X7	X8	X9	X10	X11	X12	X13	X14	X15
1	0	1	2	1	1	0	1	2	1	1	-1	2	3	1
1	-1	2	3	1	2	0	2	4	2	2	1	1	3	2
2	0	2	4	2	1	0	1	2	1	1	-1	2	3	1
2	1	1	3	2	1	0	1	2	1	2	0	2	4	2
2	0	2	4	2	2	1	1	3	2	2	1	1	3	2
2	1	1	3	2	1	-1	2	3	1	1	0	1	2	1

Table 4.2: Segment tables as defined in (4.1).
Segment no.	First index	Last index
1	1	5
2	6	10
3	11	15

Table 4.3: Results of segmentation using a non-penalized multivariate cost
Segment no.	First index	Last index
1	1	15

4.2 Segments with Similar Averages

(Ceballos et al. 2018) describes a process on how to find windows of contiguous homozygosity, i.e. segments in the genetic data in which the alleles are of the same type. This is of interest to a researcher investigating diseases. In this scenario, segmentr can be used to segment random variables \(X_1, ..., X_m\) that represent genetic data, encoded as zero for homozygosity and one for heterozygosity, i.e. \(0\) when the alleles are the same and \(1\) when different.

So, to use segmentr to solve this problem, it’s necessary to find a cost function that favors the homogeneity of a given segment. That problem is approached by proposing a heterogeneity cost function, i.e. a function that penalizes segments whose elements are far from the segment average. One such function is defined in (4.4). Notice, however, the cost function proposed favors single column segments, as it’s the size for which all the elements approximate the segment average the most. To counter this undesirable behavior, we penalized the function by adding a constant to it, as described in (4.5). The constant factor has the effect of adding up when too many segments are considered in the estimation process, making it so wider segments are picked up in the estimated solution.

In order to observe how the proposed function behaves, consider the example defined in (4.6), in which \(X_i\) for \(i \in \{1, \dots, 20\}\) represent each a column indexed by \(i\) of a data set \(D\), and \(\text{Bern(p)}\) represents the Bernoulli distribution with probability \(p\). The result of segmenting a data set sampled from (4.6) using the penalized heterogeneity cost described in (4.5) is shown in Table 4.4.

\[\begin{equation} K(S_k)=\sum_i(x_i-E[S_k])^2 \tag{4.4} \end{equation}\]

\[\begin{equation} K_p(S_k)=K(S_k) + 1 \tag{4.5} \end{equation}\]

\[\begin{equation} \begin{aligned} X_{1} & \sim \text{Bern}(0.9) \\ X_{2} & \sim \text{Bern}(0.9) \\ X_{3} & \sim \text{Bern}(0.9) \\ X_{4} & \sim \text{Bern}(0.9) \\ X_{5} & \sim \text{Bern}(0.9) \\ X_{6} & \sim \text{Bern}(0.1) \\ X_{7} & \sim \text{Bern}(0.1) \\ X_{8} & \sim \text{Bern}(0.1) \\ X_{9} & \sim \text{Bern}(0.1) \\ X_{10} & \sim \text{Bern}(0.1) \\ X_{11} & \sim \text{Bern}(0.1) \\ X_{12} & \sim \text{Bern}(0.1) \\ X_{13} & \sim \text{Bern}(0.1) \\ X_{14} & \sim \text{Bern}(0.1) \\ X_{15} & \sim \text{Bern}(0.1) \\ X_{16} & \sim \text{Bern}(0.9) \\ X_{17} & \sim \text{Bern}(0.9) \\ X_{18} & \sim \text{Bern}(0.9) \\ X_{19} & \sim \text{Bern}(0.9) \\ X_{20} & \sim \text{Bern}(0.9) \end{aligned} \tag{4.6} \end{equation}\]

Table 4.4: Results of segmentation by applying the cost function defined in (4.5) to a sample of size 10 of the model defined in (4.6).
Segment no.	First index	Last index
1	1	5
2	6	15
3	16	20

4.3 Accuracy of Algorithms Using Different Algorithms

Given different solutions obtained by the segment function, it’s often necessary to compare them with each other. For that, the Hausdorff distance, defined in Chapter 2.3, can be used to measure a distance between two sets of estimated change points.

We want to measure how well the different algorithms provided in the segmentr package find the segments in the example correlated columns simulated data set described in (4.1), for which we know what the exact solution to the simulation is.

Table 4.5: Comparison of solutions of different algorithms to segment the data set described in (4.1), measuring how far each solution is from the ideal solution using the Hausdorff distance.
Description	Changepoints	Hausdorff Distance
Expected solution	6, 11	0
Exact algorithm estimate	6, 11	0
Hierarchical algorithm estimate	6, 8, 11	2
Hybrid algorithm estimate with threshold 50	6, 11	0
Hybrid algorithm estimate with threshold 4	6, 8, 11	2

Therefore, in Table 4.5 we can see the comparison of different algorithms with different parameters. We notice the “exact” algorithm manages to properly match the expected change points solution, whereas the “hierarchical” algorithm finds an extra segment, which causes the Hausdorff distance to be bigger than zero. The two “hybrid” algorithm cases are interesting in the sense it that it shows how to algorithm works by either adopting the exact or the hierarchical algorithm depending on the threshold and the size of the segment being analyzed. When the threshold argument is large, it applies the exact algorithm to the data set, whereas when the threshold is small, it applies the hierarchical algorithm instead.

Consider now the simulated data set of segments with similar averages described in equation (4.6). We do the same algorithm comparison with this data set in Table 4.6. In contrast with the previous experiment, all the algorithms manage to find the expected solution to the problem.

Table 4.6: Comparison of solutions of different algorithms to segment the data set described in (4.6), measuring how far each solution is from the ideal solution using the Hausdorff distance.
Description	Changepoints	Hausdorff Distance
Expected solution	6, 16	0
Exact algorithm estimate	6, 16	0
Hierarchical algorithm estimate	6, 16	0
Hybrid algorithm estimate with threshold 50	6, 16	0
Hybrid algorithm estimate with threshold 4	6, 16	0

References

Castro, Bruno M., Renan B. Lemes, Jonatas Cesar, Tábita Hünemeier, and Florencia Leonardi. 2018. “A Model Selection Approach for Multiple Sequence Segmentation and Dimensionality Reduction.” Journal of Multivariate Analysis 167: 319–30. https://doi.org/https://doi.org/10.1016/j.jmva.2018.05.006.

Ceballos, Francisco, Peter Joshi, David W. Clark, Michèle Ramsay, and James F. Wilson. 2018. “Runs of Homozygosity: Windows into Population History and Trait Architecture.” Nature Reviews Genetics 19 (January). https://doi.org/10.1038/nrg.2017.109.

Park, K. I. 2017. Fundamentals of Probability and Stochastic Processes with Applications to Communications. Springer International Publishing. https://books.google.com.br/books?id=cd2SswEACAAJ.

X1	X2	X3	X4	X5	X6	X7	X8	X9	X10	X11	X12	X13	X14	X15
1	0	1	2	1	1	0	1	2	1	1	-1	2	3	1
1	-1	2	3	1	2	0	2	4	2	2	1	1	3	2
2	0	2	4	2	1	0	1	2	1	1	-1	2	3	1
2	1	1	3	2	1	0	1	2	1	2	0	2	4	2
2	0	2	4	2	2	1	1	3	2	2	1	1	3	2
2	1	1	3	2	1	-1	2	3	1	1	0	1	2	1

X1	X2	X3	X4	X5	X6	X7	X8	X9	X10	X11	X12	X13	X14	X15
1	0	1	2	1	1	0	1	2	1	1	-1	2	3	1
1	-1	2	3	1	2	0	2	4	2	2	1	1	3	2
2	0	2	4	2	1	0	1	2	1	1	-1	2	3	1
2	1	1	3	2	1	0	1	2	1	2	0	2	4	2
2	0	2	4	2	2	1	1	3	2	2	1	1	3	2
2	1	1	3	2	1	-1	2	3	1	1	0	1	2	1

X1	X2	X3	X4	X5	X6	X7	X8	X9	X10	X11	X12	X13	X14	X15
1	0	1	2	1	1	0	1	2	1	1	-1	2	3	1
1	-1	2	3	1	2	0	2	4	2	2	1	1	3	2
2	0	2	4	2	1	0	1	2	1	1	-1	2	3	1
2	1	1	3	2	1	0	1	2	1	2	0	2	4	2
2	0	2	4	2	2	1	1	3	2	2	1	1	3	2
2	1	1	3	2	1	-1	2	3	1	1	0	1	2	1