We consider the problem of breakpoint detection in a regression modeling framework. To that end, we introduce a novel method, the max-EM algorithm which combines a constrained Hidden Markov Model with the Classification-EM (CEM) algorithm. This algorithm has linear complexity and provides accurate breakpoints detection and parameter estimations. We derive a theoretical result that shows that the likelihood of the data as a function of the regression parameters and the breakpoints location is increased at each step of the algorithm. We also present two initialization methods for the location of the breakpoints in order to deal with local maxima issues. Finally, a statistical test in the one breakpoint situation is developed. Simulation experiments based on linear, logistic, Poisson and Accelerated Failure Time regression models show that the final method that includes the initialization procedure and the max-EM algorithm has a strong performance both in terms of parameters estimation and breakpoints detection. The statistical test is also evaluated and exhibits a correct rejection rate under the null hypothesis and a strong power under various alternatives. Two real dataset are analyzed, the UCI bike sharing and the health disease data, where the interest of the method to detect heterogeneity in the distribution of the data is illustrated.
翻译:暂无翻译