The effect of bias on hypothesis formation is characterized for an automated data-driven projection pursuit neural network to extract and select features for binary classification of data streams. This intelligent exploratory process partitions a complete vector state space into disjoint subspaces to create working hypotheses quantified by similarities and differences observed between two groups of labeled data streams. Data streams are typically time sequenced, and may exhibit complex spatio-temporal patterns. For example, given atomic trajectories from molecular dynamics simulation, the machine's task is to quantify dynamical mechanisms that promote function by comparing protein mutants, some known to function while others are nonfunctional. Utilizing synthetic two-dimensional molecules that mimic the dynamics of functional and nonfunctional proteins, biases are identified and controlled in both the machine learning model and selected training data under different contexts. The refinement of a working hypothesis converges to a statistically robust multivariate perception of the data based on a context-dependent perspective. Including diverse perspectives during data exploration enhances interpretability of the multivariate characterization of similarities and differences.
翻译:这种智能探索过程将完整的矢量状态空间分割成不相连的子空间,以产生工作假设,用两组标签数据流之间的异同加以量化。数据流通常是时间序列,可能呈现复杂的时空模式。例如,鉴于分子动态模拟中的原子轨迹,机器的任务是量化通过比较蛋白变异体促进功能的动态机制,其中一些已知为功能变异体,而另一些则为功能不起作用。在机器学习模型和选定的培训数据中,均发现和控制偏差。工作假设的完善与基于背景的对数据具有统计性强的多变观点相交织。在数据探索期间,包括多种观点,可以促进对异异和异异的多变特性的解读。