Biochemical discovery increasingly relies on classifying molecular structures when the consequences of different errors are highly asymmetric. In mutagenicity and carcinogenicity, misclassifying a harmful compound as benign can trigger substantial scientific, regulatory, and health risks, whereas false alarms primarily increase laboratory workload. Modern representations transform molecular graphs into persistence image tensors that preserve multiscale geometric and topological structure, yet existing tensor classifiers and deep tensor neural networks provide no finite-sample guarantees on type I error and often exhibit severe error inflation in practice. We develop the first Tensor Neyman-Pearson (Tensor-NP) classification framework that achieves finite-sample control of type I error while exploiting the multi-mode structure of tensor data. Under a tensor-normal mixture model, we derive the oracle NP discriminant, characterize its Tucker low-rank manifold geometry, and establish tensor-specific margin and conditional detection conditions enabling high-probability bounds on excess type II error. We further propose a Discriminant Tensor Iterative Projection estimator and a Tensor-NP Neural Classifier combining deep learning with Tensor-NP umbrella calibration, yielding the first distribution-free NP-valid methods for multiway data. Across four biochemical datasets, Tensor-NP classifiers maintain type I errors at prespecified levels while delivering competitive type II error performance, providing reliable tools for asymmetric-risk decisions with complex molecular tensors.
翻译:生化发现日益依赖于对分子结构进行分类,而不同分类错误的后果具有高度不对称性。在致突变性与致癌性研究中,将有害化合物误判为良性可能引发重大的科学、监管及健康风险,而误报则主要增加实验室工作量。现代表征方法将分子图转化为持久同调图像张量,以保留多尺度几何与拓扑结构,然而现有张量分类器与深度张量神经网络无法提供关于第一类错误的有限样本保证,且在实践中常出现严重的误差膨胀。本文首次提出张量奈曼-皮尔逊(Tensor-NP)分类框架,在利用张量数据多模态结构的同时,实现对第一类误差的有限样本控制。基于张量正态混合模型,我们推导了理想NP判别函数,刻画其Tucker低秩流形几何结构,并建立张量特异的边际条件与检测条件,从而获得关于第二类误差超额的高概率边界。进一步提出判别张量迭代投影估计器及结合深度学习与张量NP伞形校准的张量NP神经分类器,首次构建了适用于多维数据的无分布NP有效方法。在四个生化数据集上的实验表明,张量NP分类器能将第一类误差维持在预设水平,同时保持具有竞争力的第二类误差性能,为复杂分子张量的非对称风险决策提供了可靠工具。