The rapid development of Large Language Models (LLMs) for code generation has transformed software development by automating coding tasks with unprecedented efficiency. However, the training of these models on open-source code repositories (e.g., from GitHub) raises critical ethical and legal concerns, particularly regarding data authorization and open-source license compliance. Developers are increasingly questioning whether model trainers have obtained proper authorization before using repositories for training, especially given the lack of transparency in data collection. To address these concerns, we propose a novel data marking framework RepoMark to audit the data usage of code LLMs. Our method enables auditors to verify whether their code has been used in training, while ensuring semantic preservation, imperceptibility, and theoretical false detection rate (FDR) guarantees. By generating multiple semantically equivalent code variants, RepoMark introduces data marks into the code files, and during detection, RepoMark leverages a novel ranking-based hypothesis test to detect model behavior difference on trained data. Compared to prior data auditing approaches, RepoMark significantly enhances data efficiency, allowing effective auditing even when the user's repository possesses only a small number of code files. Experiments demonstrate that RepoMark achieves a detection success rate over 90\% on small code repositories under a strict FDR guarantee of 5\%. This represents a significant advancement over existing data marking techniques, all of which only achieve accuracy below 55\% under identical settings. This further validates RepoMark as a robust, theoretically sound, and promising solution for enhancing transparency in code LLM training, which can safeguard the rights of code authors.


暂无翻译

0
下载
预览

Large Language Model (LLM) agents have recently shown strong potential in domains such as automated coding, deep research, and graphical user interface manipulation. However, training them to succeed on long-horizon, domain-specialized tasks remains challenging. Current methods primarily fall into two categories. The first relies on dense human annotations through behavior cloning, which is prohibitively expensive for long-horizon tasks that can take days or months. The second depends on outcome-driven sampling, which often collapses due to the rarity of valid positive trajectories on domain-specialized tasks. We introduce Apollo, a sampling framework that integrates asynchronous human guidance with action-level data filtering. Instead of requiring annotators to shadow every step, Apollo allows them to intervene only when the agent drifts from a promising trajectory, by providing prior knowledge, strategic advice, etc. This lightweight design makes it possible to sustain interactions for over 30 hours and produces valuable trajectories at a lower cost. Apollo then applies supervision control to filter out sub-optimal actions and prevent error propagation. Together, these components enable reliable and effective data collection in long-horizon environments. To demonstrate the effectiveness of Apollo, we evaluate it using InnovatorBench. Our experiments show that when applied to train the GLM-4.5 model on InnovatorBench, Apollo achieves more than a 50% improvement over the untrained baseline and a 28% improvement over a variant trained without human interaction. These results highlight the critical role of human-in-the-loop sampling and the robustness of Apollo's design in handling long-horizon, domain-specialized tasks.


暂无翻译

0
下载
预览

Open-weight bio-foundation models present a dual-use dilemma. While holding great promise for accelerating scientific research and drug development, they could also enable bad actors to develop more deadly bioweapons. To mitigate the risk posed by these models, current approaches focus on filtering biohazardous data during pre-training. However, the effectiveness of such an approach remains unclear, particularly against determined actors who might fine-tune these models for malicious use. To address this gap, we propose \eval, a framework to evaluate the robustness of procedures that are intended to reduce the dual-use capabilities of bio-foundation models. \eval assesses models' virus understanding through three lenses, including sequence modeling, mutational effects prediction, and virulence prediction. Our results show that current filtering practices may not be particularly effective: Excluded knowledge can be rapidly recovered in some cases via fine-tuning, and exhibits broader generalizability in sequence modeling. Furthermore, dual-use signals may already reside in the pretrained representations, and can be elicited via simple linear probing. These findings highlight the challenges of data filtering as a standalone procedure, underscoring the need for further research into robust safety and security strategies for open-weight bio-foundation models.


暂无翻译

0
下载
预览

We derive closed-form extensions of Riccati's recursions (both sequential and parallel) for solving dual-regularized LQR problems. We show how these methods can be used to solve general constrained, non-convex, discrete-time optimal control problems via a regularized interior point method, while guaranteeing that each primal step is a descent direction of an Augmented Barrier-Lagrangian merit function. We provide MIT-licensed implementations of our methods in C++ and JAX.


暂无翻译

0
下载
预览

Efficient large-scale retrieval requires representations that are both compact and discriminative. Foundation models provide powerful visual and multimodal embeddings, but nearest neighbor search in these high-dimensional spaces is computationally expensive. Hashing offers an efficient alternative by enabling fast Hamming distance search with binary codes, yet existing approaches often rely on complex pipelines, multi-term objectives, designs specialized for a single learning paradigm, and long training times. We introduce CroVCA (Cross-View Code Alignment), a simple and unified principle for learning binary codes that remain consistent across semantically aligned views. A single binary cross-entropy loss enforces alignment, while coding-rate maximization serves as an anti-collapse regularizer to promote balanced and diverse codes. To implement this, we design HashCoder, a lightweight MLP hashing network with a final batch normalization layer to enforce balanced codes. HashCoder can be used as a probing head on frozen embeddings or to adapt encoders efficiently via LoRA fine-tuning. Across benchmarks, CroVCA achieves state-of-the-art results in just 5 training epochs. At 16 bits, it particularly well-for instance, unsupervised hashing on COCO completes in under 2 minutes and supervised hashing on ImageNet100 in about 3 minutes on a single GPU. These results highlight CroVCA's efficiency, adaptability, and broad applicability.


暂无翻译

0
下载
预览

While traditionally not considered part of the scientific method, science communication is increasingly playing a pivotal role in shaping scientific practice. Researchers are now frequently compelled to publicise their findings in response to institutional impact metrics and competitive grant environments. This shift underscores the growing influence of media narratives on both scientific priorities and public perception. In a current trend of personality-driven reporting, we examine patterns in science communication that may indicate biases of different types, towards topics and researchers. We focused and applied our methodology to a corpus of media coverage from three of the most prominent scientific media outlets: Wired, Quanta, and The New Scientist -- spanning the past 5 to 10 years. By mapping linguistic patterns, citation flows, and topical convergence, our objective was to quantify the dimensions and degree of bias that influence the credibility of scientific journalism. In doing so, we seek to illuminate the systemic features that shape science communication today and to interrogate their broader implications for epistemic integrity and public accountability in science. We present our results with anonymised journalist names but conclude that personality-driven media coverage distorts science and the practice of science flattening rather than expanding scientific coverage perception. Keywords : selective sourcing, bias, scientific journalism, Quanta, Wired, New Scientist, fairness, balance, neutrality, standard practices, distortion, personal promotion, communication, media outlets.


暂无翻译

1
下载
预览

It is well known that phase formation by electrodeposition yields films of poorly controllable morphology. This typically leads to a range of technological issues in many fields of electrochemical technology. Presently, a particularly relevant case is that of high-energy density next-generation batteries with metal anodes, that cannot yet reach practical cyclability targets, owing to uncontrolled elelctrode shape evolution. In this scenario, mathematical modelling is a key tool to lay the knowledge-base for materials-science advancements liable to lead to concretely stable battery material architectures. In this work, we introduce the Evolving Surface DIB (ESDIB) model, a reaction-diffusion system posed on a dynamically evolving electrode surface. Unlike previous fixed-surface formulations, the ESDIB model couples surface evolution to the local concentration of electrochemical species, allowing the geometry of the electrode itself to adapt in response to deposition. To handle the challenges related to the coupling between surface motion and species transport, we numerically solve the system by proposing an extension of the Lumped Evolving Surface Finite Element Method (LESFEM) for spatial discretisation, combined with an IMEX Euler scheme for time integration. The model is validated through six numerical experiments, each compared with laboratory images of electrodeposition. Results demonstrate that the ESDIB framework accurately captures branching and dendritic growth, providing a predictive and physically consistent tool for studying metal deposition phenomena in energy storage devices.


暂无翻译

0
下载
预览

We propose a mathematically principled PDE gradient flow framework for distributionally robust optimization (DRO). Exploiting the recent advances in the intersection of Markov Chain Monte Carlo sampling and gradient flow theory, we show that our theoretical framework can be implemented as practical algorithms for sampling from worst-case distributions and, consequently, DRO. While numerous previous works have proposed various reformulation techniques and iterative algorithms, we contribute a sound gradient flow view of the distributional optimization that can be used to construct new algorithms. As an example of applications, we solve a class of Wasserstein and Sinkhorn DRO problems using the recently-discovered Wasserstein Fisher-Rao and Stein variational gradient flows. Notably, we also show some simple reductions of our framework recover exactly previously proposed popular DRO methods, and provide new insights into their theoretical limit and optimization dynamics. Numerical studies based on stochastic gradient descent provide empirical backing for our theoretical findings.


暂无翻译

0
下载
预览

Shallow free surface flows are often characterized by both subdomains that require high modeling complexity and subdomains that can be sufficiently accurately modeled with low modeling complexity. Moreover, these subdomains may change in time as the water flows through the domain. This motivates the need for space and time adaptivity in the simulation of shallow free surface flows. In this paper, we develop the first adaptive simulations using the recently developed Shallow Water Moment Equations, which are an extension of the standard Shallow Water Equations that allow for vertically changing velocity profiles by including additional variables and equations. The model-specific modeling complexity of a shallow water moment model is determined by its order. The higher the order of the model, the more variables and equations are included in the model. Shallow water moment models are ideally suited for adaptivity because they are hierarchical such that low-order models and high-order models share the same structure. To enable adaptive simulations, we propose two approaches for the coupling of the varying-order shallow water moment equations at their boundary interfaces. The first approach dynamically updates padded state variables but cannot be written in conservative form, while the second approach uses fixed padded state variable values of zero and reduces to conservative form for conservative moment equations. The switching procedure between high-order models and low-order models is based on a new set of model error estimators, originating from a decomposition of the high-order models. Numerical results of the collision of a dam-break wave with a smooth wave yield accurate results, while achieving speedups up to 60 percent compared to a non-adaptive model with fixed modeling complexity.


暂无翻译

0
下载
预览

Local principal component analysis (Local PCA) has proven to be an effective tool for estimating the intrinsic dimension of a manifold. More recently, curvature-adjusted PCA (CA-PCA) has improved upon this approach by explicitly accounting for the curvature of the underlying manifold, rather than assuming local flatness. Building on these insights, we propose a general framework for manifold dimension estimation that captures the manifold's local graph structure by integrating PCA with regression-based techniques. Within this framework, we introduce two representative estimators: quadratic embedding (QE) and total least squares (TLS). Experiments on both synthetic and real-world datasets demonstrate that these methods perform competitively with, and often outperform, state-of-the-art alternatives.


暂无翻译

0
下载
预览
登陆后查看更多精品内容
VIP会员
本周荟萃主题
区块链
区块链(Blockchain)是由节点参与的分布式数据库系统,它的特点是不可更改,不可伪造,也可以将其理解为账簿系统(ledger)。它是比特币的一个重要概念,完整比特币区块链的副本,记录了其代币(token)的每一笔交易。通过这些信息,我们可以找到每一个地址,在历史上任何一点所拥有的价值。
深度学习
机器学习的一个分支,它基于试图使用包含复杂结构或由多重非线性变换构成的多个处理层对数据进行高层抽象的一系列算法。
机器学习
“机器学习是近20多年兴起的一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。机器学习理论主要是设计和分析一些让 可以自动“ 学习”的算法。机器学习算法是一类从数据中自动分析获得规律,并利用规律对未知数据进行预测的算法。因为学习算法中涉及了大量的统计学理论,机器学习与统计推断学联系尤为密切,也被称为统计学习理论。算法设计方面,机器学习理论关注可以实现的,行之有效的学习算法。很多 推论问题属于 无程序可循难度,所以部分的机器学习研究是开发容易处理的近似算法。”

——中文维基百科
强化学习
强化学习(RL)是机器学习的一个领域,与软件代理应如何在环境中采取行动以最大化累积奖励的概念有关。除了监督学习和非监督学习外,强化学习是三种基本的机器学习范式之一。 强化学习与监督学习的不同之处在于,不需要呈现带标签的输入/输出对,也不需要显式纠正次优动作。相反,重点是在探索(未知领域)和利用(当前知识)之间找到平衡。 该环境通常以马尔可夫决策过程(MDP)的形式陈述,因为针对这种情况的许多强化学习算法都使用动态编程技术。经典动态规划方法和强化学习算法之间的主要区别在于,后者不假设MDP的确切数学模型,并且针对无法采用精确方法的大型MDP。
推荐系统
推荐系统,是指根据用户的习惯、偏好或兴趣,从不断到来的大规模信息中识别满足用户兴趣的信息的过程。推荐推荐任务中的信息往往称为物品(Item)。根据具体应用背景的不同,这些物品可以是新闻、电影、音乐、广告、商品等各种对象。推荐系统利用电子商务网站向客户提供商品信息和建议,帮助用户决定应该购买什么产品,模拟销售人员帮助客户完成购买过程。个性化推荐是根据用户的兴趣特点和购买行为,向用户推荐用户感兴趣的信息和商品。随着电子商务规模的不断扩大,商品个数和种类快速增长,顾客需要花费大量的时间才能找到自己想买的商品。这种浏览大量无关的信息和产品过程无疑会使淹没在信息过载问题中的消费者不断流失。为了解决这些问题,个性化推荐系统应运而生。个性化推荐系统是建立在海量数据挖掘基础上的一种高级商务智能平台,以帮助电子商务网站为其顾客购物提供完全个性化的决策支持和信息服务。
卷积神经网络
在深度学习中,卷积神经网络(CNN或ConvNet)是一类深度神经网络,最常用于分析视觉图像。基于它们的共享权重架构和平移不变性特征,它们也被称为位移不变或空间不变的人工神经网络(SIANN)。它们在图像和视频识别,推荐系统,图像分类,医学图像分析,自然语言处理,和财务时间序列中都有应用。
计算机网络
计算机网络( Computer Networks )指将地理位置不同的多台计算机及其外部设备,通过通信线路连接起来,在网络操作系统及网络通信协议的管理和协调下,实现资源共享和信息传递的计算机系统。
命名实体识别
命名实体识别(NER)(也称为实体标识,实体组块和实体提取)是信息抽取的子任务,旨在将非结构化文本中提到的命名实体定位和分类为预定义类别,例如人员姓名、地名、机构名、专有名词等。
机器翻译
机器翻译,又称为自动翻译,是利用计算机将一种自然语言(源语言)转换为另一种自然语言(目标语言)的过程。它是计算语言学的一个分支,是人工智能的终极目标之一,具有重要的科学研究价值。
计算机视觉
计算机视觉是一门研究如何使机器“看”的科学,更进一步的说,就是是指用摄影机和电脑代替人眼对目标进行识别、跟踪和测量等机器视觉,并进一步做图形处理,使电脑处理成为更适合人眼观察或传送给仪器检测的图像。作为一个科学学科,计算机视觉研究相关的理论和技术,试图建立能够从图像或者多维数据中获取‘信息’的人工智能系统。
微信扫码咨询专知VIP会员