Large pools of synthetic DNA molecules have been recently used to reliably store significant volumes of digital data. While DNA as a storage medium has enormous potential because of its high storage density, its practical use is currently severely limited because of the high cost and low throughput of available DNA synthesis technologies. We study the role of batch optimization in reducing the cost of large scale DNA synthesis, which translates to the following algorithmic task. Given a large pool $\mathcal{S}$ of random quaternary strings of fixed length, partition $\mathcal{S}$ into batches in a way that minimizes the sum of the lengths of the shortest common supersequences across batches. We introduce two ideas for batch optimization that both improve (in different ways) upon a naive baseline: (1) using both $(ACGT)^{*}$ and its reverse $(TGCA)^{*}$ as reference strands, and batching appropriately, and (2) batching via the quantiles of an appropriate ordering of the strands. We also prove asymptotically matching lower bounds on the cost of DNA synthesis, showing that one cannot improve upon these two ideas. Our results uncover a surprising separation between two cases that naturally arise in the context of DNA data storage: the asymptotic cost savings of batch optimization are significantly greater in the case where strings in $\mathcal{S}$ do not contain repeats of the same character (homopolymers), as compared to the case where strings in $\mathcal{S}$ are unconstrained.
翻译:大量合成DNA分子最近被用来可靠地存储大量数字数据。虽然DNA作为一种存储介质由于存储密度高而具有巨大的潜力,但其实际用途目前由于现有DNA合成技术的成本高和吞吐量低而受到严重限制。我们研究了批量优化在降低大规模DNA合成成本方面的作用,这转化成以下算法任务。鉴于一个大批量库$\mathcal{S}$的固定长度随机四边字符串,将$\mathcal{S}美元分解成批量,从而最大限度地减少每批中最短的共同超级序列的长度之和。我们引入了两种批量优化的想法,既能(以不同的方式)改进天真基线,又能(以美元)和反向的DNA合成成本。鉴于一个大库是固定长度的四边线,我们还证明在DNA合成成本上与非重复的一环。 相比,在两个类量的存储中,我们无法大幅改进两个DNA的序号,在两个序列中,在两个序列中,我们将DNA的存储结果在两个序列中产生惊人的分解。