Methods: we adopted a biological networks approach that enables the systematic interrogation of ChatGPT's linked entities. In particular, we designed an ontology-driven fact-checking algorithm that compares biological graphs constructed from approximately 200,000 PubMed abstracts with counterparts constructed from a dataset generated using the ChatGPT-3.5 Turbo model. The nodes refer to biological entities (genes and diseases) that occur in the text. The edges represent the co-occurrence relationships of two entities mentioned in the same document, weighted by the proximity distance between these two entities. This research assumes a ``closed-world assumption'', meaning that fact-checking is performed only using the literature dataset as our ground truth. Results: in ten samples of 250 randomly selected records from the ChatGPT dataset of 1000 ``simulated'' articles , the fact-checking link accuracy ranged from 70% to 86%, while the remainder of the links remained unverified. Given the closed world assumption, the fact-checking precision is significant. When measuring and comparing the proximity distances of the edges of literature graphs against ChatGPT graphs we found that the ChatGPT distances were significantly shorter (ranging from 90 to 153) character distance. In contrast, the proximity distance of biological entities identified in the literature ranged from 236 to 765 character distance. This pattern held true for all the relationships among biological entities in the ten samples. Conclusion: this study demonstrated a reasonably high percentage accuracy of aggregate fact-checking of disease-gene relationships found in ChatGPT-generated texts. The strikingly consistent pattern of short proximity distances across all samples offers an illuminating feedback to the biological knowledge we possess in the literature today.
翻译:暂无翻译