Data Integrity Policy
Data Integrity Policy
June 2025
The Data Integrity Policy outlines the principles and procedures System follows to ensure the accuracy, consistency, and reliability of its data throughout its lifecycle. This policy helps maintain trust in the organization’s data systems, supports compliance with regulatory standards, and ensures that data can be used confidently to support decision-making and reporting. It establishes guidelines for data entry, storage, access, modification, and transfer to prevent unauthorized changes, corruption, or loss.
There are the eight key steps we take to ensure the accuracy and reliability of our data:
We maintain high security and compliance standards for storage and management of data. We conduct regular audits with third party providers to ensure compliance with strict standards. We design our processes to minimize the exposure to private or personally identifiable data.
We use a continuous human-in-the-loop framework to monitor and improve our extraction process. The data in the System Graph and System Clinical Graph are based on cutting-edge text extraction, tuned to exceed industry standards for accuracy. We regularly measure accuracy with validation from subject matter experts
We use automated tools to check for data quality at every step in our processing pipeline. We employ monitoring to ensure data integrity and quality.
We ensure that data in the System Graph and System Clinical Graph are traceable directly to the original source. We include original source metadata for every extracted relationship so that System’s products all have citations for original sources. System Syntheses are always created from extracted relationships and not from pre-summarized text. Users can track all citations found in any of System’s products back to the original source.
We engineer System Synthesis to minimize the likelihood of “hallucination.” Unlike some research summary services, we do not synthesize large amounts of unstructured text; we create summaries of relationships extracted from original sources using rule-based algorithms (not LLM-generated text). Users only see syntheses that are based on context provided to LLMs (and not based on knowledge encoded in LLMs). We further reduce the likelihood of hallucinations by running post-processing to ensure that all statements are traceable to one or more statements.
We regularly benchmark the accuracy of System Synthesis using question-answering standards produced by third parties (including BioASQ and Mayo Clinic). In our latest benchmark review, we had 85-90% accuracy across different test datasets.
We have created tools and processes for users to suggest revisions to information they find inaccurate. Any data that is flagged by a user is redacted to other users while our team works to correct the information.
We proactively remove retracted studies from our data. Studies flagged by Retraction Watch and similar services are automatically removed from our data when identified; any extracted relationships from retracted studies are not seen by users.
Last updated
Was this helpful?