> For the complete documentation index, see [llms.txt](https://docs.system.com/system/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.system.com/system/how-system-works/system-platform/what-types-of-data-do-we-collect.md).

# What types of data do we collect?

System programmatically collects and computes the following evidence from each source of evidence:

<table><thead><tr><th width="150">Source</th><th width="598.4285714285713">Information collected and computed</th></tr></thead><tbody><tr><td>Peer-Reviewed Scientific Articles</td><td><ul><li><p>System has fine-tuned multiple Large Language Models (LLM) with human-labeled scientific corpus to extract statistical relationships from text. At present, System supports more than 100 statistical associations and algorithms, including ratio types, differences and gains, correlations, and regression and other coefficients. System also calculates difference and gains from two reported group values (e.g., when given two means and a p-value, our models calculate and return a mean difference).</p><ul><li><p>These relationships consist of:</p><ul><li>A pair of variables (e.g. the intervention and outcome or independent and dependent variables)</li><li>A point estimation of the statistics (e.g. Odds Ratio or Hazard Ratio value)</li><li>A confidence interval and level </li><li>p-value</li></ul></li></ul></li><li><p>We have also layered in causal and mechanistic statements generated from rules-based models applied to scientific articles.</p><ul><li><p>These relationships consist of: </p><ul><li>A pair of agents (subject and object)</li><li>A statement type (e.g., activates, inhibits, phosphorylates)</li></ul></li></ul></li><li><p>Additionally, when available, System extracts various population characteristics for each study. For example:</p><ul><li>Study population type (e.g., humans, rats)</li><li>Sample size</li><li>Age range</li><li>Sex</li><li>Location</li></ul></li></ul></td></tr><tr><td>Dataset</td><td><ul><li>For each feature in a dataset, System computes feature statistics which vary depending on the data type of the feature. For example, for numerical features, System computes summary statistics and produces a histogram of the feature’s distribution. When features are “character” or “string” types, System computes statistics based on the string length so as to mask potential sources of personally identifiable information (PII) or protected health information (PHI).</li><li>For the relevant features in each dataset, System also computes pairwise correlations between features also depending on their data types.</li><li>When pairs of features are both numerical, System calculates Pearson <em>r</em> correlation values and Kendall rank correlation values.</li><li>When pairs of features are both numerical and time series, System also calculates statistics on the data after differencing and detrending, and computes “lag” correlations between features (where one time series feature is related to lagged values of another time series feature). Appropriate lags are chosen according to the unit of the time series: <code>{1 day: 30-unit lag, 7 days: 12-unit lag, 30 days: 6-unit lag, 365 days: 10-unit lag}</code></li><li>When pairs of features are categorical, System calculates Cramer’s V association values.</li><li>When one feature in a pair is numerical and the other is categorical, System calculates Kruskal Wallis h-test values.</li><li>For datasets retrieved with the platform’s data warehouse integrations (Redshift, BigQuery, Snowflake), System currently computes only Pearson r correlation values (for numerical and time series features) and computes the relevant relationships between time series features.</li></ul></td></tr><tr><td>Model</td><td><ul><li>For each feature in a dataset used for training or testing a model, System computes feature statistics which vary depending on the data type of the feature. For example, for numerical features, System computes summary statistics and produces a histogram of the feature’s distribution. When features are “character” or “string” types, System computes statistics based on the string length so as to mask potential sources of personally identifiable information (PII) or protected health information (PHI).</li><li>For the relevant features in each dataset, System also computes pairwise correlations between features also depending on their data types.</li><li>When pairs of features are both numerical, System calculates Pearson <em>r</em> correlation values and Kendall rank correlation values.</li><li>When pairs of features are both numerical and time series, System also calculates statistics on the data after differencing and detrending, and computes “lag” correlations between features (where one time series feature is related to lagged values of another time series feature, where the appropriate lags are chosen according to the unit of time).</li><li>When pairs of features are categorical, System calculates Cramer’s V association values.</li><li>When one feature in a pair is numerical and the other is categorical, System calculates Kruskal Wallis h-test values.</li><li>For datasets retrieved with the platform’s data warehouse integrations (Redshift, BigQuery, Snowflake), System currently computes only Pearson r correlation values (for numerical and time series features) and computes the relevant relationships between time series features. For models shared on System, the platform collects information about the model and computes performance metrics and feature importance values based on submitted “test” datasets (or “training” datasets, when also submitted).</li><li>For model objects, metadata about the statistical package used to train the model (e.g., scikit-learn, XGBoost, statsmodels, Tensorflow, etc.), the model specification (including any specified hyperparameters), and the names of features used to train the model are collected and presented on model pages.</li><li>Depending on the model type, relevant performance metrics are computed using the “test” sample provided. For classifier models, these include (but are not limited to) Accuracy, Precision, Recall, F1 score, ROC AUC, and confusion matrices. For regression models, these include R2 scores, and measures of prediction error (RMSE, etc.).</li><li>Additionally, feature importance scores are calculated. For each model performance metric, a “permutation score” is computed, where the contribution of each feature’s impact on the performance metric is calculated by measuring how random permutation of a feature’s value affects the performance metric. This effect is reported and features are assigned rank values based on the relative importance.</li><li>System also computes interpretability and explainability measures when appropriate. Model pages often include partial dependence plots (PDPs) alongside feature importance values to show the marginal impact of a feature on a model’s predicted values. Model pages will soon also include SHAP importance plots for features.</li><li>For models deployed “in production,” drift in both model performance and feature importance is tracked.</li></ul></td></tr></tbody></table>

**How System determines “strength”** &#x20;

Strength is an algorithm-agnostic measure of the magnitude of the effect implied by an association. System's methodology differs based on the type of the association.

For correlation-style associations (such as Pearson's R, or Kendall's Tau) we use commonly accepted community guidelines to bucket those associations into one of the five following categories:

| STRENGTH    | PEARSON-R   | KENDALL-TAU | CRAMER-V      | EFFECT SIZE |
| ----------- | ----------- | ----------- | ------------- | ----------- |
| Very Weak   | \[0, 0.1)   | \[0, 0.1)   | \[0, 0.05)    | \[0, 0.1)   |
| Weak        | \[0.1, 0.3) | \[0.1, 0.3) | \[0.05, 0.1)  | \[0.1, 0.3) |
| Medium      | \[0.3, 0.6) | \[0.3, 0.6) | \[0.1, 0.15)  | \[0.3, 0.6) |
| Strong      | \[0.6, 0.9) | \[0.6, 0.9) | \[0.15, 0.25) | \[0.6, 0.9) |
| Very Strong | \[0.9, 1]   | \[0.9, 1]   | \[0.25, …)    | \[0.9, 1]   |

For associations derived from predictive models, we use the evidence already on System to bin the value of a feature’s importance into one of the above buckets. The feature importance value (e.g. permutation score) combined with the performance of the model that the association was derived from (e.g. F1 score) is compared with similar associations on System.

| STRENGTH    | <p>REGRESSORS<br>(R2 SCORE \* PERMUTATION SCORE)</p> | <p>CLASSIFIERS<br>(F1 SCORE \* PERMUTATION SCORE)</p> |
| ----------- | ---------------------------------------------------- | ----------------------------------------------------- |
| Very Weak   | \[0, 0.1) of max on System                           | \[0, 0.1) of max on System                            |
| Weak        | \[0.1, 0.3) of max on System                         | \[0.1, 0.3) of max on System                          |
| Medium      | \[0.3, 0.6) of max on System                         | \[0.3, 0.6) of max on System                          |
| Strong      | \[0.6, 0.9) of max on System                         | \[0.6, 0.9) of max on System                          |
| Very Strong | \[0.9, 1] of max on System                           | \[0.9, 1] of max on System                            |

**Examples**

| Source                                                                                                                                | Source Type | Statistical Association Retrieved                                                                                                                                                | Strength    | Significance | Relationship on System                                                                                  |
| ------------------------------------------------------------------------------------------------------------------------------------- | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------- | ------------ | ------------------------------------------------------------------------------------------------------- |
| [World Bank Education Indicators in Least Developed Countries](https://beta.system.com/view/dataset/sx8xzYYNzbC?view_context=catalog) | Dataset     | <p>PEARSON R: 0.983 between</p><p><code>primary\_school\_life\_expectancy\_years</code></p><p>and</p><p><code>primary\_school\_completion\_rate\_female</code></p>               | Very Strong | P < 0.001    | [**Link**](https://beta.system.com/view/metric-relationship/XxuIvFIoBn4/tqNoyTzhyMj?view_context=graph) |
| [COVID US Aggregate Lag Analysis](https://beta.system.com/view/dataset/JRk1M23zLdI?view_context=catalog)                              | Dataset     | <p>Max PEARSON R (when one feature is lagged): 0.867 between</p><p><code>Confirmed Cases Of COVID-19</code></p><p>and</p><p><code>Deaths From COVID-19</code></p><p>+ 23 lag</p> | Strong      | P < 0.001    | [**Link**](https://beta.system.com/view/metric-relationship/keU4ewFbpt2/STQBWdquZ6H?view_context=graph) |
| [Regression XGBoost Predicting Weekly Incident COVID-19 Deaths](https://beta.system.com/view/model/rJilPCxYteE?view_context=catalog)  | Model       | <p>R2 Permutation Score: 0.225 between</p><p><code>Two\_Week\_Prior\_Weekly\_Deaths</code></p><p>and</p><p><code>Weekly\_Deaths</code></p>                                       | Very Strong | P < 0.001    | [**Link**](https://beta.system.com/view/model/rJilPCxYteE?view_context=catalog)                         |
| [Adolescent Caffeine Use, ADHD, And Cigarette Smoking](https://beta.system.com/view/project/7vhgr45EWUs?view_context=graph)           | Paper       | <p>Adjusted Odds Ratio: 1.94 between</p><p><code>Individual Is A Lifetime Cigarette Smoker</code></p><p>and</p><p><code>Individual Consumes Caffeinated Coffee</code></p>        | Very Strong | P = 0.003    | [**Link**](https://beta.system.com/view/metric-relationship/v4A6yw9TgNG/AqSFuIzaUED?view_context=graph) |


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.system.com/system/how-system-works/system-platform/what-types-of-data-do-we-collect.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
