3DGenBench is a web server for scoring performance of 3D genomic models. 3DGenBench provides two challenges. The first challenge aims at quantifying how accurate a model predicts experimental data. The second benchmark aims to estimate how well a model can predict changes in chromosome folding caused by structural genomic mutations.
Overview
There are five steps required to obtain 3DGenBench scores:
- Explore reference Hi-C dataset
- Generate computational predictions of Hi-C contacts for one or multiple samples (see example data)
- Upload your predictions to 3DGenBench server
- Provide samples metadata and compute metrics
- Explore metrics
Step 1. Explore Hi-C Dataset
There are two main dataset types for prediction. Rearrangement dataset (hereafter Paired) contains capture Hi-C and complementary epigenetic data, such as CTCF ChIP-seq, for wild-type and mutated samples (Homo sapiens, Mus musculus, Drosophila melanogaster cell lines). Genomic region dataset (hereafter Single) contains loci for prediction larger than 10 Mbp without known chromosome rearrangements. Those datasets can be found here.
Sample metadata include the following information:
chr
, start prediction
and end prediction
columns describe the genomic region for which Hi-C interactions are expected to be predicted. Where start prediction
bin corresponds to the interval (start_prediction-resolution)-start_prediction, the same rule is for end prediction
bin.
Rearr #n Start
, Rearr #n end
columns describe the rearrangement coordinates.
For each sample there are several columns for rearrangement coordinates if multiple simultaneous mutations have been found in the region.
The rearrangement type can be found in Rearrangement Type
column.
Also pay attention to the sample cell type, genomic assembly used, and available Hi-C map resolutions (5, 10, 20, 25, or 50 kb).
Hi-C maps for wild-type and mutated conditions are available in the most commonly used formats: hic, cool (for 5, 10, 20, 25, or 50 kb resolution), and pairs.
Also, for most datasets we provided supplementary tracks describing CTCF binding.
All the data can be downloaded via hyperlinks in the table.
If you need to all available Hi-C data for one particular sample, please follow links in WT Archived Data
or MUT Archived Data
columns.
Also you can explore dataset folder at our local FTP storage using hyperlinks in WT FTP Folder
or MUT FTP Folder
columns.
The detailed description of files can be found here.
If you need to download the entire Hi-C data set, use command:
wget -r -np https://genedev.bionet.nsc.ru/ftp/by_Project/INC_COST_3DBenchmark/hic_dataset_zipped/
Also, you can download CTCF data in narrowPeak data format using links in CTCF Data
column.
These files have 2 additional columns with information about CTCF binding site orientation calculated using GimmeMotifs.
If you want to download the entire CTCF data set, use command:
wget -r -np https://genedev.bionet.nsc.ru/ftp/by_Project/INC_COST_3DBenchmark/CTCF_data/
Step 2. Predict Hi-C Contacts or Insulation Score Data
Use your computational model to predict Hi-C contacts or insulation score data for one of the reference samples.
Hi-C Contacts Input
The predicted list of contacts should be provided as a tab-separated values (TSV) file which contains the following columns:
chr contact_start contact_end contact_count
Where contact_start corresponds to the interval (contact_start - resolution)-contact_start, the same is for the contact_end.
An example file can be downloaded here.
Insulation Scores Input
The predicted insulation score track should be provided as a BedGraph file without header (technically a BedGraph-like TSV file). Columns are the following:
chrom chrom_start chrom_end insulation_score
For Paired benchmark two predicted tracks should be provided, both for WT and Mutated samples.
Step 3. Upload Predicted Data
The data can be uploaded here. The uploaded files will be available in dropdown list here (see next Step).
Also, if you have too many files to upload, you can upload your data via FTP using any FTP client, such as FileZilla or WinSCP.
Protocol: SFTP Host name: gate1.cytogen.ru Port number: 8046 Username: sftp_user Password: 3DGenBench
Step 4. Provide Sample Metadata & Compute Metrics
Once the data is uploaded, go here, choose the type of prediction (Single, Paired, or insulation score-only for both types), then fill the form according to labels. You can use button to load example of predicted contacts file. Alternatively, example samples can be loaded as shown in the figure below.
The page allows you to submit predictions for several samples using button.
Step 5. Explore Metrics
The status of the submission is available at the link in success message, or here by ID. Cyan status of submission indicates your job is queued, orange means your job is running at the server, green status shows that the job was successfully completed, and red indicates that there was an error. If your job has failed, you can try and read job logs (at the bottom of job page), or you can contact us.
If the job ends successfully, you will see the metrics page. Those metrics describe prediction accuracy of your model (see the section below). You can find the example of computed metrics checking any ID with the green status or submitting a test unit as described above.
What Do Output Metrics Mean?
Those metrics reflect how well the model predicts experimental Hi-C data:
- Spearman’s correlation between experimental and predicted Hi-C matrices
- SCC (stratum adjusted correlation coefficient) from Yang et al. (2017), implemented by hicreppy with max_dist parameter equals to 1500000, between experimental and predicted Hi-C matrices
- Spearman’s correlation of insulation score at each bin (computed using Cooltools calculate_insulation_score)
Those metrics reflect how well the model captures differences in 3D genome architecture caused by the rearrangement:
- Ectopic interactions computed as in Simona Bianco et al. (2018). Briefly, we subtract WT Hi-C map from MUT Hi-C map, distance-normalize the results, and compute values which are 3 standard deviations from the mean of the distribution of the observed differences. Those outliers are designed as ectopic interactions.
To provide quantitative measurement of ectopic interactions overlap, we use visualization of Precision-Recall (PR) curves, output Area Under the Curve (AUC) metrics, and show the overlap of the predicted and experimentally measured ectopic interactions as compared to randomized controls:
- Changes in insulation score.
For calculating ectopic insulation score, we divide the insulation score (computed using CoolTools calculate_insulation_score) at each bin for WT and MUT conditions and divide one track by another element-wise.
That gives us fold changes of the insulation score for each locus (bin).
Two additional metrics are used to compare predicted and experimental Hi-C contacts with regard to genomic region datasets (Single):
- Spearman’s correlation between experimental and predicted decay of contact frequency with genomic distance P(s).
- Spearman’s correlation between experimental and predicted compartment strength computed as in Martin Falk et al. (2019).