SEQdata-BEACON is a comprehensive database of sequencing performance and statistical tools for performance evaluation and yield simulation in BGISEQ-500. We chose 60 BGISEQ-500 sequencers in the BGI-Wuhan lab and collected all available files generated following chemical reactions and basecalling. The criteria for selecting metrics from these files are as follows: 1) metrics of great concern and closely related to the sequencing process; 2) metrics covering information on the sequencing type, the optical path state of the machine etc.; 3) metrics related to traceable information for troubleshooting. Based on these, we used flow cell identifier (FC) as an index to extract 64 metrics from data resources to create a database ‘SEQdata-BEACON’. Each entry was assigned a unique identification number (ID). We accumulated lanes in paired-end 100 (PE100) sequencing since it is the major sequencing type in BGISEQ-500. No detailed sample information was entered into the database in order to protect the privacy of customers. The database was constructed on a MySQL server (version 8.0).
The database primarily contained 65 metrics about information on sample, yield, quality, machine status and supplies. Furthermore, we explored our database to describe the statistical results of metric features and to construct a yield simulation model based on yield-related metrics. To provide open access to our data, we designed a comprehensive website ‘SEQdata-BEACON’ with Home, Browse, Tools, Download and Guide pages to display the database and data-mining applications. The Google Chrome web browser (version 68.0.3440.106) is suggested to access the website.
We expect SEQdata-BEACON to be a comprehensive platform: with data accumulation, it can demonstrate the actual performance of the sequencing platforms; by developing more data-mining applications, it can enrich functional tools such as QC metrics models and metrics standards; by presenting data and statistical results on the website, it can also give users useful optimization and troubleshooting suggestions to solve their problems.
The ‘Browse’ page allows users to look through the numerical metric features, including a heatmap of Pearson’s correlation coefficients and the metric distributions. The browse interface provides the users a preliminary view of the overall features of the metrics from a heatmap of Pearson’s correlation coefficients and scatterplot of Q30 versus Reads. Users can also choose the name of their metric of interest in the drop-down menu, and get the corresponding distribution chart at the bottom of the webpage. For example, the distribution of FIT and its changes per cycle are both illustrated in charts for observing the distribution patterns and fluctuations.
Also, the detailed description of the 65 metrics is listed.
We established a linear regression model to simulate the yield by inputting seven metrics. Users who want to predict their yield can enter specific metric values on our website in the example format and click ‘Start’, then the expected yield confidential intervals will be shown.
The 95% confidential interval was indicated in grey in the normal distribution.
Example format: Enter the value of the parameters in each cycle, and values separated by a hashtag; Each parameter value needs to be in the same cycle.
The ‘Download’ page allows users to obtain the data in EXCEL format according to our update time; all the data and analysis results will be updated every two months.