Show less. Catalogue of the exhibition displaying archaeological finds from mesolithic Age to the 20th century the National Museum of Slovenia, Ljubljana,… Read more. Log in with Facebook Log in with Google. Remember me on this computer. Enter the email address you signed up with and we'll email you a reset link. Need an account?

Click here to sign up. Expand PDF. Free PDF. The Ljubljanica — a River and its Past. Riverine Archaeology. Download Free PDF. Sensitivity of detecting reads with barcodes depending on length of barcoded reference primer. Sensitivity was calculated for two different precision levels over barcoded reference primer lengths from 7 nt just the reference barcode to 27 nt 7 nt reference barcode plus 20 nt of reference primer appended. The staircase effect occurs due to discrete threshold steps and fixed precision levels.

Detecting barcoded reads is only the first step in the demultiplexing protocol. In the next step, barcodes are decoded i. Without a threshold, approximately Precision and sensitivity of assigning reads to samples. Tests were conducted with the [7,3] barcode set solid points and the set of barcoded nt-long PCR primers empty triangles A Precision of assigning reads to samples for different thresholds B Sensitivity of assigning reads to samples for different thresholds C Precision and sensitivity plotted against each other.

For the variant analysis of the experimental data, we decided on a threshold based on the set of barcoded nt-long PCR primers that balanced high sensitivity with a high precision. We took into consideration that insufficient precision could have led to false variant calling, and insufficient sensitivity could have led to no variant calling at all.

Importantly, the median length of reads without any barcode was 1, nt, compared to a median length of 2, nt for those reads with at least one barcode at either end see Figure S1 in Additional file 3. This supports the hypothesis that the former had genuinely no barcode at either end of the read. The median number of reads assigned per sample was 2, A search for Atp1a1 sequence variants in the experimental data helped us to examine if our method was actually applicable to the real experimental design and how well it performed at avoiding cross-contaminating samples or reducing the number of usable sequence reads per sample.

The SNV having two base changes one nucleotide apart, observed in two clones, is consistent with non-tandem double mutations occasionally caused by polymerase errors at and near a single DNA damage site after ultraviolet light. Each variant call was supported by a large number of high-quality aligned reads, with coverage ranging from to median The quality of variant calls was consistently high, with all Phred quality scores reaching Still, a close examination of aligned reads assigned with different thresholds using genome viewer IGV showed signs of cross-contamination with reads that had a differing SNV.

A screenshot is available in Additional file 5. Multiplexed deep sequencing technologies are popular among researchers due to high information output and steadily decreasing processing time and costs. In multiplexing experiments, proper design of the barcodes is highly important.

Careful consideration must be given to their physico-chemical and biological properties as well as to their error-correction capabilities. Instead, additional information is needed such as the position of the barcode or adjacent primer sequences. Available deep sequencing platforms differ in their approaches to this problem. However, this approach is not completely error-free. In addition, using positional information is not always possible either, since that technique is restricted to specific platforms and applications.

Although this approach looks intuitively obvious, it is not clear what can be taken as the optimal solution for the choice of barcodes, primers, and detection algorithms. Additionally, sequencing errors add more noise to the data, which in turn requires proper thresholding for correct sequence assignment. Our presented solution is built on the idea of controlling the tail area-based false discovery rate Fdr, and offers researchers a versatile tool to find an optimal threshold for detecting barcoded sequences.

Additionally, it gives researchers a reliable impression of the quality of their threshold decision and the trade-off between precision and sensitivity, as well as facilitating further conclusions on the validity of the demultiplexing processing step. The method is generally usable for this particular problem, yet it needs to be modified to the specific technology and circumstances. The part of the method that needs to be adapted is the simulation of reads. Read simulation algorithms and analyses of read properties of common Next Generation Sequencing technologies can be found in the literature [ 35 — 37 ].

The approach of controlling the False discovery rate for a discrete test is new and still in an experimental stage. Nonetheless, recent development in the field of Fdr controlling procedures give the impression that exploiting the discreteness of the data increases reliability and sensitivity [ 38 ]. In this work, we focused on the specific advantages and issues of the PacBio SMRT platform, a next generation technology specialized in sequencing single large molecules [ 13 ].

Our protocol preferred sequencing primers attached to both ends of the DNA target. In reality, for several reasons actual reads are quite infrequent in the expected form. One out of two barcodes is frequently missing. Technologically, with PacBio SMRT, the extension of the sequence by the immobilized polymerase and the reading may not be well synchronized.

If the polymerase has been too fast or the deliberate time delay too long, the start of the insert could have been missed together with the barcode and the PCR primer. In some cases, the polymerase does not continue the reaction all the way to the end of the sequence. This means that the reverse complemented barcode at the end of the sequence may be missing as well [ 14 , 15 ]. Having calculated similarities between barcodes or barcoded primers to the Mus musculus reference genome database, we see that longer barcode sequences generally show less randomly occurring similarities.

This advantage is derogated by the number of barcodes used in the experiment: More barcodes increase the likelihood of coincidental similarities. The solution to this problem is to use longer barcodes or to concatenate barcodes with adjacent primer sequences. Here we demonstrate the major dilemma of the optimality of the barcode design and identification. On the one hand, barcode sequences should be short and distinct to minimize different kinds of sequencing errors. On the other hand, a short barcode sequence is not unique in a genomic context and requires additional information for correct identification.

For example, the barcode sequence itself can be extended by adding an adjacent primer sequence. This minimizes the false discovery rate due to decreased risk of coincidental similarities. In this work, we found that using additional information from the PCR primer sequence improved barcode recovery tremendously. In future work, the experiment should be designed to handle the case where no such information is available.

Firstly, adding an identical artificial sequence a so called stop-word to each barcode sequence solves the problem presented by redundancy of the words in big genomes. The best choice of stop-words is based on its dissimilarity to the targeted genome or insert.

Secondly, sets of longer barcodes with error-correction capabilities beyond one error can be generated, which are beneficial to the overall statistics of the true barcode recovery. The Fdr has to be calculated once per experimental data set, which includes the the simulation of reads and matching them to the experimental data. Computational complexity of the method grows quadratically over the length of the used barcode or barcoded primers.

We found that longer barcoded primers increase sensitivity compared to shorter barcoded primers, while computational time was moderate in all cases. Additionally, we found that the increase in sensitivity plateaued for very long barcoded primers. The statistical approach described here provides a solid method for finding an optimal threshold to separate barcoded and orphaned reads in real sequencing data sets. In addition to our main theme, the sample assignment of the genetic material was sufficiently precise and sensitive to generate a large number of high-quality and well-aligned reads.

The structure of the results indicated very low cross-contamination of insert read assignments caused by incorrect barcode calls and high-quality calls due to the large number of aligned reads at the respective SNV position. PacBio offers their own method for the detection of barcodes in circular consensus reads CCS as part of their Quiver analysis software [ 39 ].

Our method can be considered as an alternative approach to the same problem. In addition it offers additional benefits, such as a statistical insight in the reliablity of the decision in the context of hundreds of thousands of reads as well as the systematic discovery of an eligible threshold. We presented a method for enhancing the detection of barcoded reads that can be adapted to different sequencing technologies and protocols.

The method is based on false discovery rate statistics that were designed to assess the likelihood of true positives in an ocean of coincidental positives. Based on the precision-sensitivity estimates derived with our method, individual users can decide on a proper cutoff or threshold to detect sequence reads as being barcoded. Users can quantify the quality of the assignment of reads to samples.

Additionally, they can select their particular trade-off between precision and sensitivity, thereby increasing the confidence in the results even in highly error-prone situations. Depending on the outcome, performance of the method can be further improved by the use of longer barcodes with higher error-correcting properties, or elongating the barcode by utilizing adjacent adapter or PCR primer sequences during computational detection to increase sensitivity.

Special acknowledgments are made to Verena Zuber for reading and correcting the statistical elements of the manuscript. We wish to thank Elizabeth Kelly for helping with proofreading. We thank Dr. James Trosko, Michigan State University, for advice on the ouabain resistance assay. The University of Dresden has been granted a patent on the Sequence-Levenshtein technology used in this work, for which TB is registered as inventor. The ID is DE 10 All authors read and approved the final manuscript.

Tilo Buschmann, Email: ed. Rong Zhang, Email: moc. Douglas E Brash, Email: ude. Leonid V Bystrykh, Email: ln. Read article at publisher's site DOI : Toxins Basel , 11 3 :E, 04 Mar Sci Rep , , 31 Aug Cited by: 1 article PMID: PLoS One , 11 9 :e, 20 Sep PLoS One , 10 10 :e, 22 Oct This data has been text mined from the article, or deposited into data resources. To arrive at the top five similar articles we use a word-weighted algorithm to compare words from the Title and Abstract of each citation.

Bioinformatics , 33 6 , 01 Mar Cited by: 0 articles PMID: PLoS One , 6 10 :e, 28 Oct Buschmann T , Bystrykh LV. BMC Bioinformatics , , 11 Sep Sci China Life Sci , 57 11 , 17 Oct Cited by: 3 articles PMID: Nat Protoc , 17 1 , 10 Jan Curr Opin Chem Biol , , 10 Oct Contact us. Europe PMC requires Javascript to function effectively. Recent Activity. Search life-sciences literature Over 39 million articles, preprints and more Search Advanced search.

This website requires cookies, and the limited processing of your personal data in order to function. By using the site you are agreeing to this as outlined in our privacy notice and cookie policy. Buschmann T 1 ,. Rong Zhang Search articles by 'Rong Zhang'. Zhang R ,. Brash DE ,. Bystrykh LV. Affiliations 1 author 1. Share this article Share with email Share with twitter Share with linkedin Share with facebook.

Results In our analysis, barcode sequences showed high rates of coincidental similarities with the Mus musculus reference DNA. Conclusion Our method offers a proper quantitative treatment of the problem of detecting barcoded reads in a noisy sequencing environment. Free full text. BMC Bioinformatics. Published online Aug 7. PMID: Author information Article notes Copyright and License information Disclaimer.

Corresponding author. Received Dec 13; Accepted Jul This article is published under license to BioMed Central Ltd. This article has been cited by other articles in PMC. Go to:. Additional file 1: Dynamic algorithm of sequence-Levenshtein distance. A fast algorithm to calculate the Sequence-Levenshtein distance between sequences A and B. PDF 77 KB. Additional file 2: Supplement. PDF KB. Additional file 3: Distribution of read lengths. The figure depicts the distribution of read lengths, grouped in regard to their status as being barcoded at neither, one, or both ends.

PDF 13 KB. Additional file 4: Variant calls. This archive contains the variant calls in bcf file format as exported by samtools. ZIP 6 MB. Additional file 5: Evidence of cross contamination. This screenshot from the genome viewer IGV shows signs of cross contamination in the aligned reads when a small threshold, middle threshold, and very high threshold was used. The screenshot shows variants at position , which is an SNV that was reliably found in other samples.

Electronic supplementary material The online version of this article doi Open in a separate window. Figure 1. Barcode preparation The Sequence-Levenshtein distance between two DNA sequences A and B is the minimal number of insertions, deletions, and substitutions necessary to transform one sequence into any prefix of the other or vice versa. Simulation of barcoded and orphaned PacBio reads We begin with a set of experimental reads S e m p which we want to simulate for further analysis c.

Frequency of test statistic in simulated data The frequency distribution of such a set S s i m was the sum of the frequency distributions of both sets and :. Fitting simulated read sets to empirical read sets In the next step, we fitted one set of simulated reads S s i m to the set of empirical reads S e m p. Experimental validation To validate the Fdr approach, we asked whether we could successfully identify single-nucleotide variants SNVs within the genomic portion of samples that were sequenced in multiplexed fashion.

Variant calling The reads of the 20 samples were stripped of their barcodes and then aligned to the Mus musculus reference mRNA using the software package bwa-mem version 0. Coincidental barcode similarities in the reference Mus musculus DNA database All our experimental and simulation barcode sets were designed to correct one insertion, deletion, or substitution error.

Figure 2. Coincidental and genuine barcode similarities in Atp1a1 sequencing data In the experimental data, the expected complete size of the Atp1a1 insert was 3, bp including 7-nt-long barcodes at both ends. Figure 3. Coincidental and real similarities in experimental Atp1a1 reads In the experimentally obtained Atp1a1 sequence reads, at least a certain percentage of reads must have actually started with a barcode. Figure 4. Figure 5. Figure 6. Assigning barcoded reads to their original samples Detecting barcoded reads is only the first step in the demultiplexing protocol.

Figure 7. Variant calling A search for Atp1a1 sequence variants in the experimental data helped us to examine if our method was actually applicable to the real experimental design and how well it performed at avoiding cross-contaminating samples or reducing the number of usable sequence reads per sample. ZIP 6 MB 5. Competing interests The University of Dresden has been granted a patent on the Sequence-Levenshtein technology used in this work, for which TB is registered as inventor. Universal DNA tag systems: a combinatorial design scheme.

J Comput Biol. ChemInform Targeted high-throughput sequencing of tagged nucleic acid samples. Nucleic Acids Res. A pyrosequencing-tailored nucleotide barcode design unveils opportunities for large-scale sample multiplexing. Barcodes for DNA sequencing with guaranteed error-correction capability. Electron Lett. Generalized DNA barcode design based on hamming codes.

Buschmann T, Bystrykh L. Levenshtein error-correcting barcodes for multiplexed DNA sequencing. Efficient computation of absent words in genomic sequences. Genomic DNA k-mer spectra: models and modalities. Genome Biol. Meyer M, Kircher M. Illumina sequencing library preparation for highly multiplexed target capture and sequencing. Cold Spring Harbor Protoc. Target-enrichment strategies for next-generation sequencing.

Nat Meth. Kircher M, Kelso J. High-throughput DNA sequencing — concepts and limitations. SMRT Technology. A tale of three next generation sequencing platforms comparison of ion torrent, pacific biosciences and Illumina MiSeq sequencers. BMC Genomics.

Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc. Ser B Methodol ; 57 — Efron B. Local False Discovery Rates. Stanford University: Division of Biostatistics; Storey JD. A direct approach to false discovery rates. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat Sinica. Identifying differentially expressed genes using false discovery rate controlling procedures.

Controlling false discoveries in genetic studies. Strimmer K. A unified approach to false discovery rate estimation. Lexicographic codes: Error-correcting codes from game theory. Piscataway: IEEE; Greedy closure evolutionary algorithms ; pp. Genome reference consortium mouse build Characterization of ultraviolet light-induced ouabain-resistant mutations in chinese hamster cells.

