SPADA Draft Documents

First page Table of contents Previous page 59 Next page Last page

newly emerging infectious diseases or diseases that have sparked little research interest. In such 267 cases, utilizing only the few full-length genomes would result in “over-fitting” wherein many 268 regions appear to be conserved, but in fact deeper sequencing would show that many of those 269 regions are not appropriate for primer design. In such a case, it is advantageous to include both 270 full-length as well as partial and incomplete genomes in the inclusivity dataset. However, as 271 most assay design methods attempt to maximize the number of inclusivity sequences detected 272 with the smallest number of assays, including unmodified partial sequences will force assays to 273 cover the regions that have been sequenced most often, rather than focusing on the regions of the 274 genome that are actually most conserved. While this can be a good strategy when dealing with 275 highly variable genomes for which the strain diversity is better represented by available 276 amplicon sequences than the available whole genome sequences, it is a poor strategy if the 277 available amplicon sequences are generated from a hypervariable region or a region that is 278 perfectly conserved in near neighbors. An alternate strategy is to “fill in” and “extend” missing 279 sequence data by interpolating and extrapolating partial and incomplete sequences (10). 280 Bacteria also present challenges for many design algorithms since they usually have circular 281 genomes without a defined starting point, and they code for proteins on both strands. As a result, 282 different sequencing labs can publish the genomes with different strands and/or starting points. 283 Thus, it is useful to perform work up front to include the same strand in the inclusivity database 284 for all members of the set. Bacteria also present challenges due to their size which is roughly 285 100-1000 times larger than that of viruses, thereby placing demands on computational CPU 286 (Central Processing Unit) and memory resources for signature analysis algorithms (below, we 287 describe efficient k-mer algorithms that are capable of handling bacterial genomes). It is 288 recommended that partial genomes be segregated into a separate database from the inclusivity 289

Made with FlippingBook flipbook maker