SPADA Meeting Book

346

4.3.4 Target region selection

Traditionally, the first step in design is to find a region of the pathogen genome that is 347 conserved among variants of a given target. A multiple-sequence-alignment (MSA) algorithm 348 (e.g., CLUSAL, T-COFFEE, MAFFT, or MUSCLE ) is the traditional approach to identify such 349 conserved regions. However, MSA algorithms do not scale well (in terms of CPU and memory) 350 with either large numbers of sequences or with long sequence lengths. Even with modern cloud 351 computing resources, computing a large MSA can be intractable. In addition, pathogen DNA and 352 RNA sequences vary significantly in their number of bases, substitutions, insertions, and 353 deletions. When combined with the low complexity of nucleic acids (i.e. only 4 bases for nucleic 354 acids vs. 20 amino acids for proteins), it is particularly difficult to get the high-quality 355 alignments that are required to deduce the desired conserved regions. These limitations make it 356 essentially impossible to apply an MSA to large collections of bacterial genomes or highly 357 variable viral genomes (e.g., LCMV, CCHFV, Lassa virus, HPV, and HRV). Sequence 358 alignments of the final design region, however, are helpful for displaying the variations present 359 and provide a helpful reality check after a design region is discovered with a k-mer approach. 360 Thus, we do recommend using an MSA that is restricted to the design region of interest, but not 361 for the entire genome. 362 A superior approach for determining the optimal design region(s) is to analyze targets using 363 k-mers (i.e. substrings of length k, usually 14-25, depending on the application; the rationale is 364 described in references 8, 9, and 11). Such k-mer algorithms are computationally efficient for 365 large databases and long sequences and can be applied to databases of pathogenic viruses and 366 bacteria. An optimal design region from a pathogen would show high conservation among the 367 variants of the desired target (e.g., clinical isolates of a pathogen) and show a lack of 368

20

Made with FlippingBook - Online magazine maker