Proposed virulence-associated genes of Streptococcus suis isolates from the United States serve as predictors of pathogenicity

Background There is limited information on the distribution of virulence-associated genes (VAGs) in U.S. Streptococcus suis isolates, resulting in little understanding of the pathogenic potential of these isolates. This lack also reduces our understanding of the epidemiology associated with S. suis in the United States and thus affects the efficiency of control and prevention strategies. In this study we applied whole genome sequencing (WGS)-based approaches for the characterization of S. suis and identification of VAGs. Results Of 208 S. suis isolates classified as pathogenic, possibly opportunistic, and commensal pathotypes, the genotype based on the classical VAGs (epf, mrp, and sly encoding the extracellular protein factor, muramidase-release protein, and suilysin, respectively) was identified in 9% (epf+/mrp+/sly+) of the pathogenic pathotype. Using the chi-square test and LASSO regression model, the VAGs ofs (encoding the serum opacity factor) and srtF (encoding sortase F) were selected out of 71 published VAGs as having a significant association with pathotype, and both genes were found in 95% of the pathogenic pathotype. The ofs+/srtF+ genotype was also present in 74% of ‘pathogenic’ isolates from a separate validation set of isolates. Pan-genome clustering resulted in the differentiation of a group of isolates from five swine production companies into clusters corresponding to clonal complex (CC) and virulence-associated (VA) genotypes. The same CC-VA genotype patterns were identified in multiple production companies, suggesting a lack of association between production company, CC, or VA genotype. Conclusions The proposed ofs and srtF genes were stronger predictors for differentiating pathogenic and commensal S. suis isolates compared to the classical VAGs in two sets of U.S. isolates. Pan-genome analysis in combination with metadata (serotype, ST/CC, VA genotype) was illustrated to be a valuable subtyping tool to describe the genetic diversity of S. suis. Supplementary Information The online version contains supplementary material available at 10.1186/s40813-021-00201-6.


Background
The severe clinical presentation associated with Streptococcus suis infection is of increasing concern in the U.S. swine industry. The heterogeneity of S. suis can be described by serotyping and multi-locus sequence typing (MLST), and currently 29 true serotypes (1-19, 21, 23-25, 27-31, and 1/2) and 1551 registered sequence type (ST) profiles (as of December 2020) exist [1][2][3][4][5]. The numerous S. suis serotypes and STs limit our attempts to understand the epidemiology of the disease in an effort to prevent and manage the various clinical manifestations. Further, S. suis has zoonotic potential, and many of the effective antibiotics available for treatment are of high or critically important status per U.S. Food and Drug Administration's Guidance for Industry #152 [6]. Also, serotype variations make it difficult to compare isolates within and across geographically distinct pig populations. The development of effective universal vaccines is hindered by the number of different virulent serotypes and the lack of knowledge of serotype-or STspecific virulence markers and associated clinical disease.
Historically, systematic characterization of S. suis isolates occurred more extensively in other countries compared to the United States. For instance, serotypes 2, 3, and 1/2 have been well-characterized in Canadian swine populations [7][8][9]. In addition, virulence-associated genes (VAGs) (epf, mrp, and sly) and STs indicative of virulence potential were identified, mostly for serotype 2 [9][10][11][12]. Experimental studies illustrating the virulence potential of Canadian serotype 2 strains (ST1, ST25, and ST28) suggest the virulence potential of ST28 strains is low at best in Canada, but a very different clinical presentation was being observed on U.S. swine farms with ST28 [11,13,14]. In the past 4 to 5 years, S. suis infections on U.S. swine farms appeared to be more persistent and severe [15,16]. However, whether this is a result of new circulating strains, an increase in virulence, or some other cause is not well understood, reinforcing the importance of the characterization of S. suis to monitor changes in strains within a herd.
U.S. swine practitioners utilize herd vaccination strategies as a means of controlling S. suis disease. However, selecting representative isolates and properly timing the administration of vaccines still remain a challenge [17]. As a result of the diversity of S. suis, limited commercial vaccines are available, and many practitioners develop and maintain farm-specific autogenous vaccines. In addition, clear criteria for identifying pathogenic strains that cause primary disease are lacking, making isolate selection for vaccines more difficult [18]. Isolates are commonly selected based on criteria such as serotype and isolation from systemic tissues [19,20]. However, due to the diversity within and between serotypes, crossprotection between, and even within, different serotypes is difficult to attain [21][22][23][24]. Moreover, the presence of virulence markers is critical for selecting isolates for autogenous vaccines. Over 100 putative and confirmed virulence factors and markers (not crucial or critical for virulence) for S. suis have been described in the literature, but few have been verified in experimental models [18,25]. These include Eurasian serotype 2 virulence markers extracellular protein factor (epf gene), muramidase-released protein (mrp gene), and suilysin (sly gene), which have been investigated in STs 1, 25, and 28 strains from North America [11,26,27].
The application of genomic approaches to identify associations between VAGs and disease manifestation can lead to a better understanding of S. suis pathogenesis. However, a comparative genomic study investigating the current distribution of S. suis VAGs in U.S. isolates is lacking. Recently, we reported associations of various pathotypes with subtypes, including serotype and ST, of S. suis [28]. In this current study, a genomic approach was utilized to identify associations between VAGs and pathotype of U.S. isolates while evaluating the likelihood of classical Eurasian serotype 2 VAGs and newly proposed VAGs to identify pathogenic strains. In addition, pan-genome genetic relationships, along with their VAG profiles (virulence-associated genotypes), were investigated for isolates within and between swine production companies. Finally, we applied the genomic approach for identifying associations between VAGs and pathogenicity classification to a validation set of S. suis isolates to determine whether our approach was robust enough to identify pathogenic strains isolated from other swine production companies.

Source of isolates and collection of epidemiological data
A training set of 208 S. suis isolates were used in this study. These isolates, previously described by Estrada et al. (2019), were classified into three pathotypes (pathogenic, possibly opportunistic, and commensal) based on clinical information and site of isolation. "Pathogenic" isolates were obtained from systemic tissues such as the brain/meninges and heart. "Possibly opportunistic" isolates were predominantly from lung samples from pigs without signs of neurological or systemic disease. "Commensal" isolates were from laryngeal, tonsil, or nasal samples retrieved from farms with no current control methods for S. suis disease.
Furthermore, epidemiological data, such as swine production company and site, were collected for the training set of isolates. The swine production companies coded as A, D, E, K, and M are all large operations that range in size from 70,000 to 340,000 sows and with headquarters in the United States (A and D = MN, E = MO, K = KS, and M = IL).

VAG profiling
VAG profiling was performed on the training set using a custom database of previously published VAGs of S. suis (Additional file 1) [28]. Illumina sequencing reads were mapped to reference DNA sequences (≥ 60% coverage and ≥ 90% sequence identity) using the SRST2 (Short Read Sequence Typing for Bacterial Pathogens) program [29]. The construction of a presence and absence heatmap (Euclidian distances and UPGMA clustering) was performed with R software [30].

Statistical analysis
Associations between published S. suis VAGs and pathotype, as previously defined by Estrada et al. [28], were investigated. Published VAGs present in a majority of isolates (> 90%, 188/208) were removed. VAGs present in < 50% of the pathogenic pathotype (< 70/139) were also removed. Remaining VAGs were tested by chisquare, comparing the three pathotypes and the status (presence/absence) of individual genes. Genes lacking a significant (chi-square p-value < 0.05) association with pathotype were removed from the analysis. The remaining genes were analyzed using the Least Absolute Shrinkage and Selection Operator (LASSO) regression model [31].
The LASSO regression model reduces coefficients to zero and gradually eliminates genes that have no or low correlation with the target variable. The LASSO model was used to determine VAGs that may serve as the 'best' predictors of pathogenicity, in this case using the pathogenic pathotype as the indicator of pathogenicity. The analysis was performed using the R package glmnet and the best lambda penalty value to determine the fewest number of predictor genes [32]. Due to variation in the number of predictor VAGs in each run, we ran 100 iterations of the LASSO model to determine the most relevant predictor genes. Predictor VAGs reported in all 100 iterations were considered relevant candidate VAGs.

Genome assembly and pan-genome analysis
Genome assembly was performed on Illumina sequencing data from the training set [28]: SRA accession numbers SRR9123061-SRR9123268. Genome assemblies were generated using MEGAHIT de-novo assembler (kmer range of 25-225) and polished using Pilon [33,34]. QUAST was used to evaluate the genome assemblies [35]. Only contigs that were 500 bp or larger were kept for annotation by Prokka to predict coding sequences [36]. The pan-genome was annotated using Roary with a 90% BLASTp identity cut-off to define clusters of genes and allowing paralog clustering [37,38]. The FastTree program was used to generate an approximatelymaximum-likelihood phylogenetic tree based on the binary presence and absence of core and accessory genes.
Percent similarity was calculated as the percentage of shared genes in the pan-genome.

Selection and whole genome sequencing of validation set
Thirty-two isolates obtained from a single swine production company from 2017 to 2019 were classified as either 'pathogenic' or of 'unknown-pathogenicity' based on tissue source (Additional file 2). Isolates classified as 'pathogenic' were obtained from the brain (n = 19). The isolates of 'unknown-pathogenicity' were isolated from non-systemic tissues (no neurological signs) (n = 13). The S. suis isolates were sequenced and the sequencing reads were processed using a similar method as described for the training set [28]. Isolates were confirmed as S. suis if they possessed the S. suis-specific recombination/repair protein (recN) sequence (Streptococcus suis 05HAS68, Accession CP002007).

Serotype, MLST, VAG profile, and pan-genome analysis of validation set
The serotyping of the validation set of S. suis isolates was verified using a S. suis serotyping pipeline described by Athey et al. (2016) to differentiate serotypes 2 and 1/ 2 and serotypes 1 and 14 [39]. In-silico MLST analysis was performed using the SRST2 program, and the ST allele sequences and profiles obtained from the S. suis MLST database [5]. Novel STs were further grouped into major clonal complexes (CCs) as previously described [28]. Data on presence or absence of the classical VAGs (epf, mrp, and sly) was obtained for each of the 32 isolates as described above for the training set. Similar genome assembly and pan-genome analysis as described for the training set were performed on the 32 isolates. The number of gene clusters identified for the training and validation sets may differ due to gene duplication, pseudogenes, gene acquisition/loss, and other genomic variations, as well as differences in the number of genomes included in the pan-genome analysis [40,41].

VAG profiling
Distribution of the epf, mrp, and sly genes In our previous study of 208 S. suis isolates (referred to as the training set), 139, 47, and 22 were classified as the pathogenic, possibly opportunistic, and commensal pathotype, respectively [28]. The training set was also characterized by determination of serotype, MLST, and CC. In the current study, the distribution of the epf, mrp, and sly genes was bioinformatically determined for the 208 isolates in the training set. These classical VAGs epf, mrp, and sly were identified in 20 (14.4%), 127 (91.4%), and 77 (55.4%) isolates of the pathogenic pathotype, respectively ( Table 1). The epf gene was predominantly present in serotypes 1, 2, and 14 and CC1 isolates   while mrp and sly were distributed among a diverse set of subtypes. The epf, mrp, and sly genes were identified in 0 (0%), 6 (27.3%), and 4 (18.2%) isolates of the commensal pathotype, respectively (Table 2). We further investigated genotype combinations of the epf, mrp, and sly genes and their distributions in STs 1, 25, and 28 ( Table 2). The predominant genotype in the pathogenic pathotype was epf−/mrp+/sly-(41.0%, 57/139) followed by epf−/mrp+/sly+ (36.0%, 50/139). A majority of the ST28 (94%) and both ST25 isolates in the training set possessed the epf−/mrp+/slygenotype. The epf+/mrp+/ sly+ genotype was identified in only 20 of the 139 isolates classified as the pathogenic pathotype. All 17 ST1 isolates possessed the epf+/mrp+/sly+ genotype. In summary, a majority of isolates, even those of the pathogenic pathotype, lacked the three classical VAGs, but all the isolates containing the three VAGs were classified as pathogenic or ST1.

Determining predictors of pathotype by VAG profiling
Given the limited distribution of classical VAGs (epf and sly) among isolates in the pathogenic pathotype, the classical VAGs are not appropriate indicators of pathogenicity for non-serotype 2 S. suis isolates from the United States. Thus, a total of 71 previously published S. suis VAGs (including epf, mrp, and sly) were investigated for the presence of alternative genes that may be indicators of pathogenic strains. Thirty-two (45%) VAGs were present in all genomes regardless of pathotype and were clearly not indicators of the pathogenic pathotype (Table 3 & Additional file 3). Five VAGs were absent in all of the isolates in the commensal pathotype. SalK and salR, which encode the SalK/SalR two-component signal transduction system [42], were not detected in any of the isolates despite mapping to different reference sequences.
Clustering analysis was used to determine if relationships between the presence of previously published VAGs and pathotype existed. The analysis of 71 VAGs identified three clusters (Cluster I-III), two of which associated with pathotype ( Fig. 1 & Additional file 4). Cluster I consisted of isolates of all three pathotypes. Cluster II predominantly consisted of isolates from the pathogenic pathotype and lacked isolates from the commensal pathotype. Cluster III contained the majority of isolates from the commensal pathotype (73%). Isolates in the pathogenic cluster (Cluster II) were predominantly characterized as serotype 1/2 CC28. Serotype 1/14 CC1 isolates formed a subcluster of Cluster I which lacked isolates from the commensal pathotype. Clustering analysis also illustrated multiple candidate published VAGs for discriminating between pathotypes, specifically VAGs present in the two pathogenic clusters and absent in the commensal cluster.
We then performed statistical analyses to test for associations between VAGs and pathotype. Of the 71 published VAGs detected in the genomes, 16 were tested by chi-square and 14 were considered significant (chisquare p < 0.05) ( Table 3 & Additional file 5). The classical VAGs mrp and sly were considered significant by chi-square. The 14 VAGs that were significant by chisquare were further analyzed by the LASSO model. The sly gene was in the top ten VAGs identified by LASSO, but mrp was not [data not shown]. The LASSO model identified four other candidate VAGs associated with the pathogenic pathotype ( Table 4). The VAGs ofs and srtF were present in over 95% (≥ 132/139) of isolates in the pathogenic pathotype and thus, the presence of both genes was tested as predictors of the pathogenic pathotype. Ninety-five percent (132/139) of the pathogenic pathotype contained both genes while only 23% (5/22) of the commensal pathotype contained both genes.   Diversity of U.S. S. suis by pan-genome analysis Relatedness of 208 S. suis isolates by pan-genome analysis Pan-genome analysis of 208 S. suis genomes generated a pan-genome of 8373 gene clusters and illustrated multiple clusters that corresponded to the five MLST CCs (CC1, CC28, CC94, CC104, and CC750) (Fig. 2) [28]. Isolates from at least sixteen swine production companies (A-P) (≥ 2 isolates each) were identified in the data set, with A (n = 13), D (n = 16), E (n = 18), K (n = 21), and M (n = 16) representing the five production companies with the most isolates in this study (predominant production companies). The most predominant CCs (CC1, CC28, and CC94) were identified in multiple production companies. CCs 1, 28, and 94 were identified in 12, 11, and 12 of the 16 production companies, respectively.

Relatedness of isolates within the five predominant production companies
The genetic relationships between pathogenic and possibly opportunistic isolates within a production company were investigated using pan-genome analysis for each of the five predominant production companies A, D, E, K, and M (Fig. 3). None of the isolates from these production companies were classified as the commensal pathotype [28]. In addition, we explored associations between pan-genome clusters and genotypes of the classical (epf, mrp, and sly) and proposed pathogenic (ofs and srtF genes) VAGs. The isolates demonstrated various genotypes of classical VAGs, and a majority (96.4%) possessed the proposed ofs+/srtF+ genotype for predicting pathogenic strains. Due to the diversity of isolates within each production company, we investigated pan-genome clusters and

Relatedness of commensal isolates
We further investigated the genetic relationships between the 22 isolates of the commensal pathotype. An 82.6-99.9% similarity was observed, with isolates forming two large clusters and multiple sub-clusters (Fig. 4). Thirteen isolates lacked a CC, while one, three, and five isolates were assigned to CC1, CC94, Fig. 2 Relatedness of 208 S. suis isolates by pan-genome analysis. Genetic relationships between isolates are based on the presence and absence of 8373 gene clusters among 208 S. suis genomes. The phylogenetic tree is colored-coded (branches) and labeled (right) by CC; multiple STs did not form a CC or formed a CC without a primary founder. Isolates from at least sixteen swine production companies (A-P) (≥ 2 isolates each) were identified in the data set. Misc. refers to miscellaneous production companies (single isolates each). Isolates belonging to the five predominant production companies (A, D, E, K, and M) are color-coded by their respective production company. * strains in the commensal pathotype (n = 22) and CC750, respectively. The CC1 and CC94 isolates possessed more VAGs than the other commensal isolates with all possessing mrp and sly, and both ofs and srtF, while a majority of commensal isolates (77.3%) lacked the classical and proposed pathogenic VAGs.

Characterization of the validation set of S. suis isolates
A distinct validation set of 32 S. suis isolates was obtained from a single production company to perform pan-genome analysis and further test the novel proposed pathogenic genotype (ofs+/srtF+). These isolates were classified as either 'pathogenic' or of 'unknown-pathogenicity.' The pan-genome consisted of 7078 gene clusters among the 32 genomes, and these pan-genome clusters associated with the 'pathogenic' and 'unknown-pathogenicity' classifications, as well as with virulenceassociated genotypes (Fig. 5). Clusters c-f corresponded to the 'pathogenic' classification. Only the isolates in clusters e and f possessed the classical VAGs epf, mrp, and sly. Moreover, all the isolates in these two clusters possessed the proposed pathogenic ofs+/srtF+ genotype. A majority of the isolates in cluster d (67%) possessed the ofs+/srtF+ genotype. Cluster a and singletons g-n corresponded to the 'unknown-pathogenicity' classification (Fig. 5). A majority (86%) of the isolates in cluster a possessed the classical VAGs mrp and sly, but three isolates (43%) also possessed the proposed pathogenic genotype. A majority (75%) of the singletons g-n lacked both the classical and proposed VAGs. Two isolates possessed the proposed pathogenic genotype. The diversity and lack of VAGs in Fig. 3 Pan-genome analysis of isolates from the five predominant production companies. The predominant production companies are presented as A, D, E, K, and M. Color-coding of isolate names by production company and color-coding of phylogenetic tree branches by CC follow the same color schemes as Fig. 2. The percent similarity of isolates within a cluster is defined as the percentage of shared genes from a total of 8373 genes. The presence of the classical VAGs epf, mrp, and sly is represented in green and the proposed VAGs ofs and srtF in orange clusters g-n is similar to the diversity seen among the commensal pathotype, suggesting these isolates are commensal strains.

Discussion
In this study, 71 published S. suis VAGs (including the classical VAGs epf, mrp, sly) were evaluated to identify pathogenic isolates associated with systemic and neurological disease from the United States. Notably, VAGs ofs and srtF demonstrated stronger associations with the pathogenic pathotype than the other 69 VAGs, suggesting novel published VAGs associated with pathogenicity. A genotyping scheme consisting of these two genes (ofs+/srtF+ genotype) identified pathogenic isolates in a validation set of S. suis isolates, demonstrating its potential application for predicting pathogenicity in other swine production companies. The genetic diversity of isolates within and between swine production companies was evaluated by pan-genome analysis, and important associations were observed among pan-genome clusters, CCs, and virulence-associated (VA) genotypes.
Muramidase-released protein has been associated with enhanced survival of S. suis in human blood and an increase in blood-brain barrier permeability in mice while suilysin plays a role in the inflammatory response although neither of which have been described as being critical as virulence factors [43][44][45][46][47][48]. The epf gene was identified in only 14% and the sly gene was identified in 55% of isolates in the pathogenic pathotype. The mrp gene was identified in 91% of the pathogenic pathotype and 27% of the commensal pathotype suggesting the classical VAG mrp continues to be an adequate identifier of pathogenic strains. The epf+/mrp+/sly+ genotype is correlated with S. suis clinical disease caused by European and Asian ST1 strains belonging to serotypes 1, 2, 9, and 14 [27,49,50]. The ST1 isolates in the training set had the epf+/mrp+/sly+ genotype, and the ST25 and ST28 isolates had the epf−/mrp+/slygenotype, confirming the use of the classical VAGs for identifying virulent ST1 strains but the limited use for identifying ST25 and ST28 strains in North America [11].
Various subtyping methods, including serotyping and MSLT, have been used for evaluating the genetic diversity of S. suis isolates and identifying patterns specific to clinical isolates. Pulsed-field gel electrophoresis (PFGE) has been used for evaluating the genetic diversity of S. suis serotype 2, 1/2, 3, 7, and 9 strains [51,52]. Although PFGE has high discriminatory power, typing a large number of isolates is time consuming and labor intensive. Unique randomly amplified polymorphic DNA patterns have been recovered from S. suis isolates from diseased pigs and correlated with the production of virulence markers [53,54]. However, these analyses were mainly focused on serotype 2 strains. Multiplex PCR Fig. 4 Pan-genome analysis of the 22 commensal isolates. Color-coding of phylogenetic tree branches by CC follows the same color scheme as Fig. 2. The presence of the classical VAGs epf, mrp, and sly is represented in green and the proposed VAGs ofs and srtF in orange assays were developed for the differentiation of isolates into serotypes and detection of multiple VAGs [55][56][57]. A limitation of multiplex PCR assays is the number of targets that can successfully be tested in a single assay [58]. In this study, we utilized pan-genome analysis in conjunction with serotyping, MLST, and VAG profiling as a subtyping tool for S. suis. Whole genome sequencing (WGS)-based approaches, such as comparative genome hybridization, minimum core genome sequence typing, pan-genome and Bayesian analysis of population structure, and genome-wide association studies, have been used in combination with phenotypic methods for the identification and classification of S. suis strains into groups of differing levels of virulence [21,22,[59][60][61][62]. WGS-based approaches have multiple advantages to molecular subtyping techniques, such as the ability to characterize the entire genome, a higher discriminatory power capable of discriminating closely related strains, the ability to perform in silico (via computer simulation) analyses, and access to a vast number of bioinformatics tools for the analysis of whole genomes [63,64].
Novel published VAGs for the identification of pathogenic S. suis isolates were selected using a chi-square test and a LASSO regression model testing associations between published VAGs and pathotype. As a result, the two genes ofs and srtF were selected as the 'best' indicators of pathogenicity for isolates in our study. The ofs gene encodes a serum opacity factor and was associated with virulence attenuation in an experimental pig model [65]. The srtF gene encodes a class C sortase and is part of the srtF pilus gene cluster composed of four genes, srtF, sipF, sfp1, and sfp2 [66]. SrtF gene mutants of S. suis serotype 2 ST1 strain P1/7 caused attenuation of virulence in an intranasal caesarean-derived colostrumdeprived (CDCD) pig model [67]. However, the presence of the pilus gene cluster does not guarantee pilus protein expression [11]. Our research identifies the genes as markers for pathogenicity and not the expression of proteins. The percentage of isolates containing the ofs+/ srtF+ genotype that were classified as pathogenic increased from 79 to 96% (132/137) when excluding the possibly opportunistic pathotype (isolates possibly associated with respiratory disease) from the analysis. The proposed pathogenic genotype for predicting pathogenicity was further tested in a validation set consisting of 32 S. suis isolates to evaluate the likelihood of these two genes identifying pathogenic strains in other swine production companies. The ofs+/srtF+ genotype was observed in 73.7% (14/19) of the 'pathogenic' isolates, together indicating a ≥ 74% probability that an isolate will be classified as pathogenic given the proposed genotype. The proposed ofs+/srtF+ genotype, in complement to the classical VAGs for ST1, identifies pathogenic strains in the United States. A potential application of this research is the development of a diagnostic PCR test targeting these two proposed VAGs.
Nineteen of 139 isolates in the pathogenic pathotype lacked the ofs+/srtF+ genotype suggesting the possibility Pan-genome analysis of the validation set of 32 S. suis isolates. Isolates are color coded by classification, 'pathogenic' (red) or 'unknownpathogenicity' (blue). The phylogenetic tree branches are colored-coded by CC. The percent similarity of isolates within a cluster is defined as the percentage of shared genes from a total of 7078 genes. The presence of the classical VAGs epf, mrp, and sly is represented in green and the proposed VAGs ofs and srtF in orange of misclassification of these isolates based on tissue source (systemic versus non-systemic). In addition to pathogen-specific traits, environmental and management conditions and host traits contribute to the development of S. suis disease. These factors include temperature fluctuations, overcrowding, concurrent infections, and host immunity and genetics [20,68,69]. The ofs+/srtF+ genotype was identified in five isolates in the commensal pathotype but four of these isolates were characterized as CC1 or CC94, which are generally pathogenic subtypes [70,71]. The five commensal isolates are present in Cluster I (Fig. 1), which represents a cluster containing all three pathotypes, indicating these isolates share similar VAGs with pathogenic isolates. Virulent strains have been previously isolated from the nasal cavities and tonsils of clinically healthy pigs, so characterization by tissue source can be misleading [72,73].
Pan-genome analysis in combination with metadata (serotype, ST/CC, VA genotype) was used in this study as a subtyping tool to describe the genetic diversity of S. suis isolates within a production company and between companies for epidemiological purposes. The differentiation of S. suis may provide information on the origin of isolates (geographical location, year, source, etc.) or aid in the identification and tracking of strains over time [74][75][76]. Isolates from the pathogenic pathotype in this study formed distinct clusters with correlation to CC and VA genotypes, which is consistent with previous studies [54,77]. The same CC-VA genotype patterns were identified in multiple production companies, suggesting a lack of association between production company, CC, or VA genotype. These observed patterns may be widespread as opposed to originating from a common source of infection as previously suggested [78][79][80]. Furthermore, the high genetic similarity and identical CC and VAG genotypes within a pan-genome cluster (such as in cluster A in production company E) are indicative of a clone, providing useful information for the identification and tracking of clones over time [81][82][83]. Thus, the use of WGS to complement metadata (e.g. epidemiological, clinical and demographical data) provides a valuable tool for subtyping S. suis as part of epidemiological studies [84,85]. Further, pan-genome analysis of U.S. S. suis isolates may be used to identify candidate VAGs not yet identified or characterized.
The differentiation of S. suis isolates is also crucial for the development of autogenous vaccines [86]. Different strains have been recovered from diseased pigs from the same herd and selecting the strain or strains associated with disease is challenging [52,[87][88][89]. For the validation set, multiple CC-VA genotype patterns were found among the 'pathogenic' clusters, indicating multiple clones were present in this production company. This diversity of isolates is supported by the identification of five serotypes (1, 1/2, 2, 14, and 7) in the validation set, all of which are generally pathogenic subtypes [24,90,91]. Despite the diversity of clinical strains in the same herd, previous reports indicate a specific strain is the predominant cause of disease and the primary candidate for an autogenous vaccine [87][88][89]92]. CC1 was predominantly identified in this production company, and these CC1/ST1 isolates (cluster e and f) demonstrated similar gene content (99% similarity) and genotypes but had different serotypes (serotypes 1/14 vs serotype 2). These results suggest two sub-populations with differences in virulence potential and the need for multiple isolates in a vaccine [93,94]. On the other hand, the CC28 isolates (cluster d) demonstrated similar gene content (92-99% similarity), serotypes, and genotypes, suggesting similar virulence potential, and the selection of a single isolate for vaccine [11,13]. As these isolates came from the same production company, all three isolates may by recommended for vaccine development. In addition to the genetic diversity of S. suis isolates, historical background of a production company should be considered while selecting isolates. Historical factors such as prior on-farm identification of S. suis, historic and current sources of replacement animals, and other confounding disease factors can further support the inclusion of multiple isolates in vaccine development.

Conclusion
In this study, the current distribution of published, including classical, VAGs in U.S. isolates was determined, which indicated that classical VAGs are not sufficient to differentiate pathogenic and commensal U.S. strains. Of the 71 published VAGs investigated, the ofs and srtF genes were shown to be stronger predictors of pathogenicity in both a training and a validation set of isolates. Furthermore, a WGS-based approach was used to determine the genetic diversity of isolates demonstrating its use in epidemiological studies and vaccine isolate selection.