Inverted repeats in the monkeypox virus genome are hot spots for mutation

Abstract The current monkeypox virus (MPXV) strain differs from the strain arising in 2018 by 50+ single nucleotide polymorphisms (SNPs) and is mutating much faster than expected. The cytidine deaminase apolipoprotein B messenger RNA editing enzyme, catalytic subunit B (APOBEC3) was hypothesized to be driving this increased mutation. APOBEC has recently been identified to preferentially mutate cruciform DNA secondary structures formed by inverted repeats (IRs). IRs were recently identified as hot spots for mutation in severe acute respiratory syndrome coronavirus 2, and we aimed to identify whether IRs were also hot spots for mutation within MPXV genomes. We found that MPXV genomes were replete with IR sequences. Of the 50+ SNPs identified in the 2022 outbreak strain, 63.9% of these were found to have arisen within IR regions in the 2018 reference strain (MT903344.1). Notably, IR sequences found in the 2018 reference strain were significantly lost over time, with an average of 32.5% of these sequences being conserved in the 2022 MPXV genomes. This evidence was highly indicative that mutations were arising within IRs. This data provides further support to the hypothesis that APOBEC may be driving MPXV mutation and highlights the necessity for greater surveillance of IRs of MPXV genomes to detect new mutations.


Abstract
The current monkeypox virus (MPXV) strain differs from the strain arising in 2018 by 50+ single nucleotide polymorphisms (SNPs) and is mutating much faster than expected. The cytidine deaminase apolipoprotein B messenger RNA editing enzyme, catalytic subunit B (APOBEC3) was hypothesized to be driving this increased mutation. APOBEC has recently been identified to preferentially mutate cruciform DNA secondary structures formed by inverted repeats (IRs). IRs were recently identified as hot spots for mutation in severe acute respiratory syndrome coronavirus 2, and we aimed to identify whether IRs were also hot spots for mutation within MPXV genomes. We found that MPXV genomes were replete with IR sequences. Of the 50+ SNPs identified in the 2022 outbreak strain, 63.9% of these were found to have arisen within IR regions in the 2018 reference strain (MT903344.1). Notably, IR sequences found in the 2018 reference strain were significantly lost over time, with an average of 32.5% of these sequences being conserved in the 2022 MPXV genomes. This evidence was highly indicative that mutations were arising within IRs. This data provides further support to the hypothesis that APOBEC may be driving MPXV mutation and highlights the necessity for greater surveillance of IRs of MPXV genomes to detect new mutations. single-nucleotide polymorphisms (SNPs). As the reference sequence from 2018 only differs from the current sequence by approximately 100 bp, the mutation rate was between 6 and 12-fold more than expected over this time period. These mutations were primarily G > A and C > T mutations, which they concluded was likely due to the activity of apolipoprotein B messenger mRNA (mRNA) editing enzyme, catalytic subunit B (APOBEC3) family members. APOBEC3 is a cytidine deaminase with innate antiviral activity that is upregulated during viral infections. This enzyme promotes G > A and C > T hypermutations at 'hot spots' within viral DNA to render the virus less infective and prevent biological processes such as replication. 3 However, there is also evidence that sublethal mutagenesis can contribute to greater genetic diversity and enhance viral propagation. Only a single mutation observed in the study above was not a G > A or C > T transition, highly indicative that APOBEC was involved in driving this mutational diversity.
There is growing evidence that non-B DNA secondary structures such as cruciform (formed by inverted repeats [IRs]), triplexes, and Gquadruplexes (G4) are involved in driving mutational diversity. [4][5][6][7] IRs are not to be confused with the inverted terminal repeats (ITRs), repeat sequences of around 2-12 Kbp which can occur within the first and last 12 Kbp of poxvirus genomes. In this instance, the terminal repeat at the 3′ end is complementary to the terminal repeat at the 5′ end of the entire genome sequence. In contrast, IRs are much shorter sequences and can be found interspersed throughout the entire genome. IRs consist of a single-stranded sequence of nucleotides, followed downstream by its reverse complement, and separated by a short loop sequence consisting of any nucleotide (e.g., 5′-AAGCTnnnnnAGCTT-3′).
When the loop length is zero, the sequence is referred to as a palindrome. IRs have been demonstrated to play important roles within genome instability, where they contribute to evolution and disease. [8][9][10] Indeed, it was recently identified that mutations in severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) occurred with greater frequency within IRs and suggested that IRs are important drivers of viral mutational diversity. 11 Interestingly, a recent study identified that APOBEC mutagenic activity was much higher against IRs compared with other non-B or B-DNA structures. 12 Thus, one could question whether APOBEC might be driving mutational diversity in MPXV by inducing mutations within IRs. Here, we analyzed 247 MPXV genomes to identify the presence of both G4s and IRs. Furthermore, we identified which of the SNPs identified in the 2022 outbreak genomes arose within IR regions in the 2018 reference strain and whether IRs were hot spots for mutation in MPXV.

| Genome sequences
Two-hundred and forty-seven genomes were obtained from the National Center for Biotechnology Information (NCBI) database and analyzed for the presence of IRs and G4s (last accessed 11/11/2022; Table S1). One-hundred and twenty-four of these genomes were chosen at random to explore which IRs in the 2018 reference strain

| Detection of G4-forming sequences in MPXV genomes
Analysis of genomes for G4-forming sequences was conducted using G4Hunter. 13 G4Hunter identifies all sequences with propensity to form G4 within a genome. The number of G4-forming sequences present was identified at the detection thresholds 0-1.2, 1.2-1.4, 1.4-1.6, 1.6-1.8, 1.8-2, and above 2. The window size was 25 nucleotides. Those appearing at higher thresholds had a higher propensity to fold into G4s and those with a near-zero average score were indicative of sequences likely to form duplexes. This data can be found in Table S2. To identify the location of G4-forming sequences within annotated genomic features, the files containing known genomic features in the MPXV genomes were downloaded from the NCBI database. The presence of G4-forming sequences within a pre-defined genomic feature (e.g., gene), or within ±100 bp of these genomic features were analyzed. The location of G4-forming sequences in known genomic features was identified using a publicly available script found at https://pypi.org/project/dna-analyser-ibp/.

| Detection of IR sequences in MPXV genomes
Genomes were analyzed using Palindrome Analyzer to detect the presence and localization of IRs. 14 The default parameters for analysis were to detect IRs with a size between 6 and 30 bp, spacer size from 0 to 10 bp, and with up to one mismatch. Information about the number and frequency of IRs within the MPXV genomes can be found in Table S3. The information regarding the nucleotide position of the SNPs in the 2018 reference strain was obtained from Isidro et al. and was cross-referenced with our IR analyses to identify whether these were located within an IR sequence. Whether these exact IR sequences were conserved amongst other MPXV genomes between 2018 and 2022 was further manually assessed.

| Statistics
Data were first tested for normality via a Shapiro-Wilk normality test. To assess whether IRs were being lost compared with the 2018 reference strain (MT903344.1), all data were normalized to this strain (mean of 100%) and significance was determined via a one-sample t-test. A p-value of <0.05 was considered statistically significant.

| RESULTS
It has previously been reported that members of the Poxviridae family have some of the lowest frequencies of G4-forming sequences among viruses. 15 However, it has also recently been shown that all MPXV genomes from the 2022 outbreak contain an unstable G4 in the C9L gene, which increases inhibition of the immune response. 16 We first analyzed MPXV genome sequences for the presence of G4-forming sequences. As expected, we identified very few G4-forming sequences within these genomes, ranging from 6 to 10 sequences (frequency of 0.030-0.055 per kbp; Figure 1A; Table S2).
Most of these sequences were found within 100 bp before genes and within genes themselves, with very few sequences being identified in the 100 bp following the gene sequence ( Figure 1B (Table 1). Of these, 100% of the    (Table S4).

| DISCUSSION
In this study, we identified that MPXV genomes were depleted of  provides a potential link between APOBEC3 activity and mutation in MPXV genomes.

AUTHOR CONTRIBUTIONS
Stefan Bidula, Emily F. Warner, and Václav Brázda developed the research question and analysis plan. Stefan Bidula, Václav Brázda, and Michaela Dobrovolná were involved in data collection and analysis.
All authors were involved in preparing the final manuscript.

ACKNOWLEDGMENTS
No funding was received for this manuscript.

CONFLICTS OF INTEREST
The authors declare that there are no conflicts of interest.

DATA AVAILABILITY STATEMENT
The data that support the findings of this study are available from the corresponding author upon request.