The D614G Variant
Increasing Frequency and Global Distribution
The Spike D614G amino acid change is caused by an A-to-G nucleotide mutation at position 23,403 in the Wuhan reference strain; it was the only site identified in our first Spike variation analysis in early March 2020 that met our threshold criterion. At that time, the G614 form was rare globally but gaining prominence in Europe, and GISAID was also tracking the clade carrying the D614G substitution, designating it the “G clade.” The D614G change is almost always accompanied by three other mutations: a C-to-T mutation in the 5′ UTR (position 241 relative to the Wuhan reference sequence), a silent C-to-T mutation at position 3,037, and a C-to-T mutation at position 14,408 that results in an amino acid change in RNA-dependent RNA polymerase (RdRp P323L). The haplotype comprising these 4 genetically linked mutations is now the globally dominant form. Prior to March 1, 2020, it was found in 10% of 997 global sequences; between March 1 and March 31, 2020, it represented 67% of 14,951 sequences; and between April 1 and May 18, 2020 (the last data point available in our May 29, 2020 sample), it represented 78% of 12,194 sequences. The transition from D614 to G614 occurred asynchronously in different regions throughout the world, beginning in Europe, followed by North America and Oceania and then Asia (Figures 1, 2, 3, S2, and S3).
Figure 1. The Global Transition from the Original D614 Form to the G614 Variant
(A) Changes in the global distribution of the relative frequencies of the D614 (orange) and G614 (blue) variants in 2 time frames. Circle size indicates the relative sampling within each map. Through March 1, 2020, the G614 variant was rare outside of Europe, but by the end of March 2020 it had increased in frequency worldwide. These data are explored regionally in Figure 2 (Europe), Figure S2 (North America), and Figure S3 (Australia and Asia).
(B) Paired bar charts compare the fraction of sequences with D614 and with G614 for two time periods separated by a 2-week gap. The first time period (left bar) includes all sequences up to the onset day (see main text). The second time period (right bar) includes all sequences acquired at least 2 weeks after the onset date. All regions are shown that met the minimal threshold criteria for inclusion (see main text) with a significant shift in frequency (two-sided Fisher’s exact test, p < 0.05). Four hierarchical geographic levels are split out based on GISAID naming conventions.
(C) Running weekly average counts of sampled sequences exhibiting the D614 (orange) and G614 (blue) variants on different continents between January 12 and May 12, 2020. The measure of interest is the relative frequency over time. The shape of the overall curve just reflects sample availability; sequencing was more limited earlier in the epidemic (hence the left-hand tail), and there is a time lag between viral sampling and sequence availability in GISAID (hence the right-hand tail).
Weekly running count plots were generated with Python Matplotlib (Hunter, 2007); all elements of this figure are frequently updated at
cov.lanl.gov/.
Figure 2. The Transition from D614 to G614 in Europe
(A) Maps of relative D614 and G614 frequencies in Europe in 2 time frames.
(B) Weekly running counts of G614 illustrating the timing of its spread in Europe. The legend for Figure 1 explains how to read these figures. Some nations essentially had G614 epidemics when sampling began, but even in these cases, small traces of D614 found early were soon lost (e.g., France and Italy). The Italian epidemic started with the D614 clade, but Italy had the first sampled case of the full G614 haplotype and had shifted to all G614 samples prior to March 1, 2020 (Figure S5). European nations that began with a mixture of D614 and G614 most clearly reveal the frequency shifts (e.g., Germany, Spain, and the United Kingdom). The United Kingdom is richly sampled and so is subdivided into smaller regions (England, Wales, and Scotland) and then further divided to display two well-sampled English cities. Even in settings with very well-established D614 epidemics (e.g., Wales and Nottingham; see also Figures S2 and S3), G614 becomes prevalent soon after its appearance. The increase in G614 frequency often continues well after stay-at-home orders are in place (pink line) and past the subsequent 2-week incubation period (pink transparent box). The figures shown here can be recreated with contemporary data from GISAID at the
cov.lanl.gov/ website. UK stay-at-home order dates were based on the date of the national proclamation, and others were documented on the web.
Figure 3. Modeling the Daily Fraction of the G614 Variant as a Function of Time in Local Regions Using Isotonic Regression
(A) Analysis summaries for all of the level 3 and 4 regional subdivisions from GISAID data (Figure 1) that have at least 5 each of D614 and G614 variants and that are sampled on at least 14 days. We report the number of each variant, the number of days with test results, and the number of days spanning the first and the last reported tests. the p values are for two one-sided tests, comparing the null hypothesis of no consistent changes in relative frequency over time with positive pressure (the fraction of the G614 variant increasing with time) or negative pressure (the fraction of the G614 variant decreasing with time). Across all regions with sufficient data, binomial p values against the null that increases and decreases are equally likely to indicate that the consistency of increasing G614 is highly significant. California has increasing and decreasing patterns with low p values; this can happen when different time windows support opposing patterns. The G614 decreasing time window in California was driven by sampling from
Clara county, a rare region that has retained the D614 form (Figure S4). In the May 29, 2020 dataset used here,
Clara county was sampled later in May than any other region in California, so the California G614 frequency dips at this last available time point. When
Clara county is removed from the California sample, the pattern of increasing levels of G614 is restored (red asterisk).
(B) Three examples of cities, plotting the daily fraction of G614 as a function of time and accompanied by plots of running weekly counts. The dot size is proportional to the number of sequences sampled that day. The staircase line is the maximum likelihood estimate under the constraint that the logarithm of the odds ratio is non-decreasing. Two typical examples are shown, highlighted in blue (Sydney and Cambridge), and one exception is shown, highlighted in orange (Yakima). Yakima had a brief sampling window enriched for G614 early in the sampling period, but otherwise G614 maintained a low frequency.
Summaries and plots for all regional data at levels 2–4 (included country) are included in Data S1.
Figure S2. The Increasing Frequency of the D614G Variant over Time in North America, Related to Figure 1
Maps of the relative frequencies of D614 and G614 in North America in two different time windows. B. Weekly running counts of G614 illustrating the timing of its spread in North America. This figure complements Figures 2 and S3, and Figure 1 has details about how to read these figures. When a particular stay-at-home order date was known for a state or county it is shown as a pink line, followed by a light pink block indicating the maximum two-week incubation time. Different counties in California had different stay-at-home order dates (Mar. 16-19) so are not highlighted, but more detail can be seen regarding California in Figure S4. The decline in D614 frequency often continues well after the stay-at-home orders were initiated, and sometimes beyond the 14-day maximum incubation period, when serial reintroduction of the G614 would be unlikely. On the right, Washington State is shown, with details from two heavily sampled counties, Snohomish and King. Both counties had well-established ongoing D614 epidemics when G614 variants were introduced, undoubtedly by travelers. Washington state’s stay-at-home order was initiated March 24. At this time there were 1170 confirmed cases in King County, and 614 confirmed cases in Snohomish County. (Confirmed COVID19 case count data from: COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University). Testing was limited, and so this is lower bound on actual cases. Of the sequences sampled by March 24, 95% from King County (153/161) and 100% from Snohomish County (33/33) were the original D614 form (Part B, details at
cov.lanl.gov/). By mid-April, D614 was rarely sampled. Whatever the geographic origin of the G614 variants that entered these counties, and whether one or if multiple G614 variants were introduced, the rapid expansion of G614 variants occurred in the framework of well-established local D614 variant epidemics.
Clara county is one of the two exceptions to the pattern of D614 decline in Figure 1B: details are provided in Figures S4A and S4C.
We developed two statistical approaches to assess the consistency and significance of the D614-to-G614 transition. In general, to observe a significant change in the frequency of variants in a geographic region, three requirements must be met. First, both variants must at some point be co-circulating in the geographic area. Second, there must be sampling over an adequate duration to observe a change in frequency. Third, enough samples must be available for adequate statistical power to detect a difference. Both of our approaches enable us to systematically extract all GISIAD local and regional data that meet these three requirements.
Our first approach requires that there be an “onset,” defined as the first day where the cumulative number of sequences reached 15 and both forms were represented at least 3 times; we further require that there be at least 15 sequences available at least 2 weeks after onset. Each geographic region that meets these criteria is extracted separately based on the hierarchical geographic/political levels designated in GISAID (Figure 1B). A two-sided Fisher’s exact test compares the counts in the pre-onset period with the counts after the 2-week delay period and provides a p value against the null hypothesis that the fraction of D614 versus G614 sequences did not change. All regions that met the above criteria and that showed a significant change in either direction (p < 0.05) are included. Almost all shifted toward increasing G614 frequencies: 5 of 5 continents, 16 of 17 countries (two-sided binomial p value of 0.00027), 16 of 16 regions (p = 0.00003), and 11 of 12 counties and cities (p = 0.0063).
In Figure 2 (Europe), Figure S2 (North America), and Figure S3 (Australia and Asia), we break down the relationships shown in Figure 1B in detail. The G614 variant increased in frequency even in regions where D614 was the clearly dominant form of a well-established local epidemic when G614 entered the population. Examples of this scenario include Wales, Nottingham, and Spain (Figure 2); Snohomish county and King county (Figure S2); and New South Wales, China, Japan, Hong Kong, and Thailand (Figure S3). Although introduction of a new variant might sometimes result in emergence of the new form because of stochastic effects or serial re-introductions or apparent emergence because of sampling biases, the consistency of the shift to G614 across regions is striking. The increase in G614 often continued after national stay-at-home orders were implemented and, in some cases, beyond the 2-week maximum incubation period.
Figure S3. The Increasing Frequency of the D614G Variant over Time in Australia and Asia, Related to Figures 1 and 2
This figure complements Figure 2 and Figure S2, and Figure 1 has details about how to read these figures. The plot representing national sampling in Australia is on the left, with two regional subsets of the data on the right. In each case a local epidemic started with the D614 variant, and despite being well established, the G614 variant soon dominates the sampling. Only limited recent sampling from Asia is currently available in GISAID; to include more samples on the map the 10-day period between March 11-20, is shown rather than the period between March 21-30; even the limited sampling mid-March the supports the repeated pattern of a shift to G614. The Asian epidemic was overwhelmingly D614 through February, and despite this, G614 repeatedly becomes prominent in sampling by mid-March.
We found two exceptions to the pattern of increasing G614 frequency in Figure 1B; details regarding these cases are shown in Figure S4. The first is Iceland. Changes in sampling strategy during a regional molecular epidemiology survey conducted through the month of March 2020 might explain this exception (Gudbjartsson et al., 2020). In early March 2020, only high-risk people were sampled, the majority being travelers from countries in Europe where G614 dominated. In mid-March 2020, screening began to include the local population; this coincided with the appearance of the D614 variant in the sequence dataset. The second exception is
Clara county, one of the most heavily sampled regions in California (Deng et al., 2020). The D614 variant dominates sequences from the
Clara Department of Public Health (DPH) to date; the G614 variant was apparently not established in that community. In contrast, a smaller set of
Clara county sequences, sampled from mid-March to early April 2020, were specifically noted to be from Stanford; the Stanford samples had a mixture of both forms co-circulating (Figure S4), suggesting that the two communities in
Clara County are effectively distinct. A June 19, 2020 GISAID update for several California counties is provided in Figure S4C, and the G614 form is present in the most recent
Clara DPH samples.
Figure S4. Two Exceptions to the Pattern of Increasing Frequency of the G614 Variant over Time, from Figure 1B, Related to Figure 1
A. Details regarding
Clara county, the only exceptional pattern at the county/city level in Figure 1B. Many samples from the
Clara County Department of Public Heath (DPH) were obtained from March into May, and D614 has steadily dominated the local epidemic among those samples. The subset of
Clara county samples specifically labeled “Stanford,” however, were sampled over a few weeks mid-March through early April, and have a mixture of both the G614 and D614 forms. These distinct patterns suggest relatively little mixing between the two local epidemics. Why
Clara county DPH samples should maintain the original form is unknown, but one possibility is that they may represent a relatively isolated community that had limited exposure to the G614 form, and G614 may not have had the opportunity to become established in this community – though this may be changing, see Part C. The local stay-at-home orders were initiated relatively early, March 16, 2020. B. Details regarding Iceland, the only country with an exceptional pattern from Figure 1B. All Icelandic samples are from Reykjavik, and only G614 variants were initially observed there, with a modest but stable introduction of the original form D614 in mid-March. This atypical pattern might be explained by local sampling. The Icelanders conducted a detailed study of their early epidemic (Gudbjartsson et al., 2020), and all early March samples were collected from high risk travelers from Europe and people in contact with people who were ill; the majority of the traveler samples from early March were from people coming in from Italy and Austria, and G614 dominated both regions. On March 13, they began to sequence samples from local population screening, and on March 15, more travelers from the UK and USA with mixed G614/D614 infections began to be sampled in the high-risk group, and those events were coincident with the appearance of D614. C. Updated data regarding California from the June 19, 2020 GISAID sampling. Most of the analysis in this paper was undertaken using the May 29, 2020 GISAID download, but as California was an interesting outlier, and more recent sampling conducted while the paper was under review was informative, we have included some additional plots from California data that were available at the time of our final response to review, on June 19th. Informative examples from well-sampled local regions are shown. Stay-at-home order dates are shown as a pink line, followed by a light pink block indicating the maximum two-week incubation time. N indicates the number of available sequences. Overall California, and specifically, San Diego and San Joaquin, show a clear shift from D614 to G614. The transition for San Joaquin was well after the stay-at-home orders and incubation period had passed. San Francisco shows a trend toward G614.
Clara DPH, which was essentially all D614 in our May 29th GISAID download, had 7 G614 forms sampled in late May that were evident in our June 19th GISAID download. Ventura is an example of a setting that was essentially all G614 when it began to be sampled significantly in early April, so a transition cannot be tracked; i.e., we cannot differentiate in such cases whether the local epidemic originated as a G614 epidemic, or whether it went through a transition from D614 to G614 prior to sampling. The figures in Parts A, B, and C can be recreated with more current data at
cov.lanl.govcontent/index.