An American scientist has incited a new skirmish over the origin of the coronavirus, reporting that he has retrieved potentially significant genetic data about SARS-CoV-2 that had been stored and later deleted from a digital archive at the National Institutes of Health.
Jesse Bloom, a computational biologist at the Fred Hutchinson Cancer Research Center in Seattle, posted his findings on the preprint server bioRxiv, where papers that have not yet been peer-reviewed or published in a journal have been landing by the thousands since the start of the pandemic.
The scientific significance of Bloom’s research remained unclear Wednesday, but it stirred instant online reaction, favorable and unfavorable alike, among scientists who have been debating the flurry of theories about the initial coronavirus outbreak.
“I recognize this is a hot-button topic,” Bloom said in an interview with The Washington Post. “It’s not a highly traditional scientific study, but at least it has some new data and new information.”
Bloom, who retrieved the data through Google Cloud, does not claim that it advances one theory or another, but he contends it bolsters evidence that the virus was circulating in Wuhan, China, before a December outbreak of covid-19, the illness caused by the virus, that was linked to a market selling live animals.
What is not in dispute is that raw data was deleted from a database at the NIH. Processed forms of the same data were included in a preprint paper from Chinese scientists posted in March 2020 and, after peer review, published that June in the journal Small.
The NIH released a statement Wednesday saying that a researcher who originally published the genetic sequences asked for them to be removed from the NIH database so that they could be included in a different database. The agency said it is standard practice to remove data if requested to do so. The NIH statement did not identify the scientist who requested that the material be excised from the agency’s sequence read archive, known as the SRA.
“These SARS-CoV-2 sequences were submitted for posting in SRA in March 2020 and subsequently requested to be withdrawn by the submitting investigator in June 2020. The requestor indicated the sequence information had been updated, was being submitted to another database, and wanted the data removed from SRA to avoid version control issues,” the NIH said.
The statement said the NIH “can’t speculate on motive beyond a submitter’s stated intentions.”
Bloom’s paper acknowledges that there are benign reasons why researchers might want to delete data from a public database.
The data cited by Bloom are not alone in being removed by the NIH during the pandemic. The agency, in response to an inquiry from The Post, said the National Library of Medicine has so far identified eight instances since the start of the pandemic when researchers had withdrawn submissions to the library.
“This one from China and the rest from submitters predominantly in the U.S.,” the NIH said in its response. “All of those followed standard operating procedures.”
Bloom said in an email to The Post that he was not accusing the NIH of wrongdoing. But Bloom’s online paper suggests the deletion of data violates scientific norms and the code of trust essential to science. On Twitter, Bloom said the data was also taken down from a Chinese database.
“Certainly, the consequence of removing the sequences was to obscure their existence,” Bloom told The Post in the interview.
In the preprint, he wrote, “that the current study suggests that at least in one case, the trusting structures of science have been abused to obscure sequences relevant to the early spread of SARS-CoV-2 in Wuhan.”
Efforts by The Post to reach the senior author of the sequencing paper have been unsuccessful.
Robert Garry, a Tulane University virologist who co-wrote an influential March 2020 paper saying SARS-CoV-2 was a natural virus and not engineered, took issue with the new Bloom paper. Among his criticisms: The key data from the China study, a list of mutations seen in the virus sequences, has remained available to researchers in an appendix. He said Bloom found the same mutations.
“Jesse Bloom found exactly nothing new that is not already part of the scientific literature,” Garry wrote in an email. He called the Bloom paper “inflammatory.”
Benjamin Neuman, a virologist at Texas A&M University, agreed that the data on mutations remained public. Neuman said he understood Bloom’s goal – to use the raw genomic sequences to construct what is known as a phylogenetic tree of SARS-CoV-2. Such a diagram would show how and when the virus evolved and splintered into different lineages.
“The question is what constitutes adequate publication?” Neuman said in an email. “Is it having access to the data, which we have through [the paper from China], or access to the data in your preferred form, which is what Bloom mined out? It’s the exact same data in refined vs. raw form.”
Bloom is no stranger to the debate over the virus’s origins. He was the lead author of a letter to the journal Science, signed by an additional 17 prominent scientists, that last month criticized a World Health Organization probe into the origins of the virus. The letter called for a deeper investigation of the “lab leak” hypothesis, which asserts that the coronavirus – accidentally or by design – potentially slipped out of a laboratory in Wuhan.
Stanford University microbiologist David Relman, another organizer of that letter, said of Bloom’s findings: “It shows how critical it is that early data be sought, preserved, and shared in trying to infer virus evolutionary paths and origins, since early data are always sparse to begin with, and since analyses are therefore so sensitive to specific data that happen to be available.”
In his paper, Bloom does not claim that the data he retrieved advances the argument for a lab leak or a natural zoonosis.
“This study provides no evidence either way,” Bloom said in an email. “But it does indicate that we probably have not exhausted all relevant data.”
Recovery of deleted deep sequencing data sheds more light on the early Wuhan SARS-CoV-2 epidemic
Jesse D Bloom
doi:
doi.org/10.1101/2021.06.18.449051Abstract
The origin and early spread of SARS-CoV-2 remains shrouded in mystery. Here I identify a data set containing SARS-CoV-2 sequences from early in the Wuhan epidemic that has been deleted from the NIH's Sequence Read Archive. I recover the deleted files from the Google Cloud, and reconstruct partial sequences of 13 early epidemic viruses. Phylogenetic analysis of these sequences in the context of carefully annotated existing data suggests that the Huanan Seafood Market sequences that are the focus of the joint WHO-China report are not fully representative of the viruses in Wuhan early in the epidemic. Instead, the progenitor of known SARS-CoV-2 sequences likely contained three mutations relative to the market viruses that made it more similar to SARS-CoV-2's bat coronavirus relatives.
Competing Interest Statement
The author consults for Moderna on SARS-CoV-2 evolution and epidemiology, consults for Flagship Labs 77 on viral evolution and deep mutational scanning, and has the potential to receive a share of IP revenue as an inventor on a Fred Hutch licensed technology/patent (application WO2020006494) related to deep mutational scanning of viral proteins.
Paper in collection COVID-19 SARS-CoV-2 preprints from medRxiv and bioRxiv