How Scientists Discovered 85,000 Viral Species in Leftover Data

[Phage viruses attempt to infect a cell. Image via Wikimedia Commons.]

In 2015, the largest database of genetic information in the world– the National Center for Biotechnology Information (NCBI)–had complete genomes for 45,000 species of bacteria–but only 2,200 genomes from viruses.

Viruses outnumber bacteria in every habitat researchers have sampled. In fact, they outnumber stars in the universe and grains of sand on planet Earth.  Scientists tend to zero in on the handful of viruses that threaten human lives, but we know nothing about the vast majority of our invisible, arguably non-living neighbors. 

Last week, a paper in Nature announced that scientists had identified 85,000 previously undiscovered viral species by combing through leftover data from environmental DNA samples. Many of those viruses appear to infect bacteria and microbes that we’ve never seen come down with an infection before.

No spiffy new virus capturing techniques were required; the researchers, led by Nikos Kyrpides and David Paez-Espino, simply used existing data collected by previous scientific projects. Scientists gather environmental samples all the time. When microbiologists want to  see if a bacterial species lives in people’s mouths, they do a cheek swab. When marine biologists track the spread of algae-killing viruses, they scoop up samples of ocean water. But human mouths and open oceans are both home to complex microbial ecosystems. When scientists sequence the DNA from their organism of interest, they often end up sequencing the DNA from many of the other microbes in their sample, too. 

Most of the time that data about off-target species isn’t used in the original study, but sometimes scientists add their raw environmental DNA data–aka “metagenomic data”–to publicly available databases.

Kyrpides and Paez-Espino, who both work at the Department of Energy’s Joint Genome Institute in Walnut Creek, California, had access to a vast database.  “The largest amount of data was in metageonomic sequences,” said Kyrpides. “We were very interested in mining all of this information.”

The range of habitats in the data they used spanned from deep sea hydrothermal vents to human guts, from forest soil to synthetic environments like petri dishes, and everywhere in between. Freshwater lakes, saltwater lakes,  human mouths, open oceans, sewage, swamps, termite guts, and more were all represented in the data they crunched.

To spot the viral sequences, the team assembled a set of 1800 genetic markers from known viruses and had the computer sift through the metagenome data–the sum total of all the DNA found in the environmental samples–looking for sequences that matched or at least nearly matched. (If there’s a stretch of unidentified DNA in bracketed by two known viral genes, that mystery DNA in the middle most likely belongs to an undiscovered viral species.)

[Correction: Paez-Espino wrote to me in an email, “This is not like that. We basically built-up a set of 25 thousand specific viral protein families from (1) known viruses and (2) a set of 1,800 manually-curated metagenomic viral sequences from diverse habitats.”

…So they were looking for clusters of their 25,000+ genes that are known to code for viral proteins.] 

Using that logic and a lot of computer-power, they found 125,000 DNA sequences that were probably from viruses. Some were large enough that they could likely be entire genomes; others were just fragments of viral genomes.  A few sequences matched existing DNA data on known species, but 99% didn’t.

Kyrpides argues that scientists miss out on a lot of information by only searching for the species they already know about. “That has been a huge mistake, because we’re losing the connectivity,” says Kyrpides. “If you ask, ‘How do viruses move from the environment to human skin and back?’ We have no idea!”

The result of their data crunching was a sprawling study, which yielded several surprises.  But the most important takeaway is simply that viruses are much more diverse than researchers have given them credit for. 

Matchmaker, Matchmaker, CRISPR me a catch!

For the information on the tens of thousands of new viral species they identified to be useful, Kyrpides and his colleagues needed to understand where these species fit into ecosystems of viruses and hosts. That meant figuring out which host organisms these viruses infect. 

“There are some bacteria that we know cause infectious disease. So if we know for the first time that some of the viruses that can kill [disease-causing bacteria], that’s useful for bioengineering,” says Paez-Espino.

Playing matchmaker between viruses and hosts required several strategies. First, the Paez-Espino and his colleagues ran their new viral genomes through a computer program that estimated how well the new viruses fit into existing viral families. If the new viral genome looked very similar to a known virus, it stands to reason that both viruses infect the same host. That strategy found matches about 2.4% of the time.

Then, Paez-Espino and colleagues turned to the witnesses of past viral infections–the genomes of potential hosts. Many bacteria and archaea carry a genetic immune system called CRISPR, which samples viral genetic material and splices it directly into the bacterial genome for future reference The spliced-in viral sequences–or “proto-spacers”–are flanked by distinctive gene sequences called CRISPR repeats. With bacterial genomes from the wild, scientists can pretty safely assume that DNA in between two CRISPR repeats originally came from a virus. 

[Side note: Yes. The CRISPR in question is that CRISPR, which has been blowing up biotech news feeds for the past 3 years. The CRISPR-Cas9 system has been grabbing a lot of headlines because of its usefulness as a genetic engineering tool and the legal battle over who discovered it, but CRISPR-Cas splicing is a natural part of life. Its original inventors were bacteria and archaea. Unfortunately, prokaryotes cannot file patents.]

So Paez-Espino and company compared the genetic data from their 125,000 alleged new viral species to bacterial and archaea genomes in a microbial database, looking for CRISPR-bracketed matches in known bacteria. Several thousand more of the newly discovered species were linked to their infection target that way.

Many of the bacterial hosts the researchers identified had zero known viral enemies prior to this study. 

Don’t Let the Size Fool You

The research team also discovered 7 giant viruses that infect bacteria.

[Correction: An earlier version of this post said they were the first to discover giant prokaryote-infecting viruses, aka “giant phages”. Actually, one giant bacteriophage species had been discovered previously, but they added 7 newly discovered giant phages to the mix.] 

Viruses  have a reputation for being tiny, so when researchers look for viruses, they often run their sample through a fine mesh with holes so small that no eukaryotic or bacterial cells can get through. That method makes it easier to isolate the teensiest viruses, but it also means that  giant viruses get left out.

Paez-Espino and his colleagues found several viral genomes that would likely code for surprisingly sizeable viruses. The largest one they found is likely half the total length of its bacterial host!  “Using the old protocol, we’re missing a lot of information on the biggest viruses,” says Paez-Espino.

Infectious Cosmopolitanism

The vast, vast majority of viruses do not infect humans. Ever. They simply don’t have the equipment to do so. Bacteria are radically different from our cells, so phages that evolved to target prokaryotes, by and large, don’t hurt us at all.  Even jumping between closely related animal species is pretty difficult for most viruses.

However, a handful of the new viruses turned up in a shockingly broad range of environments. And while it’s possible that those viruses are just common in the labs that process metagenomic data, some seem to be genuinely capable of infecting and thriving under many different conditions.  They left CRISPR scars in wide ranges of microbes.

“We can’t explain why some of the viruses are cosmopolitan and able to infect across so many different lineages,” said Paez-Espino.

But those viruses are probably among the most likely to be singled out for future study.

Discovered by Computer

One notable aspect of this study is that it took place almost entirely in silico, as the scientists say. The team didn’t isolate the physical bodies of any of these new viral species; they inferred the new viruses’ existence based on extensive computation. 

Some of the new species may be red herrings, but the chances of them all being computer-generated mirages is extremely low.

And there are most likely thousands more viruses–without resemblance to known viral species–still lurking in metagenomic databases, waiting to be discovered. 

Up until the advent of computers, microbiologists could only spot the minority of bacteria and viruses that can survive and thrive in petri dishes. The field increasingly relies on genomics data to make educated guesses about microbial diversity.

Additionally, our samples of microbes from natural environments only represent a tiny fraction of the total microbiosphere. Even a major experiment like the much-touted marine virus survey–which got a fancy Carl Zimmer write-up in Quanta and everything– took only a handful of samples.  

“This is really absolutely the low-hanging fruit,” said Kyrpides. Concentrated on that low-hanging fruit. “What we found is maybe 10% of what might be around.”

Or it might be an even smaller percentage…


Scientists found 85,000 new virus species by re-analyzing data from previous experiments. There are likely millions and millions of viral species out there that we haven’t discovered yet.

Leave a Reply

Your email address will not be published. Required fields are marked *

Website Protected by Spam Master