Tag Archives: bioinformatics

How Scientists Discovered 85,000 Viral Species in Leftover Data

[Phage viruses attempt to infect a cell. Image via Wikimedia Commons.]

In 2015, the largest database of genetic information in the world– the National Center for Biotechnology Information (NCBI)–had complete genomes for 45,000 species of bacteria–but only 2,200 genomes from viruses.

Viruses outnumber bacteria in every habitat researchers have sampled. In fact, they outnumber stars in the universe and grains of sand on planet Earth.  Scientists tend to zero in on the handful of viruses that threaten human lives, but we know nothing about the vast majority of our invisible, arguably non-living neighbors. 

Last week, a paper in Nature announced that scientists had identified 85,000 previously undiscovered viral species by combing through leftover data from environmental DNA samples. Many of those viruses appear to infect bacteria and microbes that we’ve never seen come down with an infection before.

No spiffy new virus capturing techniques were required; the researchers, led by Nikos Kyrpides and David Paez-Espino, simply used existing data collected by previous scientific projects. Scientists gather environmental samples all the time. When microbiologists want to  see if a bacterial species lives in people’s mouths, they do a cheek swab. When marine biologists track the spread of algae-killing viruses, they scoop up samples of ocean water. But human mouths and open oceans are both home to complex microbial ecosystems. When scientists sequence the DNA from their organism of interest, they often end up sequencing the DNA from many of the other microbes in their sample, too. 

Most of the time that data about off-target species isn’t used in the original study, but sometimes scientists add their raw environmental DNA data–aka “metagenomic data”–to publicly available databases.

Kyrpides and Paez-Espino, who both work at the Department of Energy’s Joint Genome Institute in Walnut Creek, California, had access to a vast database.  “The largest amount of data was in metageonomic sequences,” said Kyrpides. “We were very interested in mining all of this information.”

The range of habitats in the data they used spanned from deep sea hydrothermal vents to human guts, from forest soil to synthetic environments like petri dishes, and everywhere in between. Freshwater lakes, saltwater lakes,  human mouths, open oceans, sewage, swamps, termite guts, and more were all represented in the data they crunched.

Continue reading “How Scientists Discovered 85,000 Viral Species in Leftover Data” »