How many people is a ball of waste water?
The researchers developed a machine learning model that uses the assortment of microbes found in wastewater to determine how many individual people they represent.
Research from Fangqiong Ling’s lab showed earlier this year that the amount of SARS-CoV-2 in a sanitation system correlated with the burden of disease – COVID-19 – in the region it served.
But before that work could be done, Ling needed to know: How can you determine the number of individuals represented in a random sample of wastewater?
A chance encounter with a colleague helped Ling, an assistant professor in the Department of Energy, Environmental, and Chemical Engineering at Washington University’s McKelvey School of Engineering in St. Louis, lead to the creation of a method for doing exactly that.
In the future, this method may be able to relate other wastewater properties to individual data.
The problem was simple: “If you only take one measurement of wastewater, you don’t know how many people you are measuring,” Ling said. This goes against the way studies are usually designed.
“Usually when you design your experiment, you design your sample size, you know how many people you’re measuring,” says Ling. Before she could look for a correlation between SARS-CoV-2 and the number of people with COVID, she needed to figure out how many people were represented in the water she was testing.
Initially, Ling thought that machine learning might be able to uncover a direct relationship between the diversity of microbes and the number of people it represented, but the simulations, performed with “off-the-shelf” machine learning , did not work. .
Then Ling had a chance meeting with Likai Chen, assistant professor of mathematics and statistics. Both realized they shared an interest in working with new and complex data. Ling mentioned that she was working on a project Chen might be able to contribute to.
“She told me about the problem and I told her that it was indeed something we could do,” says Chen. It turns out that Chen was working on a problem that used a technique that Ling also found useful.
The key to being able to determine how many individual people were represented in a sample has to do with the fact that the larger the sample, the more likely it is to resemble the average. But in reality, individuals tend not to be exactly “average.” Therefore, if a sample looks like an average sample of microbiota, it is likely to be made up of many people. The further away from the mean, the more likely it is to represent an individual.
“But now we’re dealing with high-dimensional data, aren’t we?” Chen said. There are an almost infinite number of ways to group these different microbes together to form a sample. “So that means we have to figure out how to aggregate this information across different sites?”
Using this basic intuition – and a lot of math – Chen worked with Ling to develop a more personalized machine learning algorithm that could, if trained on real samples of microbiota from over 1,100 people, determine how many people were represented in a wastewater sample (these samples were not linked to training data).
“It’s much faster and you can train it on a laptop,” Ling says. And it’s not only useful for the microbiome, but also, with enough examples – training data – this algorithm could use human virome viruses or metabolic chemicals to link individuals to water samples worn out.
“This method was used to test our ability to measure population size,” says Ling. But it goes much further. “Now we are developing a framework to enable validation across studies.”
The research appears in the journal Computational Biology PLOS.
Source: Washington University in St. Louis