Dr. Malik Magdon-Ismail, Professor of Computer Science at RPI, has been working on a machine learning model for predicting the impacts of the pandemic on smaller cities, like New York’s Capital Region.
My main research thrust is machine learning. And what that means is, you know, I'm given I build models, algorithms, and sort of do the mathematics around, someone gives me a data set. And they'd like to extract some information, some prediction, usually some projection. And so what we do is we we build a model, we then calibrate the model, which we call learning. So we learn the model, we calibrate the model to the data. And the data doesn't have the information they want, for example, predictions where they want the prediction, but we can use the model that is calibrated to the data, we can use the model to make those predictions. So that's my main research thrust.
And so what got me interested in this problem was, you know, the fact that, you know, early on early March, there was a lot of uncertainty about the nature of this pandemic, and you know, is asymptomatic in infection, a large impact on the way the disease spreads, and so on and so forth. So there was a lot of uncertainty. And I thought, what does the data say? So that got me looking into the data. And once I got into the data, I started collecting the data. And I started analyzing the data at a very superficial level. And I realized that oh, this is exactly exactly in the sweet spot of my expertise, which is we have a little bit of data. And we'd like to understand we'd like to make predictions on things like when will the infection rates start decreasing? What will the total number of infections be and so on and so forth.
So it's exactly the framework that I typically focus on on research, which is I have a little bit of data. And I'd like to make predictions. And then this is a very nice application of that.
Based on the summary that I read, before we got on the phone together, it seems that the idea of keeping people home, the social distancing, is vital here, right? That your projections found that the peak would come later, with more people infected in the Capital Region, if only half of everybody was staying home, as opposed to if three quarters of the population stayed home, we would see an earlier peak with fewer infections. Is that correct?
So that's, that's a good summary of what the model predicts. And there are there are two main things going on in the model. As far as how we're modeling the pandemic. The one main thing is, you know, how quickly does this infection spead. How quickly does the virus spread? And the other main thing that's going on is, you know, how many people does the virus have available to infect.
And so typically, when you observe such a such a pandemic, you'll see this steep rise in the number of infections, then it's sort of what what we say flattens out, and then yourself seeing a decrease in the number of infections. So what's going on there is the steep rise is sort of telling you how quickly this thing can infect people. But then it's infecting people. And it eventually runs out of people to infect, and that's when you get flattening out and, and sort of the slowing down of the infection rates. And if you keep more people at home, you're affecting that second thing, which is how many people does the virus have to infect? And indirectly, when you do social distancing, you're also affecting the rate at which it can infect people. And so by controlling these two parameters, you can drastically change when it peaks out, and also you can drastically change how many people will get infected.
The simple reason why you change how many people get infected is you've just allowed it to infect fewer people by keeping more people at home.
Governor Cuomo has been talking about that in the last couple of days, as we speak, saying, if you can keep the number of people that the average person infects under one, to .9, then you can stop the spread. But if that goes over one, to 1.2 people apiece, then it starts to become a really hard math problem. Is that basically, right?
That's more or less, right. There's this famous number in sort of epidemic spreading models, which is number one, and you can think about it this way. If before I get cured, I in fact, let's say more than one person, say two people, then once I'm cured, I disappear but two people appear and and things have grown. On the other hand, if you can get me down to below one then gradually that thing will disappear. So this this Yeah, this critical threshold of one is very crucial to, to being able to sort of hold the infection at base. How will the model change now that there's more testing available even in places like the capital region? OK, so the model that I've been using is is primarily based on what what I call early data. So the infections as they're coming back, were early in the in the sort of, in the sort of pandemic, up to let's say, April 10, or 11th.
And I am sort of basing it off the fact that you know, who goes to get tested? So what are the people that we're seeing who are testing COVID positive are the people who were serious enough to go and get tested. So that's the model. That's sort of the one of the assumptions behind this model, which is, you're going to get tested. When you feel sick. And later on, we might start implementing and some countries have implemented random testing models where you don't just test people who feel sick, you test people in a random way, that'll affect actually the data. And it'll also affect the modeling and the modeling assumptions.
And, you know, if we, if we were able to test everybody every day, then we would be able to control the disease much more efficient efficiently than the current state, which is, you know, we really only test people who come to us when they're sick, right.
And also, I think it should be pointed out, we don't actually know if 75% of the people are staying at home, right?
Oh, we do not. So that's one of the things that the model cannot learn from the data. So this is this is one of the things where you can do what we what we call sir scenario analysis. So I have analyzed the scenario where 75% of the people stay at home, we can analyze the scenario where 80% stay at home. 50%, 20% what's the right number? It's hard to judge and and, and and from my perspective, I'm doing the scenario analysis, but other people might have a lot more insight on what exactly that number is. And if you give me that exact number, then I can plug it in.
Are any Capital Region organizations or governments taking this data from you to help plan that you're aware of?
Not that I'm aware of. It's available, you know, I'm happy to help out in whatever way I can. Because you know, this is the only way I can help out I'm not on the front lines, I can just help with understanding what's going on.
Did I send you an email?
Yeah, I'm looking at some photos right now. I'm not gonna pretend to know exactly what I'm looking at, but maybe you can explain.
Yes, you should see a bunch of red dots.
So let me tell you a little bit about what those red dots are. So if you you know, you can see the tight. So for example, 3/30, March 30. And if you go up to the dots that corresponds to 3/30, somewhere around 20. So vertical, the vertical axis, you know, somewhere around somewhere around two, but you got to multiply by 10, because the scale is a times 10 scale. Got it?
So that red dot that corresponds to 3/30 means that approximate the 20 confirmed infections on 3/30. Okay, now, why do I say approximately because the number that I have here is the number that was reported by DOH. The DOH reports the number when they get the confirmed positive test, we don't know when that actual test actually became an a confirmed positive. So we don't know when the test was was requested. And it's approximate because it's only the people who went to get tested who are coming back with confirmed positives. We don't know how many people are out there who are positive.
OK, yeah, that's been the big problem here, obviously.
Oh, so. So this these red dots, you look at these red dots, and it shows you how many confirmed infections we've had every day. And you know, it's going up to about April 10. So I'm going to ask you look at these dots and tell me what's gonna happen. OK, and you're gonna probably tell me, Well, that's a very hard problem. It doesn't look like it's even solvable. And that's what we're, we're trying to solve by, the machine learning approach, which says, Let's take an epidemiological model, a simple one, and try to explain this data using this simple model. And the model can explain this data.
But we'd really like to know what's going on on May 29. Hmm. Unfortunately, the data doesn't go out to May 29. So that's where the model comes in the model and says, Listen, you don't have data to May 29 but the model, which explains the data you have so far, can still give you something to do look at for May 29.
Now you'll see this magical black line. So that black line is this model that we, in the in the language of sort of machine learning we, we might say it's been fit to the red dots. So it explains the red dots. And you can see so that it goes up steeply. And you can probably guess that that steep decline is corresponding to the first episode of social distancing. And it starts going up again, then it's, it's, it spikes down, that's the lockdown, and then it starts going up again. So that's the black line, the black line is the model. Now let me emphasize, the model has only been sort of calibrated to the red dots, but the model exists everywhere. My model, I fit the data I'm, I'm consistent with what you've seen so far. And here's what I say later on. And then we use the model to project because we don't have data where we really want to know what's going on and in this case, the model has had as sort of indicating that, you know, we're going to peak out somewhere around late May, early June.
I guess the question is, are you updating with new data beyond the original red dot inputs?
Yes, every time I get, you know, every one or two days when the deal, so the deal, he is updating the data daily as it gets reports every day, and then when I get a new red dot, I can put it in here, rerun the numbers, rerun the model, and the predictions might change slightly. So this is very similar. The analogy is like this, I want to know how much rainfall will occur in Albany by the end of the summer. And I have a nice weather model. But you know, it's going to give me a prediction for the end of the summer, and that that prediction is going to have a very large range. And that corresponds to the gray region you see in the picture. And as I get closer to that date, incorporating more data and more, you know, cloud cover what have you fought from the more recent days my prediction will get more and more accurate. So it would appear that the social distancing policies, the shutting of social life, more or less, according to the model is going to have the desired effect by May or June and that the number of cases will level off. We're talking about the Capital Region here. Yes.
So presumably, and maybe maybe you don't want to answer this because it's speculation, but when we've heard discussion about a second wave, that is because the social distancing policies are relaxed, and then more people are coming in contact with more other people who may or may not be positive, and that might cause this model to adjust upward, right?
Yeah. So um, so what's happening here is, so this black curve and the gray region, I'll tell you more about the gray region a little a little later. OK, let's focus on the black curve. So this black curve is based on two things. Two things affected. Remember, I talked about how many people are available for the virus to infect? And what's the rate at which it infects people. And the rate at which it infects people, the model tries to figure out from observing how the red dots are behaving, so it tries to figure this out. And but however, the way the red dots are behaving doesn't tell you so much information about how many people are available to be infected. That's, that's a number that I'm putting in as 75%. You know, and it's a guess, to some extent, I'm, I'm using what I see around me to some extent, I'm using sort of like estimates from mobility data that Google has put out, which suggests that, you know, non essential business in the Albany region has dropped by, you know, somewhere between 60 to 70%. So I'm using an educated guess to say 75% of people are staying at home, and that's why we get this curve. Now, one of the uses that we can put this model to tsay, okay, supposing on May 29, you know, when we think we are peaking out, we relax the 75% restriction and allow people to slowly in some phased manner go back to, to normal life. So let's say we, we, we come down from 75%, staying at home to 50%, staying at home, well, it's easy enough to plug that number 50% instead of 75% into the model, and then see what will happen to this curve. And what you're saying is true that one of two things can happen is that, you know, the turnaround and the decay may slow down a little because all of a sudden, you've given the virus more people to infect, and so it's just going to go around and infect them.
Or it could even restart and, and continue to rise. Because you've given it way more people to infect and instead of peaking out, it says, hey, I've got a new pasture, and I'm going to go and in fact that so these are some of the things that could play out. And it It depends a lot on some of the things that people are talking about, which is how many people, even though you picked out on, you know, on the end of May with a certain number of confirmed positives, how many people out there have been unconfirmed positive sort of asymptomatic positives who've sort of recovered and who have sort of built what what, what people are terming herd immunity to this virus. What that means is that even though you've opened up the population a little more, there's also a lot of this this group of people that immune, who might be able to sort of buffer the newly opened up uninfected people from the people who are infected and prevent, you know, the infection going across just because I'm immune.
So all right, what's the what is the gray area representing?
Ah, the gray area is, you know, in to make a long story short, uncertainty.
So if you ask me for, if you ask me for a prediction, OK, on May 29, I'm going to go to the black curve and say, I think on May 29, or whatever end of May, we're going to have around 700 infections per day approximately, but we'll have peaked out, and it's going to start dropping. OK. What the gray region says is, it really addresses how did I get that black curve? And how did I get that black curve was I looked at the at the red dots, and I said, you know, that's the data, find me the model that's consistent with the data. And then I'm going to use that model to predict out. But then when I look at the data, I find that there are a lot of different models that are minor tweaks from each other that are still just as consistent with the data. And so the question is, what should I do with all these models that more or less look equivalent when it comes to explaining the data? So the idea is I should incorporate all those models into this gray region.
So what they gray region says there are some models or some scenarios, which sort of peak out mid June some scenarios, if you look at the follow the bottom gray curve peeking out, you know, around 10th of May. And so if you ask me for a range of when the peak out would be, it would be around 10th of May to, let's say, 10th of June. So that gray region represents a whole host of models that are all equally consistent with the data, which are the red dots. And then you might ask, Well, why are there a whole host of models that are equally consistent with the data? And that comes back to the data? And, and how hard a problem that was when I showed you only the data, you saw the data and you said, Oh, this is a hard task to predict what's going on in the future. And that's because the data has a lot of uncertainty in it. You see this jumping up and down in the data. Yeah, that jumping up and down is very hard to believe that that's actually what's going on. This is just a bunch of random effects that are taking place. Some of them having to do with reporting some of them having to do with, you know, which people decide to go and get tested, and so on and so forth. So that randomness in the data, that sort of irregular behavior of the data is absolutely the norm when you look at any real data. And what that usually translates into is there are gonna be many models that are equally good at explaining the data. And you need to take all those models into consideration when you make a prediction, and that's what's the gray region is.
And so your model is saying it's plausible on the one hand that we could see, you know, X number of cases well above, you know, what could happen or X number of cases well below that line. That is, they all have a fair chance of happening, the same reasonable chance of happening.
I don't want to get you in trouble here. So I want to say clearly, this is me saying this, but Governor Cuomo in a press conference on Friday said that when President Trump was criticizing him for his request for all these ventilators that the state was seeking, it was based on a model. And that model ended up not being what happened in reality. So I'm imagining that people shouldn't misunderstand that to say that, you know, a high estimate doesn't necessarily mean that it was the wrong policy.
Well, this is really outside my expertise. But yes, a high estimate doesn't necessarily mean wrong policy.
So that sets me up to ask this question. What do you wish that the average person out there understood about how these predictions are made and how this work is done?
So, so yeah, so one of the reasons why, for example, the average person might look at this and say, You know what, you know, you got it wrong by more than 100. That's what I'm getting at. Exactly. Yeah. You got it wrong by more than 101. There are two sort of points that I'd like to make. The first is, this is an inherently hard problem. problem. So you've got a little bit of data because we are trying to make predictions very early in the game. I mean, there's no point in making a prediction, or looking back a year later and saying, Oh, we could have done this and that. So the, the real challenge is for modelers, and you know, public health officials and epidemiologists, and so on to make a prediction, let's say after 10 days of watching what's going on, or 20 days, not, not 120 days. And so when you typically when you have a little bit of data, you know, there is going to be a large range in the predictions. And one of the things that, you know, is useful for the public to keep in mind is that this range is existing, partly because we're trying to make such an early prediction. So then why make the prediction at all, which is that, you know, we need something we need to base our actions on something and, and, and something that can come out of these red dots is things like these ranges.
So, you know, if we explained that these ranges are sort of what they are, and you can, and you can set your expectations to these ranges, then people might understand the context in which we're trying to do this modeling. I mean, the analogy I give is, you know, ask a weatherman to tell you how much rain will there be in in Capital District by the end of the summer. And, you know, OK, rain in Albany is nothing compared to coronavirus. But you can see that it's very hard to predict, you know, the total amount of rain over over an extended period. And there'll be a large range in that prediction.
But we've sort of grown accustomed to weather predictions having these uncertainties and so, you know, when we look at the temperature a week from now, and we observe that aid was 10 degrees off from you know, what you predicted a week ago, we don't feel so bad, because you know, it's just the temperature but that's the inherent nature of the problem and we've gotten used to it. Whereas in this case, it's a very high impact prediction. And though we'd like to have it as accurate as we can, the reality is that the data doesn't allow it.