You are correct that by including a lot of factors they would have to check they have enough data to estimate each factor properly. However, they do have a lot of data. Opta says they use a database of millions of shots. That still might not be enough if they include too many variables in the model - just be cause of the sheer number of combinations of variables that are possible. From what I’ve read about this they seem to be pretty disciplined in what they include.
In any case, the things the model won’t predict well are more likely to be rare events that don’t show up many times in the training data. But since those shots are rare anyway, it doesn’t matter so much. You have a trade off between including enough variables in the model to predict most events really well (learn from the huge amounts of data that you do have on common shooting chances in football), or you could in theory limit the ability of the model to predict things so as to be “less surprised” by outlier events, but nobody really would want that.
You have an intuition that most goal attempts are “unique”, but xG seems to do pretty well on average. That “uniqueness” is still captured by the model in terms of variation around the expected number of goals but unfortunately they never present that.
I personally would like to see some information about xG uncertainty, because that WOULD include information about the rarity in the training data. A super rare shot should in theory (assuming a good model) have more uncertainty about the xG.
So if we saw things like an xG range of plausible estimates instead, say it was xG-range 0.7-0.9 because all we had in the game was 1 penalty, then that is pretty common and the uncertainty can be low (narrow range of plausible values). But say instead of a single penalty there was like 5 half chances with rare events included we might see something like xG-range 0.2-1.4 and both cases could have the same total xG overall but by presenting the uncertainty we have a much better understanding of what the model actually predicts and what type of game we are dealing with.
You are correct that by including a lot of factors they would have to check they have enough data to estimate each factor properly. However, they do have a lot of data. Opta says they use a database of millions of shots. That still might not be enough if they include too many variables in the model - just be cause of the sheer number of combinations of variables that are possible. From what I’ve read about this they seem to be pretty disciplined in what they include.
In any case, the things the model won’t predict well are more likely to be rare events that don’t show up many times in the training data. But since those shots are rare anyway, it doesn’t matter so much. You have a trade off between including enough variables in the model to predict most events really well (learn from the huge amounts of data that you do have on common shooting chances in football), or you could in theory limit the ability of the model to predict things so as to be “less surprised” by outlier events, but nobody really would want that.
You have an intuition that most goal attempts are “unique”, but xG seems to do pretty well on average. That “uniqueness” is still captured by the model in terms of variation around the expected number of goals but unfortunately they never present that.
I personally would like to see some information about xG uncertainty, because that WOULD include information about the rarity in the training data. A super rare shot should in theory (assuming a good model) have more uncertainty about the xG.
So if we saw things like an xG range of plausible estimates instead, say it was xG-range 0.7-0.9 because all we had in the game was 1 penalty, then that is pretty common and the uncertainty can be low (narrow range of plausible values). But say instead of a single penalty there was like 5 half chances with rare events included we might see something like xG-range 0.2-1.4 and both cases could have the same total xG overall but by presenting the uncertainty we have a much better understanding of what the model actually predicts and what type of game we are dealing with.