by Gord
A while back we had a note from a reader who suggested that it was not unreasonable to expect each year that Arsenal should win the title or at least be challenging for the title up to the last few weeks of the season.
In reply Untold showed that such a situation had never been achieved by any club year after year, and indeed in most seasons no club other than the eventual winner was looking like winning until the last few weeks.
To explore this further I started looking at another football situation which looks as if it ought to have a simple, predictable outcome, but in fact doesn’t.
I set up a simple model of a 20 team league. Team A averaged 2 goals per game, all 19 other teams average only 1 goal per game. And I had the computer set to let them play the same 38 game season the EPL plays. In fact I ran quite a few “seasons” to see if common sense prevailed.
Now you would obviously expect Team A to win the league every year in the same way that the German league is won most years by Bayern Munich. But what I wanted to know was what would happen if the computer followed the rule of Team A averaging two goals a game and the others one goal per game: would Team A always win?
Well, strange as it might seem, Team A doesn’t always win the league. The highest point total where Team A is not the outright winner is 77, in fact it is tied in that season with Team 2. Team 14 came in 3rd with 65 points.
In another random season, despite scoring on average two goals a game Team A ends with 73 points, but Team 16 (averaging only one goal a game) wins the league with 74 points. Team A has to settle for second, with Team 2 at 59 points and Team 4 at 58 points.
In another “season” at 72 points we have Team 5 winning one year. In another “season” Team 19 wins with 73 points, Team A having to settle for second with 72 points. And in yet another Team A drops to third with 71 points, with the league winner has 79 and the second place team 75.
Looking at the worst seasons for Team A, one season it only got 64 points and came 4th, six points behind the winner. And worst of all it actually went down 52 points one season, while the team in 6th place got 60 points.
So it goes on: even with the set formula that Team A always average two goals per game across the years it can get between 53 and 99 points.
Meanwhile at the bottom teams were able to be relegated with anything between 25 and 46 points.
So what does this tell us?
Basically that even with the league fixed in a way that common sense suggests should result in Team A always winning the league, it doesn’t always happen. The vagaries of chance get in the way and knock Team A down the league.
And these vagaries happen without ownership changes, with no management changes, while players never get tired or injured and play the same all season long. Oh, the officiating was perfect and impartial, not like what this crap PGMO provides. And each of these surveys was 50 times as long as Wenger has been at Arsenal. The aaa must be turning in their graves because it turns out that even with a league so fixed that one club averages two goals a game and the rest one goal a game, they are still not guaranteed to win the league each season.
Just to look at another variation instead of a German “one club” League approach I tried a “top 4” approach where four teams score an average of two goals a game and 16 get an average of one.
In this scenario the lowest observed point total difference between the teams in the “Top 4” is 9 points and the maximum is 37. And not surprisingly, there are circumstances where not all of the “Top 4” finish in the top 4. The lowest point total I have found 20th place, with 21 points. And that in a league where four clubs get an average of two goals a game and the rest an average of one goal a game.
What this shows is that the vagaries of even a totally fixed league can get in the way of endless winning.
Do like Tony suggests, and research things. Don’t be a spade and just shovel your advice at people.
Technical background
Around 1 million Poisson random deviates were created to produce this document. Thanks to Math::Random::MT::Auto, the Mersenne Twister supplied all the randomness needed, and /dev/random never got depleted.
The purpose of computing is insight, not numbers. “Numerical Methods for Scientists and Engineers” by R.W.Hamming. Dover Press, 1962 and 1973
Or to put it another way, “To study, and when the occasion arises to put what one has learned into practice – is that not deeply satisfying?” Confuscious, Analects 1.1.1
- Arsenal v West Ham: 5/4/17. What this man does to refereeing should be a criminal offence.
- Radical French invention offers supporters chance to get their Arsenal back.
- Only 18% of Premier League referees manage to get more major decisions right than wrong!
Gord,
Thanks, I like it. Shows the difference between average and actual performance. Out of curiosity what were the differences between the maximum scores for the 2 average team and the rest. The minimum for both obviously being 0.
Andrew
Some of the text has been polished a little.
These runs are simulating 1000 years, so 38000 games each. Which is a minute or so of computer time.
I hadn’t looked at what you specifically asked for, so I will do another run and get back. I just got back from the mail box, and got some goodies in the mail. I think the one package is an upgraded CPU for one of my computers (going from 2 core to 8 core, and more speed).
Andrew.
In a 1000 year simulation, there were 158 times where the top 2 of this league with 2 strong teams did not finish 1,2. Which would seem to be fairly common.
Regardless of which teams were 1,2; the gap between second and third place varied between 0 (40 times) and 31 points. For gaps larger than 22 points, the number of times these occured was always less than 10. Beyond 26, always 1 (or 0). Just eyeballing the data, it appears that all gaps between 0 and about 15 points are equally likely.
If I just look at circumstances where the top 2 teams are the 2 higher scoring teams, the distribution of gap is about the same, except that the likelihood of a gap of 0 has dropped to about 1/3 of what I seen above. Being a different run of data, there were 137 times where the league didn’t finish with these 2 teams at the top.
Looking at the analogous situation for 4 strong teams, in 258 years out of 1000 the top four did not finish with some permutation of the four top scoring teams.
For this same situation (the 4 top scoring teams all finishing in the top 4) is similar to the 2 strong team situation. It is more or less same chance of gap from a gap of 1 to a gap of 9, and then it tails off. Largest gap seen was 21. A gap of 0, was about 1/3 to 1/2 as often as that of this gap of 1,2, … 9.
Oops, largest gap was 25 in that last one.
One more spontaneous experiment. I increased the strong teams to about 2.5 goals per game (but all a little different), and slightly spread out the weaker teams (0.925 to 1.085).
Now, the number of times the league didn’t finish with the 4 strongest at the top, drops down to 30 (from about 250).
The character of the gap changes. It is now of a more curved shape (with a single mode), with the mode being about a gap of 15. Again, the gap of zero is less than extrapolating small gaps to zero would suggest. The largest gap grew out to 32.
Something to add.
Most of this is comparing a team which scores 2 goals per game on average, to one which scores 1 goal per game.
Some days, it seems like you just can’t buy a goal, and other days they might as well be falling like rain. Some managers will quite willingly let their team run up a high score if things happen to be bouncing that way on the day.
I don’t think Wenger does this. I think he starts to apply the handbrake, if things look to be heading that way. If this is true, and he starts applying the handbrake at say 60 minutes (which is 2/3 of the game), then you need to multiply the number of goals Arsenal score by something like the inverse (which would be 3/2) for that particular game.
interesting exercise
seems to align well with recent Bundesliga and LaLiga results
yeah, i’d say handbrake when in lead was pretty common (compared to eg MC) in previous years, though this year seems a bit different
I have the model sitting there, if you have specific questions, maybe we can try them?
Gord, would be nice if you could opensource the model, so other people could tinker with it or give feedback.
It’s only 200 lines of perl, which a bunch of lines being comments. Where would I publish it?
I still plan to finish that “November” project, but that will mean writing proper modules and database stuff (all in Perl). Which would get pushed to CPAN. I suppose I could put this model in an /examples directory.
Gord, eg github.com
One thing i’d want to try over the weekend is to play with different distributions and perhaps try to infer one from a real season data.
I’ll look into github.
I don’t think what you propose is easy. I am using a Poisson distribution for the random deviates. You would need to consider binomial and negative binomial for distributions that are either over or under disperse with respect to a Poisson.
But, if nobody tries, then one doesn’t really know.
But just running single threaded code on a FX-8320E, running in perldb mode inside emacs, a run only takes about 1 minute.
The model sounds interesting, but it cannot recreate the ambiguity in the rules where the same foul is treated differently.
I still say that ambiguity is the real problem.
Its like “training” your children, if you are ambiguous in discipline the child will not know which laws are to be followed and which not, hence the players know this and plan for the first few fouls they they may get away with.
On the other hand without any ambiguity in the rules, that is, “a foul is a foul is a foul regardless of the perspectives. Yellow always.
A bad foul is a red always.
All players will become used to it very quickly i’ll bet.
Wow ! Very impressive , my friend . So there really exists many a slip between the cup and the lips .
Great job , Gord .
You’re correct Para. This is a very simple model. It only considers final score, it doesn’t even try to model when goals are scored.
What it does allow for, is setting up detail in scoring. To try and do this with tables from a statistics textbook with pencil and paper would be burdensome. Especially if you looked at binomial or negative binomial scoring in addition to Poisson.
This is a fantastic idea! I second the requests to opensource and/or collaborate. I use Monte Carlo simulations for work, but never thought to apply them to a football season. Will definitely see if I have the time to tinker this weekend. Also, I feel that the simplicity of the model is justified and does not really detract from (what I understand to be) the claim:
under ideal conditions, the team that scores more on average does not always win the league. Would it also be right to say that they don’t always end up as top scorers either?
It makes me curious to see the relationship between average goals scored vs league position. I mean, there must be some point at which the top team nearly always wins. If so, it might also be informative about leagues in real life. Good work!
I’ve done a bunch of Monte Carlo here, not sure how you missed it. 🙂
I’ve got some things that need to get done, before I try to clean up the code a bit and add some documentation. I normally do this stuff in Perl. And then if needs arise, I could translate to C, FORTRAN or something else. I guess you could translate to Python, Ruby or whatever, in a lot of ways they all sort of look like C. If you translate to LISP, I wouldn’t have a clue what tot do with it. Or Forth for that matter (I don’t think on a stack).
But, a while ago, someone was looking for data. For myself, just stuffing it into DBM::Deep is fine (just saves the Perl structure). But I have started on turning this into SQL with DBIx::Class. I roughed out a structure, and in trying to actually implement it, I am having to change some things.
Thanks! Much appreciated.
So even with such stats (goals!) in favour you can’t be guaranteed a title?
I blame Vengarggghhhh.
‘He’s lost the dressing room’ etc. / replace with the genius’ meme of the week.
Sure, if you crank the goals up, you can pretty much guarantee a title.
If we look at the distribution for an average of 4 in Wikipedia, you can see that there is a slight chance you could actually see 10 goals scored by one team in a game. (It’s about 0.5%.) Or, maybe 1 game in 2.5 seasons. There are 10 goal games, even some around 20, but they are unusual. But getting that far away from the average, you really should be looking at other distributions than Poisson as well, just to use something that has some logic to it.
So, my guess was that an average of 2 was probably appropriate for a strong team. Maybe I should have used 2.2? I did the one run with 2.5 (or so) above.
Sorry, that should be more like 1 game in 5 seasons.
@Gord,
Sorry, maybe I wasn’t clear. I just meant that I also use MC simulations for work, but never thought to apply them to this kind of thing. 🙂 I get that this was also MC, I was just applauding its application.
I’m gearing up to do some weather related stuff. Downscaling so that I can connect circulation models to local weather, and trying to calculate prevailing winds taking landscape into account. So soon, I’ll have 22 amd64 cores and 5 GPUs running. Which I hope is enough to start this.
First thing like this I did, was a simulation of grain growth in solids. Back in 1984, on a VAX 11/785. Big difference in number crunching now from then.
Okay. A user and repository exists. At github, there is a new user (UntoldArsenal) which has a single project (ua1). This project contains a .gitignore file, a README.md file, and a BFDepl.pl file. Some time soon, I hope to add a LICENSE file, which will be the typical license of Perl projects unless there is a reason to use something else. For the moment, there is no LICENSE, because the file isn’t worth it.
Thanks for doing this. I haven’t really used Perl before so this will give me an excuse to learn. Although, as a novice it might be easier for me to just rewrite in something with which I am more familiar. In any case, this is very appreciated.
It is possible to find Perl code that is fast, and difficult to understand. I tend not to write that way.
Like C, there are times when having braces around a block of code is optional. I (nearly) always put braces around code.
Perl has some “magic”, which might need explanation. So ask, if you don’t understand.
Perl has “kinds of variables”. The type of a variable is largely determined by context. If you are using something as a string, it stays as a string until you do something that requires it to not be a string. And then it will “magically” change, if it can. Int’s and double’s are other kinds of content. FALSE, the number 0 and the empty string are logically FALSE. Pretty much anything else is TRUE. A variable can also hold a reference to a value.
A variable which starts with ‘$’, is nominally a scalar (single thing). A variable which starts with an ‘@’, is a list. And element of a list is a scalar. A variable which starts with a ‘%’ is a hash list (keys and values).
If you are “counting occurences” of things, it is common to see something like:
$h->{$item}++;
$h is in this circumstance, a reference to a hash structure. The variable $item is what you are counting. If the hash structure does not already contain a key of $item, Perl will automagically vivify such a key, and the value it starts with is 0. If we then do something like:
@sortedItem = sort keys( %$h );
the list @sortedItem will contain a list of all the items that were seen, sorted into ASCIIbetical order.
There are Perl environments which install on “non-UNIX” machines, which mostly means Windows. I believe Strawberry Perl is one of the better Windows ones. The environment typically comes with a lot of documentation, in the form of “man” pages (see perldoc). The format for writing Perl Documentation is called POD (Plain Old Documentation). There are things which convert POD to markdown and others.
It turns out Strawberry Perl calls Github home. So, I nominally copied (with 3 substitutions) their LICENSE file. And updated the line number for the last executable line of code in BFDepl.pl, since I added the license statement to the POD at the beginning of the program.
With a complete perl environment, if you were to run:
_perldoc BFDepl.pl
the perldoc program will look for sections of POD in the file, and display it.
I am not a git expert. I had made some changes and added a file, and some changes became visible but not all. I had installed a GUI for git, and it pointed out what I did wrong.
In any event, the README.md still only has a partially useful content. It does point to a Plans.md file (and the link seems to work). This is my idea on plans for things in this project. Nothing is carved in stone. The BFDepl.pl file does have an email address which gets to me. I suppose spammers farm github, so I may at some point need to filter email.
There are many command line calculator type programs out there. RosettaCode had 3 chunks of code related to RPN, infix and a rpn calculator. I copied those 3 chunks as a starting point, but really only worked on the rpnCalc routine.
If a user puts an explicit decimal point in a number, the program will put the number (with an assumed error that is one half the smallest significant figure present) into a Number::WithError object. When the RPN string is rewritten so that Perl can ‘eval’ the string, the two numbers (joined by the operator) are replaced with the variable names, so that the Number::WithError overloading of operators is performed when carrying out the operation desired.
Essentially, calc2.pl is just a demonstration of:
rpnCalc( ‘5 4 *’ ); # 20
rpnCalc( ‘5. 4.0 *’ ); # 2.00e+01 +/- 2.0e+00
rpnCalc( ‘5.00 4.000 *’ ); # 2.0000e+01 +/- 2.0e-02
The first example does not make use of any Number::WithError functionality, as both numbers lack decimal places.