Regressing My Way To March Madness
I used math to fill out my NCAA basketball brackets. But to keep things interesting, I picked the underdog whenever they were within striking distance of the favorite.
A few months ago, I proposed reasonable rankings for college football. I applied a logistic function to the scoring margin of each game, and then assigned an index to every team using linear regression. The output was a ranking system that quite reasonably agreed with the subjective rankings of the college football playoff selection committee.
The logistic function could be tuned by a single parameter, which I called 𝜆. When 𝜆 was small, wins were much more important than scoring margins; when 𝜆 was large, scoring margins were more important. In a subsequent piece, I showed that the committee didn’t particularly care much about scoring margin or home field advantage.
Since it’s that time of year, I’m conducting a similar analysis of men’s and women’s college basketball. And for added fun, I’ll use the resulting rankings to populate my own March Madness brackets.
Setting the Parameters
Like football, it turned out that a popular ranking system (here, that was the Associated Press rankings) didn’t care much about home field advantage or scoring margin. However, because college basketball has a rather inclusive tournament with more than 64 teams, these rankings matter much less than football’s CFP rankings. In other words, I’m less interested in predicting the rankings themselves, and more interested in predicting results of actual games. Therefore, I included both home field advantage and scoring margin in my regression.
Let’s take a closer look at home field advantage. In the 2024-25 NCAA men’s basketball season, the average scoring margin favored the home team by an average of about 9 points. That meant home field advantage was probably something like 4.5 points—the home team got a 4.5-point boost, while the away team operated at a 4.5-point disadvantage. In women’s basketball, this average scoring margin favored the home team by 7.8 points on average, so the home field advantage was something like 3.9 points. (I acknowledge that both of these margins would probably decrease if I looked exclusively at intra-conference games.)
As for 𝜆, it was tempting to set this value to 1, matching what I had done for college football. This would have meant wins were more important than scoring margins. If I had significantly more time to devote to this analysis, I’d see which value of 𝜆 historically results in the best prediction of tournament brackets. As a quick alternative, I used the standard deviation for margin of victory, after adjusting for home field advantage. This turned out to be about 11.6, which was significantly greater than 1.
Men’s Rankings
After plugging all 6,173 men’s games into a regression, here were the top-ranked teams that popped out:
From this, it became clear that some teams were ranked by the AP much lower than they should have been, either because of reputation, losses (to good teams), or some other reason. Gonzaga was striking in this regard, and wound up being an 8-seed in the tournament. Meanwhile, several schools (such as Louisville) were looked upon quite favorably by the AP.
Women’s Rankings
After plugging all 5,818 women’s games into a regression, here were the top-ranked teams that popped out:
First off, the index of the top teams (South Carolina and Connecticut) was significantly higher than those on the men’s side, indicating that there’s greater parity in the men’s game. But those two teams towered above the rest, which made it all the more shocking (to me, at least) that Connecticut received a 2-seed in the tournament.
Several teams were underrated by the AP, including Iowa and Michigan State. Meanwhile, West Virginia, ranked 16th by the AP and 12th in my regression, somehow wound up receiving a 6-seed in the tournament.
My Brackets
Once my regression was complete, the next thing I did was fill out a bracket in which higher-ranked teams (according to my index) advanced. However, this bracket was incredibly boring. On the men’s side, there was only one upset (a team with a higher seed defeating a team with a lower seed): (9) Baylor over (8) Mississippi State, in the first round.
While unfair at times, the seedings this year do a great job of cleaning up their own messes in subsequent rounds. For example, I already said that Gonzaga should have been higher than an 8-seed, based on its performance this year. But by virtue of being an 8-seed, a victory in the first round means facing the 1-seeded (and vaunted) Houston in the second round.
On the verge of falling asleep staring at this “boringest of brackets,” the next thing I did was start anew. But this time, whenever any matchup’s underdog was within ~5 index points of the favorite (according to my model), I awarded a victory to the underdog. That said, for the Final Four, all bets were off, and I resumed picking the favorite.
Without further ado, here is my men’s bracket (remember, I have an underdog winning whenever they are almost as good as the favorite):
Here are the notable upsets in my men’s bracket:
In the first round, I picked (12) Colorado State (my 49th best team) over (5) Memphis (my 46th best team).
In the first round, I picked (11) VCU (my 33rd best team) over (6) BYU (my 23rd best team).
In the second round, I picked (6) Illinois (my 20th best team) over (3) Kentucky (my 15th best team).
In the Sweet 16, I picked (3) Texas Tech (my 11th best team) over (2) St. John’s (my 10th best team).
In the Sweet 16, I picked Iowa State (my 8th best team) over (2) Michigan State (my 7th best team).
And here’s my corresponding women’s bracket:
Here are the notable upsets in my women’s bracket:
In the first round, I picked (10) Nebraska (my 37th best team) over (7) Louisville (my 31st best team).
In the second round, I picked (6) West Virginia (my 12th best team) over (3) North Carolina (my 18th best team).
In the Sweet 16, I picked (3) LSU (my 10th best team) over (2) NC State (my 11th best team).
In the Sweet 16, I picked (3) Notre Dame (my 5th best team) over (2) TCU (my 8th best team).
In the Elite 8, I picked (3) Notre Dame (my 5th best team) over (1) Texas (my 3rd best team).
In the Elite 8, I picked (2) Connecticut (my 2nd best team) over (1) USC (my 6th best team).
In the Semi-Final, I picked (2) Connecticut (my 2nd best team) over (1) UCLA (my 4th best team).
These brackets were considerably spicier than if I had just picked what my regression said was the favored team. And while I admit these precise brackets are less likely to occur than having the favorites always win, they better resemble what will ultimately happen, as they include a smattering of upsets.
In any case, I’ll certainly be keeping an eye on the games and my brackets as I write up puzzles over the next few weeks.
Great post again. What if you left lambda as a free parameter in your regression? Just think it might be interesting to check (and for the football).
Note: this is the question I had last time but I think you misinterpreted it (and instead answered a probably more interesting question)
> Let’s take a closer look at home field advantage. In the 2024-25 NCAA men’s basketball season, the average scoring margin favored the home team by an average of about 9 points.
That's going to be affected by good teams having more home games against bad teams, right? Top schools tend to pay cupcakes to come play them as one-offs, rather than scheduling home and homes. Take Auburn, for example:
Auburn played 15 home games to 10 away games; their non-conference home games were exclusively against teams that they out-classed: a 32 point win over FAU, a 51 point win over Vermont, a 23 point win over Kent St, a 33 point win over North Alabama, a 44 point win over Richmond, a 41 point win over Georgia State, and a 29 point win over Monmouth. They played one non-conference away game at Duke, and the rest of their non-conference schedule was neutral site. Is that inflating the apparent home court advantage?