I have been combing through lots and lots of data, as I prepare my own entry to the Kaggle Machine Learning March Mania Contest again this year. I won’t go into how I am managing my entry right now, as the competition is obviously still open, but I thought I would share some of the insights I have accumulated along the way.
First off, you need to have a strategy. You can be the guy with the chalk bracket or the batshit-crazy-upset-dude, but we all know somewhere in the middle is probably where you need to go… just enough chalk, just enough upsets.
To get a good feel for how the tournament has played out over the past 20 years, I have put together a few graphics. The first one shows the winning percentage for each seed against every other seed since 1985. (The winning % is for the seed down the left side)
It’s kind of crazy. If look look at it, #1 seeds are only 40% versus #11 seeds since 1985. WTH? This obviously needs context, so here’s the same chart showing how many times each seed has played in that time frame.
Now we can calculate that #11 seeds have actually won 3-of-5 times against #1 seeds. Great, but what does this mean?
Hopefully this can help you solidify your strategy once the draw comes out. Maybe you like a certain 11-seed. How far should you maybe consider riding them? It should also be a guide to help you LIMIT your upsets from being just too wacky.
Another thing to consider is just how volatile the tournament will be. I have analyzed each year individually since 1985 and here are a few of my thoughts.
In the past 30 years, 19 of those seasons have been below “average” when it comes to upsets. I have defined these as the Chalk years. They tend to have fewer upsets and fewer large-scale upsets. The list includes: 1987, 1988, 1989, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 2000, 2003, 2004, 2005, 2007, 2008, 2009 and 2012.
Since ’85, there have been an average 17.7 “Upsets” (by seed) and 7.9 “Big Upsets” per season. I define Big Upsets as those where the seed differential was greater than 4 (at least a 6 over a 1). I also used Mean Upset — the sum of all upset differentials over the number of tournament games.
QUIRKY STAT MOMENT: Two years with the most upsets since 1985? 1999 (23) and 2014 (22). Guess who won both years? UCONN. Strange, huh?
When figuring out the “upset” and “chalk” years, the upset years stood out. Those would be 1985, 1986, 1990, 1999, 2001, 2002, 2006, 2010, 2011, 2013 and 2014. As you can see, four of the last five season fall into this category. Why? That’s a story for another day.
I am sure there are plenty of ways to argue the way I divided up the years, but the concept is solid: there are upset years and there are chalk years… and we seem to be in a time of upsets.
Remember, even though a season is defined as chalk, there are still plenty of upsets. In 2012, the most recent chalk year, two 15-seeds, Lehigh and Norfolk State, both won games over 2-seeds. Also, 10, 11, 12 and 13 seeds all had first round wins. However, the tournament was dominated by lower-seeded players throughout.
I hope some of this helps. Remember get a little crazy, but not too crazy… and it also help to be really lucky.