2014 US Open Men’s Draw Simulation

The U.S. Open main draw begins this morning and for the fourth year in a row, I will not be able to attend. Gone are the good ol’ days of working for the USTA and getting to take the trip up to New York to take it all in.

Since I cannot go, I decided to utilize Markov Chain models and Monte Carlo simulations to predict who will win.

Markov Models for tennis are essentially placing some initial inputs into a model and allow it to simulate an entire match, giving you the probabilities player A wins over player B. A Monte Carlo simulation is when you run an entire tournament over and over like this. Even if you can do the math, one of the most difficult parts is creating the initial inputs to run the Markov Model.

MY METHODOLOGY
I decided to experiment with an idea that begins with something I read in Dr. Kamran Aslam’s PhD dissertation he wrote at USC. Dr. Aslam and his advisor, Dr. Paul K. Newton published portions of this paper several times, including in the Journal of Quantitative Analysis in Sport back in 2009.

Dr. Aslam took the idea that you start by finding the overall mean probability to win a point while returning. This is defined as the returning average of ‘the field’. Let’s say this is 0.330. Then, if Roger Federer is playing Novak Djokovic and Roger’s average ability to win a point returning is 0.40, then he is 0.07 better than ‘the field’. If Novak’s average is 0.41, then he is 0.08 better than ‘the field’.

Then, if Roger’s percentage he wins serve is 0.7, you subtract Novak’s ability ‘above the field’ (0.08), making Roger’s effective serving percentage, 0.62.

Likewise, if Novak’s serving percentage is 0.68, then his effective serving percentage is 0.61. Therefore, the input to the program would be 0.62 and 0.39 for Roger (one minus Novak’s effective serving percentage). If you ran this for Novak, the inputs would be 0.61 and 0.38.

Modifications
To get the data, I scraped all serving and receiving stats from the ATP website for each player in the draw. I also decided to scale the data.

Scaling
Using only hard court results for the 2014 season, I scaled the data based on the level of competition. This allowed me to include all Challenger data as well as ATP-level, which is available on the ATP site. If an opponent was inside the top-64, no scaling was done. If the opponent was ranked between 65 and 128, then I scaled it down by 1.5%. If the opponent was in the top-192, I scaled it another 1.5%. I scaled it another 1.5% for between 193-256 and another 1.5% for those over a 256 ranking.

For some matches, the opponent’s ranking is listed at N/A. In those cases, the scaling was done based on the player’s own ranking, which seemed to be close enough to the actual ranking, except in a few instances.

Scaling this way may not be the best solution, but this is a solid starting point.
I then found ‘the field’ by averaging the scaled percentages of all players in the tournament. Five players have not played on the hard courts yet this season, so I removed them when calculating the field. Also, rather than placing zeroes in the data for them, I substituted numbers slightly below the averages for both serving and receiving.

In future versions, I may substitute scaled, full season statistics, irrelevant of surface for these players.

Noah Rubin
Then there was the case of Noah Rubin, who only had one hard court match, where he had some pretty good numbers, despite losing last week in Winston-Salem. In this case, I decided to manually modify his percentages down closer to the five I had to manually enter who had not played a single hard court match.

Shortcomings
Most of the problems come from too little data on some players. Some of this can be handled by using more stringent scaling for Challenger-level matches. Two of the most noticeable are Gilles Muller and Jared Donaldson. Muller won the Guadalajara Challenger and none of his opponents’ rankings are listed in the data, so they were only scaled per his No.68-ranking. Donaldson also had a lot of Challenger results that were not scaled sufficiently.

Coding
Jeff Sackman at tennisabstract.com published some python code to run the Markov Models a few years ago (here’s a link to his 2014 predictions, which you may like more than mine). He uses similar inputs and generates a probability player A wins the match. I modified Jeff’s code for my purposes, then wrapped it within a Monte Carlo Simulation and ran it 50,000 times.

I am not posting my entire code just yet on github, but hope to soon. I need to refine my entire process, soup-to-nuts, before I feel comfortable with that.

THE RESULTS
The table below shows howe far a player advances. For instance, Roger Federer lost 1802 out of 50,000 trials in the first round, but won the tournament 16895 times.

Federer seems to be the biggest winner here with Rafael Nadal out. I know this isn’t perfect, but it is a good start and something to work with moving forward. There are some basic assumptions I make and some data that needs refining, but overall I am satisfied with the outcome.

PLAYER R1 R2 R3 R16 Q S F W PCT
Roger-Federer 1802 1399 4752 5083 5099 8057 6913 16895 33.8%
Tomas-Berdych 4984 1008 5294 6528 7710 11934 5119 7423 14.8%
Novak-Djokovic 244 18702 4269 3604 5764 4425 5658 7334 14.7%
Andy-Murray 795 7560 7183 7213 13748 5073 4877 3551 7.1%
Gilles-Muller 5497 25948 3678 3013 3864 2726 2705 2569 5.1%
Milos-Raonic 3273 16891 7391 7462 5420 5007 2704 1852 3.7%
Stan-Wawrinka 10010 5230 11653 5295 7827 5558 2753 1674 3.3%
Kei-Nishikori 9637 4005 2004 16234 7388 6449 2777 1506 3.0%
David-Ferrer 6078 14349 3023 8772 9880 5189 1576 1133 2.3%
Blaz-Kavcic 6213 9611 16077 5177 6755 3917 1524 726 1.5%
Marin-Cilic 17736 6164 6662 8257 6488 3217 905 571 1.1%
Peter-Gojowczyk 6533 24512 5850 5298 3446 2674 1152 535 1.1%
Jared-Donaldson 21742 3033 10349 5732 6508 1532 677 427 0.9%
David-Goffin 2882 3878 15412 13738 10691 2144 836 419 0.8%
Roberto-Bautista-Agut 9661 6577 12280 16611 2253 1578 685 355 0.7%
Adrian-Mannarino 9823 8095 13559 14600 1878 1270 528 247 0.5%
Paolo-Lorenzi 4830 17542 13070 6225 6295 1285 515 238 0.5%
Simone-Bolelli 12534 12440 6899 10541 4619 2052 681 234 0.5%
Facundo-Bagnis 22299 5898 6488 11127 2308 1093 590 197 0.4%
Bernard-Tomic 16933 18795 2488 5200 4245 1712 432 195 0.4%
Igor-Sijsling 10360 6161 21214 6118 3278 2052 626 191 0.4%
Ernests-Gulbis 10555 16235 7808 10532 2450 1743 487 190 0.4%
Gael-Monfils 28258 2975 8844 4478 4152 860 292 141 0.3%
Dominic-Thiem 13221 16989 7085 8985 1991 1272 317 140 0.3%
Ivo-Karlovic 12126 5162 26714 3018 1564 938 342 136 0.3%
Jo-Wilfried-Tsonga 20942 8099 8426 7785 3365 856 395 132 0.3%
Richard-Gasquet 13310 18723 9403 4126 3460 664 222 92 0.2%
Yen-Hsun-Lu 15007 14542 16524 1747 1306 530 254 90 0.2%
Benoit-Paire 20522 10624 8481 6731 2643 642 277 80 0.2%
Grigor-Dimitrov 15605 14089 10471 5488 3458 618 197 74 0.1%
Philipp-Kohlschreiber 27701 5789 5829 8094 1545 654 317 71 0.1%
Marcos-Baghdatis 32264 5563 4588 4262 2312 804 148 59 0.1%
John-Isner 14229 12162 12604 8587 1547 599 217 55 0.1%
Kevin-Anderson 3544 18025 16794 7219 3297 922 151 48 0.1%
Juan-Monaco 29058 7290 6559 4912 1679 336 124 42 0.1%
Sam-Querrey 2888 23684 19638 1877 1236 444 192 41 0.1%
Alexander-Kudryavtsev 24959 5369 15596 2145 1205 570 117 39 0.1%
Bradley-Klahn 21459 11049 12642 2322 1899 430 162 37 0.1%
Evgeny-Donskoy 25041 5464 15495 2100 1199 560 116 25 0.1%
Steve-Johnson 22309 10788 9442 5674 1130 539 93 25 0.1%
Dudi-Sela 751 25740 14179 6185 2661 380 84 20 0.0%
Tommy-Robredo 19699 17062 5206 5582 1758 570 106 17 0.0%
Radek-Stepanek 11794 30753 3478 2102 1464 282 111 16 0.0%
Andreas-Beck 2542 27796 11149 6347 1739 323 89 15 0.0%
Sergiy-Stakhovsky 15749 11889 12577 7060 2052 550 109 14 0.0%
Julien-Benneteau 29478 9134 6133 3817 1163 200 61 14 0.0%
Wayne-Odesnik 40363 3301 1383 3746 849 295 54 9 0.0%
James-McGee 10881 25214 7925 4574 1153 192 52 9 0.0%
Andrey-Kuznetsov 28541 9826 9132 1393 891 159 49 9 0.0%
Jiri-Vesely 39990 3818 3995 1132 811 202 44 8 0.0%
Tatsuma-Ito 27691 9767 7701 3842 650 298 43 8 0.0%
Lleyton-Hewitt 45016 1036 2024 1202 499 183 34 6 0.0%
Ivan-Dodig 21405 16081 8031 3614 590 241 32 6 0.0%
Mikhail-Youzhny 14304 19251 10826 4501 894 191 27 6 0.0%
Marco-Chiudinelli 20849 21573 4124 2442 820 164 22 6 0.0%
Dustin-Brown 33067 12283 1366 1966 1030 242 41 5 0.0%
Jan-Lennard-Struff 21002 16351 8368 3712 434 111 17 5 0.0%
Blaz-Rola 23408 15137 9150 1331 812 118 40 4 0.0%
Gilles-Simon 10414 5720 27011 4809 1745 265 32 4 0.0%
Thomaz-Bellucci 17702 25481 4806 1182 643 167 15 4 0.0%
Feliciano-Lopez 28595 13364 5638 2021 284 85 9 4 0.0%
Jeremy-Chardy 22487 19476 5697 1275 794 223 45 3 0.0%
Fernando-Verdasco 26592 13988 7748 1047 529 77 17 2 0.0%
Jerzy-Janowicz 25303 14188 7428 2276 656 132 15 2 0.0%
Ryan-Harrison 34395 9438 4306 1399 417 34 9 2 0.0%
Dusan-Lajovic 24697 14643 7500 2296 715 128 20 1 0.0%
Alejandro-Falla 27513 16945 4151 827 449 97 17 1 0.0%
Edouard-Roger-Vasselin 30301 13073 3280 2566 648 120 11 1 0.0%
Jack-Sock 17867 26513 1780 3228 486 114 11 1 0.0%
Fabio-Fognini 15873 23836 7021 2983 230 47 9 1 0.0%
Sam-Groth 16829 31352 1087 534 148 41 8 1 0.0%
Guillermo-Garcia-Lopez 34993 9023 5491 333 124 27 8 1 0.0%
Illya-Marchenko 29151 16700 2530 1245 321 47 5 1 0.0%
Victor-Estrella-Burgos 39640 4215 5357 598 145 39 5 1 0.0%
Kenny-De-Schepper 39445 7622 1860 930 112 26 4 1 0.0%
Mikhail-Kukushkin 28998 13406 5509 1915 134 34 3 1 0.0%
Andreas-Seppi 34251 8382 5378 1699 261 27 1 1 0.0%
Matthias-Bachinger 38206 11021 552 173 41 5 1 1 0.0%
Daniel-Gimeno-Traver 9294 29670 6064 4411 431 97 33 0 0.0%
Vasek-Pospisil 37466 7425 2746 1885 410 58 10 0 0.0%
Lukas-Rosol 3411 36289 9252 834 174 33 7 0 0.0%
Lukas-Lacko 36779 9154 2435 1390 181 57 4 0 0.0%
Paul-Henri-Mathieu 44503 5116 256 70 41 10 4 0 0.0%
Tobias-Kamke 9550 7288 27984 4629 464 82 3 0 0.0%
Marinko-Matosevic 48198 762 571 318 113 35 3 0 0.0%
Tim-Smyczek 20643 22118 5015 2062 125 34 3 0 0.0%
Marcos-Giron 35771 8081 4583 1414 124 24 3 0 0.0%
Pere-Riba 40177 4868 3526 1305 104 17 3 0 0.0%
Teymuraz-Gabashvili 22253 21265 5983 390 91 15 3 0 0.0%
Andreas-Haider-Maurer 40339 4504 3543 1481 108 23 2 0 0.0%
Filip-Krajinovic 29357 16801 2890 900 40 10 2 0 0.0%
Donald-Young 43787 3968 1813 305 106 20 1 0 0.0%
Denis-Istomin 36690 9754 2577 701 260 17 1 0 0.0%
Jarkko-Nieminen 37874 4004 7699 332 73 17 1 0 0.0%
Nicolas-Mahut 32298 15471 1808 306 102 14 1 0 0.0%
Alejandro-Gonzalez 24004 22641 2772 478 101 3 1 0 0.0%
Andrey-Golubev 34127 13201 2166 478 25 2 1 0 0.0%
Yoshihito-Nishioka 45170 3981 714 113 21 0 1 0 0.0%
Damir-Dzumhur 43922 4573 607 655 215 28 0 0 0.0%
Benjamin-Becker 43467 5726 551 194 54 8 0 0 0.0%
Pablo-Andujar 32133 16181 760 850 70 6 0 0 0.0%
Marcel-Granollers 12044 29804 7917 210 19 6 0 0 0.0%
Martin-Klizan 12104 35990 1429 390 82 5 0 0 0.0%
Nick-Kyrgios 35696 10478 3088 667 66 5 0 0 0.0%
Santiago-Giraldo 27747 17902 4055 243 48 5 0 0 0.0%
Frank-Dancevic 17016 28639 3479 755 108 3 0 0 0.0%
Taro-Daniel 46727 2871 309 78 13 2 0 0 0.0%
Dmitry-Tursunov 25996 21351 2271 317 64 1 0 0 0.0%
Aleksandr-Nedovyesov 39119 9397 1234 239 10 1 0 0 0.0%
Niels-Desein 47118 1643 1091 139 8 1 0 0 0.0%
Radu-Albot 39586 4073 6027 279 35 0 0 0 0.0%
Leonardo-Mayer 9476 29457 10544 512 11 0 0 0 0.0%
Albert-Ramos-Vinolas 33171 16487 255 76 11 0 0 0 0.0%
Federico-Delbonis 23729 20814 5281 167 9 0 0 0 0.0%
Noah-Rubin 26271 19393 4197 131 8 0 0 0 0.0%
Matthew-Ebden 40450 4523 4807 213 7 0 0 0 0.0%
Joao-Sousa 32984 15840 1045 125 6 0 0 0 0.0%
Michael-Llodra 40706 8643 555 93 3 0 0 0 0.0%
Robin-Haase 49205 666 115 11 3 0 0 0 0.0%
Pablo-Cuevas 46456 3144 374 24 2 0 0 0 0.0%
Steve-Darcis 37896 11966 124 14 0 0 0 0 0.0%
Jurgen-Melzer 37956 11030 1005 9 0 0 0 0 0.0%
Albert-Montanes 40524 8732 738 6 0 0 0 0 0.0%
Pablo-Carreno-Busta 47458 2446 93 3 0 0 0 0 0.0%
Diego-Schwartzman 49756 234 8 2 0 0 0 0 0.0%
Maximo-Gonzalez 47112 2751 136 1 0 0 0 0 0.0%
Carlos-Berlocq 49249 733 17 1 0 0 0 0 0.0%
Borna-Coric 46589 3335 76 0 0 0 0 0 0.0%