Tennis & Artificial Intelligence
As a result of combining data that predicts match outcomes and contains post-match statistics (number of forehand unforced errors, break points won, double faults etc), Andre Cornman and his team have built a data source that contains information on 46,114 matches.
To build the most accurate model, Andre gathered a wealth of match information from the past two decades. In addition, Andre tested a number of machine learning algorithms before he was satisfied with his results. Andre felt it was necessary to give a variety of Machine Learning schools of thought a chance and landed on Bayesian Machine Learning, the same specific machine learning employed by RebellionResearch.com.
The bottom line is that the most successful algorithm has fast scalability without expending too much computer processing power, which made Bayes' approach the obvious winner due to its ability to handle a rapidly ever changing data set. Like the stock market.
Popular questions that arise in Machine Learning discussions are “how well can we tune this hyper parameter?” and “How accurate is the model when we cross-validate it?” First, a hyper parameter is essentially an input that the machine uses to learn and develop predictive trends from the data. Tuning these inputs or parameters may involve experimenting with several different values, analyzing the results, and repeating this iterative process.
Answering this question for each method has been helpful for data scientists to understand the future tennis algorithm. To understand cross validation and answer the second major question, an algorithm will run using different subsets of input data, and the results will be compared to a tested set of data where the machine learning model was not trained at all.
It can be thought of as a component of the scientific method where every experiment needs an independent variable and a control variable. In this scenario, the ML model being trained (manipulated variable) is being compared to an uneducated machine (control variable), to gain further knowledge regarding its ability to learn and generalize patterns.
After discussing the two big questions hovering over machine learning case studies, here are the higher level advantages and disadvantages of the tested algorithms:
When data is derived from a sport like tennis with a substantial amount of variables/factors and thousands of different matches documented, tuning the parameters is vital in the continuous improvement of the model. For instance, they determined that their latest integration of the random forest model is only correct fifty percent of the time when it picks the lower ranked player to win.
There is a great degree of complexity when analyzing head to head matchups, because you have to account for the intangible factors including surface, previous match history, and player’s confidence. It would be truly incredible to see the evolution of this random forest model and how they will go about properly implementing these distinct inputs.
Written by Jason Scanlon, Edited by Rachel Weissman & Alexander Fleiss