Sports Analytics Weekly by kubeia.io

A Systematic Review of Machine Learning in Sports Betting: Techniques, Challenges, and Future Directions

arxiv.org - René Manassé Galekwa, Jean Marie Tshimula, Etienne Gael Tajeuna, Kyamakya Kyandoghere

Abstract:The sports betting industry has experienced rapid growth, driven largely by technological advancements and the proliferation of online platforms. Machine learning (ML) has played a pivotal role in the transformation of this sector by enabling more accurate predictions, dynamic odds-setting, and enhanced risk management for both bookmakers and bettors. This systematic review explores various ML techniques, including support vector machines, random forests, and neural networks, as applied in different sports such as soccer, basketball, tennis, and cricket. These models utilize historical data, in-game statistics, and real-time information to optimize betting strategies and identify value bets, ultimately improving profitability. For bookmakers, ML facilitates dynamic odds adjustment and effective risk management, while bettors leverage data-driven insights to exploit market inefficiencies. This review also underscores the role of ML in fraud detection, where anomaly detection models are used to identify suspicious betting patterns. Despite these advancements, challenges such as data quality, real-time decision-making, and the inherent unpredictability of sports outcomes remain. Ethical concerns related to transparency and fairness are also of significant importance. Future research should focus on developing adaptive models that integrate multimodal data and manage risk in a manner akin to financial portfolios. This review provides a comprehensive examination of the current applications of ML in sports betting, and highlights both the potential and the limitations of these technologies.

The Winner of the NFL Draft is Not Necessarily Cursed

arxiv.org - Ryan S. Brill, Abraham J. Wyner

Abstract:Football analysts traditionally determine the relative value of draft picks by average future player value at each draft position. One implication is the loser's curse: top draft picks belonging to last year's worst teams produce less surplus value on average than draft picks later in the first round belonging to better teams. Additionally, these valuations do not match the valuation implied by the trade market. Either general managers are making terrible trades on average, or there is a sound economic reason for the discrepancy; we are partial to the latter explanation. Traditional analyses don't consider that variance in performance decays convexly accross the draft, causing eliteness (e.g., right tail probability) to decay much more steeply than expected value. Because elite players have an outsize influence on winning the Super Bowl, we suspect general managers value performance nonlinearly, placing exponentially higher value on players as their eliteness increases. Draft curves that account for this closely resemble the trade market. Additionally, we create draft curves that adjust for position via a novel Bayesian hierarchical Beta regression model. We find that if you are interested in an elite quarterback, there is no loser's curse.

Predictive Modeling of Lower-Level English Club Soccer Using Crowd-Sourced Player Valuations

arxiv.org - Josh Brown, Yutong Bu, Zachary Cheesman, Benjamin Orman, Iris Horng, Samuel Thomas, Amanda Harsy, Adam Schultze

Abstract:In this research, we examine the capabilities of different mathematical models to accurately predict various levels of the English football pyramid. Existing work has largely focused on top-level play in European leagues; however, our work analyzes teams throughout the entire English Football League system. We modeled team performance using weighted Colley and Massey ranking methods which incorporate player valuations from the widely-used website Transfermarkt to predict game outcomes. Our initial analysis found that lower leagues are more difficult to forecast in general. Yet, after removing dominant outlier teams from the analysis, we found that top leagues were just as difficult to predict as lower leagues. We also extended our findings using data from multiple German and Scottish leagues. Finally, we discuss reasons to doubt attributing Transfermarkt's predictive value to wisdom of the crowd.

Revisiting PlayeRank

arxiv.org - Louise Schmidt, Cristian Lillo, Javier Bustos

Abstract:In this article we revise the football's performance score called PlayeRank, designed and evaluated by Pappalardo et al.\ in 2019. First, we analyze the weights extracted from the Linear Support Vector Machine (SVM) that solves the classification problem of "which set of events has a higher impact on the chances of winning a match". Here, we notice that the previously published results include the Goal-Scored event during the training phase, which produces inconsistencies. We fix these inconsistencies, and show new weights capable of solving the same problem. Following the intuition that the best team should always win a match, we define the team's quality as the average number of players involved in the game. We show that, using the original PlayeRank, in 94.13\% of the matches either the superior team beats the inferior team or the teams end tied if the scores are similar. Finally, we present a way to use PlayeRank in an online fashion using modified free analysis tools. Calculating this modified version of PlayeRank, we performed an online analysis of a real football match every five minutes of game. Here, we evaluate the usefulness of that information with experts and managers, and conclude that the obtained data indeed provides useful information that was not previously available to the manager during the match.

Predicting and understanding shooting performance in professional biathlon: A Bayesian approach

arxiv.org - Manuele Leonelli

Abstract:Biathlon is a unique winter sport that combines precision rifle marksmanship with the endurance demands of cross-country skiing. We develop a Bayesian hierarchical model to predict and understand shooting performance using data from the 2021/22 Women's World Cup season. The model captures athlete-specific, position-specific, race-type, and stage-dependent effects, providing a comprehensive view of shooting accuracy variability. By incorporating dynamic components, we reveal how performance evolves over the season, with model validation showing strong predictive ability at both overall and individual levels. Our findings highlight substantial athlete-specific differences and underscore the value of personalized performance analysis for optimizing coaching strategies. This work demonstrates the potential of advanced Bayesian modeling in sports analytics, paving the way for future research in biathlon and similar sports requiring the integration of technical and endurance skills.

Optimizing Daily Fantasy Baseball Lineups: A Linear Programming Approach for Enhanced Accuracy

arxiv.org - Max Grody, Sandeep Bansal, Huthaifa I. Ashqar

Abstract:Daily fantasy baseball has shortened the life cycle of an entire fantasy season into a single day. As of today, it has become familiar with more than 10 million people around the world who participate in online fantasy. As daily fantasy continues to grow, the importance of selecting a winning lineup becomes more valuable. The purpose of this paper is to determine how accurate FanDuel current daily fantasy strategy of optimizing daily lineups are and utilize python and linear programming to build a lineup optimizer for daily fantasy sports with the goal of proposing a more accurate model to assist daily fantasy participants select a winning lineup.

A Simple and Effective Temporal Grounding Pipeline for Basketball Broadcast Footage

arxiv.org - Levi Harris

Abstract:We present a reliable temporal grounding pipeline for video-to-analytic alignment of basketball broadcast footage. Given a series of frames as input, our method quickly and accurately extracts time-remaining and quarter values from basketball broadcast scenes. Our work intends to expedite the development of large, multi-modal video datasets to train data-hungry video models in the sports action recognition domain. Our method aligns a pre-labeled corpus of play-by-play annotations containing dense event annotations to video frames, enabling quick retrieval of labeled video segments. Unlike previous methods, we forgo the need to localize game clocks by fine-tuning an out-of-the-box object detector to find semantic text regions directly. Our end-to-end approach improves the generality of our work. Additionally, interpolation and parallelization techniques prepare our pipeline for deployment in a large computing cluster. All code is made publicly available.

Motion Graph Unleashed: A Novel Approach to Video Prediction

arxiv.org - Yiqi Zhong, Luming Liang, Bohan Tang, Ilya Zharkov, Ulrich Neumann

Abstract:We introduce motion graph, a novel approach to the video prediction problem, which predicts future video frames from limited past data. The motion graph transforms patches of video frames into interconnected graph nodes, to comprehensively describe the spatial-temporal relationships among them. This representation overcomes the limitations of existing motion representations such as image differences, optical flow, and motion matrix that either fall short in capturing complex motion patterns or suffer from excessive memory consumption. We further present a video prediction pipeline empowered by motion graph, exhibiting substantial performance improvements and cost reductions. Experiments on various datasets, including UCF Sports, KITTI and Cityscapes, highlight the strong representative ability of motion graph. Especially on UCF Sports, our method matches and outperforms the SOTA methods with a significant reduction in model size by 78% and a substantial decrease in GPU memory utilization by 47%.

GTA: Global Tracklet Association for Multi-Object Tracking in Sports

arxiv.org - Jiacheng Sun, Hsiang-Wei Huang, Cheng-Yen Yang, Zhongyu Jiang, Jenq-Neng Hwang

Abstract:Multi-object tracking in sports scenarios has become one of the focal points in computer vision, experiencing significant advancements through the integration of deep learning techniques. Despite these breakthroughs, challenges remain, such as accurately re-identifying players upon re-entry into the scene and minimizing ID switches. In this paper, we propose an appearance-based global tracklet association algorithm designed to enhance tracking performance by splitting tracklets containing multiple identities and connecting tracklets seemingly from the same identity. This method can serve as a plug-and-play refinement tool for any multi-object tracker to further boost their performance. The proposed method achieved a new state-of-the-art performance on the SportsMOT dataset with HOTA score of 81.04%. Similarly, on the SoccerNet dataset, our method enhanced multiple trackers' performance, consistently increasing the HOTA score from 79.41% to 83.11%. These significant and consistent improvements across different trackers and datasets underscore our proposed method's potential impact on the application of sports player tracking. We open-source our project codebase at this https URL.

Better by Default: Strong Pre-Tuned MLPs and Boosted Trees on Tabular Data

cnrs.fr - David Holzmüller, Léo Grinsztajn , Ingo Steinwart

For classification and regression on tabular data, the dominance of gradient-boosted decision trees (GBDTs) has recently been challenged by often much slower deep learning methods with extensive hyperparameter tuning. We address this discrepancy by introducing (a) RealMLP, an improved multilayer perceptron (MLP), and (b) improved default parameters for GBDTs and RealMLP. We tune RealMLP and the default parameters on a meta-train benchmark with 71 classification and 47 regression datasets and compare them to hyperparameter-optimized versions on a disjoint meta-test benchmark with 48 classification and 42 regression datasets, as well as the GBDT-friendly benchmark by Grinsztajn et al. (2022). Our benchmark results show that RealMLP offers a better time-accuracy tradeoff than other neural nets and is competitive with GBDTs. Moreover, a combination of RealMLP and GBDTs with improved default parameters can achieve excellent results on medium-sized tabular datasets (1K--500K samples) without hyperparameter tuning.

A Comprehensive Benchmark of Machine and Deep Learning Across Diverse Tabular Datasets

arxiv.org - Assaf Shmuel, Oren Glickman, Teddy Lazebnik

Abstract:The analysis of tabular datasets is highly prevalent both in scientific research and real-world applications of Machine Learning (ML). Unlike many other ML tasks, Deep Learning (DL) models often do not outperform traditional methods in this area. Previous comparative benchmarks have shown that DL performance is frequently equivalent or even inferior to models such as Gradient Boosting Machines (GBMs). In this study, we introduce a comprehensive benchmark aimed at better characterizing the types of datasets where DL models excel. Although several important benchmarks for tabular datasets already exist, our contribution lies in the variety and depth of our comparison: we evaluate 111 datasets with 20 different models, including both regression and classification tasks. These datasets vary in scale and include both those with and without categorical variables. Importantly, our benchmark contains a sufficient number of datasets where DL models perform best, allowing for a thorough analysis of the conditions under which DL models excel. Building on the results of this benchmark, we train a model that predicts scenarios where DL models outperform alternative methods with 86.1% accuracy (AUC 0.78). We present insights derived from this characterization and compare these findings to previous benchmarks.

Time-series forecasting through recurrent topology

nature.com - Taylor Chomiak & Bin Hu

Time-series forecasting is a practical goal in many areas of science and engineering. Common approaches for forecasting future events often rely on highly parameterized or black-box models. However, these are associated with a variety of drawbacks including critical model assumptions, uncertainties in their estimated input hyperparameters, and computational cost. All of these can limit model selection and performance. Here, we introduce a learning algorithm that avoids these drawbacks. A variety of data types including chaotic systems, macroeconomic data, wearable sensor recordings, and population dynamics are used to show that Forecasting through Recurrent Topology (FReT) can generate multi-step-ahead forecasts of unseen data. With no free parameters or even a need for computationally costly hyperparameter optimization procedures in high-dimensional parameter space, the simplicity of FReT offers an attractive alternative to complex models where increased model complexity may limit interpretability/explainability and impose unnecessary system-level computational load and power consumption constraints.

Sports Analytics Weekly by kubeia.io - 47/2024

🎲 Betting

📝 Sports Analytics

👁️ Computer Vision

🤖 Machine Learning

🌩 Forecasting