Sports Analytics Weekly by kubeia.io

Forecasting soccer matches with betting odds: A tale of two markets

sciencedirect.com - Tadgh Hegarty , Karl Whelan

We compare the properties of betting market odds set in two distinct markets for a large sample of European soccer matches. We confirm inefficiencies in the traditional market for bets on a home win, an away win, or a draw, as found in previous studies such as Angelini and De Angelis (2019). In particular, there is a strong pattern of favourite–longshot bias. Conversely, we document how a betting market that has emerged in recent years, the Asian handicap market, can generate efficient forecasts for the same set of matches using a new methodology for mapping its odds into probabilities.

xGChain

pythonfootball.com - Python Football Review

What each metric measures, which gap they fill (left by xA and xAG), which players have posted the best build-up numbers this season, and how to pull xGChain/xGBuildup data with Python

Partially Regularized Ordinal Regression to Adjust Teams' Scoring for Strength of Schedule and Complementary Unit Performance in American Football

arxiv.org - Andrey Skripnikov, Sujit Sivadanam

Abstract:American football is unique in that offensive and defensive units typically consist of separate players who don't share the field simultaneously, which tempts one to evaluate them independently. However, a team's offensive and defensive performances often complement each other. For instance, turnovers forced by the defense can create easier scoring opportunities for the offense. Using drive-by-drive data from 2014-2020 Division-I college football (Football Bowl Subdivision, FBS) and 2009-2017 National Football League (NFL) seasons, we identify complementary football features that impact scoring the most. We employ regularized ordinal regression with an elastic penalty, enabling variable selection and partially relaxing the proportional odds assumption. Moreover, given the importance of accounting for strength of the opposition, we incorporate unpenalized components to ensure full adjustment for strength of schedule. For residual diagnostics of our ordinal regression models we apply the surrogate approach, creatively extending its use to non-proportional odds models. We then adjust each team's offensive (defensive) performance to project it onto a league-average complementary unit, showcasing the effects of these adjustments on team scoring. Lastly, we evaluate the out-of-sample prediction performance of our selected model, highlighting improvements gained from incorporating complementary football features alongside strength-of-schedule adjustments.

Through the Gaps: Uncovering Tactical Line-Breaking Passes with Clustering

arxiv.org - Oktay Karakuş, Hasan Arkadaş

Abstract:Line-breaking passes (LBPs) are crucial tactical actions in football, allowing teams to penetrate defensive lines and access high-value spaces. In this study, we present an unsupervised, clustering-based framework for detecting and analysing LBPs using synchronised event and tracking data from elite matches. Our approach models opponent team shape through vertical spatial segmentation and identifies passes that disrupt defensive lines within open play. Beyond detection, we introduce several tactical metrics, including the space build-up ratio (SBR) and two chain-based variants, LBPCh^1 and LBPCh^2, which quantify the effectiveness of LBPs in generating immediate or sustained attacking threats. We evaluate these metrics across teams and players in the 2022 FIFA World Cup, revealing stylistic differences in vertical progression and structural disruption. The proposed methodology is explainable, scalable, and directly applicable to modern performance analysis and scouting workflows.

Time-Varying Home Field Advantage in Football: Learning from a Non-Stationary Causal Process

arxiv.org - Minhao Qi, Hengrui Cai, Guanyu Hu, Weining Shen

Abstract:In sports analytics, home field advantage is a robust phenomenon where the home team wins more games than the away team. However, discovering the causal factors behind home field advantage presents unique challenges due to the non-stationary, time-varying environment of sports matches. In response, we propose a novel causal discovery method, DYnamic Non-stAtionary local M-estimatOrs (DYNAMO), to learn the time-varying causal structures of home field advantage. DYNAMO offers flexibility by integrating various loss functions, making it practical for learning linear and non-linear causal structures from a general class of non-stationary causal processes. By leveraging local information, we provide theoretical guarantees for the identifiability and estimation consistency of non-stationary causal structures without imposing additional assumptions. Simulation studies validate the efficacy of DYNAMO in recovering time-varying causal structures. We apply our method to high-resolution event data from the 2020-2021 and 2021-2022 English Premier League seasons, during which the former season had no audience presence. Our results reveal intriguing, time-varying, team-specific field advantages influenced by referee bias, which differ significantly with and without crowd support. Furthermore, the time-varying causal structures learned by our method improve goal prediction accuracy compared to existing methods.

Analysis of points outcome in ATP Grand Slam Tennis using big data and machine learning

arxiv.org - Martin Illum, Hans Christian Bechsøfft Mikkelsen, Emil Hovad

Abstract:Tennis is one of the world's biggest and most popular sports. Multiple researchers have, with limited success, modeled the outcome of matches using probability modelling or machine learning approaches. The approach presented here predicts the outcomes of points in tennis matches. This is based on given a probability of winning a point, based on the prior history of matches, the current match, the player rankings and if the points are started with a first or second. The use of historical public data from the matches and the players' ranking has made this study possible. In addition, we interpret the models in order to reveal important strategic factors for winning points. The historical data are from the years 2016 to 2020 in the two Grand Slam tournaments, Wimbledon and US Open, resulting in a total of 709 matches. Different machine learning methods are applied for this work such as, e.g. logistic regression, Random forest, ADABoost, and XGBoost. These models are compared to a baseline model, namely a traditional statistics measure, in this case the average. An evaluation of the results showed that the models for points proved to be a fraction better than the average. However, with the applied public data and the information level of the data, the approach presented here is not optimal for predicting who wins when the opponents are on the same position on the ranking. This methodology is interesting with respect to examining which factors are important for the outcomes of who wins points in tennis matches. Other higher quality data sets exists from e.g. Hawk Eye, although these data sets are not available for the public.

The Perfect Pass Formula: Can Physics Predict Success?

substack.com - Alex Marin Felices

The modeling of pass probabilities in football has traditionally followed two principal paths: time-to-intercept models and machine learning approaches. Time-to-intercept frameworks calculate the duration a player needs to reach a ball, thereby evaluating potential receivers. Machine learning methods, while more flexible and data-driven, often yield models that are "difficult to conceptualize" [1, 2].This paper introduces a hybrid method: a physics-based time-to-intercept computation embedded within a statistical model. The authors outline four key desiderata for this model: (1) it must yield interpretable probabilities; (2) it must be empirically grounded in real match data; (3) it must operate predictively using only data available at the moment of the pass; and (4) it must vary smoothly with respect to small differences in intercept times, ensuring continuity in the output.

Foundations of Computer Vision

mit.edu - Antonio Torralba, Phillip Isola, and William Freeman

This book covers foundational topics within computer vision, with an image processing and machine learning perspective. We want to build the reader’s intuition and so we include many visualizations. The audience is undergraduate and graduate students who are entering the field, but we hope experienced practitioners will find the book valuable as well.Our initial goal was to write a large book that provided a good coverage of the field. Unfortunately, the field of computer vision is just too large for that. So, we decided to write a small book instead, limiting each chapter to no more than five pages. Such a goal forced us to really focus on the important concepts necessary to understand each topic. Writing a short book was perfect because we did not have time to write a long book and you did not have time to read it. Unfortunately, we have failed at that goal, too.

No Train Yet Gain: Towards Generic Multi-Object Tracking in Sports and Beyond

arxiv.org - Tomasz Stanczyk, Seongro Yoon, Francois Bremond

Abstract:Multi-object tracking (MOT) is essential for sports analytics, enabling performance evaluation and tactical insights. However, tracking in sports is challenging due to fast movements, occlusions, and camera shifts. Traditional tracking-by-detection methods require extensive tuning, while segmentation-based approaches struggle with track processing. We propose McByte, a tracking-by-detection framework that integrates temporally propagated segmentation mask as an association cue to improve robustness without per-video tuning. Unlike many existing methods, McByte does not require training, relying solely on pre-trained models and object detectors commonly used in the community. Evaluated on SportsMOT, DanceTrack, SoccerNet-tracking 2022 and MOT17, McByte demonstrates strong performance across sports and general pedestrian tracking. Our results highlight the benefits of mask propagation for a more adaptable and generalizable MOT approach. Code will be made available at this https URL.

Real-time Localization of a Soccer Ball from a Single Camera

arxiv.org - Dmitrii Vorobev, Artem Prosvetov, Karim Elhadji Daou

Abstract:We propose a computationally efficient method for real-time three-dimensional football trajectory reconstruction from a single broadcast camera. In contrast to previous work, our approach introduces a multi-mode state model with W discrete modes to significantly accelerate optimization while preserving centimeter-level accuracy -- even in cases of severe occlusion, motion blur, and complex backgrounds. The system operates on standard CPUs and achieves low latency suitable for live broadcast settings. Extensive evaluation on a proprietary dataset of 6K-resolution Russian Premier League matches demonstrates performance comparable to multi-camera systems, without the need for specialized or costly infrastructure. This work provides a practical method for accessible and accurate 3D ball tracking in professional football environments.

Ice Hockey Puck Localization Using Contextual Cues

arxiv.org - Liam Salass, Jerrin Bright, Amir Nazemi, Yuhao Chen, John Zelek, David Clausi

Abstract:Puck detection in ice hockey broadcast videos poses significant challenges due to the puck's small size, frequent occlusions, motion blur, broadcast artifacts, and scale inconsistencies due to varying camera zoom and broadcast camera viewpoints. Prior works focus on appearance-based or motion-based cues of the puck without explicitly modelling the cues derived from player behaviour. Players consistently turn their bodies and direct their gaze toward the puck. Motivated by this strong contextual cue, we propose Puck Localization Using Contextual Cues (PLUCC), a novel approach for scale-aware and context-driven single-frame puck detections. PLUCC consists of three components: (a) a contextual encoder, which utilizes player orientations and positioning as helpful priors; (b) a feature pyramid encoder, which extracts multiscale features from the dual encoders; and (c) a gating decoder that combines latent features with a channel gating mechanism. For evaluation, in addition to standard average precision, we propose Rink Space Localization Error (RSLE), a scale-invariant homography-based metric for removing perspective bias from rink space evaluation. The experimental results of PLUCC on the PuckDataset dataset demonstrated state-of-the-art detection performance, surpassing previous baseline methods by an average precision improvement of 12.2% and RSLE average precision of 25%. Our research demonstrates the critical role of contextual understanding in improving puck detection performance, with broad implications for automated sports analysis.

Locating Tennis Ball Impact on the Racket in Real Time Using an Event Camera

arxiv.org - Yuto Kase, Kai Ishibe, Ryoma Yasuda, Yudai Washida, Sakiko Hashimoto

Abstract:In racket sports, such as tennis, locating the ball's position at impact is important in clarifying player and equipment characteristics, thereby aiding in personalized equipment design. High-speed cameras are used to measure the impact location; however, their excessive memory consumption limits prolonged scene capture, and manual digitization for position detection is time-consuming and prone to human error. These limitations make it difficult to effectively capture the entire playing scene, hindering the ability to analyze the player's performance. We propose a method for locating the tennis ball impact on the racket in real time using an event camera. Event cameras efficiently measure brightness changes (called `events') with microsecond accuracy under high-speed motion while using lower memory consumption. These cameras enable users to continuously monitor their performance over extended periods. Our method consists of three identification steps: time range of swing, timing at impact, and contours of ball and racket. Conventional computer vision techniques are utilized along with an original event-based processing to detect the timing at impact (PATS: the amount of polarity asymmetry in time symmetry). The results of the experiments were within the permissible range for measuring tennis players' performance. Moreover, the computation time was sufficiently short for real-time applications.

Random Walk: A Modern Introduction

uchicago.edu - Gregory F. Lawler and Vlada Limic

Random walk – the stochastic process formed by successive summation of independent, identically distributed random variables – is one of the most basic and well-studied topics in probability theory. For random walks on the integer lattice Zd, the main reference is the classic book by Spitzer. This text considers only a subset of such walks, namely those corresponding to increment distributions with zero mean and finite variance. In this case, one can summarize the main result very quickly: the central limit theorem implies that under appropriate rescaling the limiting distribution is normal, and the functional central limit theorem implies that the distribution of the corresponding path-valued process (after standard rescaling of time and space) approaches that of Brownian motion.

TabPFN: Deep Learning for Tabular Data (That Actually Works!) — with Prof. Frank Hutter

youtube.com

‪@JonKrohnLearns‬ talks tabular data with Frank Hutter, Professor of Artificial Intelligence at Universität Freiburg in Germany. Despite the great steps that deep learning has made in analysing images, audio, and natural language, tabular data has remained its insurmountable obstacle. In this episode, Frank Hutter details the path he has found around this obstacle even with limited data by using a ground-breaking transformer architecture. Named TabPFN, this approach is vastly outperforming other architectures, as testified by a write up of TabPFN’s capabilities in Nature. Frank talks about his work on version 2 of TabPFN, the architecture’s cross-industry applicability, and how TabPFN is able to return accurate results with synthetic data.

Dealing With Missing Values, Part 1.

xgblog.ai - Bojan Tunguz

No real world data collection process is perfect, and we are often left with all sorts of noise in our dataset: incorrectly recorded values, non-recorded values, corruption of data, etc. If we are able to spot all those irregular points, oftentimes the best we can do is treat them as missing values. Missing values are the fact of life if you work in data science, machine learning, or any other field that relies on the real-world data. Most of us hardly give those data points much thought, and when we do we rely on many ready-made tools, algorithms, or rules of thumb to deal with them. However, to do them proper justice you sometimes need to dig deeper, and make a judicious choice of what to do with them. And what you end up doing with them, like in many other circumstances in data science, can be boiled down to the trusted old phrase of “it depends”. Missing data can significantly impact the results of analyses and models, potentially leading to biased or misleading outcomes.

Probability Is Only A Game

argmin.net - Ben Recht

Earlier this year, I spent a few posts discussing de Finetti’s “score coherence” as a motivation for probability. To review, if a forecaster is scored by a proper scoring rule, then their predictions must obey the axioms of probability. If they don’t, there is always a set of forecasts they could have provided that would get a better score.Defensive Forecasting has a similar flavor. You can think of the forecasts as probabilities, and the forecaster chooses them so that no matter what the outcomes are, the errors obey the law of large numbers. Indeed, that’s a simple way to motivate the “game theoretic” strategy of Defensive Forecasting. Choose a forecast so that the error looks uncorrelated with the past, no matter what the future brings.

In Defense of Defensive Forecasting

arxiv.org - Juan Carlos Perdomo, Benjamin Recht

Abstract:This tutorial provides a survey of algorithms for Defensive Forecasting, where predictions are derived not by prognostication but by correcting past mistakes. Pioneered by Vovk, Defensive Forecasting frames the goal of prediction as a sequential game, and derives predictions to minimize metrics no matter what outcomes occur. We present an elementary introduction to this general theory and derive simple, near-optimal algorithms for online learning, calibration, prediction with expert advice, and online conformal prediction.

Restatements or Forecasts?

argmin.net - Ben Recht

Lots of folks in the comments were unimpressed with the example I used yesterday to defend Defensive Forecasting. Matt Hoffman wrote, “We care about a forecaster's average error, not the error of their average prediction.” John Quiggin quipped, “This isn't what I would call forecasting, but estimation.” I agree with both of them! What’s fascinating about Defensive Forecasting is that it lets you turn estimates into forecasts. If you can make predictions whose average has low error, you can also make predictions with low average error.

Time Series Forecasting with Graph Transformers

kumo.ai - Jan Eric Lenssen

Time series forecasting has been and will continue to be an important task in machine learning for several different applications. In this blog post, we described how to design an end-to-end pipeline for forecasting on graph structures, performing forecasting on a subset of graph nodes while using input signals from the whole graph, e.g., to combine data from multiple tables in a database. We also discussed the differences between point predictions and probabilistic forecasting using generative formulations, which we believe to be an interesting area for future investigation.

Sports Analytics Weekly by kubeia.io - 25/2025

🎲 Betting

📝 Sports Analytics

👁️ Computer Vision

💰 Quantitative Finance

🤖 Machine Learning

🧮 Statistics

🌩 Forecasting