Sports Analytics Weekly by kubeia.io

SharpBetting: How It Works, What You Get

youtube.com

In this special video discussion, David provides a complete overview of what SharpBetting offers, who it’s for, and how to get the most from your membership.🧠 What You’ll Learn:Why SharpBetting was created and who’s behind itWhat sports and markets are currently coveredWhich membership tier suits you bestHow to use the filters, settings

Python Football Review

pythonfootball.com - Martin

So I’m starting something new from scratch. The Python Football Review
* What it is: A free weekly newsletter with hands‑on Python templates and football deep-dives.
* Who it’s for: Fans, scouts, journalists, and complete beginners who want practical data skills—no fluff and no jargon.
* When: Every Thursday.
* What to expect:
+ Deep dives into metrics like xG, xGOT, xT, PPDA, and more
+ Step‑by‑step Python snippets for scraping, wrangling and analysing data
+ Book reviews, case studies, and “replicate‑this‑project” guides
+ Occasional forecasting pieces (because that’s where many of us started)

Better Metrics for Football Forecasts: Moving Beyond the Ranked Probability Score

pena.lt - Martin Eastwood

I've recently been questioning whether RPS is really the best tool for evaluating football forecasts - especially when our ultimate goal is to identify the most informative models as efficiently and fairly as possible.In this article, I’ll explain why RPS might not be the optimal choice, introduce alternative scoring metrics like Log Loss (also known as Ignorance Score) and the multiclass Brier score, and share some experiments / ideas I've tested to explore which metrics are best suited for evaluating football predictive models.

Can AI Predict Player Performance in New Team Environments?

substack.com - Alex Marin Felices

Using Fine-Tuned LEMs to Assess Soccer Players' Impact Across Various Contexts.

Expected Goals (xG) 101

pythonfootball.com - Python Football Review

What xG really measures (and what it doesn’t), the key misconception that trips up even seasoned pros, why its loudest critics are mostly wrong, and how to pull tons of xG data with Python.

Stop the Simulations! - The xG Football Club

substack.com - Alex Marin Felices

A New Approach for Faster and More Accurate Tournament Outcome Predictions.

Calculating Expected Threat in Python Using Linear Algebra

pena.lt - Martin Eastwood

Imagine you're watching a soccer match and your team's midfielder has the ball at the halfway line. How dangerous is this position? What about if they dribble forward 10 yards? Or make a pass to the wing? Expected Threat (xT), originally developed Sarah Rudd and popularised by Karun Singh, attempts to answer these questions by quantifying the offensive value of every position on the pitch.Unlike simpler metrics such as expected goals (xG) that only measure shot quality, xT evaluates both immediate shooting opportunities and the potential for creating future scoring chances. This makes it useful for analyzing buildup play and measuring contributions from those players who don't directly create shots.

Pi Ratings: The Smarter Way to Rank Football Teams

pena.lt - Martin Eastwood

Football analytics has come a long way in recent years, moving from simple league tables to more sophisticated methods of quantifying team performance. If you’ve ever looked at Elo ratings or FIFA rankings, you know that rating systems attempt to provide a clearer picture of how good a team really is, beyond just the wins and losses. But are these systems as accurate as they could be?Imagine two teams: Team A beats Team B 1-0 in a closely fought match, while Team C thrashes Team D 5-0. Should Team A and Team C gain the same rating boost? Many traditional rating systems don't differentiate much between these results, even though one clearly signals a more dominant performance. This is where Pi Ratings come in — a dynamic rating system designed to better reflect team ability by considering score discrepancies, home vs. away performances, and recent form.

How Do Professional Football Clubs Use Data?

substack.com - Alex Marin Felices

The following summary critically reviews the research conducted by Lorenzo Lolli, Pascal Bauer, Callum Irving, Daniele Bonanno, Oliver Höner, Warren Gregson, and Valter Di Salvo, titled "Data analytics in the football industry: a survey investigating operational frameworks and practices in professional clubs and national federations from around the world." All data, figures, and analysis presented here are drawn from their original work; I do not claim any authorship or ownership of the content. This summary has been written to provide a concise and technically informed synthesis of the paper’s findings, methodologies, and implications, while maintaining fidelity to the authors’ intellectual contributions.

Why Your xG Model Might Be Wrong: The Bayesian Solution to Accurate Scoring Predictions

substack.com - Alex Marin Felices

The concept of expected goals (xG) has become a fundamental metric in football analytics, estimating the likelihood of a shot resulting in a goal based on contextual features such as shot distance, angle, and body part used. However, mainstream xG models do not account for player-specific attributes, leading to a uniform probability assignment for identical shots taken by different players. This limitation disregards variations in individual skill levels, exemplified by a scenario where Lionel Messi and a National League player take the same shot under identical conditions but are assigned the same xG value. Intuitively, Messi's superior finishing ability should yield a higher probability of scoring, yet conventional xG models fail to incorporate this effect.

Framing Causal Questions in Sports Analytics: A Case Study of Crossing in Soccer

arxiv.org - Shomoita Alam, Erica E. M. Moodie, Lucas Y. Wu, Tim B. Swartz

Abstract:Causal inference has become an accepted analytic framework in settings where experimentation is impossible, which is frequently the case in sports analytics, particularly for studying in-game tactics. However, subtle differences in implementation can lead to important differences in interpretation. In this work, we provide a case study to demonstrate the utility and the nuance of these approaches. Motivated by a case study of crossing in soccer, two causal questions are considered: the overall impact of crossing on shot creation (Average Treatment Effect, ATE) and its impact in plays where crossing was actually attempted (Average Treatment Effect on the Treated, ATT). Using data from Shandong Taishan Luneng Football Club's 2017 season, we demonstrate how distinct matching strategies are used for different estimation targets - the ATE and ATT - though both aim to eliminate any spurious relationship between crossing and shot creation. Results suggest crossing yields a 1.6% additive increase in shot probability overall compared to not crossing (ATE), whereas the ATT is 5.0%. We discuss what insights can be gained from each estimand, and provide examples where one may be preferred over the alternative. Understanding and clearly framing analytics questions through a causal lens ensure rigorous analyses of complex questions.

Simulating MLB Seasons using Bayesian Inference and Random Walks

arxiv.org - Simon Cha

Abstract:As a dedicated follower of sports statistics and with the MLB season beginning in late March, I set out to predict how many wins each team would accumulate by the end of the 162 game season. The goal was to build a simulation framework capable of forecasting the remainder of the season, starting from a 20 game burn-in period to establish initial estimates of team strength. My approach used a Bayesian inference model incorporating team win percentage, batting average, and pitching ERA to construct a posterior distribution of win probability for each matchup. For each game, I sampled from the posterior and simulated the outcome using a Bernoulli trial. Because future matchup inputs were unobserved, I forecasted batting averages using random walks and modeled pitching ERA with Kalman filters. After simulating many seasons, the model produced a distribution of win totals for all 30 teams and can also be used to estimate each team's probability of making the postseason.

From Players to Champions: A Generalizable Machine Learning Approach for Match Outcome Prediction with Insights from the FIFA World Cup

arxiv.org - Ali Al-Bustami, Zaid Ghazal

Abstract:Accurate prediction of FIFA World Cup match outcomes holds significant value for analysts, coaches, bettors, and fans. This paper presents a machine learning framework specifically designed to forecast match winners in FIFA World Cup. By integrating both team-level historical data and player-specific performance metrics such as goals, assists, passing accuracy, and tackles, we capture nuanced interactions often overlooked by traditional aggregate models. Our methodology processes multi-year data to create year-specific team profiles that account for evolving rosters and player development. We employ classification techniques complemented by dimensionality reduction and hyperparameter optimization, to yield robust predictive models. Experimental results on data from the FIFA 2022 World Cup demonstrate our approach's superior accuracy compared to baseline method. Our findings highlight the importance of incorporating individual player attributes and team-level composition to enhance predictive performance, offering new insights into player synergy, strategic match-ups, and tournament progression scenarios. This work underscores the transformative potential of rich, player-centric data in sports analytics, setting a foundation for future exploration of advanced learning architectures such as graph neural networks to model complex team interactions.

TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation

arxiv.org - Ling You, Wenxuan Huang, Xinni Xie, Xiangyi Wei, Bangyan Li, Shaohui Lin, Yang Li, Changbo Wang

Abstract:Soccer is a globally popular sporting event, typically characterized by long matches and distinctive highlight moments. Recent advances in Multimodal Large Language Models (MLLMs) offer promising capabilities in temporal grounding and video understanding, soccer commentary generation often requires precise temporal localization and semantically rich descriptions over long-form video. However, existing soccer MLLMs often rely on the temporal a priori for caption generation, so they cannot process the soccer video end-to-end. While some traditional approaches follow a two-step paradigm that is complex and fails to capture the global context to achieve suboptimal performance. To solve the above issues, we present TimeSoccer, the first end-to-end soccer MLLM for Single-anchor Dense Video Captioning (SDVC) in full-match soccer videos. TimeSoccer jointly predicts timestamps and generates captions in a single pass, enabling global context modeling across 45-minute matches. To support long video understanding of soccer matches, we introduce MoFA-Select, a training-free, motion-aware frame compression module that adaptively selects representative frames via a coarse-to-fine strategy, and incorporates complementary training paradigms to strengthen the model's ability to handle long temporal sequences. Extensive experiments demonstrate that our TimeSoccer achieves State-of-The-Art (SoTA) performance on the SDVC task in an end-to-end form, generating high-quality commentary with accurate temporal alignment and strong semantic relevance.

Action Anticipation from SoccerNet Football Video Broadcasts

arxiv.org - Mohamad Dalal, Artur Xarles, Anthony Cioppa, Silvio Giancola, Marc Van Droogenbroeck, Bernard Ghanem, Albert Clapés...

Abstract:Artificial intelligence has revolutionized the way we analyze sports videos, whether to understand the actions of games in long untrimmed videos or to anticipate the player's motion in future frames. Despite these efforts, little attention has been given to anticipating game actions before they occur. In this work, we introduce the task of action anticipation for football broadcast videos, which consists in predicting future actions in unobserved future frames, within a five- or ten-second anticipation window. To benchmark this task, we release a new dataset, namely the SoccerNet Ball Action Anticipation dataset, based on SoccerNet Ball Action Spotting. Additionally, we propose a Football Action ANticipation TRAnsformer (FAANTRA), a baseline method that adapts FUTR, a state-of-the-art action anticipation model, to predict ball-related actions. To evaluate action anticipation, we introduce new metrics, including mAP@\delta, which evaluates the temporal precision of predicted future actions, as well as mAP@\infty, which evaluates their occurrence within the anticipation window. We also conduct extensive ablation studies to examine the impact of various task settings, input configurations, and model architectures. Experimental results highlight both the feasibility and challenges of action anticipation in football videos, providing valuable insights into the design of predictive models for sports analytics. By forecasting actions before they unfold, our work will enable applications in automated broadcasting, tactical analysis, and player decision-making. Our dataset and code are publicly available at this URL.

NFL Draft Modelling: Loss Functional Analysis

arxiv.org - Tanmay Grandhisiri

Abstract:In the NFL draft, teams must strategically balance immediate player impact against long-term value, presenting a complex optimization challenge for draft capital management. This paper introduces a framework for evaluating the fairness and efficiency of draft pick trades using norm-based loss functions. Draft pick valuations are modelled by the Weibull distribution. Utilizing these valuation techniques, the research identifies key trade-offs between aggressive, immediate-impact strategies and conservative, risk-averse approaches. Ultimately, this framework serves as a valuable analytical tool for assessing NFL draft trade fairness and value distribution, aiding team decision-makers and enriching insights within the sports analytics community.

Action Valuation in Sports: A Survey

arxiv.org - Artur Xarles, Sergio Escalera, Thomas B. Moeslund, Albert Clapés

Abstract:Action Valuation (AV) has emerged as a key topic in Sports Analytics, offering valuable insights by assigning scores to individual actions based on their contribution to desired outcomes. Despite a few surveys addressing related concepts such as Player Valuation, there is no comprehensive review dedicated to an in-depth analysis of AV across different sports. In this survey, we introduce a taxonomy with nine dimensions related to the AV task, encompassing data, methodological approaches, evaluation techniques, and practical applications. Through this analysis, we aim to identify the essential characteristics of effective AV methods, highlight existing gaps in research, and propose future directions for advancing the field.

How to optimise tournament draws: The case of the 2022 FIFA World Cup

arxiv.org - László Csató

Abstract:The organisers of major sports competitions use different policies with respect to constraints in the group draw. Our paper aims to rationalise these choices by analysing the trade-off between attractiveness (the number of games played by teams from the same geographic zone) and fairness (the departure of the draw mechanism from a uniform distribution). A parametric optimisation model is formulated and applied to the 2022 FIFA World Cup draw. A flaw of the draw procedure is identified: the pre-assignment of the host to a group implies additional but unnecessary distortions. All Pareto efficient sets of draw constraints are determined via simulations. The proposed framework can be used to find the optimal draw rules of a tournament and justify the distortion of the draw procedure for the stakeholders.

Space evaluation at the starting point of soccer transitions

arxiv.org - Yohei Ogawa, Rikuhei Umemoto, Keisuke Fujii

Abstract:Soccer is a sport played on a pitch where effective use of space is crucial. Decision-making during transitions, when possession switches between teams, has been increasingly important, but research on space evaluation in these moments has been limited. Recent space evaluation methods such as OBSO (Off-Ball Scoring Opportunity) use scoring probability, so it is not well-suited for assessing areas far from the goal, where transitions typically occur. In this paper, we propose OBPV (Off-Ball Positioning Value) to evaluate space across the pitch, including the starting points of transitions. OBPV extends OBSO by introducing the field value model, which evaluates the entire pitch, and by employing the transition kernel model, which reflects positional specificity through kernel density estimation of pass distributions. Experiments using La Liga 2023/24 season tracking and event data show that OBPV highlights effective space utilization during counter-attacks and reveals team-specific characteristics in how the teams utilize space after positive and negative transitions.

Action Spotting and Precise Event Detection in Sports: Datasets, Methods, and Challenges

arxiv.org - Hao Xu, Arbind Agrahari Baniya, Sam Well, Mohamed Reda Bouadjenek, Richard Dazeley, Sunil Aryal

Abstract:Video event detection has become an essential component of sports analytics, enabling automated identification of key moments and enhancing performance analysis, viewer engagement, and broadcast efficiency. Recent advancements in deep learning, particularly Convolutional Neural Networks (CNNs) and Transformers, have significantly improved accuracy and efficiency in Temporal Action Localization (TAL), Action Spotting (AS), and Precise Event Spotting (PES). This survey provides a comprehensive overview of these three key tasks, emphasizing their differences, applications, and the evolution of methodological approaches. We thoroughly review and categorize existing datasets and evaluation metrics specifically tailored for sports contexts, highlighting the strengths and limitations of each. Furthermore, we analyze state-of-the-art techniques, including multi-modal approaches that integrate audio and visual information, methods utilizing self-supervised learning and knowledge distillation, and approaches aimed at generalizing across multiple sports. Finally, we discuss critical open challenges and outline promising research directions toward developing more generalized, efficient, and robust event detection frameworks applicable to diverse sports. This survey serves as a foundation for future research on efficient, generalizable, and multi-modal sports event detection.

SAM 2: Segment Anything in Images and Videos

openreview.net - Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle...

We present Segment Anything Model 2 (SAM 2), a foundation model towards solving promptable visual segmentation in images and videos. We build a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date. Our model is a simple transformer architecture with streaming memory for real-time video processing. SAM 2 trained on our data provides strong performance across a wide range of tasks. In video segmentation, we observe better accuracy, using 3x fewer interactions than prior approaches. In image segmentation, our model is more accurate and 6x faster than the Segment Anything Model (SAM). We believe that our data, model, and insights will serve as a significant milestone for video segmentation and related perception tasks. We are releasing our main model, the dataset, an interactive demo and code.

Domain Adaptation of VLM for Soccer Video Understanding

arxiv.org - Tiancheng Jiang, Henry Wang, Md Sirajus Salekin, Parmida Atighehchian, Shinan Zhang

Abstract:Vision Language Models (VLMs) have demonstrated strong performance in multi-modal tasks by effectively aligning visual and textual representations. However, most video understanding VLM research has been domain-agnostic, leaving the understanding of their transfer learning capability to specialized domains under-explored. In this work, we address this by exploring the adaptability of open-source VLMs to specific domains, and focusing on soccer as an initial case study. Our approach uses large-scale soccer datasets and LLM to create instruction-following data, and use them to iteratively fine-tune the general-domain VLM in a curriculum learning fashion (first teaching the model key soccer concepts to then question answering tasks). The final adapted model, trained using a curated dataset of 20k video clips, exhibits significant improvement in soccer-specific tasks compared to the base model, with a 37.5% relative improvement for the visual question-answering task and an accuracy improvement from 11.8% to 63.5% for the downstream soccer action classification task.

From Broadcast to Minimap: Achieving State-of-the-Art SoccerNet Game State Reconstruction

arxiv.org - Vladimir Golovkin, Nikolay Nemtsev, Vasyl Shandyba, Oleg Udin, Nikita Kasatkin, Pavel Kononov, Anton Afanasiev, Sergey Ulasen...

Abstract:Game State Reconstruction (GSR), a critical task in Sports Video Understanding, involves precise tracking and localization of all individuals on the football field-players, goalkeepers, referees, and others - in real-world coordinates. This capability enables coaches and analysts to derive actionable insights into player movements, team formations, and game dynamics, ultimately optimizing training strategies and enhancing competitive advantage. Achieving accurate GSR using a single-camera setup is highly challenging due to frequent camera movements, occlusions, and dynamic scene content. In this work, we present a robust end-to-end pipeline for tracking players across an entire match using a single-camera setup. Our solution integrates a fine-tuned YOLOv5m for object detection, a SegFormer-based camera parameter estimator, and a DeepSORT-based tracking framework enhanced with re-identification, orientation prediction, and jersey number recognition. By ensuring both spatial accuracy and temporal consistency, our method delivers state-of-the-art game state reconstruction, securing first place in the SoccerNet Game State Reconstruction Challenge 2024 and significantly outperforming competing methods.

I don't like NumPy

dynomight.net - dynomight

They say you can’t truly hate someone unless you loved them first. I don’t know if that’s true as a general principle, but it certainly describes my relationship with NumPy.NumPy, by the way, is some software that does computations on arrays in Python. It’s insanely popular and has had a huge influence on all the popular machine learning libraries like PyTorch. These libraries share most of the same issues I discuss below, but I’ll stick to NumPy for concreteness.

Data Shapley in One Training Run

openreview.net - Jiachen T. Wang, Prateek Mittal, Dawn Song, Ruoxi Jia

Data Shapley offers a principled framework for attributing the contribution of data within machine learning contexts. However, the traditional notion of Data Shapley requires re-training models on various data subsets, which becomes computationally infeasible for large-scale models. Additionally, this retraining-based definition cannot evaluate the contribution of data for a specific model training run, which may often be of interest in practice. This paper introduces a novel concept, In-Run Data Shapley, which eliminates the need for model retraining and is specifically designed for assessing data contribution for a particular model of interest. In-Run Data Shapley calculates the Shapley value for each gradient update iteration and accumulates these values throughout the training process. We present several techniques that allow the efficient scaling of In-Run Data Shapley to the size of foundation models. In its most optimized implementation, our method adds negligible runtime overhead compared to standard model training. This dramatic efficiency improvement makes it possible to perform data attribution for the foundation model pretraining stage. We present several case studies that offer fresh insights into pretraining data's contribution and discuss their implications for copyright in generative AI and pretraining data curation.

The Gambler Who Cracked the Horse-Racing Code

bloomberg.com - Kit Chellel

Bill Benter did the impossible: He wrote an algorithm that couldn’t lose at the track. Close to a billion dollars later, he tells his story for the first time.

Football Prediction Models: Which Ones Work the Best?

pena.lt - Martin Eastwood

I've recently released version 1.1.0 of my penaltyblog Python package, bringing significant improvements to the speed and predictive performance of football (soccer) goals models. With this update, I thought it would be a great opportunity to compare the different models available — such as Poisson, Dixon and Coles, and more — exploring how they work, how to optimize their parameters, and how they perform on real-world data.Let's start off with a high-level look at the different models available, looking at how they work, what their strengths are and what their weaknesses are.

Sports Analytics Weekly by kubeia.io - 21/2025

🎲 Betting

📝 Sports Analytics

👁️ Computer Vision

🤖 Machine Learning

🕰️ Blast From the Past

🌩 Forecasting