arxiv.org - Philipp Fleig, Vijay Balasubramanian
Every interaction of a living organism with its environment involves the placement of a bet. Armed with partial knowledge about a stochastic world, the organism must decide its next step or near-term strategy, an act that implicitly or explicitly involves the assumption of a model of the world. Better information about environmental statistics can improve the bet quality, but in practice resources for information gathering are always limited. We argue that theories of optimal inference dictate that ``complex'' models are harder to infer with bounded information and lead to larger prediction errors. Thus, we propose a principle of ``playing it safe'' where, given finite information gathering capacity, biological systems should be biased towards simpler models of the world, and thereby to less risky betting strategies. In the framework of Bayesian inference, we show that there is an optimally safe adaptation strategy determined by the Bayesian prior. We then demonstrate that, in the context of stochastic phenotypic switching by bacteria, implementation of our principle of ``playing it safe'' increases fitness (population growth rate) of the bacterial collective. We suggest that the principle applies broadly to problems of adaptation, learning and evolution, and illuminates the types of environments in which organisms are able to thrive.
arxiv.org - Matteo Iacopini, Eoghan O'Neill, Luca Rossini
Ranking lists are often provided at regular time intervals by one or multiple rankers in a range of applications, including sports, marketing, and politics. Most popular methods for rank-order data postulate a linear specification for the latent scores, which determine the observed ranks, and ignore the temporal dependence of the ranking lists. To address these issues, novel nonparametric static (ROBART) and autoregressive (ARROBART) models are introduced, with latent scores defined as nonlinear Bayesian additive regression tree functions of covariates. To make inferences in the dynamic ARROBART model, closed-form filtering, predictive, and smoothing distributions for the latent time-varying scores are derived. These results are applied in a Gibbs sampler with data augmentation for posterior inference. The proposed methods are shown to outperform existing competitors in simulation studies, and the advantages of the dynamic model are demonstrated by forecasts of weekly pollster rankings of NCAA football teams.
arxiv.org - Calvin C. K. Yeung, Keisuke Fujii
Complex interactions between two opposing agents frequently occur in domains of machine learning, game theory, and other application domains. Quantitatively analyzing the strategies involved can provide an objective basis for decision-making. One such critical scenario is shot-taking in football, where decisions, such as whether the attacker should shoot or pass the ball and whether the defender should attempt to block the shot, play a crucial role in the outcome of the game. However, there are currently no effective data-driven and/or theory-based approaches to analyzing such situations. To address this issue, we proposed a novel framework to analyze such scenarios based on game theory, where we estimate the expected payoff with machine learning (ML) models, and additional features for ML models were extracted with a theory-based shot block model. Conventionally, successes or failures (1 or 0) are used as payoffs, while a success shot (goal) is extremely rare in football. Therefore, we proposed the Expected Probability of Shot On Target (xSOT) metric to evaluate players' actions even if the shot results in no goal; this allows for effective differentiation and comparison between different shots and even enables counterfactual shot situation analysis. In our experiments, we have validated the framework by comparing it with baseline and ablated models. Furthermore, we have observed a high correlation between the xSOT and existing metrics. This alignment of information suggests that xSOT provides valuable insights. Lastly, as an illustration, we studied optimal strategies in the World Cup 2022 and analyzed a shot situation in EURO 2020.
youtube.com
When Moneyball was published in 2003, no one could have predicted its monumental impact across business, sports, culture, and beyond. Now, twenty years later, the book, and later the blockbuster movie, sparked a renaissance that has totally changed how organizations think about and use data. Jackie MacMullen will lead a discussion with Michael Lewis, Bill James, Shane Battier, and Daryl Morey as the group reflects on the impact and legacy of Moneyball - and analytics driven thinking - over the last two decades.
espn.com - Kevin Seifert
Johnson suggested that numbers influence the perception of players at other positions. He cited Hall of Fame quarterback Brett Favre, whose No. 4 "gave you the illusion that he could run and do a lot of things, even though he couldn't." In contrast, Johnson suggested that the perception of then-Jets quarterback Sam Darnold's athleticism was diminished by wearing No. 14.
forbes.com - Randy Bean
The late New Yorker writer Roger Angell called baseball “The Summer Game”, a sport distinguished by colorful greats from the past, the likes of which have included Honus Wagner, Ty Cobb, Satchel Paige, Dizzy Dean, and Josh Gibson. As we head into the 2022 MLB playoff season in a few short weeks, it is an apt moment to reflect on how this American past time which began in the wake of the Civil War – the first professional baseball team, the Cincinnati Red Stockings was established in 1869 – has in many ways been transformed through the usage of modern data and analytics, as other professional sports teams are coming to be as well. I have written about similar transformations of 19th century businesses in other industries – Levi’s in retail, and JP Morgan Chase in banking – through data and analytics.
arxiv.org - Rhys Tracy, Haotian Xia, Alex Rasla, Yuan-Fang Wang, Ambuj Singh
This research aims to improve the accuracy of complex volleyball predictions and provide more meaningful insights to coaches and players. We introduce a specialized graph encoding technique to add additional contact-by-contact volleyball context to an already available volleyball dataset without any additional data gathering. We demonstrate the potential benefits of using graph neural networks (GNNs) on this enriched dataset for three different volleyball prediction tasks: rally outcome prediction, set location prediction, and hit type prediction. We compare the performance of our graph-based models to baseline models and analyze the results to better understand the underlying relationships in a volleyball rally. Our results show that the use of GNNs with our graph encoding yields a much more advanced analysis of the data, which noticeably improves prediction results overall. We also show that these baseline tasks can be significantly improved with simple adjustments, such as removing blocked hits. Lastly, we demonstrate the importance of choosing a model architecture that will better extract the important information for a certain task. Overall, our study showcases the potential strengths and weaknesses of using graph encodings in sports data analytics and hopefully will inspire future improvements in machine learning strategies across sports and applications by using graphbased encodings.
arxiv.org - Federico Fioravanti, Fernando Delbianco, Fernando Tohmé
We seek to gain more insight into the effect of the crowds on the Home Advantage by analyzing the particular case of Argentinean football (also known as soccer), where for more than ten years, the visiting team fans were not allowed to attend the games. Additionally, during the COVID-19 lockdown, a significant number of games were played without both away and home team fans. The analysis of more than 20 years of matches of the Argentinean tournament indicates that the absence of the away team crowds was beneficial for the Top 5 teams during the first two years after their attendance was forbidden. An additional intriguing finding is that the lack of both crowds affects significantly all the teams, to the point of turning the home advantage into home `disadvantage' for most of the teams.
arxiv.org - Nathan Sandholtz, Lucas Wu, Martin Puterman, Timothy C. Y. Chan
Abstract:For decades, National Football League (NFL) coaches' observed fourth down decisions have been largely inconsistent with prescriptions based on statistical models. In this paper, we develop a framework to explain this discrepancy using a novel inverse optimization approach. We model the fourth down decision and the subsequent sequence of plays in a game as a Markov decision process (MDP), the dynamics of which we estimate from NFL play-by-play data from the 2014 through 2022 seasons. We assume that coaches' observed decisions are optimal but that the risk preferences governing their decisions are unknown. This yields a novel inverse decision problem for which the optimality criterion, or risk measure, of the MDP is the estimand. Using the quantile function to parameterize risk, we estimate which quantile-optimal policy yields the coaches' observed decisions as minimally suboptimal. In general, we find that coaches' fourth-down behavior is consistent with optimizing low quantiles of the next-state value distribution, which corresponds to conservative risk preferences. We also find that coaches exhibit higher risk tolerances when making decisions in the opponent's half of the field than in their own, and that league average fourth down risk tolerances have increased over the seasons in our data.
arxiv.org - Ken Yamamoto, Seiya Uezu, Keiichiro Kagawa, Yoshihiro Yamazaki, Takuma Narizuka
Abstract:In this study, the stochastic properties of player and team ball possession times in professional football matches are examined. Data analysis shows that player possession time follows a gamma distribution and the player count of a team possession event follows a mixture of two geometric distributions. We propose a formula for expressing team possession time in terms of player possession time and player count in a team's possession, verifying its validity through data analysis. Furthermore, we calculate an approximate form of the distribution of team possession time, and study its asymptotic property.
arxiv.org - Yiqi Zhong, Luming Liang, Ilya Zharkov, Ulrich Neumann
A central challenge of video prediction lies where the system has to reason the objects' future motions from image frames while simultaneously maintaining the consistency of their appearances across frames. This work introduces an end-to-end trainable two-stream video prediction framework, Motion-Matrix-based Video Prediction (MMVP), to tackle this challenge. Unlike previous methods that usually handle motion prediction and appearance maintenance within the same set of modules, MMVP decouples motion and appearance information by constructing appearance-agnostic motion matrices. The motion matrices represent the temporal similarity of each and every pair of feature patches in the input frames, and are the sole input of the motion prediction module in MMVP. This design improves video prediction in both accuracy and efficiency, and reduces the model size. Results of extensive experiments demonstrate that MMVP outperforms state-of-the-art systems on public data sets by non-negligible large margins (about 1 db in PSNR, UCF Sports) in significantly smaller model sizes (84% the size or smaller).
arxiv.org - Nitin Nilesh, Tushar Sharma, Anurag Ghosh, C. V. Jawahar
Analysis of player movements is a crucial subset of sports analysis. Existing player movement analysis methods use recorded videos after the match is over. In this work, we propose an end-to-end framework for player movement analysis for badminton matches on live broadcast match videos. We only use the visual inputs from the match and, unlike other approaches which use multi-modal sensor data, our approach uses only visual cues. We propose a method to calculate the on-court distance covered by both the players from the video feed of a live broadcast badminton match. To perform this analysis, we focus on the gameplay by removing replays and other redundant parts of the broadcast match. We then perform player tracking to identify and track the movements of both players in each frame. Finally, we calculate the distance covered by each player and the average speed with which they move on the court. We further show a heatmap of the areas covered by the player on the court which is useful for analyzing the gameplay of the player. Our proposed framework was successfully used to analyze live broadcast matches in real-time during the Premier Badminton League 2019 (PBL 2019), with commentators and broadcasters appreciating the utility.
arxiv.org - Yisheng Pei, Varuna De Silva, Mike Caine
Although the data-driven analysis of football players' performance has been developed for years, most research only focuses on the on-ball event including shots and passes, while the off-ball movement remains a little-explored area in this domain. Players' contributions to the whole match are evaluated unfairly, those who have more chances to score goals earn more credit than others, while the indirect and unnoticeable impact that comes from continuous movement has been ignored. This research presents a novel deep-learning network architecture which is capable to predict the potential end location of passes and how players' movement before the pass affects the final outcome. Once analysed more than 28,000 pass events, a robust prediction can be achieved with more than 0.7 Top-1 accuracy. And based on the prediction, a better understanding of the pitch control and pass option could be reached to measure players' off-ball movement contribution to defensive performance. Moreover, this model could provide football analysts a better tool and metric to understand how players' movement over time contributes to the game strategy and final victory.
arxiv.org - Jiaben Chen, Huaizu Jiang
Human-centric video frame interpolation has great potential for improving people's entertainment experiences and finding commercial applications in the sports analysis industry, e.g., synthesizing slow-motion videos. Although there are multiple benchmark datasets available in the community, none of them is dedicated for human-centric scenarios. To bridge this gap, we introduce SportsSloMo, a benchmark consisting of more than 130K video clips and 1M video frames of high-resolution (≥\geq720p) slow-motion sports videos crawled from YouTube. We re-train several state-of-the-art methods on our benchmark, and the results show a decrease in their accuracy compared to other datasets. It highlights the difficulty of our benchmark and suggests that it poses significant challenges even for the best-performing methods, as human bodies are highly deformable and occlusions are frequent in sports videos. To improve the accuracy, we introduce two loss terms considering the human-aware priors, where we add auxiliary supervision to panoptic segmentation and human keypoints detection, respectively. The loss terms are model agnostic and can be easily plugged into any video frame interpolation approaches. Experimental results validate the effectiveness of our proposed loss terms, leading to consistent performance improvement over 5 existing models, which establish strong baseline models on our benchmark. The dataset and code can be found at: this https URL.
arxiv.org - Jerrin Bright, Yuhao Chen, John Zelek
Abstract:Using videos to analyze pitchers in baseball can play a vital role in strategizing and injury prevention. Computer vision-based pose analysis offers a time-efficient and cost-effective approach. However, the use of accessible broadcast videos, with a 30fps framerate, often results in partial body motion blur during fast actions, limiting the performance of existing pose keypoint estimation models. Previous works have primarily relied on fixed backgrounds, assuming minimal motion differences between frames, or utilized multiview data to address this problem. To this end, we propose a synthetic data augmentation pipeline to enhance the model's capability to deal with the pitcher's blurry actions. In addition, we leverage in-the-wild videos to make our model robust under different real-world conditions and camera positions. By carefully optimizing the augmentation parameters, we observed a notable reduction in the loss by 54.2% and 36.2% on the test dataset for 2D and 3D pose estimation respectively. By applying our approach to existing state-of-the-art pose estimators, we demonstrate an average improvement of 29.2%. The findings highlight the effectiveness of our method in mitigating the challenges posed by motion blur, thereby enhancing the overall quality of pose estimation.
ssrn.com - Ali Kakhbod, Seyed Mohammad Kazempour, Jesse H. Jones
Tweet-level data from a social media platform reveals low average accuracy and high dispersion in the quality of advice by financial influencers, or “finfluencers”: 28% of finfluencers are skilled, generating 2.6% monthly abnormal returns, 16% are unskilled, and 56% have negative skill (“antiskill”) generating -2.3% monthly abnormal returns. Consistent with homophily shaping finfluencers’ social networks, antiskilled finfluencers have more followers and more influence on retail trading than skilled finfluencers. The advice by antiskilled finfluencers creates overly optimistic beliefs most times and persistent swings in followers’ beliefs. Consequently, finfluencers cause excessive trading and inefficient prices such that a contrarian strategy yields 1.2% monthly out-of-sample performance
arxiv.org - Eren Unlu
With recent empirical observations, it has been argued that the most significant aspect of developing accurate language models may be the proper dataset content and training strategy compared to the number of neural parameters, training duration or dataset size. Following this argument, we opted to fine tune a one billion parameter size trained general purpose causal language model with a dataset curated on team statistics of the Italian football league first ten game weeks, using low rank adaptation. The limited training dataset was compiled based on a framework where a powerful commercial large language model provides distilled paragraphs and question answer pairs as intended. The training duration was kept relatively short to provide a basis for our minimal setting exploration. We share our key observations on the process related to developing a specific purpose language model which is intended to interpret soccer data with constrained resources in this article.
youtube.com - Daniel Whithopf
Classical C++ linear algebra libraries offer sparse matrix types where the sparseness is only known at run-time. At compile-time, some libraries also incorporate information about the matrix shape (upper/lower triangular, diagonal, symmetric) but not about sparsity.However, due to run-time / SIMD considerations these sparse matrix shapes are stored as a dense matrix meaning that 50% or more of the entries are trivially 0. While being potentially beneficial for run-time, this also introduces a significant memory overhead.In this talk, we will see how it is indeed possible to get a free lunch by combining memory efficiency with run-time efficiency.
youtube.com - Amir Kirsh , Alex Dathskovsky
Many developers believe that large language models have the potential to become a game changer for programming. In this talk we will explore the impacts and possibilities of incorporating AI bots into software development. How would this change the way we create software, and what would be the effects on our work processes? By examining the mistakes made in current ChatGPT-generated C++ code, what lessons can we learn about the challenges of working with machine-generated code? Additionally, how can we best prepare for the new era coming ahead?The talk will present some actual generated examples to illustrate our points, extrapolating to the future development of AI tools and their implications, having C++ in mind. Audience participation in the discussion will be encouraged, sharing thoughts and comments.
youtube.com
Vector processing to accelerate computation developed more than forty years in the 1970's. At the time limited to extremely expensive machines dedicated to large mathematical problems. By 2016, single instruction multiple data (SIMD) registers and pipelines started occupying the silicon of processors available on every desktop. While the early promise of SIMD seemed to rely on the idea of the compiler vectorizing loops automatically, that mostly has not happened. Instead, over time programmers started exploring how to directly utilize SIMD by altering algorithms to directly exploit parallelism. The performance results were staggering with some SIMD algorithms producing trouncing the performance of highly optimized code. In 2023, a new era is dawning where portable SIMD applications can be built on top of libraries targeted at application developers. Most notably for c++ std::simd.
playingnumbers.com - Tyler James Burch
In 2015, MLB introduced Statcast to all 30 stadiums. This system monitors player and ball movement and has provided a wealth of new information, in the process introducing many new terms to broadcasting parlance. Two specific terms, exit velocity and launch angle, have been used quite frequently since, with good reason – they’re very evocative of the action happening on the field.
twitter.com
The single biggest argument about statistics: is probability frequentist or Bayesian? It's neither, and I'll explain why.
twitter.com
Confounding factors are variables that can cause or prevent the outcome of interest, are not intermediate variables, and are not associated with the factor under investigation. They can provide a false perception of association between the study variable and the outcome.