World Cup xG Lab
TeamsPlayersModelCoverage
StatsBomb powers the past chance-quality model and shot-location views. FBref adds recent league-form context. Understat adds club xG context. DataMB adds 25/26 percentile scouting profiles. This dashboard is not a guaranteed 2026 World Cup prediction model.

Model performance

Expected Goals Model

A practical view of how the historical shot-quality model estimates chances, with source context kept separate.

StatsBomb

What xG Means Here

Expected goals estimates the chance that a shot becomes a goal based on the shot context. A 0.10 xG shot means similar shots are scored about 10% of the time.

The model describes shot quality in available historical data. It does not claim a player will score from a future location.

Production model

xgboost_xg_model

XGBoost was selected as the production expected-goals model because it performed best on log loss among the production candidates.

Why this matters for the World Cup dashboard

World Cup squads mix players from many leagues, competitions, and data sources. Training an xG model gives the dashboard one consistent way to translate historical shot locations and shot context into chance quality. For fans, that means the site can compare whether a player or team generated high-quality chances in available data, then clearly separate those chance-quality estimates from recent club context like FBref, Understat, and percentile profiles.

Long shot

Low chance

0.03 xG

A hopeful shot from distance. Similar chances are rarely scored.

Box shot

Medium chance

0.12 xG

A shot inside the box with some angle or defensive pressure.

Big chance

High chance

0.35 xG

A close-range chance where similar shots are scored much more often.

Production model summary

These are the headline validation metrics for the model used by the dashboard historical xG layer.

Log loss

0.283

Rewards calibrated probabilities

StatsBomb

Brier score

0.081

Measures probability error

StatsBomb

ROC-AUC

0.795

Ranks goals above non-goals

StatsBomb

Accuracy

0.899

Secondary for xG

StatsBomb

Production Model Comparison

These are the dashboard's original production model candidates. The XGBoost model remains the production expected-goals layer.

ModelLog lossBrierROC-AUCAccuracyRows
xgboost_xg_modelBest0.2830.0810.7950.899N/A
baseline_logistic_regression0.2860.0820.7890.899N/A

Research: Source Model Comparison

Research experiments compare StatsBomb-only, Understat-only, combined-source, and reduced-feature xG models. These results are shown for transparency and are not automatically promoted into the dashboard's production player xG layer.

ModelTest SourceLog lossBrierROC-AUCAccuracyRows
StatsBomb-only rich model

StatsBomb

overall0.2830.0810.7960.89921,657
StatsBomb-only rich model

StatsBomb

StatsBomb0.2830.0810.7960.89921,657
Understat-only model

Understat

overall0.2610.0750.8190.907103,044
Understat-only model

Understat

Understat0.2610.0750.8190.907103,044
Understat published xG benchmarkBest

Understat published model

Understat0.2480.0700.8360.908103,044
Combined-source shared model

StatsBomb + Understat

overall0.2710.0770.8020.904124,701
Combined-source shared model

StatsBomb + Understat

StatsBomb0.2840.0810.7870.89921,741
Combined-source shared model

StatsBomb + Understat

Understat0.2680.0770.8060.905102,960

Research: Feature Availability Experiment

This tests your friend's idea directly: richer features versus reduced or missing event context. Accuracy barely moves, so calibration metrics matter more.

ModelMissing Ref FeaturesLog lossBrierROC-AUCAccuracyRows
statsbomb_full_rich_features

StatsBomb

0%0.2830.0810.7960.89921,657
statsbomb_geometry_only

StatsBomb

40%0.2990.0860.7570.89621,657
statsbomb_understat_style_reduced

StatsBomb

10%0.2840.0820.7940.89821,657
understat_only_reduced_features

Understat

10%0.2610.0750.8190.907103,044
combined_source_shared_features

StatsBomb + Understat

10%0.2710.0780.8020.904124,701

Log loss

Lower

Sensitive to confident wrong probabilities.

StatsBomb

Brier score

Lower

Average squared probability error.

StatsBomb

ROC-AUC

Higher

Ranks goals above non-goals across thresholds.

StatsBomb

Accuracy

Secondary

Can be misleading because most shots are not goals.

StatsBomb

Features Used

Shot location
Distance to goal
Angle to goal
Body part
Shot type
Under pressure
Play pattern
Minute and period

How this was built

Streamlit was used as an early prototype for model exploration and validation. The production portfolio path is a precomputed artifact pipeline served through FastAPI and rendered in Next.js.

Step 1

StatsBomb Open Data

Historical shot events

v

Step 2

Pandas cleaning + feature engineering

Shot coordinates, distance, angle, context

v

Step 3

scikit-learn / XGBoost model training

Chance-quality probability model

v

Step 4

MLflow experiment tracking

Metrics, model runs, artifacts

v

Step 5

Dashboard JSON artifacts

Precomputed team and player profiles

v

Step 6

FastAPI backend

Clean API layer for the frontend

v

Step 7

Docker deployment

Portable app runtime

v

Step 8

AWS S3 / EC2

Portfolio deployment target

v

Step 9

Next.js + Tailwind frontend

Fan-friendly scouting dashboard

Step 1

StatsBomb Open Data

Historical shot events

->

Step 2

Pandas cleaning + feature engineering

Shot coordinates, distance, angle, context

->

Step 3

scikit-learn / XGBoost model training

Chance-quality probability model

v

Step 4

MLflow experiment tracking

Metrics, model runs, artifacts

->

Step 5

Dashboard JSON artifacts

Precomputed team and player profiles

->

Step 6

FastAPI backend

Clean API layer for the frontend

v

Step 7

Docker deployment

Portable app runtime

->

Step 8

AWS S3 / EC2

Portfolio deployment target

->

Step 9

Next.js + Tailwind frontend

Fan-friendly scouting dashboard

How to read this dashboard

StatsBomb

Past shot samples used by the expected-goals model. Not a 2026 prediction.

DataMB

25/26 percentile scouting profiles. Percentiles are not raw stats and are not model inputs.

FBref

Recent club and league form context. Not used by the trained expected-goals model.

Understat

Club expected-goals context from covered leagues, plus a separate experimental xG check where clearly labeled.

Known model limits

Show limitations
  • - StatsBomb powers the historical shot-location model and shot maps.
  • - FBref aggregate data is recent player context only and does not replace the xG model.
  • - Understat aggregate and shot-derived data is club context. The Understat shot model is experimental.
  • - Small samples are shown with warnings and should not be read as guaranteed future scoring locations.
  • - This dashboard is not a guaranteed 2026 World Cup prediction model.