ball_y from a trained Linear(192, 10) state head, and sends that value back as the paddle target. A classical extrapolator ball.y + 5*ball.vy is computed alongside for comparison.
Every 5 physics ticks (~6 Hz) the client draws the Pong game on an
offscreen 128x128 canvas, applies occlusion if enabled,
then toDataURL('image/png') and ships the base64 bytes
over WebSocket.
The server receives a ground_truth field in every message --
it uses it only for the classical baseline computation, never as model input.
The paddle target comes exclusively from
state_head(predictor(encoder(client_png)))[ball_y].
Trainable parameters: 1,930 (just the Linear(192, 10) state head).
Frozen parameters: 13,082,080 (encoder + predictor + projectors).
Pong physics between bounces are trivial. Any algorithm that sees the ball plays near-optimally. To force an actual prediction task, the occlusion slider blanks out a vertical strip on the right side of the canvas before encoding. At 0% the model sees the whole court. At 40% the model must imagine the ball whenever it crosses midfield.
You (the human) always see the whole court through the semi-transparent purple overlay. Only the encoder input is occluded.
Running p50 / p95 / max absolute error of JEPA's prediction
and the classical baseline vs ground truth, separately for
ball_y and ball_x.
The classical baseline is NOT a fair comparison -- it has direct access to ground-truth state and integrates physics for 5 steps. JEPA has only the PNG bytes. The interesting questions are: (a) how close does JEPA get despite having less information? (b) does JEPA's tail (p95, max) degrade gracefully? (c) does occlusion break JEPA?