[lepong] pixel-input jepa
CNN JEPA plays Pong from pixels. Client renders canvas, sends PNG bytes. Server only sees pixels: no ball state, no re-rendering. The occlusion slider blanks part of the court before sending, so the model must imagine the ball in the hidden region. Compare JEPA against a classical extrapolator.
server connecting...
mode jepa
frames logged 0
How it works. The client draws the Pong game on a canvas, optionally blackens the right side, PNG-encodes it, and sends bytes to the server. The server encodes the PNG through a frozen CNN encoder + 6-layer transformer predictor, reads ball_y from a trained Linear(192, 10) state head, and sends that value back as the paddle target. A classical extrapolator ball.y + 5*ball.vy is computed alongside for comparison.
ai paddle mode
occlusion
score
0 : 0
live ball_y error
jepa median
-- %
jepa p95
-- %
jepa max
-- %
classical median
-- %
classical p95
-- %
classical max
-- %
samples
0
live ball_x error
jepa median
-- %
jepa p95
-- %
jepa max
-- %
classical median
-- %
classical p95
-- %
classical max
-- %
runtime
rally
0
plan time
-- ms
ws rtt
-- ms
history
--
federation
idle · gameplay feeds the pool
rounds contributed0
last val_loss
view hub →
pipeline

canvas -> PNG -> JEPA

Every 5 physics ticks (~6 Hz) the client draws the Pong game on an offscreen 128x128 canvas, applies occlusion if enabled, then toDataURL('image/png') and ships the base64 bytes over WebSocket.

The server receives a ground_truth field in every message -- it uses it only for the classical baseline computation, never as model input. The paddle target comes exclusively from state_head(predictor(encoder(client_png)))[ball_y]. Trainable parameters: 1,930 (just the Linear(192, 10) state head). Frozen parameters: 13,082,080 (encoder + predictor + projectors).

partial observation

occlusion slider (0-60%)

Pong physics between bounces are trivial. Any algorithm that sees the ball plays near-optimally. To force an actual prediction task, the occlusion slider blanks out a vertical strip on the right side of the canvas before encoding. At 0% the model sees the whole court. At 40% the model must imagine the ball whenever it crosses midfield.

You (the human) always see the whole court through the semi-transparent purple overlay. Only the encoder input is occluded.

metrics

what we measure

Running p50 / p95 / max absolute error of JEPA's prediction and the classical baseline vs ground truth, separately for ball_y and ball_x.

The classical baseline is NOT a fair comparison -- it has direct access to ground-truth state and integrates physics for 5 steps. JEPA has only the PNG bytes. The interesting questions are: (a) how close does JEPA get despite having less information? (b) does JEPA's tail (p95, max) degrade gracefully? (c) does occlusion break JEPA?