lepong

[lepong] pixel-input jepa

CNN JEPA plays Pong from pixels. Client renders canvas, sends PNG bytes. Server only sees pixels: no ball state, no re-rendering. The occlusion slider blanks part of the court before sending, so the model must imagine the ball in the hidden region. Compare JEPA against a classical extrapolator.

server connecting...

mode jepa

frames logged 0

How it works. The client draws the Pong game on a canvas, optionally blackens the right side, PNG-encodes it, and sends bytes to the server. The server encodes the PNG through a frozen CNN encoder + 6-layer transformer predictor, reads ball_y from a trained Linear(192, 10) state head, and sends that value back as the paddle target. A classical extrapolator ball.y + 5*ball.vy is computed alongside for comparison.

ai paddle mode

occlusion

score

0 : 0

live ball_y error

jepa median

-- %

jepa p95

-- %

jepa max

-- %

classical median

-- %

classical p95

-- %

classical max

-- %

samples

live ball_x error

jepa median

-- %

jepa p95

-- %

jepa max

-- %

classical median

-- %

classical p95

-- %

classical max

-- %

runtime

rally

plan time

-- ms

ws rtt

-- ms

history

federation

train while playing

idle · gameplay feeds the pool

rounds contributed0

last val_loss—

view hub →

pipeline

canvas -> PNG -> JEPA

Every 5 physics ticks (~6 Hz) the client draws the Pong game on an offscreen 128x128 canvas, applies occlusion if enabled, then toDataURL('image/png') and ships the base64 bytes over WebSocket.

The server receives a ground_truth field in every message -- it uses it only for the classical baseline computation, never as model input. The paddle target comes exclusively from state_head(predictor(encoder(client_png)))[ball_y]. Trainable parameters: 1,930 (just the Linear(192, 10) state head). Frozen parameters: 13,082,080 (encoder + predictor + projectors).

partial observation

occlusion slider (0-60%)

Pong physics between bounces are trivial. Any algorithm that sees the ball plays near-optimally. To force an actual prediction task, the occlusion slider blanks out a vertical strip on the right side of the canvas before encoding. At 0% the model sees the whole court. At 40% the model must imagine the ball whenever it crosses midfield.

You (the human) always see the whole court through the semi-transparent purple overlay. Only the encoder input is occluded.

metrics

what we measure

Running p50 / p95 / max absolute error of JEPA's prediction and the classical baseline vs ground truth, separately for ball_y and ball_x.

The classical baseline is NOT a fair comparison -- it has direct access to ground-truth state and integrates physics for 5 steps. JEPA has only the PNG bytes. The interesting questions are: (a) how close does JEPA get despite having less information? (b) does JEPA's tail (p95, max) degrade gracefully? (c) does occlusion break JEPA?