M2 · Gaze & DeepGaze — NTH Bootcamp

Camera + CV

→

M2 · here

Gaze

→

Stimulation

→

Phosphenes

→

Decoding

01Why gaze matters for a phosphene prosthesis

Each phosphene a user sees lives at a fixed location in their visual field — not in the world. Because the brain's map of the visual field rotates with the eye, the entire phosphene image rotates with it too. When the user's eye moves, the whole perceived scene moves with it.

If the prosthesis ignores this and streams the camera feed straight to the implant, every saccade or fixational drift drags the rendered scene across the user's visual field. The brain expects the world to stay put when it commands an eye movement; the prosthesis breaks that expectation. The result is unstable, swimming vision — the "sliding effect".

To avoid this, the pipeline has to know where the user is currently looking and compensate, so the same patch of the world stays anchored to the same point in the world even as the eye moves. This is called gaze-contingent rendering, and it is the main practical reason a prosthesis needs an answer to "where is the user looking right now?"

There are two further reasons gaze matters, especially in research:

Computational experiments without participants. You cannot recruit a patient every time you want to test a new algorithm. A gaze model lets you simulate plausible eye movements over still images or video and run the whole prosthesis pipeline end-to-end on a laptop — fast, repeatable, and consistent across conditions.
Constructing an importance map from gaze data. The implant has only a few hundred electrodes — far too few to render the whole scene in detail. If you can predict which region of the scene the user is more likely to look at, you can try representing that part with more details, leaving the rest of the scene represented with sparse outlines.

This module is about that prediction step. We use DeepGaze III, a pretrained model that has learned where human eyes land on natural images, as a stand-in for the actual user's eye movements.

02Heatmap vs scanpath

Two different things people mean by "where the eye looks." A heatmap is static: "how interesting is each pixel?" A scanpath is dynamic: "where does the eye actually go, and in what order?" Same scene, very different signals.

Compare the two views

Left: the heatmap. Every pixel gets a score, all at once. No time, no order.

Right: a scanpath. A sequence of stops (fixations) connected by jumps (saccades). Press play to watch one unfold.

step 0 / 12

Heatmap (static)

Scanpath (sequence)

Both views start from the same model. What information does the scanpath carry that the heatmap throws away?

Order and time. The heatmap is a sum over all possible looks. The scanpath is one specific sequence, and it captures the fact that the eye does not jump to its second-favourite region right after its favourite - it explores. The next fixation depends on the recent ones. The heatmap cannot express that.

03History matters

Past fixations bend the probability of the next one. Eyes do not loop on the same spot - they suppress recently-visited regions and explore. This is called inhibition of return.

Click to drop fixations

Click anywhere on the scene to add a fixation. The red shading is a probability map: bright red = the eye is very likely to land there next, dim = unlikely. Watch how each click pushes a "hole" into the map around the clicked point.

That hole is inhibition of return: the model has learned that humans, having just looked there, will probably look elsewhere next.

fixations placed: 0

Red = next-fixation probability. Cream dots = your clicks (latest is darkest).

Click the same spot ten times in a row. What happens to the map there?

It darkens. Every click multiplies in another suppression bump at that location, so the region effectively becomes a probability hole. Real DeepGaze III only sees the last four fixations, so it cannot keep darkening forever - but the qualitative effect is the same.

Why is "inhibition of return" useful for a prosthesis pipeline?

It tells you the user will not stare at the same point forever. If your pipeline only re-renders when the gaze moves, you need a realistic model of when it moves and where it goes. Without inhibition of return, simulated gaze would just camp on the most salient point and never explore the scene - which is the opposite of what real users do.

04DeepGaze III in three boxes

A pretrained neural network trained on real human eye-tracking data (MIT1003 scanpaths, plus SALICON for the saliency backbone). It takes three inputs and returns one probability map over the image.

input 1image
the scene

input 2Center Bias
where humans look
by default

input 3fixation history
last 4 fixations

→

DeepGaze IIIprobability map
where to look next

→

samplenext fixation
one (x, y) point

Center Bias

Humans fixate near the centre of an image more often than the edges - even before content matters. The Center Bias captures this. It is a fixed prior that gets added to every prediction, regardless of what is in the scene.

The default Center Bias was fit on the MIT1003 dataset of human eye-tracking. Replacing it lets you simulate non-typical users (e.g. someone with central vision loss who fixates off-centre).

Fixation history

The last 4 fixation locations are passed in as (x, y) coordinates. The model has learned from data where humans look given they just looked there. From this alone, the model recovers inhibition of return, typical saccade lengths, and scanning patterns - no explicit rules needed.

Probability map

The output is a map the same size as the image. Each pixel says how likely it is that the next fixation lands there. Bright = likely. Dim = unlikely.

To pick the next fixation you either take the brightest pixel (deterministic) or sample randomly from the map (stochastic, what we do here - it gives the variability you see between real human viewers).

A note on the term "log-density". If you read the DeepGaze code you will see the model actually returns the logarithm of the probability map, not the probability map itself. This is a numerical trick - very small probabilities are easier to handle in log space. You exponentiate to get back to a regular probability map. The notebook does this with one line: p = np.exp(log_density).

If DeepGaze sees a face in the image, where does the probability map peak?

On the face - especially the eyes. We usually find faces to be very important in a scene and we tend to fixate our gaze initially onto them .DeepGaze was trained on humans looking at natural images and so it captures this behavioural pattern.

What is the Center Bias actually doing, in plain words?

It is a fudge factor that says "if you have no idea what is in the image, guess that the user will look near the middle, because that is what people usually do." It improves predictions on average and lets the rest of the model focus on learning content-driven attention.

05Sampling a trajectory

One fixation is not a scanpath. To get a sequence you sample, update the history, and run the model again. Repeat as many times as you want fixations.

Step through a scanpath

The loop, in words:

Start with the eye at the centre of the image (no real history yet).
Ask the model: given the image, the Center Bias, and the recent history, what is the probability map?
Pick one (x, y) point by sampling from that map: each pixel's chance of being chosen is proportional to its value. Bright spots are likelier; dim spots are still possible.
Add that point to the history; drop the oldest one so we keep the last four.
Repeat for as many fixations as you want.

fixations: 0
total path length: 0 px

Red shading = current probability map. Numbered cream dots = the scanpath so far.

Run "play" twice. Do you get the same scanpath both times? Why?

No - the sampling step is random. The same probability map can produce many different scanpaths. This is realistic: two humans looking at the same image do not produce identical scanpaths either. The model gives you a distribution over plausible behaviours.

Reset and step manually. After the first fixation, does the probability map change shape?

Yes. The first fixation enters the history; the model is then asked "given that the user just looked at (x, y), where would they look next?" - and the map shifts. That shift is the model expressing inhibition of return and natural saccade lengths.

06What can you measure?

Once you have a scanpath - or many - the interesting statistics begin. These are the numbers a prosthesis engineer cares about: how far does the eye jump? How fast? How long does it linger?

Generate & measure

Press resample to sample several scanpaths and accumulate their statistics. The histograms on the right fill in. Press reset to clear them and start fresh.

Saccade length: how far the eye jumps between two fixations. Real humans show many short jumps and occasional long ones.

Saccade angle: direction relative to the previous jump. Tells you if the eye prefers to keep going forward or to double back.

Dwell time: how long the eye stays at each fixation, in milliseconds.

samples collected: 0

Saccade length (pixels)

Saccade angle (radians)

Dwell time (milliseconds)

Saccade length distribution is skewed - most jumps are short. Why does that matter for the implant?

Because most of the time the next region of interest is near the current one. The implant does not have to re-render the whole field; it can update mostly the local area. This is what makes foveated rendering with limited bandwidth feasible at all.

Dwell time averages a few hundred milliseconds. How often does the implant need to refresh?

At least as fast as the dwell time - say, every 100-200 ms - or the user will saccade away before the new region is ever rendered. In practice you want headroom: ~60 Hz refresh is comfortable, well below the saccade rate.

07Self-check

Predict the answer first, then verify with the demos above (or the notebook).

Q1. Two pages: saliency heatmap vs scanpath sequence. Which one do you feed a prosthesis pipeline, and why?

The scanpath. A prosthesis only stimulates a small patch of cortex at a time, so at every fixation it has to pick one next location — that is a sequence of decisions, not a set. A heatmap is a sum across many viewers and many looks; it tells you where attention pools on average but not the order in which a single user would visit those points.

Q2. Inhibition of return: why does it help, and what's the failure mode if you crank it too high?

It pushes the model to explore: without it, the most salient point wins every sample and the gaze camps there forever (this is the failure mode of §03). Cranked too high, the suppression around recent fixations becomes so strong that the gaze is pushed into low-saliency regions just to avoid where it has been — exploration without any content guiding it.

Q3. DeepGaze III takes image, Center Bias, AND fixation history. What does removing the fixation history give you?

You're back to a per-frame saliency model: the same image always produces the same map. The first sampled fixation is unchanged, but every later fixation ignores where the eyes have already been. The scanpath becomes a sequence of independent draws from the same map — no inhibition of return, no exploration, no temporal structure.

Q4. The saccade-length histogram in §06 is heavy-tailed (lognormal — see the disclaimer there). What about real eye movements produces that shape, and would running the real DeepGaze III on a natural image reproduce it?

Two contributions. (a) Real saliency maps are peaky: a few bright regions on a wide dim background. Sampling proportional to the map gives mostly short jumps to the dominant peak near the current fixation, occasional medium jumps to a secondary peak, and rare long jumps when the background scores a hit — that mixture has a long right tail. (b) Biology layers in short corrective saccades within an object and longer between-object saccades. The combination is approximately lognormal. Real DeepGaze III on natural images does recover this shape; the toy model on this page does not, because its “scene” has only five Gaussian blobs, so the peaks-vs-tails geometry is too uniform — which is exactly why §06 substitutes literature-derived draws.

08Where to next

Next module: M3 — Neuromodulation & stimulation, where the patch of scene you decided to look at becomes an actual pattern of electrical pulses.

Going deeper is optional. The companion notebook revisits this material in Python — running the actual DeepGaze III model on a real image, in four parts — as a self-guided resource, not a workshop step.

Part A - warm-up

Load image, build Center Bias, run DeepGaze once, sample a single fixation.

Part B - trajectories

Add fixation to history, re-run. Write sample_scanpath. Compare across random seeds.

Part C - statistics

Saccade lengths and angles. A simple dwell-time model. Fraction of time on the most interesting regions.

Part D - challenges

Build a peripheral Center Bias. Add motor jitter. Continue a real eye-tracking prefix.

Files: gaze_workshop.ipynb (with TODOs), gaze_workshop_solutions.ipynb (full). The first cell installs deepgaze_pytorch from GitHub - allow ~2 min on CPU.

Tools & references

tool DeepGaze (deepgaze_pytorch), by Matthias Kümmerer and colleagues — the saliency / scanpath model this module is built around. This page uses a synthetic toy model for speed; the notebook runs the real DeepGaze III.
paper Kümmerer, Bethge & Wallis (2022), DeepGaze III: Modeling free-viewing human scanpaths with deep learning, Journal of Vision 22(5):7. doi:10.1167/jov.22.5.7.
paper Itti, Koch & Niebur (1998), A model of saliency-based visual attention for rapid scene analysis, IEEE TPAMI 20(11):1254-1259. doi:10.1109/34.730558 — origin of the saliency-map and inhibition-of-return ideas used in §02–§03.
benchmark MIT/Tübingen Saliency Benchmark — where DeepGaze models are evaluated against human fixation data.

01Why gaze matters for a phosphene prosthesis

02Heatmap vs scanpath

Compare the two views

03History matters

Click to drop fixations

04DeepGaze III in three boxes

Center Bias

Fixation history

Probability map

05Sampling a trajectory

Step through a scanpath

06What can you measure?

Generate & measure

Saccade length (pixels)

Saccade angle (radians)

Dwell time (milliseconds)

07Self-check

08Where to next

Part A - warm-up

Part B - trajectories

Part C - statistics

Part D - challenges

Tools & references

Further reading — vision-restoration field