This is my current understanding of how AI works. Not necessarily correct, just where my thinking is right now.

Embeddings and Attention: The Feeling of Words

With enough numbers and enough data, you can get a set of numbers that represent the feeling of a word – the feeling you get when you hear it.

During training, the model looks at your feeling about a word and is reminded of similar feelings you’ve had about previous words. It pays attention to those. Some say that’s all you need.

So the model gets your feeling about the current word, remembers your previous feelings, and tries to keep the vibe going. It considers options for every possible word, then immediately narrows down to a subset of top candidates. Out comes the next word.

Why “You”?

I use “you” deliberately. This is what happens inside you as you listen to someone talk:

  • You get feelings about words
  • You think back to previous statements on the fly
  • You don’t know exactly which word comes next
  • But you do have a sense of the possibilities
  • You don’t expect random “elephant” words out of nowhere

The model does something similar.

Images and CLIP

Somehow this same embedding approach – assigning numbers to representations – extends to images. CLIP associates similar images and similar text within the same embedding space.

I have a sense of how this might be possible, but it raises questions:

  • Images seem to require more dimensions than linear text sequences
  • The underlying reality of an image feels so different from text
  • How can image and text embeddings relate well?

But those problems have been solved. Which suggests something deeper is going on.

JEPA and World Models

JEPA has a World Model, which can have an embedding. So then the world, image, and text can share the same embedding space.

The “next world” vibe is based on:

  • Current world, image, text representations
  • Previous representations

The next world embedding handles some higher level representation (the world context), leaving a more constrained world for lower level embeddings to operate in. This should make them more capable – they can devote more numbers to this constrained world, the one without the higher level information already handled above.

World Model as Notation

Here’s the insight: this world model embedding concept is really just a notation.

The World Model is a notation. We can use any notation.

Since it represents a world, it can be prompted to generate the Next World.

This means we can have evals of the notation. The World Model provides itself as context for the prompt, and the response takes the current World context into account.