*>You may already know this, but image generators like Stable Diffusion and Flux...

>You may already know this, but image generators like Stable Diffusion and Flux already do this in the form of “latent diffusion”.

They... don't. Latents don't meaningfully represent human perception, they represent correlations in the dataset. Parent is talking about the function aligned with actual measured human perception (UCS is an example of that). Whether it's a good idea, and how trivial it is for the model to fit this function automatically, is another question.