Stanford Cme296 Diffusion And Large Vision Models Spring 2026 Lecture 8 Trending Topics
read summary →TITLE: Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 8 - Trending Topics CHANNEL: Stanford Online DATE: 2026-06-01 ---TRANSCRIPT--- Hello everyone and welcome to lecture 8 of CME 296. So today is a special day because as you know today is the last lecture of this entire class. So the menu for today will be a little bit different. What we’ll do is we’ll divide this lecture into two parts. In the first part, we’ll try to push to piece together everything we’ve seen in the class up until now and see what we can take away from it. And in the second part, what we look at is adjacent fields where we can apply what we have learned. Does it sound good to you? So with that what we’ll do is we’ll start with the first part which is just piecing together everything we’ve seen this quarter and the whole goal of this class has been to learn how to generate images. So for instance given an input prompt how can we generate an image that is quite aligned with the prompt as input and of course there is just a lot of dimensions we can look at. So this class has been about decomposing the fact of learning how it works into tractable parts. And if you remember the first three lectures were about just understanding how we could generate images just let’s suppose we have a blackbox model what is the paradigm that would allow us to just generate images. So the first lecture was about diffusion just learning how we could do that using the diffusion paradigm. And if you remember the first thing first thing we said was what could be a good starting point for us to generate those images. So if you think about it, the images that we want to generate, they’re a part of a data distribution that we do not know. And that data distribution can be complex and it’s difficult to sample from it. So one thing we said was okay, what are some distributions that we know how to sample from? So one such distribution was the Gausian distribution. And what we said was maybe in order for us to sample from this complicated data distribution, one thing we could do could be to sample from an easy distribution and then make our way up to the complicated data distribution. So in this first lecture what we said was okay let’s represent images with just uh multi-dimension variables that we could denote x and what we said was okay in order for us to learn how to go from gausian noise to clean images what we’ll do is we will come up with a forward process that we define that will allow us to corrupt clean images into noise and the purpose of diffusion is to learn how to reverse that process. So if you remember the forward process was um a nice uh formulation that involved also gausian distributions and so in order for us to derive the form of that reverse process. What we did was we tried to maximize the likelihood of um seeing that data distribution under P theta and we went through several steps that allowed us to derive a loss. So if you remember uh what we said was maximizing the likelihood is actually not tractable. So what we do is we derive a tractable lower bound which was something where we used the forward process that we defined in it. So this lower bound is called elbow and what we did is we expanded the terms within within that lower bound and we showed that we could actually compute it. It was actually tractable. And we ended up with a loss on us trying to estimate the noise that was added to a given image through this very simple L2 regression loss which had a noised image as input along with its noise level. And it what it was trying to do was to estimate the amount of noise that was added to the image or to the noisy image I should say. So this was a first way of looking at things. So diffusion allowed us to go from an easy to sample from distribution so the gausian distribution to the data distribution by learning how to remove the noise. The second lecture was about another way of doing things which was let’s look at this data distribution not from noise that you need to remove but more in terms of where you need to go. And we looked at a quantity called the score which was defined as the gradient of the log of P. And we saw that that quantity had nice properties and in particular it allowed us to not worry about that normalizing constant that is intractable. And what the score was telling us was where to go in order to to land at that data distribution. And in particular, there was a nice formula from Lvon Dynamics which basically said that if you knew how to compute the score, then what you could do is sample from noise and make your way to the data distribution. So the whole goal of this lecture was to derive a way to estimate that score. So the problem is the score is not the quantity that that we know. So what we did was have a way to introduce the gausian distribution to some extent so that we can we can leverage the properties that that we know from the gausian distribution and in particular the fact that we knew how to compute in an analytical way the score of a gausian distribution. So what we did is we took observation from the target data distributions and what we did is we noised them and then we looked at how we could estimate the score of the noise distribution. So we saw that it was something that we could actually compute. But then there was a trade-off. The more noise you would put in the data distribution, the more your noise data distribution would be far from the data distribution of interest. But then when you do that, you’re able to estimate that score the score of the noise data distribution quite easily. So you have a trade-off of you can compute the quantity but it’s not the quantity that you actually want versus if you have only a small amount of noise then the score of that low noise data distribution is actually not something that you can estimate very well but the target that you want to estimate is closer to the actual quantity that you want. So you had this trade-off and what we said was well how about we estimate the score not just as a function of where we are in space but also as a function of how noised the data distribution is. And we ended up with uh the den noising score matching loss which allowed us to estimate the score of that noise distribution given a noise image and the noise level. And once we did did that, we actually saw that this way of looking at things was actually very similar to the way we did in the first lecture with diffusion. And we saw that actually the score of the forward process is equal to minus the noise that you added over some coefficient. So both approaches led to similar formulations and we saw that we actually had this identity. And then what we did was just realized that the way we were noising our data distribution was in a discrete way and that there were many choices that you had to take at the very beginning such as how many steps you want to take and so on. And so this motivated us to move from a discrete formulation to a continuous formulation. So we derived the continuous formulation of the for process which allowed us to come up with this stochastic differential equation which is the delta in position during the forward process which is equal to a drift term that is deterministic plus a diffusion term which is stoastic. And here we’re using the continuous equivalent of the noise which as we saw was the vener process DW. And you could actually figure out how to move your X when you wanted to noise it with respect to these two terms. And we actually saw that the first approach that we looked at in the first lecture and then the second approach that we looked at here were actually special cases of this more generic formula. In particular, the DDPM formulation from lecture one was a variance preserving formulation, whereas the one that we saw with the noise condition score networks was actually a variance exploding formulation. And then we what we did was okay, our goal is not just to noise images, our goal is to learn the reverse process. And so in order to do that, the nice thing was that in the 80s we have a nice uh result that told us that for a given forward process, you could have a corresponding reverse process where you actually needed to know the score which is actually the quantity that we’re estimating here. So what we had to do was just to estimate the score in order to reverse the process. So long story short, lecture one we saw okay in order to go from noise to clean we need to predict the noise to remove. In lecture two we said okay in order to go from noise to clean we need to know where we want to move towards which is the score. You can think of the score as a compass that tells you where the data distribution is located. And then we ended up in the third lecture with a third way of looking at things which is the flow matching formulation which actually frame the problem in a different way. So what we did there was to consider the noisy distribution in the beginning as just an initial distribution and the data distribution of interest was just a target distribution. And the goal for us was to figure out how to move your probability density from an initial distribution to a target distribution. So that’s the flow matching way of looking at things. It’s the point of view of mass transport. And in particular, there was one quantity that was very important which is called the vector field or the velocity which is noted ut of x. It is a vector that is defined at every position and at all time t. And what we saw was that in order to move each observations from one point to another, what we had to do was to just follow the vector field. So we had this microscopic way of looking at things which is the OD that you see here. So d x over dt which is equal to the vector field at the time t and at the time and at position x. And we also had a second way of looking at this same thing. So the first formulation is the microscopic formulation which is what is happening to your individual particles. The second way of looking at this same thing is at the macro level where what you say is the evolution of the probability at a given time t is equal to minus the probability flux which is what constitutes the continuity equation which just tells you that you’re not losing anything between time t equals 0 and one. And the whole point of us going into these formulas was for us to figure out a vector field which would allow us to go from this initial distribution to this target data distribution. So this was the whole point and in particular let’s assume you know the vector field that you want to estimate. The strategy here would be to have a model that estimates the vector field. So here ut theta of x. So let’s assume you learn such a model. If you do learn such a model, then in order to sample from the complicated data distribution which is P1, what you need to do is to just sample from your easy data distribution P 0 and then numerically solve the OD here between time 0 and 1 using the vector field that you have learned. And we saw that it’s actually something that you can do in practice if you construct the target vector field in a way that makes it tractable. So we saw conditional probability path which were an easier way of looking at this which is okay we don’t know what the vector field in the general case how about if we look at a simpler case where we want to move the initial data distribution not to the target data distribution but to a single point which was the conditional probability path that we saw. And we saw that we could obtain a conditional flow matching loss that was equivalent to you learning the vector field that was the aggregate of all these conditional vector fields that allowed you to solve the problem of interest. So these first three lectures were by far the most mathematically challenging. I hope what I said made sense. We spent uh six hours on this and uh I hope I did a good job in 10 minutes to just recap what we did. At the end of it, I just want you to remember that so we’re in 2026. People nowadays they use flow matching by default. And so if you just want to have something that you want to really uh master um you know perfectly I would recommend just really understanding the flow matching part because this is what most models nowadays use. In particular, they use a variant that we saw that’s called the rectivi flow variant, which was an extension of uh this that allowed the path to be straighter so that at inference time you could afford having fewer steps in your in the numerical uh solver here. So that you can just sample images with fewer steps. Okay. So, first three lectures most matically mathematically challenging. It was purely for us to learn a paradigm to generate images. But up until that point, we assumed two things. So the first one is we assumed that we were generating images in an unconditioned way. So we didn’t have any input prompt. We did not worry about this. And the second thing was we just assumed that we had a way to represent images. So we just represented them with X with I don’t know n dimensional vector but we did not see what representation would make sense. And this was the focus of lecture four which was titled latent space and guidance. And what we saw there was well in order to have a meaningful representation of your images what you want is to have a way to keep only the useful stuff. So how do you do that? Well, in pixel space, you see that you have a pretty large amount of spatial correlation, which means that if you look at a pixel, a lot of the surrounding pixels are more or less of that value. So, you can imagine that there is a lot of redundant information in there. So, that’s number one. And number two is in the pixel space, well, the dimensionality is pretty high. So given that the model that you’re going to build from it is going to be a function of the dimension of your input, what you want to do is to make sure your input is not too big. So you want to compress the input into something that is only keeping you the useful bits. And in order for for us to do that, we looked at a model that was called an autoenccoder whose only goal was to learn a way to represent images into a smaller space. That’s called the latent space. So here on the slide you see that you have the image in the pixel space as inputs and then you have an encoder that does some downsampling operations that gets you some representation of the input image not in the pixel space but in a lower dimensional space which is the latent space. And what you do here is to use that lower dimensional representation in order to reconstruct your input. So you have a proxy task which is you know how can you reconstruct the input as best as you can by going through this bottleneck. And when you do that, you are indeed able to learn a latent space that compresses the information into just fewer dimensions. But then the problem that we encountered was that if you did that, you had no way of controlling the shape of the latent space. And in particular, if you want to learn how to generate images, what you want is to have an easy time at learning it. So you don’t want a space with spikes here and there and then nothing in between. So what you want is something more on the at the top of the slide which is a data distribution that is more compact and well put together. So in order for us to do that what we did was go back to the way we formulated things and add a constraint for us to regularize the latent space. And this was the VAE that we saw, the variational autoenccoder, which allowed us to structure the latent space in a way that would make our lives easier. And we saw that in order to derive a loss for us to learn that model, we use the same trick as lecture one, which leveraged the elbow. So the evidence lower bounds. Not going to go into details but one thing I wanted to say here is we ended up with a loss that had two terms. One was the pixelwise reconstruction loss which we had previously but also a second term that was aiming at structuring the latent space as much as possible to a certain distribution that was called the prior distribution P of Z. so that we can have something that looks a little bit more like the top one. Cool. So that was really the focus of that uh session and we also learned how to represent the inputs. So we saw different encoders in a particular transformerbased encoders including the VIT vision transformer and then we saw that there are ways to combine different modalities in the same space. for instance with clip using a contrastive loss. And then we also saw methods to incentivize our generation to be more aligned with our condition. And here if you remember we looked at classifier free guidance that allowed us to do that. Cool. So up until that point, we knew how to generate images in terms of the way we wanted to go about doing this. We knew how to represent the input and we knew how to represent our noisy latent. Well, now the question is okay, what model are you actually going to use to to do that? And this was lecture five which was on image generation architectures. And here we took a step back and just reminded ourselves of the problem that we have at hand which is we have a model and what we want to do is to predict a quantity that will allow us to go from noise to clean. So as I mentioned the paradigm that is by far the most common these days is the flow matching paradigm. So it’s in particular a model that would predict the velocity. So what we want is a generation model that takes in a noise level, the condition of interest like your user prompt and then your noisy latent as input in order to predict the velocity. And we saw that one popular architecture was the unit which was composed of two main parts. The first one was a downsampling part which allowed to gain a global understanding of the input through downsampling operations which allowed its receptive field to be bigger and bigger. So here what we mean is when you compute the activation map if you take a single value what we’re saying is that that value was a result of a pretty wide input. So that’s the downsampling part and then the upsampling part was just a way for us to match the shape that we wanted. And we had these copy and crop connections that allowed us to transfer lower level details that were computed as part of the downsampling operation. So this was the unit. But of course you know that in 2017 transformers just took the whole field by storm and there was more and more models that were relying on the transformer and the self attention mechanism and it was not long after that the diffusion transformer came into play. So in 2022 I believe um so that architecture dealt with a limitation of the unit which was that patches that were far from one another from the image could not have a direct way of interacting with one another. And in cases like you know if you wanted to generate an image of let’s say a teddy bear that looks at himself in the mirror then you would want to have that interaction to figure out the lower level details. And we saw that that diffusion transformer was injecting conditions using an adaptive layer norm framework. So it was just modulating the embeddings of the patches and then we saw that you know today is uh May 29th 2026. So now we have a lot of such models out there and in particular there is one specific category that has emerged which is the multimodel diffusion transformer which on top of doing the self attention mechanism also considers the condition as part of the joint attention and we looked at the timeline. So here what’s in blue is units and what’s in green is DT and variance. And of course the timeline is not exhaustive but it is just a way for you to see that nowadays image generation models are almost everyone is relying on the DIT based architecture. Cool. And then um at that stage we knew how we could generate images. We knew how to represent the input. We knew how to represent the latence. We knew what model to use. But we didn’t know how to train that model. And that was the focus of lecture six. And so before digging into the details, one thing we wanted to just align was how we wanted our time step to be sampled during the training cuz we had seen back in lecture one to three that the time step that we were sampling in order to train our model was drawn from a uniform distribution. And we spent some time just thinking about whether all noise levels are equally hard or not. And we saw that noise levels that are not super noisy but not towards clean like those in the middle were the most difficult because those were the places where we had to make important decisions about where to go. And so that’s the reason why the fact of sampling from a uniform distribution was not the most optimal. And we saw that there was this uh distribution called the logit normal distribution which is a distribution that focuses much more on the middle steps was something that was more commonly used these days. And we saw that not only that, but the resolution of your images also matter in terms of what the perceived noise is. And in particular, if you want to train a model on higher resolution images, what you want is to make sure to noise your images a bit more than what you did previously. Because for a given noise level, a lower resolution image will appear as being more noisy compared to a higher resolution one. And we did some math on the blackboard if you remember when we computed just the uncertainty around the underlying pixel value. But intuitively it comes from the fact that there is spatial correlation within your image. So if you have more pixels to represent let’s say an area of the image well if you noise some of it and if you have more chances to see the actual value then when you look at the image you will have an easier time to see what the image is about as opposed to if you have a lower resolution image. So if you have pixels that are noised then you have fewer chances of seeing the underlying true pixel value pixel value. And so with that we saw that there are a few stages to a typical model training process. The first one being pre-training which is learning how to generate images. And so here you have so it’s the most time consuming and expensive part of the whole pipeline because you need to come up with a very big corpus of images that encompasses all of the possible images that you want to generate and it needs to be certain quality and you need to have good mixtures and so on. So a lot of effort is put into having a pre-trained data that is reflective of the way you want your image generation to to learn. And after that we saw that once you get an image generation model that learn that that knows how to generate images, what you wanted to do is to generate good images and in particular good here can be something that people say just in the aesthetics term. So you want your image to look good, but you also want your image generation model to know how to generate images in your field of interest or in your task of interest. So for instance, if you have an image generation model that has been pre-trained on let’s say nature and then I don’t know houses, but if you want that model to generate images of let’s say teddy bears. So what you want is for instance to have an extension of that training that’s that we call continued training where you teach your model how to generate images in your set of images of interest which here can be teddy bears. And what we saw was that a third optional step could be the tuning step where you could have a situation where you want to generate images of one particular subject of one particular person many times. So if you are in that situation, what you want is to tweak your model in a way that will remember what you wanted to generate. And we saw dream booth as a way for us to do that. So the idea was to gather a set of images, so five to 10 typically that contains who you want to depict. And what you do is you train your model to generate images that contain that specific person or that specific object while having a rare token as input. So what you were teaching your model was to rewire its brain towards generating something when it saw a specific token. And given that these models can be pretty big, we saw that there were also techniques to tune not all weights but a subset of the weights through this technique called Lura, low rank adaptation. And once you did that, okay, let’s suppose you did either the first two or all three of the steps, what you want is to now deploy your model to production. So what you want is to have a model that does not cost you a lot and also does not take a lot of time to make generations. So we saw a bunch of methods that were aimed at shortening the number of steps that were needed for you to generate samples. So this category of methods is called distillation. So we saw a number of methods. Um so I would encourage you to uh look at uh that lecture. So there was progressive distillation as one example but uh some others as well. And then finally last week what we said was okay great we can do all this but we have not seen how we could evaluate how our images h how how good they were. And if you don’t know how to do that well you don’t know where to put your efforts. So you need to know how well you’re doing. which is why lecture 7 was focused on evaluating images and in particular how good those images were. So we saw that the most common way people evaluate images on things like leaderboards is by having images generated by the models that are in your leaderboards and by having pair-wise comp comparisons between them. And we saw that there was a particular way of measuring performance that took into consideration the history of each model. And by history I mean how people rated that model. And the reason for that is we saw that if you were to win against a weak model, it is not the same thing as if you were to win against a strong model. So you want to take into consideration how strong your opponent is. Otherwise, a metric like the win rate would be very dependent on who you are making your comparisons with. And we saw this metric called the ELO rating, ELO score that looked at this in the following way. So if you were to rate different models, so you have a ratings, let’s call it R, that quantifies how good each model is. So in order for you to let’s suppose you have a new model you want to compute its rating. So what you would do is compare that new model to let’s say a model that is part of the list and you would use the rating of the models to quantify how good they are and take that into consideration in order to find the rating of the new model. So how do you do that? Well, the idea is quite simple. So the ratings quantify how good models are. So from that quantity what you can do is to compute an expected what you expect will happen. So for instance if you have a strong model and a weak model well what you expect is the strong model to win. So you compute a quantity which is the expected score which is a function of these two ratings and you compare that with the actual score and the difference between that which is here delta tells you how surprised you are. So if you won against a strong model then you want your delta to be high and if you won against a a weak model then it’s not like a lot of signal for you because you know almost everyone wins against that weak model so delta would be lower. So all of that to say that you will come across the ELO score quite a lot. And I just want you to remember that the ELO score is a smart way of computing pair-wise comparisons by taking into consideration how strong your opponent is if you compare model A and model B. Cool. But not everyone has the luxury of having human ratings for everything they do. So that’s why we also have a bunch of automated metrics. So we looked at one that was quite popular uh nowadays which is called the FID score fresh inception distance. So what this score does is it computes the distance between the distribution of your generated images and the distribution of real images. And the idea is if these distributions are far then it means that your generated images they don’t look like real images. And we saw that the FID score so I didn’t put the formula here but it is something that is derived from a more general formulation and here the actual formula is derived considering that these two distributions are considered as gausian. So that’s the assumption that we’re doing when we’re computing this metric. And we saw that the lower this metric was, the better it was. And of course, it’s a proxy, so it’s not a perfect metric. And then we looked at more clever ways of obtaining scores in particular by leveraging these newer models. So here MLM which stands for multimodel large language models that are able to both take text but also images as inputs. So you could very well say okay you have this image, you have this input prompt. Well tell me how how it looks like or how how good that is. Then you could obtain some given scores. And we saw that you also had a way to leverage these models in order for you to not have to ask your humans to rate your images, but to ask your MLM as a judge to rate them for you. You could have a tighter loop that would allow you to make iterations faster and have human ratings when you feel like your model is good enough for you to to take that step. And this was CME 296. So you see that in order for us to generate an image from a prompt there is a lot that goes underneath. And this is what the whole lectures 1 to 7 were about. Cool. So now you finished this class after this lecture and you’re like okay now I know all of this and so what? So I wanted to talk to you about the state-of-the-art models that were out there. So what we did was just go to a popular arena that ranks the best models out there. So there’s a leaderboard and as you can see there is the ELO score that is used to determine those ranks. I just wanted to look for us to look at which models were the best and whether they were also using the methods that we saw in this in this class or not. So let’s start with the best models out there. And we see that we have the first two ranks that are from OpenAI. So GPT image and then we have two models from Google. And then the fourth uh sorry the fifth model is from uh XAI. So it’s all closed source models from top AI labs. The problem with those models is we don’t know how they work because they they’re not publishing any reports. So there’s not a lot I can tell you, but that leaderboard also ranks open weight models that have technical reports that are published that are public and available. And here, so it’s a screenshot that dates from like 5 days ago. And we see that the highest ranked model is from um from a company called Hydra. So it’s Hydream 01 image number one and then you have Quinn image that we saw in lecture five when we talked about the multimodel diffusion transformer and then the ranks three to five are from this model from Black Forest Lab called Flux 2. And what I wanted us to do was to go through these models and see what they were made of. So let’s start with the the last ones, the flux two models. So they’re based on rectified flow which as I mentioned is a derivative of flow matching. So something that we know their architecture is based on the diffusion transformer but in particular what that architecture is is based on is a combination of single stream and double stream diffusion transformer. So this is also something that we saw. So they rely on a VAE to make sure that the latent space is compact and allows you to learn easily. So that’s also something that we saw. And we also happen to have some text encoder that was pre-trained and here it’s a Mistro 3 that is used to generate the embeddings. So up until now the usual, right? Okay, let’s go to the model number two. So, Quinn image, the loss is a flow matching loss. So, also something that we know. So, the architecture is a multimodel diffusion transformer and it also relies on a VAE and the text embeddings are based on Quen also the usual. Now let’s look at the top ranked model which by the way was published I think two weeks ago so it’s extremely recent so flow matching loss also similar to what we saw the model is not exactly a multimodel diffusion transformer but it’s something that is based on the transformer so also very similar But turns out that there is no more VA and there is no pre-trained text encoder. So does that mean that what we learned was not true? Like what’s going on? So let’s see how that relates to what we learned. So that paper what it does is it actually performs the generation not in a latent space but in pixel space. And now you may think well how how is the dimensionality going because it’s much higher than a latent space. So that’s the first question. And then the second question is pixel space is much harder for you to learn because we saw that in the space valid images in the pixel space are more isolated. They’re not smooth distribution. Well here here is my interpretation. So first of all for the dimensionality. So it turns out that the patch size that the paper uses is not the traditional 2x2 that we’re used to for latent space. It’s actually a much bigger one. So it’s a 32x 32. So computationally here the idea is to make it tractable by just having bigger patches. And then when it comes to learnability, well, the fact of not having an easy space to learn from is just shifting the effort that needs to happen over to the diffusionbased transformer. And so what I found very interesting with this paper was even if you make your problem harder for your diffusion transformer or your transformerbased model, well you can still have amazing results if you scale it well enough because that model well turns out that it was scaled I believe to 8 billion and then also to 200 billion parameters which is huge for image generation models. And so what I wanted to say was that the space is fast moving, that what we talked about still holds true, but that the trade-off between making it easier for the transformer to learn versus doing it in the raw space is still something that was being figured out because there’s something that I have not mentioned which is when you learn the latent space. So you’re making it easier for your model to learn. You’re having a more compact space, but then the main thing that you’re losing is fidelity because it’s a lossy operation for you to operate in a space that is not your original space. So you know when you train your VA you you just hope that you will reconstruct things truthfully but it is not always the case and it just turns out that this paper shows that the VA the downside of the VA are such that we could maybe work without it if you scale things in a in a big enough So I found that interesting and I think it’s a newer trend of doing things without VA. So I think it’s just good for us to keep an eye on this. So again this paper was published a few weeks ago. I don’t want to make general statements because maybe in a couple of months then we will see maybe that the VA is necessary but I think this is a a part that is worth watching. Cool. So any questions on that? Yeah. Oh yeah. So the question is what is the alternative to pre-trained encoder? So the alternative is to just learn it yourself. So instead of having a pre-trained encoder computing the embeddings for you, what you do is you take the text as input. What you do is you just tokenize it. So you divide it into arbitrary units of text and you learn their representation as part of your training. And so there is one thing I didn’t uh say here, but um they’re not using a pre-trained text encoder to do that, but they’re doing some kind of prompt enhancement that allows you to be more explicit about the input prompt. Because you may have a prompt like a teddy bear is reading, but in order to generate a very good image, you need to know for instance how the lighting should be, like what what um the camera position should be and so on. And so that piece is also making the job of the text encoder easier because it just explicit the what you need to have as condition. Yeah. So the question is if there is no disjointed text encoder, how are they able to understand text? So here the the keyword is pre-trained. So there’s still a text encoder, but it’s just not something that you take off the shelf. It is something that you train yourself as part of the training. Okay, cool. So this concludes part one which was around recapping what we did up until now. So now that we know all of this what I want us to do is to see how adjacent field can benefit from what we have learned. And a very natural extension that I want to talk to you about is videos. So if you think about it, videos are nothing else than a sequence of frames which are images. So you just have an extra dimension. So here you go from a 2D image to a 3D image across time. And the question now for us is what should we have in mind when we want to build a video generation model and what are the things we can use from what we have learned in order for us to be able to do that. So the first thing that I want to tell you is there is an extra dimension which is time. The second thing that I want to talk to you about is that in a video you need to make sure that there is some temporal consistency as in you cannot go from one frame to the other in in a way that completely change what you’re looking at. But you also need to make sure that whoever is being depicted in your frames do not have suddenly things that they did not have. So for instance, if you’re looking at a video of teddy bear reading, you don’t want the teddy bear to all of a sudden acquire a hat and sunglasses at t plus one, but then at t not have any of those. So this is an extra thing that you need to worry about. So for images you wanted to generate something that was plausible. For videos you need to generate something that looks plausible at a given time t. But you also need to make sure that the way frames are are sequenced is also making sense. So you need to have that temporal consistency. And of course you need to make computation structable because here as you can imagine if we go the route of representing our input as a sequence of 2D images. Well the dimension just got increased by a factor of t which is the time. And so you need to make sure that whatever you’re doing is making things still tractable. And of course metrics as well and well I’ll just u get you the spoiler. So we still leverage the metrics that we have seen for images but we just extend them for videos. So the fresh inception distance FID is something that you may also see in the video world where the only thing that changes is the representation of the quantities that you’re dealing with. So instead of having the inception network to represent your image, what you’re doing is you’re using a pre-trained encoder that would represent your videos. And you have for instance this metric called the fresh video distance that tells you how far apart your videos generated videos are from real ones using a similar method. But I want to call out one thing. These metrics are just proxies. So it’s always good to have human in the loop. Okay. So now let’s see how we can handle just generating videos by going through specific parts of the architecture that we may need to change. So if you think about it, an image generation model, a traditional image generation model is a DIT based model operating in a latent space. So you have a VAE that compresses your input into some latent space and then you obtain let’s say the final prediction that you then decode back in your initial space. Well, the same thing is happening for videos. So here you not only are compressing from a spatial perspective, you’re also performing a temporal compression. So do you remember when we talked about f which was the ratio of the height in your initial space over the height in your latent space? Do you remember? So maybe I’ll just quickly So we had this pixel space and then we have the latent space here and then you have h and then h here. So we had f the spatial uh compression which was just a ratio of big h over little h and that ratio is typically on the order of eight for spatial compression. Here what you also want to do is to compress your input along the time dimension as well. So why do you want to do that? Well, the first thing is you want to make sure your problem is tractable from a dimension dimensionality perspective. But then the second reason is when you compress things in a smaller space, what you’re assuming is you have redundant information. So spatially we saw that we have spatial correlation meaning around some pixels you have pixels of the similar value so you have redundant information. Well it’s the same along the time axis as well because you can you can very well imagine that if you go through the frames of a video two consecutive frames will have a lot of the same information. So it would also make sense for you to compress things also along the time axis. And so here this compression ratio which we saw was for space is also something that we have along the time dimension. And as a result your latent space is composed not not of space latence but of space-time latence. So what that means is the quantities represent not only things that are within a given image but also with respect to time. So there is one thing that you may find interesting which is that the dimension over t is 1 + t and then the latent space is of dimension 1 / t over 4. But then you may ask okay why why the 1 plus why one? And I would say it’s a good question. It’s also a question I asked myself. Well, it turns out that if you want to generate a video, you need to have a starting point. You need to have an initial frame. Well, that one is a way for us to consider the first frame as being something special. It has a special place. It’s an it’s the anchor frame. We want to make sure we’re representing we’re representing it as well as possible. And so the reason why we have the oneplus is to represent that first frame, that anchor frame to its fullest extent. And then to use that as a way to continue the video in a way that is a natural continuation of that first frame. Does that make sense? So, by the way, we’re going super quickly over a lot of things. So, my goal is to just give you some pointers for if you’re just interested, let’s say after today, and you want to just explore more. These are just some like small little things. The second thing I wanted to talk about with respect to the VAE is so as we said it is not a 2D VA it’s a 3D VA because it’s operating also along the time dimension. So you will see that that VAE is called a causal VAE. So why causal? So causal is because when you perform let’s say convolutions which is typically what one of the operations that VA contains you perform your convolution in a in a symmetric manner. But here what you want to do for a given frame is to not compute feature maps that are a function of future frames. you only want for that frame to be a function of the same frame and the frames before. So that is why it’s called a causal VA. It’s because the convolution is not like the symmetric one that you’re thinking of. It’s asymmetric. And the reason why people would do that do does do that is actually for several reasons but one of the reasons is you don’t want to go out of memory every time you do these things. So what people do is sometimes they stream this encoding process and if you make sure that a given frame is not dependent on the future frames which by the way the receptive field can widen if you have several convolution operations. So if you make sure that it only depends on that frame and the frames before then computationally speaking you can actually stream that encoding and decoding process. Just giving you reason why. And then the second thing that I wanted to talk about with respect to the video oh yeah you have a question. So yeah great question. So the question is why would you need frames before the last frame? So let me give you an example. So let’s suppose a teddy bear walks across the street and the camera looks at that teddy bear and it sees a pedestrian that is coming this way and then later on that pedestrian comes back. So the information of who is going behind that teddy bear is something that is important to you. You want to be able to capture things that makes your video consistent. So that temporal consistency, which is making sure you’re not inventing objects or people in your video that may have happened in the past. So that’s the reason why you want your receptive field to not be one. So you want to have some dependency on earlier frames compared to the the last one. Does that answer your question? Yeah, the question is uh yeah efficiency concerns and yeah I hear you. So um there is something we’ll discuss which is exactly what is happening in the latent space. So of course you cannot generate a video of you know very very big time t. So what people do is divide the generation into several parts where you can generate a video of let’s say a fixed length and then take the last frame as the anchor frame to generate another video of let’s say another fixed length. So one way for you to make this more tractable is to just take the last frame of the video that you generated as the first frame and then continue later on along with some condition. So yeah. So the the question here is how can you um teach your model how to generate the next frame given the given some conditions. So we’re actually going to see that in the next slide. So this part is just about compressing the input, but we’re not yet generating the video. So we’re going to see that in just one second, but yeah, great question. Cool. Any other questions? Okay, so I’ll answer your question right now. The way we generate videos is within that latent space similar to how we do it for image generation. We have this DIT based architecture that works from patches and that generates patches. The main difference from images is that these patches they are not space patches. They’re not spatial patches. They’re spacetime patches. And what that means is as input, we have if you want something that represents some part of the of the let’s say image, it’s not really image because you’re in the latent space across time. And in order for you to think that you’re doing a good job as producing a video that is consistent with oneself, one thing that you can tell yourself when you’re looking at that whole process is that within the DIT, you have the self attention mechanism that allows your space-time patches to interact with one another so that you’re producing an output that is coherent. across space but also across time. So I think your question also involves some amount of how can you make sure that it reflects causality. So can you expand on what you mean by causality? Yeah. So the question is okay how can you make sure you have some causality versus correlation? I think at the end of the day what you’re doing is you’re training your model on some data and the resulting model is just reflecting the patterns that it has seen at training time. So if you feed it things that reflect uh for instance if you put uh I don’t know a book on a table and there’s some dust that uh goes away these are the patterns that your model will also learn. So I think at the end of the day it all goes back to what kind of data you put inside. So yeah that’s how I would think about it. Yeah. Yeah. Yeah. So the question is uh well we know in LLM we have a masked self attention that makes things causal. Maybe there’s something that we can reuse here. Um well in the image generation world we’re making sure that all parts of the image they interact with one another and this is actually something that we also have here. The main reason being that you want some consistency between every part of your video. And so for that reason people typically keep the full self attention. Um, but then at the end of the day, you you need to try it out and see how it works. But yeah, typically people keep the self attention. Yeah. Okay, great. So that model that we just saw is a model that is open weights. It’s called one. There are actually multiple models out there and this list of readings is by no means comprehensive. So in case you’re interested, feel free to just take a look at those. Um so I know one is one such model. I I believe there’s also LTX as well. Um yeah, highly recommend uh reading the papers. I would say that given the fact that you know how image generation models work, it’s now much much easier to understand these papers. So yeah, hopefully that’s going to be helpful. Cool. So video generation was one adjacent field that I wanted us to touch on. The second topic that I wanted us to talk about is image editing. So let’s imagine that you have an image and what you want to do is to edit that image. So you can say okay make this image black and white. So you could very well use what we have learned which is you put that into your text image to image model. So TI2I. So how do you do that? Well, you represent your condition which is the text and the image as part of let’s say your MMDIT and you inject that and you can obtain an output image. Well, the problem here is that you’re considering this problem as a from scratch generation problem. So you’re telling your model, make this image black and white. And what you’re expecting is that exact same image but black and white. And you can just ask yourself, okay, is it actually opti optimize the what I’m doing? So what I’m doing is I’m asking a model to generate a whole thing from scratch and I have no guarantee that it’s the same image. So the question here is is there a better way for image editing tasks to edit your image spec specifically in cases where you want the input image to be preserved. Is there a better way to edit those images than to consider this as a from scratch generation problem? Well, so in this case, you see that the black and white image had the teddy bear raised its right arm and this is not what you want. So people have been thinking about this problem from another angle. So what they’re saying is instead of considering this as an image generation problem, let’s actually think about this problem as an image editing problem. And by editing I mean in the sense of performing some operations on the image. So one idea could be to have your image to have your prompt be fed into a VLM. So vision language model which is an MLM multimodel large language model in order for you to receive editing actions that could be maybe okay um decrease the let’s say brightness by I don’t know 50% just increase the luminosity by x%. that you could then interact with an editing software like Photoshop in order to get your final image. So the nice thing with this approach is well first of all you have some guarantee that you are preserving your initial image. If let’s say you’re constraining the sets of actions that you’re taking to let’s say an allow list of actions that here can only be harmless like coloring actions. But one challenge is for your VLM to know your the set of possible actions well enough for the editing action to actually make sense. So that’s the main challenge and so that’s why some people I mean when I say some people like a lot of people are trying to think about how to do this. So there’s a very short list of papers that um try to resolve that. But there is one method that I want to mention here which is how do we want our output like how can we make sure that our output is something that is aligned with something that makes sense. Well, a possible method is to look at the logs of people who are actually making edits. So, let’s assume you go to your favorite editing software, you have an input image, and then you make some edits and then you have a final image. Well, the sequence of edits that you made is something that your model can learn from. Well, the only thing that your model doesn’t really know is what your intent was because you typically don’t tell your editing software, well, I want to um make my image more like this. You don’t say that. So these papers so most of them what they’re trying to do is to come up with pairs of initial image edited image along with a user intent that is inferred by these two images let’s say from annotators and what they’re doing is they’re tuning the VLM on these golden sets in order to have it be more have it behave more like something that would tell you actions that would correspond to your intent. So that’s why you have papers on this is just because that in order for your model to learn how to do these things, you need to do some work in terms of which data to use, how you want to train these things. And that’s why they’re here. So yeah, highly recommend reading them. It’s an open area of research. If you take a look at the years they were published, I mean they were published 2024, 2025, 2026. So extremely hot. Any questions? Yeah. So the question is how can the loss reflect the user intent? Well, the thing is you are not using the loss to infer the user intent. So, one possible way of inferring user intent is to feed an offthe-shelf VLM an initial image and an output image and to tell your VLM, well, tell me what changed between these two images in order to infer this box, you know, make this image black and white. So, I’ll give you an example. If we feed a VLM the colored image and the black and white image and we ask the VLM well come up with something that tells me what changed between this and that image. It gives you the user intent and then what you can do is use that user intent along with the initial image and then you can of course collect the editing action that the user took and tune your VLM to output these editing actions given this input. So yeah, that’s one popular way I’ve been saying this papers. Yeah, cool. Great. So, now we’re in the last part of what I wanted to talk to you about and that part is actually an area that I think is a very very hot area uh which is how can we apply diffusion to the field of LLMs. So as you know there’s been a lot of transfer of knowledge from what has worked well in the text world to the vision world. So in the text worlds as you know in 2017 we had the transformer that was initially designed for translation tasks. So let’s suppose you want to translate from English to French. What you do is you use this transformer to have some let’s say English text as input in order to have a generated French text as output. And as you know well people in the vision world they adapted the architecture in a way that leverages some of the scalability benefits that the transformer brings to the vision world. And in particular there is one architecture we have spent a bunch of time on which is the diffusion transformer which relies on that. So that’s one thing. Uh the next thing is something we have briefly talked about during lecture I believe lecture six which is around post-training approaches and in particular the fact of injecting negative signals into our model. So there is uh this um tuning method called DPO which stands for direct preference optimization from the LLM world that was something that was adapted to the diffusion world. So that’s one and another one that you may be aware of is GRPO which now is widely in use in LLMs. That is also something that has been something that people have experimented with in the vision world with flow gpo. So now you can ask yourself okay text has given so much to us what can we give back to text? Well, I know this class is not about text, so I will not go into the details, but one thing I want to talk about is that in the text world, we are doing things mostly in an auto reggressive way. A little bit like when you and I when we talk, what we do is we generate one word at a time and whatever comes out is a function of whatever came before. Just to illustrate this. So you have your LLM which here we say is auto reggressive. Let’s call it an ARM LLM auto reggressive model. So here we have let’s say a token that says okay we’re starting the sentence. So we say okay a and then a goes into the model along with everything that came before and a teddy and we repeat it again and again. Teddy bear is reading. We’re generating sentences by sequentially inputting tokens one at a time and having the model look at anything that it has generated before in order to generate the next token. Well, the problem with that is that if your output sentence or your output response is long, then you will take a lot of time outputting that response because the number of iterations here is in O of the number of tokens that you output. So in particular um so I’m not sure if you do a lot of coding on a day-to-day basis but let’s suppose you want to let’s say code a huge thing like a thousand lines of code well what that model is doing is really inputting outputting things one at a time takes a long time and of course you want this to be quick. Well, one idea is to borrow this idea of diffusion for text. So here what you can think of is instead of doing things one at a time, how about we do everything at once and we start from noise and what we do is we progressively den noiseise in order to obtain our final output. Why why wouldn’t we do that? So, by the way, it sounds very non-natural because I think you’re very used to, you know, saying things in a sequential way, but one analogy I want to uh point out here is let’s suppose you’re writing a speech. So, I’m not sure so first of all, I’m not sure if you’ve ever written a speech, but if you were to write a speech, you typically don’t write things in a very sequential way. You first start with a draft. You have some skeleton of the bullet points you want to talk about. And then what you do is you refine, go from coarse to fine grain. So what I’m saying is this idea of doing everything at once, something that starts from noise up to clean, although it’s not natural, it is not completely unreasonable either. Do you agree with what I’m saying? Yeah. So let’s suppose if we do that then what we would do is reduce the complexity from having the number of iteration being in o of number of output tokens to being in o of the steps of diffusion. So let’s see how we want to do that. So let’s assume we take text with random noise. And by the way, I’m not defining noise just yet, but just assume that the text is noised in a way that we will see later. Let’s suppose we have a diffusion based LLM. So what we’re saying is that we’re having this text with less and less noise up until having the cleaned text exactly in the same way we have done for images. So I just want to make sure this analogy is very clear here. We’re denoising the whole input text in the same way that we’re denoising the whole image all at once. Well, a natural question is we have seen that we can think of noise in the image world as gausian noise. But for text it’s a little bit trickier. And here’s the reason. In the text world, things are discrete. So you have words, you have tokens, you have this, you have that. You’re not in a continuous world. So in the image world, you would represent your noise with a continuous multivaried variable, typically gausian noise. But how would you do with text? So one idea you can have is okay let’s let’s just throw whatever token we have that could be one option well the problem is each token has a semantic meaning and so if you just take any token and you put it as a way to noise your input they then you may make your input mean have a meaning that is not a meaning you may want your input to have. So for that reason one thing that the field is starting to align on which I I believe is the common most common thing is to have an actual dedicated token typically a mask token that that would just represent a text token that you would not know of. So a noised text token. So let’s assume you do that. Then how would we apply what we have seen so far? Well, we have seen that for training. So I will spare you all the mathematical derivations which some of these papers they go into. We will not go into that right now. But let’s assume that we want to mimic the training that we have done for images. How would we do that? Well, we would take a clean sentence as in an unmasked sentence and what we would do is to corrupt the text according to a noise level. And see, so here what does that mean? That means that we would have let’s say a t% chance, let’s suppose t equal 0.5, a 50% chance of a token in that sentence being masked. So what we’re saying is let’s suppose t equal to.5 we’re masking 50% of the tokens and then our training strategy can be to reconstruct the tokens that were masked as a function of whatever was in the input that was not masked. So that can be your training objective. So I know this is not a text class but for those of you who know about BERT who’s an encoder only architecture it also has something that is similar as a pre-training task. The only difference is here the masking scheme is done with respect to a noise level that can change that can vary. So you can have t like a lot of noise. So almost everything masked versus a little bit masked. Whereas for BERT it’s always a certain amount of mass token. I just want to call that out. So let’s suppose you have that then what you’re doing is you are teaching your model to fill the blanks. Well, now how would you go about actually using such a model at inference time? So what you would do is same as what we did for images start with completely noised inputs. So here a sequence full of mass token and then what you would do is just predict what would be hiding behind those mass tokens all at once. But then you want to make sure that you have enough buffer for you to make corrections. So you may want to revise some tokens that you have predicted. So the field has several methods to do that. One of which is just remasking some tokens in a random way. So some other methods they remask tokens with respect to some confidence scores. Like for instance, if you have a token that you’re unmasking that you’re not very sure of, then you may want to just do another try. So here you would take some tokens that let’s say you’re not sure of and remask them. And what you would do is to take that new sequence and perform the same operation meaning trying to see who is behind that mass token and you do that again and again until obtaining your final output. So this is a very high level way of how such a paradigm would work for text. Well, the good thing is there is very active research on it and uh so I was reading one of the papers. So you can have speeds up to 10x compared to the traditional auto reggressive way of doing things and it is particularly useful for tasks that requires you to have response as quick as possible. I know a lot of us are using uh some form of coding agents to do our coding work these days and I’m sure you’re you also want your agent to finish as quickly as possible. Well, this paradigm would allow you to speed up that process significantly and that is the reason why coding is a very good um use case for that uh for that paradigm. So on the one hands for the speed but on the second hand also because in coding you naturally have a way to be interested in filling in the middle kind of tasks because if you look at your code base you’re rarely always trying to generate things in a sequential way in an auto reggressive way. Sometimes you have a bunch of code and you just want to generate code let’s say in the middle of two functions. So for fill in the middle tasks that paradigm is actually better compared to auto reggressive tasks. But and we will not have time to go into details but training is much more expensive. Uh particularly because in the traditional way the way training works is you input your whole input and then what you’re doing is you’re training your model to predict the next token and you can parallelize that very well. And this is unfortunately not something that you can do as well with the diffusion scheme. Then the second thing is there’s so many techniques that now are specific to how auto reggressive models work that you would need to adapt. And there are also ways to combine the benefits of the two approaches. And for instance, block diffusion is an example of it. So I just wanted to leave you with this screenshots that just tells you of the many headlines that you see around companies that are betting on this kind of technology. So you have uh company one such company called inception that’s actually founded by a professor at Stanford um that’s doing great and I will just leave you with some uh suggested readings in case you want and yeah it’s a great question. So the question is how do you handle the variable output length? It’s a great question. There are a few ways you can deal with that. But in practice you indeed need to set a given length for your output. And what you do is you stop whatever is after a special token called end of sentence token. Now you can imagine cases where you’re you’re expecting a way bigger output but you always have shorter output so you’re wasting a lot of computations. So that’s why like the last uh method I mentioned is something that can be useful. So block diffusion is one approach that’s being worked on where you’re actually generating text block by block. So you’re saying okay I’m going to generate let’s say text of size let’s say 100 and I’m going to use diffusion for that and once you generate such an output if you didn’t finish the output then you will use that as a condition to perform another round of diffusion for the second block and so on and so forth until hitting an end of sentence token. So you touch a very good point which is there are some differences between text and image and the variable length is one of them. Yeah. Yep.
Yeah. Yeah. So the question is can you apply this to other discrete things other than text? I’m sure yes. uh we will not cover this today but uh the ways people have derived a tractable mathematical way to do that in the text world is something that involves transcribing things for discrete items so I’m sure that it can be done um so yeah I would actually actually highly recommend the first paper for you to get a sense of how people go from continuous to this grid yeah the Great question. So the question is, can you not consider text as just being screenshot of sorry, can you not consider text as being images of text and use some kind of OCR mechanism to process that? So yeah, I think it’s definitely a great approach. There’s actually a paper that was published a few months ago, deepser that goes into that. I do think it’s promising. I believe there is some savings with respect to how many tokens that fills compared to how many tokens the actual text would fill in. So yeah, it’s definitely another directions that I think is promising. Cool. With that, I’ll give it to Shervin. Thank you. So welcome to the last stretch of CME
- So we’re going to see together some closing thoughts on what lies ahead as well as some challenges that we see. So first let’s look at today. So when you look at existing top models, what is the typical price you have to pay for a perfect image? So I did the research for you. I looked at major labs across the board and it seems that the price per megapixel is of about 10 cents. And I think that’s going to be a useful metric to track over time to see the transition from this being a nice thing to a commodity. And of course in reality you wouldn’t use these top models that I listed here but more of a distilled version. Uh but still the upper bound gives a good view of what people would be ready to pay for for perfect quality. But then let’s look at the kinds of challenges that could be tackled in the near term. So one thing that you notice is that images they’re doing well on them on themselves but there are wins in other modalities that you are inclined to include in order to make generation more powerful. So oneam example that I have in mind is reasoning with images. So right now anecdotally if you ask for a diagram you might get with most models a projection of what you asked on an image which is a nice start but it’s not quite the the same level of refinement as the one you might get in the text world where you ask a very precise question and the model goes in searching for a meaning behind it and then projecting the exact answer in a concise So that seems like a good area of uh research. And then one topic that I’ve mentioned here. So for image editing use cases, it seems that we’re that we’re being too hard on ourselves. We’re doing the whole generation process that is by nature unconstrained. And it seems we have an opportunity to use existing tools and wins in the text world to use agents and existing human expertise to make editing happen in a constraint uh constraints and tractable way. And then you have um um other use cases that could seem attractive to you all. For example, learning about a class, you have several streams of information that come from text. For example, the slides, the lecture video, and also the audio the audio track. So, while you can you can put pieces together one by one. Today, it seems that there is an opportunity to bring all modalities to a consistent and coherent way and generate synthesis synthesis out of this. And in the longer run you can see uh the consequences of all these benefits across industries that might be lagging because of some inertia factor. So you can think of so the field of robotics which seems to be the next area that could benefit from a revolution. So in text we can generate text now like roughly solved images now we see great images roughly solved and it seems we have an opportunity to feed all these learnings into the ingredients that could make robotics pick up for example environments uh but also medicine and other desk jobs that might need their own process of approvals to actually incorporate these wins. And maybe one day you won’t need to come to lecture at all and there will be some model gathering all the pieces for you in a coherent and pedagogic way. But I think we might still be far from that because the fact of transmitting knowledge still requires that taste and opinion that even the text world doesn’t have. So I’m quite curious to see when that will happen. on the challenges side. So as we have seen there are costs and you can use distillation to alleviate a bit that but also there is research on the hardware side where right now hardware is built on top of matrix multiplies but as you have seen the building block of these days models is transformers and which relies on the concept of attention and this attention has a very precise succession of operations that you could think of simplifying in an analog to numeric way. So this research paper could be one such pointer but also one other concern is the data quality side because today you see all these images that are being generated that are out in the wild. So for the next generation of models you might encounter the issue of not finding that true distribution P data as we had assumed we had and you have some models uh some papers that study the phenomenon of so-called model collapse where feeding what has been generated is a way for so looks like an echo chamber of mistakes that keep growing and you see that at the end If you represent embeddings of generations somewhere, you could clearly cluster the two. But how could we deal with that? So that’s the kind of trust point which is key in our society. You want to be able to trust images. And now these days with how similar the true data distribution and generated images are, it seems like the border is becoming thin. So two ways to counter that. There is a norm called C2PA that has been gaining traction across software companies. So you might see this EI AI info popping in. It’s because images that you create with AI have some sort of history attached to it that surfaces it. So that could be one way to tackle the data collection quality issue that I mentioned before. But the thing is if you take a screenshot of this the meta concept of metadata disappears. So are we doomed? Actually no. There are other ways that we could employ to get around this. One example is watermarking with example of synth ID from Google deep mine that hides behind pixels patterns that reveal its origin. So beyond that you also have the topic of safety which is very important and creating uh images that are harmful to others can have societal implications. So uh there is two parts to fixing this. One is on the model side where each company has their own policies to guard uh such generations but also law is picking up and what can you do from there? So let’s say this class ends uh how can you be updated? So one uh suggestion is to take a look at the relevant archive uh section on computer vision but you might see that the number of submissions every day is of hundreds. So this might not be tractable but on the other hand some venues do a good job at distilling some of these works not all but they can give you an idea of where the field is heading. And something else that I want to tell you is all these papers we saw a bunch of formulas all over the place and sometimes you just want to know how they work. And a great pattern that we see in the academia um in academia and in the industry is that there is code release of what was used to design a given method. So I highly encourage cloning the GitHub repo and playing with an AI assistant coding software on just guiding you through the flow and you will learn a ton just by doing that. And beyond that you have other resources. So on Twitter you have a glowing community of people who talk about these topics. So if your for you page turns out to be in the clusters of of these topics, you can get a lot of interesting recommendations. And another highlight is other Stanford courses that talk about vision like 231N that I recommend. And there is this study guide that we shared at the beginning of the class that we uh aim at keeping up to date across the years. But beyond all that, Ashen and I wanted to say that we’re very happy to come here every Friday and teach this class and spend time with you. It has been a pleasure for us and thank you hidden heroes who came here uh on the evenings and asked so so many great questions. You were one of the reasons why the class was so dynamic and generated insights that could then be useful to everyone. Thank you also to those watching online. Thank you for the great questions on ed and I hope the class was of value to you. And if you are around in the fall, Afin and I will be teaching a similar class but in the text world where it will be our second edition and if you are in for a new ride then we’d love to see you again. And our favorite teddy bear is coming one last time to wish you a great summer and all the best. Thank you.