[SPEAKER_00]: I want to apologize in advance. I'm attending only the first part of the session, so the first talk and the break. So I do apologize. I'm not going to be around for the full day already. And I stole, actually, the title from the BMVA day's title. I think the next session after the speaker says, 3D Video and Multimodal Understanding. And I said, OK, it's a perfect title. That's exactly what I'm going to tell you about.
[SPEAKER_00]: And I will actually take you through five recent papers that we published in the past couple of years on unique captioning, visual instructions, object tracking out of the field of view, we call it out of sight not out of mind, hand tracking also out of the field of view,
[SPEAKER_00]: And I'll end with another kitchen-based data set we've published called HDEPIC to kind of structure it around what you're interested in here. So I'll be talking about video and language in the first two works, video and 3D in the second two works. And obviously, I'll then bring these together, video and language and 3D, talking about the HDEPIC data set.
[SPEAKER_00]: Good. So let's start with the first one, which is unique captioning. So this is a paper we presented last ACCB, and it won the best paper award. Of course, as usual with these works, it was rejected from ACCB and then put again into the next conference and won best paper. I think it's good to have strong opinions about your work, whatever that is. And the work is based on the idea that life is very repetitive.
[SPEAKER_00]: So if we record the video, and these are some of our EGO videos, and you see people doing stuff over time, you see typically you're doing the same thing on and on and on. This is a construction worker, part of the EGO 4D dataset, and they keep doing the basic construction work they would do. They would go up the ladder, they would do something and go down the ladder and get some stuff and then keep repeating it.
[SPEAKER_00]: So if you were to use the current magic of VLMs and caption each clip independently, it will keep telling you, climbing down the stairs, climbing down the ladder, climbing down the ladder. So they would typically generate the same caption for clips that are visually not exactly the same, but very similar. The question we wanted to ask is, it's not really helpful. These captions are going to be the same. You're going to have 1,000 captions, let's say, going down the ladder.
[SPEAKER_00]: Really, is that the best way of captioning? Can we generate a unique caption for every clip in the video that is different from all other examples of this repetitive action, whether in a long video or in a set? Practically, what we want to do is we want to take this space, which typically maps different visual inputs to the same output in the text space, and try to separate it. So we have a one-to-one mapping between the visual space and the text space.
[SPEAKER_00]: Sometimes, this is fairly easy, you can look at these three examples and say, OK, in one of these examples the person is actually holding the screwdriver, so I can produce this unique caption out of these three, saying in this case the person is climbing the ladder while holding the screwdriver, so that's a unique caption. Sometimes, typically due to limitations of current captioning models, this might be impossible, as in the captioner cannot find something unique.
[SPEAKER_00]: And in this case, we go a little bit further in the video and we describe what happens next, right? So you're going down the ladder and then doing X or then doing Y. Eventually, we will have this one-to-one mapping between every clip in the video and a corresponding caption. How can you do that? And of course, we don't have a large data set to train a new captioner.
[SPEAKER_00]: So we want to make the most of what is out there. So this is the current regime. Basically, you have text. You're putting some magic in captioner. And in autoregressive manner, it will produce a sentence describing what's happening in the video. What we are after is some prompting technique. We call it a discriminative prompt, something that we can add in addition to the video to describe this clip in a unique manner.
[SPEAKER_00]: And for it to be discriminative, we need to make it different from all other examples. For example, if you have two examples and a bank of possible prompts, some of these prompts will not be unique, right? You're going to add them, but they will not produce a unique addition to make this clip different from the others. Maybe some other ones will be unique, but might not be ideal. They might not be the one you really, really want to describe something different than the other, especially in the terms of actions. Maybe this is the one you want.
[SPEAKER_00]: that the person is actually, what are they holding as they perform the action. So, what we propose is very simple. Can we have a bank of discriminative prompts, add that to the clips, consider these clips jointly, and then eventually produce a unique caption for every clip?
[SPEAKER_00]: And in principle, this is easy to do but is very expensive. You need to take every clip with every possible discriminative prompt and keep captioning them until you find this set of unique things. So while trivial, the question of how can you do it without that huge expense. And that's where you can introduce a bit of training magic.
[SPEAKER_00]: So we train this network that takes a set of clips and then decides for each one of them, from a bank of prompts, which one is likely to produce a discrimination.
[SPEAKER_00]: So instead of doing it iteratively, we basically train a model out of a large set of training data, which is again from Ego4D, and we say, if you take this clip with those other ones, which of our discriminative prompts is likely to produce an eventual unique caption? And as we said, it's not always possible for the caption to do something. This same network should say, for this particular example, I can't find a prompt, go further in the video.
[SPEAKER_00]: And that's the whole idea. So if you start with a captioner trained for ego-like levela, and then you add this discriminative prompting,
[SPEAKER_00]: What without advancing in the video at all, you can increase the ability to retrieve the right to basically have the one-to-one correspondence from 37% to 45%. And then if you allow the network to advance in time, interestingly, the standard captioner doesn't benefit. But the CDB prompt benefits hugely, going all the way to being able to uniquely discriminate 76% of the data. So you basically have a one-to-one mapping between your video and the caption.
[SPEAKER_00]: Let's go back to our example. So that's the output. Typically, we feed 10 examples, but this is just a visualization. So in this case, the network has decided. So for the first clip, it's a holding. So we add this hold discriminative prompt, and the captioner then fills it up. For the second one, we add a picking prompt. For the third one, we add a holding prompt. And as a result, we have these three unique captions for these clips. This is a harder one.
[SPEAKER_00]: where, you know, the person is just looking around some space and some shelves. And so the standard caption is looking around the shelves. And the CDP prompt, CDP net, is actually adding additional stuff in addition to the last one being able to actually say what the person is going to pick after looking around. So I'm giving you a tour of number of papers. I'm not going to say more, like the details are in the paper. Hopefully you got the idea. And I'll jump into the second paper, which represented the CVPR
[SPEAKER_00]: called Show How To, generating scene condition step-by-step visual instructions. The starting point here is you have an input image, and you have a set of steps that you want to perform. Imagine yourself in your kitchen, you have a set of ingredients, and you would like to know how to perform a recipe. What if we generate a set of images that are based on the ingredients and the equipment that you have in your kitchen?
[SPEAKER_00]: as opposed to based on some magic that assumes that you have tools that you don't actually have. So every image you see here with this nice signal, this is a generated image. Everything without a signal, this is a real image. So we're conditioning this generation on this real image and on the set of steps. And as you can see,
[SPEAKER_00]: The outcome is actually using the same set of tools and the same ingredients you have in your kitchen. And you can see it's the same person doing the action. This is evident, of course, if we try something else. What if I give you a different recipe starting from the same input image? And again, you're producing a different set of instructions. And what if I start from a different starting point?
[SPEAKER_00]: Let's see someone doing the same task outdoors. You can see that all the generation changes. This network still produces new things that you might not have. For example, there is no grill here, but it is actually adding a grill because you definitely need something. And it's trying to do it in a reasonable manner. So an outdoor grill is probably different than to an indoor pan that you might have.
[SPEAKER_00]: Of course, like every magic we're doing at the moment, to solve this, you need a combination of a method and a data set. Where would you get a data set from? So you need a data set where images are actually consistent over time. And because we are video experts, that's where we get these consistent images from.
[SPEAKER_00]: We start with how to 100 million, which is this huge data set on generating instructional steps. This is a large data set which we filter again because the LLMs are improving. So we do a new speech transcription. We filter whether the video is instructional. We ask the LLM to give us the steps and where the steps are in the video.
[SPEAKER_00]: then within this step we find the best discriminative keyframe for that step, producing, as a result, these 578,000 sequences, so 4.5 million steps, that are sentences associated with keyframes. So the outcome looks like this. This is coming out of the video. They have a sentence and keyframes. And you have it on a diverse set of cooking and non-cooking tasks. So that's the data set.
[SPEAKER_00]: The method we do start from pre-trained video models, because they have this notion of temporal correspondence. So they are typically conditioned on an image and text, and they typically have temporal, so they basically share weights across time.
[SPEAKER_00]: Different from standard video stuff, we actually have different conditioning per frame, because every frame is a new step in your recipe. And we also change it in a way that it becomes variable sequence length, because we do have different sequence lengths across the steps.
[SPEAKER_00]: The outcome really works well. So if you compare this to other data sets in terms of multiple possible things, which is are we generating the right step, are we generating it in the right scene, and are we faithful to the task, then we consistently outperform others. More importantly, if you ask humans whether they prefer these generations over others, then this is overwhelmingly much better, actually even better than the original videos at times.
[SPEAKER_00]: Because sometimes the videos don't have a very good viewpoint. And what we generate is a more clear viewpoint of what the person is doing. So these are some of the outcomes. What's really important to see is the consistency.
[SPEAKER_00]: is that you are keeping the same consistent tools that you see in the first image. Also, the fact that this works outside cooking. Most of the prior work really restricted to cooking-based stuff because of the amount of data we have out there. These are more examples with diverse, also diverse lengths of certain actions. Of course, there are limitations to these methods. One very clear and interesting limitation
[SPEAKER_00]: And this is a long example. So again, you can produce really long examples. And it's really nice that after so many steps, you can bring back the same ball that you've seen in the first example. Failure cases tend to be on things that are very hard to generate, like parts of an engine. I don't think these generative models know how to do that. And the best one is this one, which is it really doesn't maintain the status. So it goes from raw to cooked. And then when you say, cut the cubes now, it kind of goes back to a raw.
[SPEAKER_00]: version of this chicken, because that's more common that you cut chicken when it's raw than it's not raw. So there is still a lot to do, particularly for people interested in state understanding of objects. I do speak fast, so I am going to overwhelm you. But hopefully, one of these things will be interesting. And I'm moving from video to language to video plus 3D. For those of you familiar with Epic Kitchens, you know that this data set has been around for a while. OK, this is not good news.
[SPEAKER_00]: doing this looping. OK. Let me see if this will work again.
[SPEAKER_00]: No, it doesn't like it. So I'm going to skip this first video. It just talks about the fact that we have managed, after some iterations, to lift the cameras out of Epic Kitchen. So we have good camera estimates out of our original videos. And in this work, which we call Out of Sight, Not Out of Mind, our idea is to try to understand the videos not only within the field of view, but basically in the scene.
[SPEAKER_00]: So this visualization is going to show you the person starting and it will produce this small neon thing that says there is an object in motion. We don't know the objects in advance. And what we want is to track these objects in 3D. So once you put the object down, which typically you would do, and in Ego there is a lot of head motion as you can imagine,
[SPEAKER_00]: the system should know that the object is still there. So even if you turn around, the system still knows that the object is there. So out of sight is not out of mind. This is very different to current tracking approaches, where once something goes out of the field of view, the assumption is you forget it and you re-ID it when it comes back. But that's not how humans operate. Like if you look there, you're not going to ignore this and re-ID it. You know where the objects are in space, and you can even anticipate their motion.
[SPEAKER_00]: So in this work, we basically lift these objects from 2D to 3D through current new approaches to depth estimation. So these are estimating the dynamic objects, because we build the scene. So the static scene is standard SLAM. Now we want to put the dynamic objects in the scene.
[SPEAKER_00]: And current depth estimators still don't work really well, metric ones. So we look at the scene and we register this image in the scene, allowing us to better map the current single-frame depth estimators to what the depth of these objects are. And thus, we lift objects to 3D. We do not have ground-truth 3D objects, because these are natural environments.
[SPEAKER_00]: But we have tested how good this is by looking at multiple examples of when we see objects from different views, showing that typically our error is within 3 centimeters. For those interested in NERF and exact pixels, that's not good enough for us, because we're really interested in where objects are. Knowing where an object within 3 centimeters is perfectly OK for the applications.
[SPEAKER_00]: So we lift objects to 3D. We match them over time using an information of 2D and 3D information. And basically, we keep them in memory, allowing us to answer questions like, is this object in view? Is it in view but occluded? It's there, but I can't see it. It's behind something. Or is it within reach but out of my field of view? Can I stretch my hand and catch it? Or do I need to walk to actually get it? These are questions we didn't have before.
[SPEAKER_00]: Building on this idea of what's happening outside the field of view, we wanted to think about hands outside the field of view.
[SPEAKER_00]: So for those of you not familiar with this idea, now we do have lots of ways to lift these egocentric views. So what you see is just the ego view. And on the left, you're reconstructing the scene. We even can estimate now full body. This is work from Berkeley that was presented to CVPR called Ego Allo. That allows you, through training, to estimate the full pose of the person outside the camera's field of view. Great.
[SPEAKER_00]: So we wanted to use it to be able to anticipate hands and where hands are, particularly the fact that, as you can see in the examples, hands in ego tend to be often outside the field of view. So as opposed to previous hand tracking and forecasting works that only start estimating the hand's information when it's in view, we want to use this knowledge of the body to estimate where the hand is conditioned on the body when it's outside the field of view, hence the title, The Invisible Ego Hand.
[SPEAKER_00]: And, of course, the methods are well-known methods to estimating trajectories, so it's diffusion-based. So we start with noise, we condition the observation period on the camera pose, on whether the hand is in view or not by detecting hands, and on just image features, because knowing what the person is doing, like just the activity helps you know where the hands are. And then the future steps are just noise.
[SPEAKER_00]: And then you denoise all these steps, both the observation and the future. We observe for two seconds and estimate for one second in line with prior work. And the output is we estimate all the joints, the body joints and the hand joints, hence estimating the hand on body. If the hand is in view, we project it and check that it still matches where the projection in view is. And we also check that you can still estimate like the visual understanding of the scene. So these are additional losses that consistently improve the results.
[SPEAKER_00]: So we have this new benchmark that allows you to do hand trajectory forecasting, but also hand pose forecasting for all the frames, but practically for the frames where the hands are in view, and importantly, for the first time, for the frames where the hands are out of view. In this work, we use Ego XO4D, which is a data set that has been trained for Ego and XO, so we know where the hand is even when it's out of the field of view.
[SPEAKER_00]: Masashi, who's led this work, created a very nice ablation, which you can check to see the value of every step in this pipeline. But visualizations are always better. So this is the camera estimate, and this is where the hand is even completely out of the field of view. You can see the ground truth versus the prediction. When the hand is in view, we can also show you the pose. So you can see the performance of the pose during observation and forecasting.
[SPEAKER_00]: OK, so I told you I'll tell you about video and language, video and 3D, and then I'll take you through this big exercise we've done called HD Epic. This is a huge collaboration of a large number of people who have contributed to Epic throughout the years, and they're now in their universities plus plus. So we really wanted a highly detailed video data set. Our triggering point is multiple ones. One is doing 3D, and the other is our annoyance that people think VLMs work, and we wanted to prove them wrong.
[SPEAKER_00]: So we gave people this very nice backpack. Like everything we've done, we want people to record natural interactions. So they take it home, they open it, and they find the number of area glasses. This is a research device that's multi-sensored. And instructions on how to record in addition to a power bank.
[SPEAKER_00]: And we wish them luck. We tell them, could you please start recording every time you walk into the kitchen? And stop recording when you walk out for three days. Don't change what you're doing otherwise. We do work with a subset of people who don't have kids, et cetera, for consent forms. But actually, we don't want them to change anything they do. So they choose when to record, eventually giving us data that looks like this. Every time they actually walk into the kitchen, there's something with the videos. But every time they walk into the kitchen, the video starts.
[SPEAKER_00]: I think this happened to me once before. There's something to do with the HDMI compatibility, but hopefully we can get this done. We only worked with nine people. It's a very detailed validation data set. It still began 41 hours across nine different participants. And these participants worked with us really long. So on average, they committed 50 hours, because they not only collected the data, but they reviewed the footage, and they gave us further information about what they've done.
[SPEAKER_00]: For example, they gave us the recipes that they have cooked. We didn't tell them to cook anything specific, but after they finished, we asked them for what did they actually cook. We then went and labeled across the videos when did they perform every step, and also whether any additional actions were relevant to that step. Also, they weighed every ingredient on screen, and we have an annotation of when they added it to every recipe.
[SPEAKER_00]: This prep step is really something new and novel. For example, if you say your step is chopping tomato, there are relevant prep actions like retrieving the tomato from the storage, washing it, getting the knife, getting the chopping board. And typically these are not labeled. So we do have these labels. So you know all the actions that people have taken to perform the step.
[SPEAKER_00]: For example, in the first step, which is cooking the pasta, seven minutes before, the person starts by picking the kettle to fill it with water. And we do have this connection, interestingly labeled.
[SPEAKER_00]: For every ingredient, they have been weighed. Most of the ingredients have been weighed on screen. We really wanted to prove, you know, everybody was doing ego. You know that all these companies produce things that you say, oh, you're going to be doing this, and we will tell you what is the weight. We wanted, again, to prove that this is not possible. So we weighed things on screen. And also, people are adding the stuff. So we know exactly when you add the ingredient to the recipe, allowing us to check
[SPEAKER_00]: the progressive ingredient information of a recipe over time. So as you add ingredients, we know what these recipes are. This is very different. So I told you about the recipe videos that we take from How to 100 Million. These are one video, one recipe. What we have in HDApic is very different. You're doing usually multiple recipes per video, but also the same recipe can take multiple videos, sometimes multiple days. If you marinate the day before, and then you cook. So it's a very different way of looking at recipes and cooking.
[SPEAKER_00]: So that's the first layer. We call it HDF because it's highly detailed. It's not high definition. So that's the first level. The second level is narrations. So we do describe these videos in very much detail along with a start, ending time. We parse these narrations in a way that tells you what is the action, what is the object, which hand, and why is the person performing the step, as well as we label all the audio associated with the data set. So if you're familiar with narrations, OK, I really need this one to play. So if not, I will actually be.
[SPEAKER_00]: This is not good. Let me see if I can play it like this. No, it doesn't like it either. Pity.
[SPEAKER_00]: Okay. I think I'll refer you to the, see how detailed. We actually have like a long paragraph per action saying you're doing this this way because I'm doing this. So this is a very highly detailed information about how you do the action. Okay. It is stuck.
[SPEAKER_00]: OK. The third layer is actually 3D, which is what we wanted to do. So every object is annotated over time. So we're tracking the objects first in 2D over time. We have worked with a great annotation company that every time an object, small or large, is moved, there is a small trajectory with a start bounding box and an end bounding box.
[SPEAKER_00]: That means we have all movements of objects, which were then connected through annotations. But these annotations are very efficient, because we have all previous tracks, every new track. So whenever you're moving an object, we're comparing it to all objects we had annotated before. But still, a human has to check that, indeed, this is the same object, allowing us to track objects over the long video. And then we lift these objects to 3D.
[SPEAKER_00]: OK? So we actually take all these kitchens, and we annotate from the point cloud all the kitchen fixtures. So we worked with this blender artist who annotated all the drawers and the cupboards and all the information in the kitchens, allowing us to have a digital twin of the environment. So we do have these types of information. So every cupboard, drawer, and every object is annotated from beginning to end.
[SPEAKER_00]: I told you that we wanted to prove VLMs don't work, so we first annotated the dataset in details, and then we created a corresponding VQA. So we took our annotations and we started asking questions around recipes, ingredients, fine-grained actions, also 3D perception, object movement.
[SPEAKER_00]: And the outcome, of course, is not very favorable to VLMs. The more advanced ones, like Gemini, know a little bit about recipes. But the moment you ask them about object motion, they are as close to random. They're like 20%.
[SPEAKER_00]: There are interesting questions like, which of these ingredients shows higher fat? And these are designed in a way that if you don't understand what you're looking at, you can be cheated. So it's a very small piece of butter, but a lot of pine nuts. So the answer is indeed pine nuts, if you look at the ingredients. Questions around, how did you do this action? All the answers are related to washing cutting boards.
[SPEAKER_00]: But it's exactly this particular way of washing cutting boards that you're interested in doing. OK, this is playing. So you really need to select the one where the person is saying, you are rotating the chopping board to wash the backside. And current VLMs don't know this distinction. Or questions like, how did you exactly pick up this ball? And in this case, you're removing the fork out of it. And again, these VLMs cannot see these details. So all the answers are related to a true example of someone picking up a ball.
[SPEAKER_00]: Or questions like this, how many times did I open this specific cupboard? And throughout this 40 minutes video, it's opened five times. And again, models cannot do that.
[SPEAKER_00]: If you're interested in HDIPIC, look at it. All the annotations have been available since the summer. And I will stop at this because there is a way for you to look at it. So in collaboration with VGG, they have created a way for you to search the data set and know what does it contain. So you can ask questions about actions, objects, and you can see what the data set contains. It's a very nice, fun interface to look through HDIPIC.
[SPEAKER_00]: Okay, so I'm done. I told you about unique captioning, visual instructions, my recent interest in 3D, and then how can we bring video, language, and 3D together. I do get lots of invites to present this work, but this is a huge work of a large team that I'm very grateful. Everybody in Bristol is working on something related to video, language, 3D understanding, so it's a huge collaborative effort, and I am happy to take questions.
[SPEAKER_00]: So when you prompt the VLMs for videos, how do you do the sampling? Like, what's the strategy? Do you give all the, like, some of them, they natively accept video, right? So I think they, currently, if you use stuff like off-the-shelf APIs, they go with one FPS. We go with as many frames as the context allows. So in the case of the original videos, they're short clips. So we're actually feeding almost all the frames.
[SPEAKER_00]: to the VLM. But in the HD Epic VQA questions, and I think that was one of our struggles, we had to really change depending on the question. If you're asking a question about 40 minutes, the current thing, you cannot do more than one FPS. But if you're asking a question about a fine-grained, we actually go as many frames as possible rather than one FPS. So you select that particular interval. Exactly. You select the... You adjust to maximize the context you have.
[SPEAKER_00]: And do you resize the image anyway? You have to, I guess. They all resize and batch in the same manner. Which is one of the problems, indeed, actually. I think we'll go for it. Is that the question? Yes.
[SPEAKER_05]: Yeah, hi, thanks a lot for the talk. So the question that I have is, with the videos that you're giving, you're also giving like, wait, I'm assuming you're giving a lot of detailed narrations and detailed captions with them. Where? When we do the VQA? Yeah. No, no, we just give the video. No, no, when you're training them. During training, they're trained with some caption or some text.
[SPEAKER_00]: OK, so the VQA on HDApic, this is taking off-the-shelf Gemini and not training it in any way and asking it these questions. The data set comes with detailed narrations, but they have not been used to fine-tune an LLM.
[SPEAKER_05]: Okay. Now, the reason I ask is because in my short experience, I've seen they do tend to align more towards the text. They try to find answers from the text instead of the frames and images. Well, I'm going to guide you to Sam's talk a little bit later. Okay. So he'll tell you more about, indeed, this is indeed the case. Okay. Thank you so much. Video and images are underexplored. Yeah. Okay. Thank you so much. Yeah. Thank you for the talk. One of my questions is,
[SPEAKER_00]: In part of your presentation, you showed joints of hands. Yes. I was wondering if there is any data set available for that particular? Yeah. Not many, but the EGO XO4D data set has been released with joints. It has been labeled with manual and automatic joints. We only use the manual joints for evaluation, but also the automatic joints because they have multiple cameras, so they interpolate it for training. Only on hands? All the body. They have full body and hands.
[SPEAKER_00]: There is also the NIMERIA dataset you might want to look at if you're interested in body and hand joints in ego. Yeah, regarding the hands, like the left hand and the right hand, are they separated? I believe they are, yes, of course. We also have, on HD Epic, we do have masks of hands, we don't have joints.
[SPEAKER_01]: Yeah, actually, I worked with HDIPEC. It's a quite cool data set. Thank you for providing that. But regarding the key objects in the scene, for each narration, I only was able to find the bounding box of one frame, like the object in one frame. I was wondering if I did something wrong. So on HDIPEC, what we've done is you have a start and end time of the object moving. So we have a duration. In this duration, the object is in motion.
[SPEAKER_00]: Right? And then we have a bounding box at the beginning and the bounding box at the end. Oh, so not in between? Not in between. Yeah. Not in between. You can, as in, you can train something like a sample to kind of do that. But it's very expensive to do the continuous. But remember that you have the duration. You do know that this object is in motion at this time. But indeed, we have the starting bounding box and a manual mask and an ending bounding box and a manual mask. We don't have the dense ones. Yeah, thank you. Thank you. There was a question there.
[SPEAKER_00]: In your view, what's the biggest challenge here that limits your research now, currently? Challenge. Nothing works in video, so plenty of opportunities. It's like there is huge opportunities in video. I think what we're showing is there is a lot of research in images, of course. We now know that VLMs are doing very well with objects, but we also know they're not doing very well with actions.
[SPEAKER_00]: Actually, actions are poorly defined. There is a lot of questions of what's an action. They don't do very well in forecasting or understanding what the person's intent is. Also, this idea of understanding the full 3D perspective, I think for me, is a huge opportunity. There are things around
[SPEAKER_00]: pose and hand object and kind of also mapping stuff to robotics that are big at the moment. But none of these are solved problems. As in if you put the set of, what's the set of solved problems? If you caption a nice image, I think this is nearly solved, right? When you start with things in video, I don't think anyone would think anything is solved. So you have a free reign of the thing you want to solve and try to take actions in it.
[SPEAKER_03]: You mentioned about the problems. I'm asking about the things that limit your research. Is it because the algorithm? Is it because the computing resources? Or is it because of what the data said?
[SPEAKER_00]: It's more data. More data. Yeah, it's more data and benchmarks is what's limiting the research. I see. Most of our benchmarks at the moment are very easy to solve, and so people think things are working, but we know they're not. I see. So that's what makes the goal not clear. But also, video data is a lot less than image data. People collect videos for specific reasons. Yes.
[SPEAKER_00]: There is lots of talking head things. There are lots of movies and TV. But there aren't videos of someone cleaning, because why would anyone put this thing online? So videos are a very biased representation of what we want. Images are still the same, but it's a bigger problem in video. So benchmarks, data. And I think once we get that, we will be able to advance more the benchmarks. Thank you so much.
[SPEAKER_00]: The chair should tell me if there's more space for time for questions. Yes, go ahead. Oh, thank you. Thanks for the interview. I was wondering if you... Oh, thank you. Have you tried fine-tuning any VLM on the dataset on HDIPEC or not yet? So, we have designed HDIPEC to be a validation dataset, so we don't want people to fine-tune on it, otherwise all the effort to collect the detailed annotations will go away. So, we would hope that people are just not overfitting to this.
[SPEAKER_00]: I would say, so it's 41 hours. What do you do? What do you fine-tune? Have people fine-tuned VLMs and made things any better? I am not aware of that. I think they over-fit it to a domain. But I think
[SPEAKER_00]: Yeah, I think the huge effort and time is being spent in these big companies producing some post-training that keeps improving things one at a time. But yeah, I don't think you can just fine-tune and solve everything at the moment. No, definitely you're not going to solve everything. But we didn't want people to fine-tune on HDIPEC. I think that was the last question. There are too many questions, so you have to tell me.
[SPEAKER_04]: Yeah. Two. So there are two here and here. Go ahead. Thank you for the nice talk. There was a work out of sight, and you were kind of a structure from motion style. You were making a point cloud, keeping it in memory, and so on. Yes. And then you were doing question answering. My question is, how do you condition your model based on that point cloud? How do you extract features, or how do you condition it? What's the method there?
[SPEAKER_00]: No good answer at the moment. I would say, in line with all our previous work, we really publish as soon as we have. So we are working on some methods to do that. So I would say by the end of the year, keep an eye on what we're going to put out. So we're proposing some ideas, but I'm hoping others are proposing ideas too. But at the moment, there isn't a clear way of how to ingest 3D information into VLMs. And they're not good at it. Yes, because their point, yeah. They're not good at it. Humans are, but VLMs are not. Thank you.
[SPEAKER_02]: Thank you. I was thinking about your first paper that you presented. If I understood it correctly, the problem was if you try to map a clip to a caption
[SPEAKER_02]: it's non-unique. And so you then make another model which maps a clip to a question and then the question plus clip to a unique caption. So ultimately you've got the target which is a unique caption. But isn't the caption
[SPEAKER_02]: problem specific. If I had an image of a cloud, then a weather forecaster might want one kind of caption, whereas a defense contractor might want a very different caption saying no target in sight. So how do you add the objective function of the caption into the process?
[SPEAKER_00]: So at the moment, you can easily add it to your choice of discriminative prompts. Like, what is this set of things you can choose from, right? And currently, we did it through just n-grams of our text, saying these are what the things would want to know because they are present in our captions. But the easy trick is to just add it through what is the pool of prompts you can actually select from for every application.
[SPEAKER_00]: as a starting point. But then you can explore maybe changing the architecture itself by adding more information about intent to change what you could add to the caption. Thank you very much. I am here until the end of the coffee break and I do apologize if I have to leave after. Thank you.
[SPEAKER_04]: OK, let's resume. Thanks again for the excellent talk and discussion. We have time for one more session before lunch, and I think it connects quite nicely to some of the questions that already came up around evaluation, data, and what we are actually measuring in long-form video understanding.
[SPEAKER_04]: So our next speaker is going to talk about benchmarking and evaluation for long-horizon multimodal reasoning, especially in settings where models need to combine temporal evidence rather than answer from a single salient frame.
[SPEAKER_06]: Thank you. I'll try to keep this fairly focused. The motivation is simple. We have made a lot of progress on image-based evaluation, and we also have progress on short video question answering, but we still don't have a very satisfying way of evaluating whether a model actually reasons over long video.
[SPEAKER_06]: What happens in practice is that many benchmarks can be partially solved by shortcuts. Sometimes the shortcut is language priors. Sometimes it is a single key frame. Sometimes it is a very local cue near the end of the clip. And then we conclude that a model understands video, when really it has just become very good at exploiting the benchmark format.
[SPEAKER_06]: So the first question we asked was: what would a benchmark look like if we explicitly wanted to penalize shortcut behavior? In our case, we focused on tasks where the evidence is distributed over time. That means you cannot answer correctly unless you connect multiple moments in the clip.
[SPEAKER_06]: The easiest toy example is something like: where was the object before it ended up here? But in real data it is often more subtle. It could be whether a step was completed before another step, whether a tool was reused later, whether an ingredient was added and then removed, or whether an action failed and had to be repeated.
[SPEAKER_06]: We built the benchmark around these temporal dependencies rather than around nouns. There are still objects, actions, and scenes, but the annotation target is really the relation between events.
[SPEAKER_06]: One thing we learned quickly is that annotation becomes difficult the moment you move from local recognition to temporal structure. People are usually comfortable labeling what is visible now. They are much less consistent when labeling what matters over the next three minutes.
[SPEAKER_06]: So a large part of the project was not just collecting questions, but defining annotation protocols that reduce ambiguity. For example, what counts as the start of a step? What counts as a failed attempt? If someone reaches for a utensil and then changes their mind, is that part of the plan or just incidental motion?
[SPEAKER_06]: These decisions affect everything downstream. If the definitions are vague, the model looks noisy but the benchmark is actually the noisy part.
[SPEAKER_06]: We ended up with three families of questions. The first is event linking: connect two moments that belong to the same underlying process. The second is state tracking: maintain what changed, what remained available, and what is still pending. The third is counterfactual or contrastive understanding: why this answer rather than another plausible one that appears visually similar.
[SPEAKER_06]: For evaluation, we did not want a pure multiple-choice setup because models can guess and exploit answer style. But free-form generation also makes scoring very unstable. So we used a hybrid design where the model first produces a short rationale or evidence trace and then selects or composes the answer in a constrained format.
[SPEAKER_06]: This doesn't solve everything, but it at least lets us inspect whether the model looked in approximately the right place for the right reason.
[SPEAKER_06]: The other issue is context budget. If you give the model the whole video densely sampled, you exceed practical limits very quickly. If you subsample too aggressively, you remove the very evidence that defines the task. So benchmarking long video is also benchmarking the sampling strategy, whether you admit it or not.
[SPEAKER_06]: We therefore report two settings. One is a fixed-budget setting, where every model gets the same visual budget. The other is an adaptive setting, where the model is allowed to retrieve more evidence but has to pay for it in the metric. We think this is closer to deployment anyway.
[SPEAKER_06]: The broad result is not very surprising but still useful: current systems are much better at identifying salient events than at maintaining coherent state over time. They can often tell you what is happening now, and sometimes what just happened, but they struggle when the evidence is distributed and partially interrupted.
[SPEAKER_06]: Another interesting result is that stronger language models do improve answer plausibility, but this can hide failures in grounding. So the output sounds more reasonable, yet the temporal evidence is still wrong or incomplete.
[SPEAKER_06]: I will stop there and take questions, especially if people disagree with the benchmark design, because in my view that is the part worth arguing about.
[SPEAKER_04]: Thanks. Questions?
[SPEAKER_01]: Thanks, this was very interesting. On the annotation side, how do you deal with disagreement between annotators when the boundary of an event is fuzzy? For example, the start of preparing an item versus the moment the item becomes relevant to the actual task.
[SPEAKER_06]: Yes, that's a very common issue. We do two things. First, we separate temporal anchoring from semantic labeling. So annotators first identify a rough span, and only then decide what role that span plays. Second, for ambiguous cases, we preserve the ambiguity in the metadata rather than forcing a fake consensus.
[SPEAKER_06]: In other words, we would rather say this question has softer boundaries than pretend that human judgment is perfectly sharp when it is not.
[SPEAKER_05]: Do you then exclude those ambiguous cases from evaluation?
[SPEAKER_06]: Not always. Sometimes ambiguity is actually part of the task. But we do distinguish between questions where there is one intended answer with high agreement and questions where the acceptable evidence span is broader. We score those differently.
[SPEAKER_02]: I have a question about the hybrid evaluation you mentioned. If the model gives a rationale first, don't you risk rewarding fluent but fabricated evidence?
[SPEAKER_06]: Absolutely, yes. That is why we do not score the rationale as open-ended prose quality. We score it against evidence anchors. So the rationale is only useful if it points to relevant moments, states, or transitions that match the annotated structure.
[SPEAKER_02]: So it is not chain-of-thought scoring in the usual sense.
[SPEAKER_06]: Correct. We are not trying to reward verbosity. We are trying to see whether the model can expose where its answer comes from.
[SPEAKER_00]: Can I ask a related question? In your adaptive setting, are retrieval decisions part of the benchmark, or are they part of the model design? Because the moment you let people retrieve selectively, you are mixing evaluation and system engineering.
[SPEAKER_06]: I think that's unavoidable to some extent. If long-video understanding under realistic budgets is the target, then evidence selection is part of the capability. But we try to make the accounting explicit. So a model is not rewarded simply for asking for more frames or more clips. It has to justify that extra budget through performance.
[SPEAKER_00]: Right, so you are effectively evaluating answer quality under a resource policy.
[SPEAKER_06]: Exactly.
[SPEAKER_03]: I wanted to ask whether synthetic data helps here. Because if the issue is temporal structure, one could imagine generating controlled sequences where the dependencies are known exactly.
[SPEAKER_06]: It helps for diagnosis, definitely. Synthetic data is very good when you want clean control over what changed and when. What it is not always good at is capturing the messiness of real human activity, where actions overlap, are interrupted, resumed, or only partially completed.
[SPEAKER_06]: So my view is: synthetic for isolating a failure mode, real data for measuring whether the failure mode still matters outside the lab.
[SPEAKER_07]: I have a question on benchmark lifespan. We have seen repeatedly that once a benchmark becomes popular, models optimize to it and it stops telling us much. Do you have a plan to avoid that?
[SPEAKER_06]: Only a partial one. We are trying to design the benchmark as a framework rather than a frozen leaderboard. So the structure of the task remains the same, but the actual instances and temporal compositions can evolve. Otherwise, yes, it will get saturated and we will repeat the same cycle.
[SPEAKER_01]: Following up on that, do you think the field has over-indexed on static benchmarks because they are easier to compare?
[SPEAKER_06]: Yes, absolutely. Static benchmarks are convenient, and convenience has shaped a lot of the research culture. But the moment the task involves long context, interaction, or selective evidence gathering, static one-shot evaluation becomes less informative.
[SPEAKER_05]: This may be a bit practical, but how expensive is this kind of annotation really? If someone wanted to build a smaller domain-specific version, is that realistic for an academic lab?
[SPEAKER_06]: It depends on how much structure you want. If you only want coarse event links, it is manageable. If you want state tracking with explicit evidence anchors and ambiguity handling, it becomes much more labor intensive. My honest answer is that protocol design matters as much as annotation volume. A bad protocol wastes annotation very quickly.
[SPEAKER_00]: I like that point a lot. I think people underestimate how much of "model progress" is actually just annotation design and benchmark design catching up.
[SPEAKER_06]: Yes, and also task definition. If you ask a vague question, you get vague progress.
[SPEAKER_04]: We have time for maybe two short questions.
[SPEAKER_08]: Very short one. Did you compare human performance under the same visual budget constraints? Because some of these tasks sound difficult even for people if you force them to watch sparse samples.
[SPEAKER_06]: Yes, and that was actually quite revealing. Humans remain much more robust than models when the sampling is imperfect, provided they still have enough temporal anchors. But once you make the sampling too sparse, human performance also drops sharply. So the budget setting is not just a nuisance variable. It changes the nature of the task.
[SPEAKER_08]: That makes sense, thanks.
[SPEAKER_09]: Last question. Do you think these benchmarks should remain in the video domain, or do they point toward a more general multimodal evaluation problem with memory and selective evidence across time?
[SPEAKER_06]: I strongly believe it is the latter. Video just makes the issue unavoidable. But the real question is broader: how does a system maintain and update grounded state under limited context, and how do we evaluate that without confusing fluency for understanding?
[SPEAKER_04]: Very nice. I think that's a good place to stop. Thanks again.
[SPEAKER_04]: We will break for lunch now and come back at two. If you still have questions, please catch the speaker outside.