Khoury News
As his awards list grows, Lorenzo Torresani crafts AI assistants for help with everyday tasks
Khoury College's newest Aoun Chair is a veteran of the computer vision field, a recipient of multiple honors at EgoVis 2025, and a believer in the power of camera-enabled, AI-powered assistants to make daily life easier.
Misplacing your keys can be frustrating. The last thing anyone wants to do, especially when running late, is turn over couch cushions and look under furniture.
Now imagine a future where instead of frantically searching the house, you simply ask an AI assistant “Where did I leave my keys?” and it tells you “I last saw them on the windowsill.”
This is just one example of the kinds of problems that Lorenzo Torresani believes can be solved with the help of perceptual AI agents, which he envisions as personal assistants built into wearable camera devices, able to see what we see and hear what we hear. They’d help us with chores, managing schedules, and even learning new skills. In the not-so-distant future, he expects robot assistants to be a common feature in many households.
Torresani’s perspective on perceptual AI agents comes out of his more than 20 years working in computer vision, a subfield of AI that focuses on getting computers to understand and interpret images. He’s researched at Meta, Microsoft, and Dartmouth College, and joined Khoury College this fall as the President Joseph E. Aoun Chair. He is the second Khoury professor to earn the honor after Tina Eliassi-Rad, who became the first honoree in 2023.
Torresani works with cutting-edge multimodal models that can understand and interpret video images and audio, but his focus has always been on the humans this technology is meant to serve.
“It’s all about humans,” he says. “At the end of the day, that’s all we care about. We want technology that makes daily life easier, more effective, and more productive.”
Torresani’s research has been widely lauded, including with a National Science Foundation CAREER Award, a Google Faculty Research Award, three Facebook Faculty Awards, and a Fulbright US Scholar Award. Over the summer, he added several more awards at the Egocentric Vision (EgoVis) Workshop at the Conference on Computer Vision and Pattern Recognition. His work on video-audio understanding in machine learning models landed him first place in the Ego4D EgoSchema Challenge as well as three Distinguished Paper Awards.
EgoSchema is the premier benchmark for testing long-video understanding capabilities and episodic memory retrieval. The challenge requires AI models to answer multiple choice questions about numerous three-minute video clips depicting natural human behavior.
“The questions vary quite a bit, so we may have things like memory retrieval, which we call ‘needle in a haystack’ questions. You may be given a very long video, but the relevant segment to answer is very short,” Torresani explains. “Other questions involve hopping through different segments of a video and piecing evidence together.”
One of Torresani’s award-winning papers featured Video ReCap, a system he and his colleagues developed that uses machine learning to craft detailed captions for videos ranging from one second to two hours in length. This recursive captioning model starts with short segments of a few seconds and feeds that information through multiple hierarchy levels to develop broader contextual understanding.
“The model sees that in the previous level, for example, I picked up an apple and then put down an okra packet. In the higher level, it can understand that you’re shopping around a supermarket,” Torresani says.
Part of the novelty of Torresani’s work is his focus on egocentric video — video taken from the first-person perspective through a wearable camera like those in augmented reality glasses. The length of videos he’s working on is also novel.
“For the last two decades, most of our research community has focused on short video understanding — looking at brief snippets and determining what’s happening within a few seconds,” Torresani notes. “But now we’re moving into a more fine understanding. You have videos that may last several minutes and you have questions that require really piecing together evidence.”
Torresani’s work on long-form video understanding is a crucial step in creating perceptual AI agents, which would need to interpret video all day long.
“The wearable camera will always be on, which means they’ll see everything you see. If you’re cooking and they see that you are adding salt to a dish, they can tell you, ‘You already added salt,’” says Torresani.
While losing keys or getting distracted while cooking may seem like mild annoyances for most of us, for some, they’re challenges that make living independently a constant struggle.
“I’m really, really interested in developing this technology for assisting people with disabilities in their daily activities, empowering them to cook for themselves and navigate their environments,” he says.
Torresani also hopes to develop models that provide “proactive assistance in complex tasks.”
“Maybe you want to learn to play tennis or violin but you need high-level coaches to assist you in picking up these skills. Learning these skills is really expensive; it’s almost an elite thing,” he notes, adding that with wearable camera technology powered by perceptual AI agents, “It would be like having your own personal coach.”
If you wanted to improve your golf swing, for instance, the AI could provide you with a first-person perspective of how to swing the club and a trajectory along which to move your arms.
“I think this could be really disruptive and potentially accelerate our ability to learn skills, as well as raise the ceiling so that people can achieve even higher levels of proficiency in different disciplines,” Torresani says. “Through this technology, you can really democratize learning, which is very powerful.”
The Khoury Network: Be in the know
Subscribe now to our monthly newsletter for the latest stories and achievements of our students and faculty