My 5-week broad survey of research at UPenn
I spent the last 5 weeks at UPenn doing a broad survey of the research landscape. I was curious what university research was. Despite completing 5 semesters, I had no real data points on what research looked like.
Doron Ravid from Michigan on Agreeing to Implement at Micro Theory Seminar (Feb 3)
- I was really lost during this seminar. Did not help that I joined half an hour late. I didn’t realize an Economics seminar had so much notation — a lot of it flew over my head. I spent a lot of time asking LLMs for help during this talk.
- The focus was a particular type of game: a principle agent suggests a set of strategies. Non-principle agents are agreeable or not. Agreeability is something like being predictable, or being self-interested. How should the principle act given they have little information on the population of agents? There was some notion of rationalizability that I also didn’t quite understand.
- The topic was so far from something practical and actionable, kind of like advanced maths. But there were applications to other abstract fields.
- I think I was the only undergraduate in the room.
Jiafei Duan from WashU on Building Robotics Foundation Models with Reasoning in the Loop at GRASP SFI (Feb 4)
- This guy talked fast and was precise. His presentation moved quickly. He knew each slide very well.
- He offered a new paradigm for generalist robot policies called “reasoning-in-the-loop”. I guess most generalist policies look at input tokens, do some thinking, then produce action tokens. His approach was combining the steps so that the action tokens and reasoning tokens are produced … together? In some kind of latent space. Didn’t catch the exact architecture.
- Most policies also use action token that are discrete, something like bucketing. He mentioned using a continuous action token space.
- He didn’t love end-to-end policies because they are not interpretable. You can’t get good diagnostics on why things are failing. Isn’t the trend in ML towards end-to-end?
- Also mentioned adding reasoning data (text tokens) to the failure cases in data alongside the normal binary flag. This helps the model learn more from each data point.
- This was the first robotics seminar I sat in. The crowd was young — lots of Masters, PhDs, some undergrad. Probably 4 or so professors.
Roni Sengupta from UNC on Understanding and Manipulating Physics from Images at GRASP (Feb 6)
- An endoscopy is a medical procedure where you put a camera into a GI tract or into someones lungs in order to image for cancer, tumors, other diseases. Roni is working on the problem of reconstructing a 3D map of the organ given this video data. There’s huge clinical demand for this and its pretty unsolved.
- Foundation models are weak because the type of video data is out of distribution. Normal video models aren’t built for dynamic lighting, reflections, small spaces, etc. You need good physical intuitions to make sense of the space from video. For example, color isn’t static: there is a near-field-light effect where color depends on the angle and proximity of the camera. This can confuse normal models.
- She also presented on a model where you input (1) an image and (2) scribbles on the image. The model produces a new image where the scribbles become highly lit. This uses a diffusion/encoder architecture. She also presented a similar model for VFX, so you could add dust, splashes, and interactions between foreground and background from scribbles. Also presented a model that does forward and reverse aging with impiressive results.
- What if you changed the endoscopic camera to capture more data or better data? She’s less interested in this problem because it would require medical facilities to change their equipment. The highest leverage thing is to just make the models smarter.
Chris Paxton on How Close Are We To Generalist Humanoid Robots? at GRASP SFI (Feb 11)
- I see this guy on my Twitter feed a lot. He also runs a podcast reviewing new robotics papers.
- Humans manipulate their entire body to do tasks efficiently (ex: leaning backwards to open a fridge). How can we get robots do be as efficient?
- SOTA robotic policies use 5 orders of magnitude less data than good LLMs (like Qwen). So maybe robotics is just facing a data problem. How do we get cheap data? For context, in 15 minutes, a human can do some activity 51 times. When they use a UMI interface, they can do it 35 times. With a full teleoperation rig, only 11 times.
- So teleoperation data is slow and thus expensive. Also might mean your policy is plagued because the data is fundamentally slow. Egocentric data is not good for dexterity. Sim is cheap but faces the sim-to-real transfer problem.
- What about world models? He showed a cool paper that used CSGO recordings to generate a world that an agent could interact with. It uses inverse dynamic models to get action tokens from video tokens. A model trained with this may get better depth and spatial intuitions.
- Evals suck right now! They’re not standardized. Anyone can manipulate a demo video to make a robot look smart. It’s really important that we have ways to test models on our own. Although more labs are using similar hardware so we’re making some progress.
- What can we learn from recent success in self driving? Is a data flywheel from profitable deployment the ultimate unlock? Also, humans use more than vision (tactile, orientation) — could this be an unlock?
- Overall this was a great overview of the current state of robotics. It didn’t go deep in any single direction.
- Many speakers have incentives to overstate the success of their work. Chris doesn’t because he’s already well established in the field. I got a sense he was skeptical on the rate of progress in general robotics. This was a good counterbalance to everything else I heard.
George Kanidaris from Brown on Unifying the Stack: A Principled Structuralist Approach to Intelligent Robot Control at GRASP (Feb 13)
- What a brilliant presentation. Most researchers just show their work. Here’s what I did! George told a story, synthesizing the last few decades of robotics research, leading to a single concern about the field. Here’s the story of robotics — and here’s what we’re missing.
- Nature is made up of robots, not programs. Intelligence is embodied.
- The old age of robotics was as follows: we theorized about how intelligence works (ex: sensing then planning then action) and built architectures that implemented those routines. Importantly, these systems are intelligible. The new age of robotics is just throwing a bunch of data at a huge transformer and expecting results.
- One problem with the old version is that it was too fragmented. There were all these different techniques to different problems (kinematics, motion planning, world modeling, reasoning, locomotion, grasping, UAVs, SLAM, nano robotics, vision, humanoids, etc.). Robotics has no organizing principle, or unifying framework, only subfields. He calls the attempt to unify robotics “structuralism.”
- Decision processes are one mathematical object that can be used to describe all these subfields. It’s close to the standard RL model: agent, environment, rewards, policies.
- Intelligence is compositional; he presents a stack of different subparts that each answer a different question. Top down, it looks something like: tasks, objects, space, then sensorimotor.
- Each component needs to throw away excess data and build useful representations to answer the question it cares about. This is compression. The point of building abstractions is to compress. You reduce the size of the search space.
- I got lost in the middle of the presentation. The start was fantastic, the end was strong, but I couldn’t fully follow his ability to connect the research conclusions to the argument he was building.
- Is his argument for an old version of robotics just a form of self-protection? If all we need is data + transformer + compute, what value is there in the 10,000s of hours he’d spend meticulously designing architectures and subroutines for robots? George is very anti-bitter-lesson-pilled.