Zekai Yin

I am a research Assistant at the H2X Lab of Boston University, working with Professor Eshed Ohn-Bar. My research interests include computer vision, robotics, and machine learning. I am particularly focused on visual odometry, object 6 DOF pose estimation, and human-robot interaction.

Prior to this, I was a research assistant at the PKU-Agibot Lab of Peking University, where I worked with Professor Hao Dong. My work there focused on camera-to-robot pose estimation and robotic manipulation.

(I am currently seeking full-time opportunities as a Machine Learning Engineer, Research Scientist, or Software Development Engineer.)

Publications

* indicates equal contribution

ZeroVO: Visual Odometry with Minimal Assumptions

Lei Lai*, Zekai Yin*, Eshed Ohn-Bar

We introduce ZeroVO, a novel visual odometry (VO) algorithm that achieves zero-shot generalization across diverse cameras and environments, overcoming limitations in existing methods that depend on predefined or static camera calibration setups. Our approach incorporates three main innovations. First, we design a calibration-free, geometry-aware network structure capable of handling noise in estimated depth and camera parameters. Second, we introduce a language-based prior that infuses semantic information to enhance robust feature extraction and generalization to previously unseen domains. Third, we develop a flexible, semi-supervised training paradigm that iteratively adapts to new scenes using unlabeled data, further boosting the models’ ability to generalize across diverse real-world scenarios. We analyze complex autonomous driving contexts, demonstrating over 30% improvement against prior methods on three standard benchmarks—KITTI, nuScenes, and Argoverse 2—as well as a newly introduced, high-fidelity synthetic dataset derived from Grand Theft Auto (GTA). By not requiring fine-tuning or camera calibration, our work broadens the applicability of VO, providing a versatile solution for real-world deployment at scale.

Pipeline for ZeroVO: Visual Odometry with Minimal Assumptions

Robot Structure Prior Guided Temporal Attention for Camera-to-Robot Pose Estimation

Yang Tian*, Jiyao Zhang*, Zekai Yin*, Hao Dong

In this work, we tackle the problem of online camera-torobot pose estimation from single-view successive frames of an image sequence, a crucial task for robots to interact with the world. The primary obstacles of this task are the robot’s self-occlusions and the ambiguity of single-view images. This work demonstrates, for the first time, the effectiveness of temporal information and the robot structure prior in addressing these challenges. Given the successive frames and the robot joint configuration, our method learns to accurately regress the 2D coordinates of the predefined robot’s keypoints (e.g. joints). With the camera intrinsic and robotic joints status known, we get the camerato-robot pose using a Perspective-n-point (PnP) solver. We further improve the camera-to-robot pose iteratively using the robot structure prior. To train the whole pipeline, we build a large-scale synthetic dataset generated with domain randomisation to bridge the sim-to-real gap. The extensive experiments on synthetic and real-world datasets and the downstream robotic grasping task demonstrate that our method achieves new state-of-the-art performances and outperforms traditional hand-eye calibration algorithms in real-time (36 FPS).

Pipeline for Robot Structure Prior Guided Temporal Attention for Camera-to-Robot Pose Estimation

Research Experience

H2X Lab

Boston University

January 2024 - Present

  • Proposed ZeroVO (Zero-Shot Visual Odometry) as a co-first author, presenting a novel visual odometry algorithm with the minimum assumption: only two images without knowing the camera intrinsics. Achieves zero-shot generalization across diverse cameras and environments, overcoming limitations tied to specific sensors and environments. This paper was accepted by IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2025).
  • Fused the real-scale world text and geometric information with correspondence features, achieving lower translation and rotation error than the former methods without further alignment.
  • Investigated the influence of high-generalizability descriptive insights of the scene derived from Vision-Language Models, pseudo-lidar generated from metric depth, and optical flow-guided-scene flow on the visual odometry task.
  • Collected a dataset using Grand Theft Auto with 600 on-road and 600 off-road driving videos (300,000 images) in various driving conditions (speed, weather, driving style, camera parameters, etc.).
  • Proposed a Gaussian Mixture Model (GMM)-based diffusion model designed to explicitly capture human-like, multimodal driving decisions in diverse contexts. Achieving state-of-the-art performance on current benchmarks, and reveals weaknesses in standard evaluation practices.
  • Developed a human-in-the-loop simulation benchmark with an improved IDM-based reactive driving simulation combining Virtual-Reality and Racing-Game Wheel pads. Reconstructed the validation set of the nuScenes dataset using nerf-based methods (Nerfacto) and Gaussian-splatting-based models (OmniRE, HugSIM) and conducted user studies on it. Collected over 20,000 feasible and diverse driving trajectories.
  • Helped collect action data from visually impaired people and their guide dogs using the XSense motion capture system, enriching the diversity of the navigation dataset for real-world applications. Made 3D visualizations with animation using Python-controlled Blender.
  • Integrated the Viper X300S robotic arm into a Visual-Language-Action manipulation model pipeline using ROS2 and Python, collected real-world data to fine-tune the model, and conducted real-world experiments.

PKU-Agibot Lab

Peking University

July 2022 - May 2023

  • Proposed SGTAPose (Structure-Guided-Temporal-Attention Pose) as a co-first author, tackled camera-to-robot pose estimation from single-view successive frames of an image sequence using temporal cross-attention to estimate camera-to-robot pose in real-time, and achieved a higher precision than traditional hand-eye calibration. This paper was accepted by IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2023).
  • Developed the first synthetic video dataset for camera-to-robot pose estimation using Blender (180,000 images).
  • Used Ros, Libfranka, Pybullet, Franka-Control, and Ctype to improve the control and motion planning system for the Franka Panda Emika robotic arm to conduct real-world experiments.
  • Designed the Refiner module, used the Levenberg-Marquart algorithm to refine the PNP solver by adding a weight to each point according to its reprojection error.
  • Combined SAM (Segment Anything Model) and 6D pose estimation model and created a pipeline for the Xarm6 robotic arm for stacking objects, pouring water, and handover objects.

Education

Master of Science in Artificial Intelligence

Boston University

Sep 2023 - Jan 2025

Advisor: Eshed Ohn-Bar

Bachelor of Science in Data Science and Big Data Technology

Peking University

Sep 2019 - Jul 2023

Advisor: Hao Dong

Core Competencies

Languages & Frameworks

Python, C++, C, MATLAB, PyTorch, TensorFlow, Scikit-learn, JAX, Keras, HuggingFace

ML Engineering

MLflow, Weights & Biases, Docker, Kubernetes, ONNX, TensorRT, Ray

Data Processing

NumPy, Pandas, Matplotlib, SciPy, Dask, Spark, Luigi

Computer Vision

OpenCV, Detectron2, YOLO, SAM, NeRF, 3D Reconstruction, Gaussian Splatting, Pose Estimation

Robotics, Simulation & Mechanics

ROS, ROS2, Blender, PyBullet, Libfranka, Franka-Control, CAD, Fusion 360, 3D Printing, Carpentry

Work Experience

Machine Learning Engineer Intern

Nanjing Zealen Technology

Feb 2023 - May 2023

  • Designed time-series forecasting models using ST-GCN and PyTorch for 24-hour wind power prediction.
  • Implemented Temporal Fusion Transformer and XGBoost models for long-term pollution trend prediction tasks.
  • Created feature engineering pipeline processing meteorological data, improving model accuracy by 18% over baselines.
  • Developed ML pipeline with MLflow tracking, optimizing hyperparameters through Bayesian optimization techniques.

Software Development Intern

Beijing Siling Robot Technology

Jan 2021 - Feb 2021

  • Developed interaction interfaces and API components using C++ and QT for robotic control systems.
  • Created cross-platform communication protocols for the robot operating system with low-latency performance requirements.
  • Implemented real-time data processing modules for sensor fusion, optimizing robot arm control response times
  • Assisted with system integration testing, achieving 99% reliability in production deployment environments.

Leadership & Initiatives

Tutor and Course Organizer

Yuanpei College, Peking University

Mar 2020 - Jul 2023

  • Founded carpentry course, developing curriculum and safety protocols.
  • Instructed 150+ students; won 2022 YuanPei Special Contribution Award.

Founder and Leader

Yuanpei College 3D Printing and Designing Lab

Feb 2023 - Jul 2023

  • Established college's first 3D printing lab with CAD-to-fabrication workflow.
  • Designed graduation gifts using parametric modeling and additive manufacturing.