Teaching a robot to see a tennis ball

Tennis-ball detection sounds like the easiest computer-vision problem on earth. The balls are unnaturally yellow, roughly the same size, and contrasted against a court that — by design — is either green, blue, or red. So why did it take us eighteen months to get it right?

The short answer: court conditions are nothing like the dataset.

The dataset is a lie

Off-the-shelf object-detection models trained on COCO or OpenImages have plenty of tennis-ball examples. They almost all look the same: a single ball, well-lit, sitting on something. Real court conditions are completely different. The ball is rolling, half-occluded by a player's foot, in the shadow of the net post, or smashed into a corner where the wall meets the floor. Sometimes it's in a pile of fifteen other balls.

We started with a pre-trained YOLOv5 model and got 72% precision at 0.5 IoU on a held-out set of real court images. Sounds OK. In practice it meant the robot chased shadows about a third of the time.

What actually worked

Three changes moved the needle, in roughly equal proportion.

1. Real-court training data. We mounted a GoPro on a tripod at four Toronto public courts over four months — different times of day, different weather, different surfaces. We labelled 12,000 frames. That single act took us from 72% to 89%.

2. Synthetic motion blur. The deployment model runs at 30 FPS on a Jetson-class board. At that frame rate, a fast-moving ball is a yellow stripe, not a circle. We augmented the training set with motion-blur kernels at varying angles and intensities. Another 4 points of precision.

3. Temporal filtering, not just per-frame detection. Instead of acting on every frame independently, we run a small Kalman filter on top of the per-frame detections. A "ball" is only a ball if we've seen it in three out of five frames. False-positives dropped to under 2% in field tests.

The board

The whole pipeline runs on-device. No cloud, no Wi-Fi dependency, no latency budget eaten by a round trip. We use a Jetson-class compute module (the exact part is on the roadmap to swap as NVIDIA's lineup evolves), a wide-angle monocular camera, and the gripper motors. Total parts cost for the perception stack: under $200 USD at the volumes we currently buy.

Next milestone: 30+ metre detection range, up from 12 m today. That's a different model architecture and a different camera — probably a follow-up post in a few months.

— Javad

Teaching a robot to see a tennis ball

The dataset is a lie

What actually worked

The board

More from the lab

Five generations of gripper, in five pictures

Computer vision on a $200 board

Field testing on Toronto public courts