To appear at CVPR 2011, Colorado Springs, USA June 21-23 (download pdf) (supplementary material)
Computer vision has hit the mainstream with applications such as cars that detect pedestrians, motion capture for animation, and applications that let you cash a cheque by snapping a picture from your mobile phone. A great example of computer vision in the consumer market is Microsoft's Kinect gaming system which can accurately detect the pose of one or more individuals allowing gameplay to be controlled using just the body. Such a system must be able to detect pose reliably under a wide variety of conditions - different players, unusual clothing, poor lighting, cluttered backgrounds, and other sources of variation. One way that we could perform pose estimation is keeping around a large database of examples of people in a variety of poses along with labels indicating the configuration of the body in 2D or 3D. When presented with a new example (without labels) we can compare it against the database to find the best match. We then can assign the labels of the best match to the new example. However, the matching (or similarity) problem is a very tough one - especially due to the large amount of input variability due to the factors described above. If we had many examples of people in similar pose but under differing conditions, we could use machine learning to construct an algorithm that matches based on the important information (e.g. pose) and ignores the distracting information (e.g lighting, clothing, background, etc.). But how do we collect such data? In a somewhat unusual move for computer scientists, we turned to the Dutch progressive-electro band C-Mon and Kypski. Their music video/crowdsourcing project "One Frame of Frame" asks people on the web to replace one frame of the band's music video for the song "More or Less" with a capture from a webcam. A visitor to the band's website is shown a single frame of the video and asked to perform an imitation in front of the camera. The new contribution is spliced into the video which updates once an hour. This turns out to be the perfect data source for learning an algorithm to compute similarity based on pose. Armed with the band's data and a few machine learning tricks up our sleeves, we built a system that is highly effective at matching people in similar pose but under widely different settings.
Supervised methods for learning an embedding aim to map high-dimensional images to a space in which perceptually similar observations have high measurable similarity. Most approaches rely on binary similarity, typically defined by class membership where labels are expensive to obtain and/or difficult to define. In this paper we propose crowd-sourcing similar images by soliciting human imitations. We exploit temporal coherence in video to generate additional pairwise graded similarities between the user-contributed imitations. We introduce two methods for learning nonlinear, invariant mappings that exploit graded similarities. We learn a model that is highly effective at matching people in similar pose. It exhibits remarkable invariance to identity, clothing, background, lighting, shift and scale.