In applications ranging from web search through spam filtering to image recognition, machine learning algorithms are widely used to automatically learn and improve performance. These algorithms require large collections of examples to learn from, which can often be conveniently obtained using crowdsourcing services. For example, in order to train image recognition algorithms, millions of images can be labeled at modest cost by renting tens of thousands of workers on an ad-hoc basis, through an online platform.
However, a persistent issue with such data is the varying quality of the workers, with many of the results being unreliable or randomly generated, and this can cripple the learning process. This is usually handled by repeatedly querying for labels on the same example from many workers, or using a separate test set labeled by experts. Unfortunately, this is not always possible and tends to make the process much more expensive.
Results obtained by WIS scientists, starting in 2009, show how it is possible to learn from such data without having *any* repeated queries or test data, exploiting only internal statistical relations between the answers of different workers. This has the potential of significantly improving the scalability and applicability of learning from such crowdsourced data.