Data Science at Home

Data Science at Home


Representative Subsets For Big Data Learning: A New Podcast Episode

May 03, 2016

How would you perform accurate classification on a very large dataset, by just looking at a sample of it?

In this episode I interview friend and colleague Rocco Langone, Machine Learning Researcher at the University of Leuven, Belgium.
One of his recent papers is about big data and similarity metrics.

In this work Rocco proposes a deterministic method to obtain subsets from Big Data which are a good representative of the inherent structure in the data itself. This allows one to consider only a subset of the entire dataset, still performing at high accuracy if not better than traditional (eg. random) sampling.

As you can see, there is always a solution in Big Data. More details in this episode.
Enjoy!

Show notes

Representative Subsets For Big Data Learning using k -NN Graphs [paper]

Algorithm Flowchart

Distributed Map-Reduce graph construction