Data Science at Home
Representative Subsets For Big Data Learning: A New Podcast Episode
How would you perform accurate classification on a very large dataset, by just looking at a sample of it?
In this episode I interview friend and colleague Rocco Langone, Machine Learning Researcher at the University of Leuven, Belgium.
One of his recent papers is about big data and similarity metrics.
In this work Rocco proposes a deterministic method to obtain subsets from Big Data which are a good representative of the inherent structure in the data itself. This allows one to consider only a subset of the entire dataset, still performing at high accuracy if not better than traditional (eg. random) sampling.
As you can see, there is always a solution in Big Data. More details in this episode.
Enjoy!
Show notes
Representative Subsets For Big Data Learning using k -NN Graphs [paper]
Algorithm Flowchart
Distributed Map-Reduce graph construction