Machine Learning with Small Data
Machine learning and big data are broadly believed to be synonymous. The story goes that large amounts of training data are needed for algorithms to discern signal from noise. As a result, machine learning techniques have been most used by web companies with troves of user data. For Google, Facebook, Microsoft, Amazon, Apple (or the “Fearsome Five” as Farhad Manjoo of the New York Times has dubbed them), obtaining large amounts of user data is no issue. Data usage policies have become increasingly broad, allowing these companies to make use of everything from our keystrokes to our personal locations as we use company products. As a result, web companies have been able to offer very useful, but intrusive, products and services that rely on large datasets. Datasets with billions to trillions of datapoints are not unreasonable for these companies.
However, in the academic world, machine learning has been making large in-roads into the sciences. The situation here with respect to data is significantly different. It’s not so easy to obtain large amounts of scientific or medical data. The largest barrier is cost. Traditionally, researchers have relied on tools like Amazon’s Mechanical Turk to harvest data. There, low-paid workers (rates are far below US federal minimum wage, averaging out at something like $1/hr) perform repetitive tasks such as labeling objects and faces in images or annotating speakers in text. These tasks rely on fundamental human skills typically mastered by kindergarten. Performing scientific experiments however, requires significantly greater amounts of expertise. As a result, going rates for experimental workers are much much higher than for mechanical turkers.
One way around this problem is to brute-force a solution through money. Google recently published a landmark study on building deep learning systems for identifying signs of diabetic retinopathy in eye scans. To obtain data for this study, Google paid trained physicians to annotate large amounts of data. The resulting work likely cost hundreds of thousands or millions of dollars to complete. For Google, the expenditure would have amounted to a rounding error in their financials. For academic researchers, performing such a study would have required crafting and receiving a large grant from funding agencies. Needless to say, in today’s troubled scientific funding environment, few researchers can hope to obtain such resources.
What does this state of affairs entail? Are we doomed to live in a world where the best research can only be performed by large corporations with the required monetary resources? Money will always provide an advantage, but perhaps the situation is not as dire as it may seem. Recently, there has been a surge of work in low data machine learning. Work from MIT a few years ago  demonstrated that it was possible to build “one-shot” image recognition systems, capable of learning new classes of visual objects from a single example, using probabilistic programming. Follow-up work from DeepMind  demonstrated that standard deep-learning toolkits like Tensorflow could replicate the feat. Since then, recent work has shown that one-shot learning can be extended to drug discovery  (work by myself and collaborators), robotics  and other areas.
The emerging theme is that sometimes, it’s possible to transfer information between different datasets. Although there may only be very limited amounts of data available for a particular machine learning problem, if there are large amounts of data available for related problems, clever techniques can allow models to transfer useful information between the two systems. These techniques may help scientific machine learning overcome its low data problem by transferring knowledge from data-rich to data-poor problem spaces.
To gain an intuitive understanding of how these techniques work, let’s consider the fable of the baby and the giraffe.