In an earlier article on in-depth learning we discussed how distraction workloads – the use of already trained neural networks to analyze data – can run on reasonably cheap hardware, but executing the training workload that the neural network “learns” is orders of size more expensive.
In particular, the more potential input you have for an algorithm, the more out of your scale problem you get when analyzing the problem space. This is where MACH, a research project written by Tharun Medini of Rice University and Anshumali Shrivastava, enters. MACH is an acronym for Merged Average Classifiers via Hashing, and according to lead researcher Shrivastava, “[his] training times are about 7-10 times faster, and … memory impressions are 2-4 times smaller” than those of previous large-scale deep learning techniques.
In describing the extent of extreme classification issues, Medini refers to online shopping searches and notes that “there are easily more than 100 million products online.” In any case, this is conservative – one data company claimed that only Amazon sold US 606 million individual products, while the entire company offered more than three billion products worldwide. Another company estimates the American product at 353 million. Medini continues: “A neural network that takes search input and predicts 100 million outputs or products will usually end up with around 2,000 parameters per product. So you multiply that, and the final layer of the neural network is 200 billion parameters. I am talking about a very, very deadly simple neural network model. “
On this scale, a supercomputer probably needs terabytes of memory to save the model. The memory problem becomes even worse when you display GPUs. GPUs can process neural network workloads faster than general-purpose CPUs, but each GPU has a relatively small amount of RAM – even the most expensive Nvidia Tesla GPUs have only 32 GB of RAM. Medini says: “Training such a model is priceless because of the enormous communication between GPUs.”
Instead of training on the full 100 million results – product purchases, in this example – Mach divides them into three “buckets”, each containing 33.3 million randomly selected results. MACH is now creating another ‘world’ and in that world the 100 million results are again randomly sorted into three buckets. Crucial is that the random sorting is separated into World One and World Two – they each have the same 100 million results, but their random division into buckets differs for each world.
With each world that is started, a search is performed for both a “world one” classification and a “world two” classification, with only three possible results each. “What is this person thinking of?” Shrivastava asks. “The most likely class is something that is common between these two buckets.”
At the moment there are nine possible outcomes: three buckets in World One times three buckets in World Two. But MACH only needed to create six classes – the three buckets from World One plus the three buckets from World Two – to model that search space with nine outcomes. This advantage improves as more “worlds” are created; a three-world approach yields 27 results from just nine created classes, a four-world setup produces 81 results from 12 classes, and so on. “I pay a linear fee and I get an exponential improvement,” says Shrivastava.
Better yet, MACH lends itself better to distributed computing on smaller individual copies. The worlds “don’t even have to talk to each other,” says Medini. “In principle, you could train every [world] on a single GPU, which you could never do with a non-independent approach.” In the real world, the researchers applied MACH to a 49 million product Amazon training database and sorted them randomly into 10,000 buckets in each of 32 separate worlds. That reduced the required parameters in the model by more than an order of magnitude – and according to Medini required training of the model both less time and less memory than some of the best reported training times on models with similar parameters.
This would of course not be an Ars article about deep learning if we didn’t end it with a cynical reminder of unintended consequences. The unspoken reality is that the neural network does not learn to show shoppers what they asked for. Instead, it learns how to make a purchase from searches. The neural network does not know or does not care what people were actually looking for; it only has an idea of what that person is most likely to buy – and without adequate supervision, systems that are trained to increase the odds in this way can eventually propose baby products to women who have had a miscarriage or worse.