Here are my notes from ICML Day 3 (Tuesday).
Lots of interesting tracks (going in parallel) to choose from: Fisher approximations, Continuous optimization, RRNs, Reinforcement learning, Probabilistic inference, Clustering, Deep learning analysis, Game theory and etc.
The day’s been kicked off with “Test Of Time Award” presentation. Each year committee looks back ~ 10 years and choses paper that’s proven to be most impactful. This time it was “Combining Online and Offline Knowledge in UCT” – the paper that laid foundation of AlphaGo’s success. The original idea of Mogo was leveraging Reinforcement Learning and Monter-Carlo Tree search. AlphaGo’s added Deep Learning kick to it. Back in 2007 authors maid bets\predictions on the future of their algo, beating Go’s world champion in 10 years was one of them.
Several policy evaluation approaches has been discussed.
Data-Efficient Policy Evaluation Through Behavior Policy Search. Key idea here is adapting the behavioral policy parameters with gradient descent on the MSE of importance sampling. Approach shows especially good results on high variance policy.
Optimal and Adaptive Off-policy Evaluation in Contextual Bandits . New SWITCH estimator is proposed. For each action there is literally a switch depending on the conditions: either IPC (Inverse propensity scoring) or DR (doubly robust) estimator is used; otherwise oracle estimator is used. Performs quite well in practice.
This track drives the point that deep models can learn expressive features of nodes in relational graph.
Know-Evolve: Deep Temporal Reasoning for Dynamic Knowledge Graphs. Deep RNN learns non-linearly evolving entity representations over time. Network will ingest new info and updates embeddings of affected entities. Can predict time when the fact may occur. Supports prediction over unseen entities.
Recurrent neural networks
Lot’s of generated music is this track.
Sequence Tutor: Conservative fine-tuning of sequence generation models with KL-control. Combination of RRNs with Reinforcement Learnings (RL).
Results are demonstrated on generating new music melodies and molecules. First RNN #1 is pre-trained on data and the probability distribution over the next token in the sequence learned by this model is treated as a prior policy. Then RL is applied: RNN #2 is trained to generate new outputs using prior policy of RNN #1.
Deep Voice: Real-time Neural Text-to-Speech.
Five subcomponents were presented:
- Segmentation model
Find phoneme boundaries
- Grapheme-to-phoneme model
- Phoneme duration prediction model
RNN sequence regression
- Fundamental frequency (F0) prediction model
- Audio synthesis model
400x speedup over prev implementation ( thus – real time).
Evaluation: by humans. If both F0 and duration are synthesized that score given by people is around of 2 (pretty bad actually). If F0 and duration are cut from the ground truth then the results gets much realistic, human score 3.8+.
DeepBach: a Steerable Model for Bach Chorales Generation. Pretty impressive demos. 50% of human voters (mixed audience) wouldn’t know if that’s real Bach’s chorale or generated one.
For music lovers out there: here is how they encoded four voice chorale for NN to learn:
Deep learning analysis
A Closer Look at Memorization in Deep Networks. This work focuses on difference in learning on noise vs data. Main take ways are (nothing really eye opening I think):
- DNN’s do not just memorize data (phew!)
- Regularization helps with memorization
- DNNs learn simple patterns first
Cognitive Psychology for Deep Neural Networks: A Shape Bias Case Study.
They have taken developmental psychology methods explaining how children learn word labels for objects, and applied that analysis to DNNs.
Result: state-of-the-art one shot learning models trained on ImageNet exhibit a similar bias to that observed in humans: they prefer to categorize objects according to shape rather than color.
Axiomatic Attribution for Deep Networks. The idea is to examine the gradients of inputs obtained by interpolating on a straight-line path between the input at hand and a baseline input (black image), and then aggregate these gradients together.
Application is not limited to image classification ( though definitely would be of great use to the domains as healthcare). Code.
On Calibration of Modern Neural Networks. Author highlights the problem in latest DNNs: they are overconfident when misclassifying, original simpler networks as LeNet did not have such issue. Factors contributing to such miscalibration are increased N capacity, batch norm, less regularization (weight decay).
Simple trick: temperature scaling, works well, does not require architectural changes.