Last 2 days of the conference were workshops and actually had less rock-star content.

In overall ICML in this year was well organized (well, minus pass-holders that emit constant cow-bell like tinkling) and rich for content. I have not noticed any breakthrough papers though. Lots of RNNs, LSTMs, language\speech related work, GANs and Reinforcement Learning.

Toolset wise it “feels” like mostly Tensorflow, Caffe, Pytorch, even Matlab was mentioned few times.

**Principled Approaches to Deep Learning**

This track was about theoretical understanding DNN architectures.

__Do GANs actually learn distribution?__ I personally had higher expectation of this talk. Main point was that yes, it’s problematic quantify success of GANs training algo and that mode collapse is a problem. That’s pretty much all about it.

__Towards deeper understanding of a quantized networks__. DNNs are big and there is subset of quantized/low precision networks that are fast (no multiplication), have low storage and power consumption and thus could be used on low power devices. During the talk author explores nuances of training quantized networks.

Experiments hey did showed that floating point networks go from Exploration to Exploitation as learning rate shrinks, they get really focused in finding optimum. However networks with Stochastic Rounding (SR) kinda get stuck in Exploration phase. That’s a problem because SGD have Exploitation phase and thus NN with SR will surely get issues.

__SVCCA: Singular Vector Canonical Correlation Analysis for Deep Understanding and Improvement__. Vector-themed approach to understand how NN is being trained: neurons are vectors, layers are subspaces. Applying SVD and CCA we can compare representational similarity of the layers.

Some observations:

- Layers converge bottom-up (ones close to input solidify first) => can freeze lower layers in training earlier, thus improve generalization
- Interpretability: by applying SVCCA to a layer and output we can measure sensitivity to different classes through the network.

__The Sum-Product Theorem: A Foundation for Learning Tractable Models__. 15 pages of A. Friesen and P. Domingos masterpiece is here.

The paper describes a unifying framework to learn tractable models for a variety of problems including optimization, satisfiability, constraint programming, inference, etc. The unifying framework is achieved by considering sum-product functions over a semi-ring of sum operators and product operators. When the product operator is decomposable, it is shown that any problem that boils down to summing out the variables from the sum-product function can be solved in linear time with respect to the size of the representation of the sum-product function.

__LibSPN__. It’s a library for learning inference with Sum-Product Networks (SPNs), based on Tensorflow (multi-GPU).

SPNs are viewed as promising networks, combining benefits of Deep Learning and Probabilistic modeling. SPNs learn form high-dim, noisy data.

Promising results: speech\lang modeling, robotics, image classification and completion.

**Visualization for Deep Learning**

Few interesting gradient based method has been discussed, mostly image classification\segmentation\captioning related.

__Interpretation Deep Visual Representations.__ Goal: quantify the interpretability of latent representations of CNNs.

Solution: evaluate units for semantic segmentation (paper and blog has really interesting charts). Authors created Broden dataset — heavily annotated (63+K images, 1+K visual concepts), most examples are segmented down to the pixel level except textures.

Network dissection method evaluates every individual convolutional unit in a CNN as a solution to a binary segmentation task to every visual concept in Broden dataset.

Interpretability of units within all the convolutional layers is analyzed: color and texture concepts dominate at lower layers while more object and part detectors emerge in higher levels.

Network architectures are analyzed:

Authors explore how training conditions like number of iterations, dropout, batch norm affect the representation learning of neural networks:

Nice video showing how emergent concepts appear when training a model.

Code (Caffe).

__Grad-CAM: Visual explanations from NNs__ (paper, blog, cool demos). Approach: use gradients flowing into the final convolutional layer to produce a localization map highlighting the important regions in the image for predicting the concept. In overall known approach (based on Guided Backprop) but with more refined technique and extensive human based studies.

Image captioning explanations:

Attention maps analysis for Visual Question Answering was interesting:

Negative explanations highlight the support of the regions that would make the network predict a different class:

Code (Caffe)

**Time Series**

During that track it was emphasized that time series is super popular practical application of ML (finance, economy, climate, IoT, web, healthcare, energy, astrophysics, traffic and etc).

2 types of approaches:

- Extensive feature engineering + classical time services ML algos
- In case of huge amount of homogeneous time series DNN could be applied (and thus laborious feature engineering could be omitted)

__Visualizing and forecasting big time series data__. The talk was presented by Rob Hyndman (website), author of many time series relate books and R packages. Time serious visualization, automatic forecasting, hierarchical and grouped time series were discussed.

The following R packages were used:

- forecast
- anomalous
- hts

Secret link to v2 of “Forecasting: Principles and Practice” free book here.

PS: baby kangaroo photo as a bonus 😉