NIPS 2017 — notes and thoughs

Last week Long Beach, CA was hosting annual NIPS (Neural Information Processing Systems) Conference with record breaking (8000+) number of attendees.  This conference is consider once of the biggest events in ML\DNN Research community.

Below are thoughts and notes related to what was going on at NIPS. Hopefully those brief (and sometimes abrupt) statements will be intriguing enough to inspire your further research ;).

Key trends

    1. Deep learning everywhere – pervasive across the other topics listed below. Lots of vision/image processing applications. Mostly CNNs and variations thereof. Two new developments: Capsule Networks and WaveNet.
    2. Reinforcement Learning – strong come-back with multiple sessions and papers on Deep RL and multi-arm bandits.
    3. Meta-Learning and One-Shot learning are often mention in Robotics and RL context.
    4. GANs – still popular, with several variations to speed up training / conversion and address  mode collapse problem. Variational Auto-Encoders also popular.
    5. Bayesian  NNs are area of active research
    6. Fairness in ML – Keynote and several papers on dealing with / awareness of bias in models, approaches to generate explanations.
    7. Explainable ML — lots of attention to it.
    8. Tricks, approaches to speed up SGD.
    9. Graphic models are back! Deep learning meets graphical probabilistic modeling.

Thoughts on Interpretability

    1. Interpretability vs justification: Are we seeking to understand the real reasons why (or why not) a prediction was made, or merely seeking to justify the prediction and be able to sell its reasonability?
      • Most work on interpretability tries to approximate the prediction outcome to highlight some features that were likely responsible for the prediction, i.e., feature selection with post-prediction explanations.
      •  Interpretability of image models highlights image segments and/or derived maps (heat maps, differential features, ..) that were likely distinguishable for the outcome given.
    2. Interpretability and fairness are inherently related: How biased are the “explanations” given for a prediction? Are the explanations self-reinforcing a model’s implicit/explicit biases? Does interpretability help or hurt the quest for fairness in ML?
    3. Interpretability is still limited to a unidirectional explanation, i.e., assuming that the explanation will be consumed (or rejected) in a decision-making process assuming a uniform interpretation of the “interpretable explanation”. The state-of-the-art does not account for diversity of interpretations of the interpretable explanation, nor does it account for the dynamics resulting from human’s reactions and/or adaptations to a prediction + explanation. Model interpretability will be a constantly evolving process due to the human’s perception, adaptation and reactions to the models, similar to the evolution in Web search and social media. ML models are new agents injected in the workplace, society, community,  day-to-day life, etc., where they will be subjected to and expected to react to similar dynamics exhibited by their human agent counterparts under a diversity of circumstances.
    4. Interesting talk by Kilian Weinberger about “The (hidden) Cost of Calibration”. It discusses the positive effects of calibration and its potential impact on interpretability, as well as notions of fairness/unfairness and dangerous side effects. Kilian also shows how Platt’s good old SVM calibration can be applied to the prior-to-last layer weights in deep networks to produce calibrated predictions and overcome model over-confidence.
    5. FICO is introducing an explainability challenge aiming to produce inherently interpretable credit risk scoring models.

Overview of talks

Below are notes that give quick gist about what a particular session was about.

A zip archive of all NIPS 2017 papers is available here,  good luck with dig into details ;).

Session Reinforcement Learning for the People
Most impressive breakthrough from the session Lots of applications  that could benefit from RL.
Architecture and tech details Robust and Efficient Transfer Learning with Hidden-Parameter Markov Decision Processes:  optimal policies are adapted to subtle variations within tasks in an efficient and robust manner. Bayesian Neural Network both learns the common transition dynamics for a family of tasks and models how the unique variations of a particular instance impact the instance’s overall dynamics.

Sample Efficient Policy Search for Optimal Stopping Domains: can leverage full trajectories to retrospectively evaluate  alternate optimal stopping policies; substantial increase in data efficiency.

Deep RL Transfer: find good shared representations, fast transfer by encouraging shared representations learning across tasks.

Inverse Reward Design —  alleviate negative side effects of misspecified reward functions and mitigate reward hacking.


Application (potential) in industry Shopping optimization (buying airline ticket).

Tutoring games at school.

Siemens for industrial environments (“A Benchmark Environment Motivated by Industrial Control Problems”, git here)

Medical: prosthesis

Other thoughts Talk gave RL generic overview, the RL for and by the people were discussed.

Here are Stanford slides on “Human in loop RL“, that’s been partially used in the talk.

Tutorial (section 4) has info on RL by people.

Interesting problem on building “Beyond Expectations” systems: policy must be risk sensitive and robust and the same time; safe exploration should be permitted (robotics, self-driving cars).

Counterfactual Estimation: active area of research, lots of focus on this in RL works (sequential decision making).

Offline policy evaluation across representations with applications to educational games — policy developed for tutoring game by RL was better than policy developed by human.

Humans provide demonstrations — enormously influential in robotics; recent tutorial .

Humans teaching agents needs prep (learn how to do it right).


Session Fairness in ML
Most impressive breakthrough from the session We need to be careful when building models for  fairness-sensitive domains (like credit, education, employment, housing, marketing). See legally protected classes.

Observational criteria can help discover discrimination, but are insufficient on their own.

Causal viewpoint can help articulate problems, organize assumptions.


Architecture and tech details How machines learn to discriminate

·        Skewed sample

·        Tainted examples

·        Limited features

·        Sample size disparity

·        Proxies

B, Selbst (2016)

Attacking discrimination with smarter machine learning — threshold classifier, how to turn an unfair classifier into a fairer one.

Algorithmic decision making and the cost of fairness — Neither calibration nor equality of false positive rates rule out blatantly unfair practices

Avoiding Discrimination through Causal Reasoning – Consider proxies instead of underlying sensitive attributes.

Application (potential) in industry ML in credit, education, employment, housing, marketing verticals.
Other thoughts Slides for the talk:

Discrimination is domain and feature specific.

The incidence and persistence of discrimination: 1) Callback rate 50% higher for applicants with white names than equally qualified applicants with black names Bertrand, Mullainathan (2004) 2) No change in the degree of discrimination experienced by black job applicants over the past 25 years Quillian, Pager, Hexel, Midtbøen (2017)


Answer to substantive social questions not always provided by observational data, for example the two scenarios below admit identical joint distributions:



==>  motivates causal reasoning

Session Powering  the next 100 years/John Platt
Most impressive breakthrough from the session Google AI does really great investments on tacking global problems.
Overview We will need 0.2 Yottajoules (0.2 * 10^24 Joules) over next 100 years, 5x amount of energy consumed by humans so far. Fossil fuel does not scale. Electricity generation is limited by economics . — modeling e2e system costs.

==> Need to change the game: Fusion energy for example.

Google AI collaborates with TAE, helps them on exploration and inference.

Application (potential) in industry Energy sector
Session Reprogramming Human Genome
Most impressive breakthrough from the session ML helped to find super effective drug for Spinal Muscular Atrophy — SPINRAZA.

Current drug development process in ineffective bugdet wise and ML can change that by enabling finding drugs that are more effective and doing that in shorter time (speed up reesearch and clinical trials).

Architecture and tech details Use CNNs (not super complex architecture) to find pathlogic mutations.

Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning.


What data is used


Application (potential) in industry Medical
Other thoughts Presenter several times mentioned that Biology is too complex for researchers to understand fully.

Described methods genome reprogramming deal with Exons (aka print statements, end up in protein) and Introns(control how Exons are put together). — using AI to build new universe of digital medicine.

SPINRZA: mutation in Exon (C->T) causes exclusion of this Exon from protein; SPINRAZA modifies sequence to get this EXON included.

Testing compounds in cloud lab: Python script with experiment config tells robots how to conduct the experiment.

Earning trust and thoughts on Explainability: presenter considers “black box ML” problem to be red herring  sometimes. Groundbreking results and awesome perf results that ML approach  is super helpful with earning trust from stakeholders.


Session Test of time award:  Random Features for Large-Scale Kernel Machines (NIPS 2007)
Most impressive breakthrough from the session Presenter made conversial statement that current state of ML is similar to Alchemistry — it’s works and often useful but people do not always fully understand how it works. DNN tools are fragile.

Advances in “black-box” and more difficult to interpret algorithms come at a price. Need to reinstate rigor in scientific methods. Simplicity does not imply less power.

Overview Gist of the paper that got award: most of the kernels used in 10 years ago could be approximated as straight sum of product kernel functions => approximation gives linear combination with less coefficients to solve for and that  was what optimizers at that time could handle.

Do not underestimate the effect of compounded errors/approximations on model training (e.g., SGD computation on different hardware)


Session Robust Optimization for Non-Convex Objectives
Most impressive breakthrough from the session Potentially promising approach to deal with noisy data (image related).
Architecture and tech details Link to paper. Goal is to find a minimax solution that optimizes in the worst case over a given family of functions.

Example: character classification over images that are subject to various distortions(patterned backgrounds or low-resolution, or rotations). Instead of training a separate classifier for each possible scenario this approch aims to optimize performance in the worst case over different forms of corruption. Those corruptions (functions)are provided to the trainer as black-boxes.


What data is used MNIST
Application (potential) in industry Robust neural network training. Presenters demoed character classification task with adversarial distortion (corrupted, noisy data)
Other thoughts All demos were MNIST related (basic).


Session The Trouble with Bias
Most impressive breakthrough from the session Author showed overwhelming example of biases we come across in Search, Recommender System, Computer vision and other ML related scenarios.

Biases reflect our culture and history of the society. Datasets reflect who is in power. Datsets get outdated.

Overview Bias\fairness theme is often mentioned at the conference.


Harms of allocation — we’re  here now.

— stereotyping,  race related issues with computer vision, denigration (jews, skin color), underrepresentation (women execs)

Harms of representation — what we’re missing.

Celebrity face  dataset: 70% men, 80% of those are white man,  the most popular person is J. Bush (dataset was built of news articles).

What can we do?

– Forensics of the data and it’s lifecycle.

– Working with people who are not in our group

–  Be mindful of ethics of classification

Video of the talk here.

Application (potential) in industry Well, everyone where personal data (including computer vision) is involved.
Session Streaming Weak Submodularity: Interpreting Neural Networks on the Fly
Most impressive breakthrough from the session Much faster (10x?) than LIME
Architecture and tech details Paper is here. Git is here.

Segment an image into N superpixels. Then, for a subset S of those regions create a new

image that contains only these regions and feed this into the black-box classifier. For a given model M, an

input image I, and a label L1 we ask for an explanation: why did model M label image I with label L1.


Application (potential) in industry Computer vision.
Session A Unified Approach To Interpreting Model Predictions
Most impressive breakthrough from the session Claims to combine the best of 6  model explainability approaches. Has integration with xgboost.

Show improved computational performance and/or better consistency with human intuition than

previous approaches.

Architecture and tech details Paper is here.  Git is here. , SHAP (SHapley Additive exPlanations) assigns each feature an importance value for a particular prediction.
What data is used For computer vision: MNIST
Other thoughts Feature independence is assumed.


Session Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts
Most impressive breakthrough from the session Combines object detection, image semantic segmentation, and word embeddings to capture spatial-image relationships and provide interpretability. Searches all possible bounding boxes efficiently.

Approach outperforms current methods  accuracy by 3%+.

Architecture and tech details Project page here.


Words priors: the average bounding box location per word.

Ws,c — connects word s  to and image concept c.

Intuition: if word s and concept c  are related , then |Ws,c| is large.


Φc — accumulated value withing bouding box y for image concept c.

Learning params: Structured SVM

What data is used Flickr 30k Entities and the ReferItGame database
Application (potential) in industry Joint understanding of language and image.
Other thoughts Success case:


Interpreting learned params — word-concept relationships |Ws,c|. Concepts are from object detection.


Grounding image and text concepts is bi-directional: Ground text to image concepts (this paper) or ground image features to text concepts (VQA paper above). Paper compares to other techniques, including CCA (note to self: check CCA previous work and possibility of bi-directional concept grounding)

Session Label Efficient Learning of Transferable Representations across Domains and Tasks
Most impressive breakthrough from the session Learns a representation transferable across different domains and tasks in a label efficient manner.

Model is simultaneously optimized on labeled source data and unlabeled or sparsely labeled data in the target domain.

Architecture and tech details  transfRep
What data is used Transfer learning SVHN 0-4 –> MNIST 5-9

Transfer learning ImageNet object recognition –> video action recognition

Session Towards Accurate Binary Convolutional Neural Network (ABC-Net)
Most impressive breakthrough from the session Authors claim that this is the first time a binary neural network achieves prediction accuracy

comparable to its full-precision counterpart on ImageNet.

Architecture and tech details Paper is here.

Binaries NNs (BNNs) have extreme case of quantization: binary weight  and activation +-1 using sign() function. Competitive on MNIST, falls apart on ImageNet (18% accuracy loss).

Fix: approximate weights and activations more precisely.

1.      Approximate full-precision weights with the linear combination of multiple binary weight bases. The weights values of CNNs are constrained to {−1, +1}, which means convolutions can be implemented by only addition and subtraction (without multiplication), or bitwise operation when activations are binary as well. Paper shows 3∼5 binary weight bases are adequate to well approximate the full-precision weights.

2.      Introduce multiple binary activations. Employing five binary activations reduces the Top-1 and Top-5 accuracy degradation caused by binarization to around 5% on ImageNet compared to the full precision counterpart.


What data is used ImageNet
Application (potential) in industry Computer vision on smaller devices (drones, mobile)
Other thoughts Main motivation for the paper was to run CNNs on DJI Drones, that have limited resources in computation and power.
Session The Unreasonable Effectiveness of Structure
Most impressive breakthrough from the session Key question of the talk: Is there structure in the problem and how can I exploit it?

PSL framework is introduced

PSL has produced state-of-the-art results in many areas spanning natural language processing, social-network analysis, and computer vision.

Exploiting structures in inputs and outputs of ML models

·        Data is multi-modal, multi-relational, spatio-temporal

·        Dependencies and structure in outputs are often ignored, thereby making incorrect independence assumptions.

Architecture and tech details Probabilistic Soft Logic (PSL) is a machine learning framework for developing probabilistic models.

PSL Program = Rules + Input DB

PSL program:


Application (potential) in industry Natural language processing, social-network analysis, computer vision (image reconstruction, activity recognition in videos), entity resolution, recommendation systems, knowledge discovery (graph construction, knowledge completion), debate stance classification,  trust relationships.
Other thoughts Structured prediction patters:

·        Collective Classification

Votes(X,P) & Friends(X,Y) -> Votes(Y,P)

·        Link Prediction

Likes (User, Item1) &SimilarItem(Item1, Item2) -> Likes(User, Item2)

·        Entity Resolution

Determine which nodes refer to same underlaying entity


Dangers of ignoring structure:

·        Privacy:

1.      Many approaches consider only individuals’ attribute data

2.       Don’t take into account what can be inferred from relational context


1.      Impartial decision making

2.      Need to take into account structural patterns

Opportunity for ML methods that can mix:

·        Structured and unstructured approaches

·        Probabilistic & logical inference

·        Data-driven &knowledge driven modelling

Session Engineering and Reverse-Engineering Intelligence Using Probabilistic Programs, Program Induction, and Deep Learning
Most impressive breakthrough from the session Deep integration of probabilistic programming with deep learning to capture “common sense” inferences, inverse planning, inverse graphics (inferring 3d scene structure from 2d images)
Architecture and tech details Many languages for probabilistic modeling. Tutorial uses Venture.
Application (potential) in industry Robotics, computer vision (scene understanding), interactive (thinking) systems
Other thoughts Got AI tools and techniques, but not real AI.

AI and robotics hasn’t reach intelligence of an 18-months old baby yet.

Engineering common sense using probabilistic models.


Session Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results
Most impressive breakthrough from the session A method to overcome limitations of temporal ensembling. Averages model weights instead of label predictions. Achieves good test results with fewer labeled examples. Works with residual networks too.
Overview Semi-supervised learning using two models: student and teacher, that improve each other during training. Target-generating teacher model. Improves results on CIFAR-10 and ImageNet. Can be potentially combined with a virtual adversarial technique to further improve results by improving target generation.
What data is used CIFAR-10 and ImageNet


Session Attention is all you need
Most impressive breakthrough from the session Proposes a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely
Architecture and tech details
Application (potential) in industry Machine translation Potentially any application involving sequence learning, assuming that components are independent


Session A simple neural network module for relational reasoning
Most impressive breakthrough from the session Introduces Relation Networks (RNs) as a simple plug-and-play module to solve problems that fundamentally hinge on relational reasoning.
Architecture and tech details  relatReas
What data is used CLEVR, Sort-of-CLEVR, bAbl A simulated dataset of dynamic physical systems
Application (potential) in industry Visual question answering

Text-based question answering

Potentially any application that requires learning relations between features (uni or multi-modal features)

Session Deep Voice 2: Multi-Speaker Neural Text-to-Speech
Most impressive breakthrough from the session Technique for augmenting neural text-to-speech (TTS) with low-dimensional trainable speaker embeddings to generate different voices from a single model.
Architecture and tech details  deepVoice
What data is used VCTK and Audiobooks
Application (potential) in industry Single and multi-speaker text-to-speech systems Natural sounding digital assistants and animated audiobook readings
Other thoughts Shows that a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker, while achieving high audio quality synthesis and preserving the speaker identities almost perfectly.
Session Modulating early visual processing by language
Most impressive breakthrough from the session Visual question answering. E.g.: Answer question about an image “is the umbrella upside down?”

Modulates entire visual processing by a linguistic input.

Architecture and tech details  visproc
What data is used VQA and GuessWhat?!
Application (potential) in industry Visual question answering
Other thoughts Inspiration related to another paper which grounds image-based features using text concept words: Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts
Session TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning
Most impressive breakthrough from the session Speed up to up to 1.5-2x for distributed training.

Gradient is quantized to three levels {-1, 0,1}

Architecture and tech details Git repo and slides.

Only exchange gradients.

Gradient quantization reduces communication in both direction.


Application (potential) in industry Distributed DNN training in cloud.
Other thoughts TernGard give higher speed up when:

·        Using more workers

·        Using smaller communication bandwidth

·        Training DNN with more fully connected layers


Session Train longer, generalize better: closing the generalization gap in large batch training of neural networks
Most impressive breakthrough from the session Investigate different techniques to train models in the large-batch regime and present a novel algorithm named “Ghost Batch Normalization” which enables significant decrease in the generalization gap (= persistent degradation in generalization performance)  without increasing the number of updates.
Architecture and tech details Paper here. Git here.

·        Experiments show noticeable generalization gap between small and large batch (Resnet on Cifar10: 92.8% vs 86.1%).

·        After adapting learning rate results improved but gap remained (92.8 vs 89.3)

·        Ghost batch norm: mimic small batch statistic; also reduces communication bandwidth (92.8 vs 90.5). Acquiring the statistics on small virtual (ghost) batches instead of the real large batch helps to reduce the generalization error (Note: use  full batch statistic  for the inference phase)

·        Then train longer in plateau region (Regime Adaptation)-> generalization gap vanishes (92.8 vs 93.7).


What data is used MNIST, CIFAR-10, CIFAR-100 and ImageNet
Application (potential) in industry Significant speedup are possible:

“Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour” (paper) uses 8K batch size,

adopts a linear scaling rule for adjusting learning rates as a function of minibatch size and develops a new warmup scheme that overcomes optimization challenges early in training.

Other thoughts Motivation: Can we improve batch size and improve parallelization?
Session Deep Learning for Robotics
Most impressive breakthrough from the session Recurring theme: Meta-Learning. Enables discovering algos that are powered by data\experience (vs just human integrity)
Overview Current areas of research\not solved yet:

·        Faster RL

·        Long horizon reasoning

·        Trackability

·        Lifelong learning

·        Leverage simulation

·        Maximize Signa Extracted from Real World Experience

Overview of Deep RL success stories.

See list of Berkeley papers here.

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (MAML, paper)  has been presented at ICML and mentioned here again several times.

Lots of work on applying meta-learning to Optimization (task distribution: different NNS, weight inits, diff loss funcs).

Lots of work on applying meta-learning to Classification (task distribution: diff classification datasets).

Lots of work on applying meta-learning to RL (task distribution: different envs)


Session Learning State Representations
Most impressive breakthrough from the session “Shallow learning, deep representations” – sophisticated learning of problem representation makes learning task simple
Overview Working on AI that is more like human intelligence (crossing street vs chess). “simple” tasks with little data.

Key question: How do we learn a representation of the world that supports decision making?

Hypothesis: people group similar environments as originating from same cause: Bayesian inference with infinite capacity prior – latent causes, number unbounded.

Real world learning is clustering with growing number of clusters

Orbitofrontal cortex in humans implicated in making “good life choices”, but can still function normally.

Hypothesis is that it encodes hidden causes.

What data is used Experiments: guessing number of circles; direction of line on screen; Face/House-Young/Old task


Session AlphaZero – mastering games without human knowledge
Most impressive breakthrough from the session AlphaGo Zero learns no human data – just from rules + self-play, starting random.

No human features – just raw board

Single policy/value network

No Monte-Carlo – just NN evals.

“Less complexity => more generality”

Search-based policy iteration

1.      Search-based policy improvement (instead of greedy)

2.      Search-based policy evaluation (this allows evaluation of improved policy, giving even better training signal)

Now recently applied to Chess. Despite structural differences, used exact same model for chess. Within 4 hrs outperformed world champion.

Specialization hurts generalization ability.

Architecture and tech details MCTS = Monte Carlo Tree Search works better than Alpha-Beta search.
Other thoughts Advance over AlphaGo.

Go is 3k years old, 40M players, 10^70 positions

Intractible to search because of ^200 branching

2 networks:

1.      Move – generates distro over moves – reduce breadth

2.      Value – predicts winner given position – reduce depth

During world champion game saw “systematic delusion”. Fixed in new version using “principled RL”

Autonomously discovered well known openings used by humans. Also discovered new patterns adding to human knowledge.

Chess – current state-of-the-art superhuman. Alpha-beta search based on human expertise.


1.      Long list of specialized features

That be all what have been captures which is tiny part of what’s being covered at the conference :).



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s