CVPR 2018 — recap, notes and trends

This year CVPR (Computer Vision and Pattern Recognition) conference has accepted 900+ papers. This blog post has overview of some of them. Here you can find notes that we captured together with my amazing colleague Tingting Zhao.

The main conference had the following presentation tracks during 3 days:

  • Special session: Workshop Competitions
  • Object Recognition and Scene Understanding
  •  Analyzing Humans in Images
  •  3D Vision
  •  Machine Learning for Computer Vision
  •  Video Analytics
  •  Computational Photography
  •  Image Motion and Tracking
  •  Applications

Below are  some trends and topics worth mentioning:

  • Video analysis: captioning, action classification, predict in what direction person (pedestrian) will move.
  •  Visual sentiment analysis.
  •  Agent orientation in space (room), virtual rooms datasets   — topics related to enabling machines to perform tasks.
  •  Person re-identification in video feeds.
  •  Style transfer (GAaaaNs) is still a theme.
  •  Adversarial attacks analysis.
  •  Image  enhancements — remove drops, remove shadows.
  •  NLP+Computer Vision.
  •  Image and video saliency.
  •  Efficient computation on edge devices.
  •  Weakly supervised learning for computer vision.
  •  Domain adaption.
  •  Interpretable Machine Learning.
  • Applications of Reinforcement Learning to CV: optimize network, data,  NN learning process.
  • Lots of interest into data-labeling area.

Notes below are semi-grouped to the following subsections:

Here is nice compilation of person re-identification related papers (in Mandarin, online translators are doing ok job 🙂 ).

For more info  please dig into into presentations and workshops archive.

Videos from sessions are here.

Session Title Special Session: Workshop Competitions
Most impressive breakthrough from the session First time on CVPR.

Many researchers and ML practitioners overfit too much when participate in Kaggle-like competitions  and that’s caught by test dataset.  How many published papers and solutions  deployed to production are in fact overfitting?

Architecture and tech details Competitions are very powerful in extracting all signal from dataset.

Competitions show what possible.

Competitions draw focus to an area.

Limitations:

  • poor design (wrong questions) make participants’ optimize for wrong things.
  • does not allow  to iterate on the problem
  • significant time spent on minimal gain
  • winning models often heavily ensembles
  •  not a fit for all problems

Scene Analysis, Question Answering

Session Title Embodied Question Answering
Most impressive breakthrough from the session Step towards embodied agents that can SEE, TALK, ACT and  REASON.
Architecture and tech details Paper here.

Model for vision:

CNN as encoder, multi-task pixel-to-pixel prediction.

qa1

Model for Language:

2-layer LSTMs

Model for Navigation:

‘Planner’ selects actions (forward, left, right), ‘controller’ executes these primitive actions a variable number of times (1,2, . . . )  and returns control back to the planner.

~ hierarchical RL.

qa2

Answering model:

qa3

Examines last 5 frame, computes an attention pooled visual encoding based on image-question similarity, combines these with an LSTM encoding of the question, and outputs a softmax over the space of 172 possible answers.

What data is used EQA Dataset (questions in envs): rgb images, semantic segmentations masks, depth maps, top-down maps.

12 room types (kitchen, living rom…)

50 object types

Programmatically generated questions (similar to CLEVR)

https://github.com/facebookresearch/House3D

Application (potential) in industry Physical agent capable of taking actions in the world and talking  to human in natural language.
Other thoughts Project page: https://embodiedqa.org/

Code https://github.com/facebookresearch/EmbodiedQA

Session Title Learning by Asking Questions (LBA)
Most impressive breakthrough from the session LBA\Interactive Agents: decide what info they need and how to get it.

Outperforms passive supervision.

Architecture and tech details Given set of images ask questions, get answers, acquire supervision.

No human needed as oracle.

lba1

Question generation model: is an image captioning model that uses a LSTM conditioned on image features (first hidden input) to generate a question.

Question answering module is a standard VQA model.

What data is used CLEVR: 70k images, 700 image-QA
Other thoughts https://research.fb.com/publications/learning-by-asking-questions/
Session Title Im2Flow: Motion Hallucination from Static Images for Action Recognition
Most impressive breakthrough from the session Translate a static image into an accurate flow map  and thus predicts unobserved future motion implied by a single snapshot. This helps static-image action recognition.
Architecture and tech details Paper is here.  Project page and gthub here.

Encoder-decoder CNN and a novel optical flow encoding that can translate a static image into a flow map.

imflow

What data is used Train on  videos from UCF-101 HMDB-51 (sample 700K frames)
Application (potential) in industry Image\video analysis, captioning, recognition of actions and dynamic scenes.
Other thoughts Aside from human motions, our model can also predict scene motions, such as the falling waves in the ocean.

Model can infer the motion potential (score) of a novel image—that is, the strength of movement and activity that is poised to happen.

Session Title Actor and Action Video Segmentation from a Sentence
Most impressive breakthrough from the session Actions are specified by a nlp sentence (vs predefined vocabulary of actions).

Any actor (vs human only approaches).

Architecture and tech details Paper here. Code tbd here.

RGB model for actor and action video segmentation from a NL  sentence consists of 3 components:

  • CNN to encode the expression
  • 3D CNN to encode the video
  • Decoder that performs a pixel-wise segmentation by convolving dynamic filters generated from the encoded textual representation with the encoded video representation. The same model is applied to the Flow input.

actoract

What data is used Extend two popular actor and action datasets with more than 7,500 natural language descriptions.
Application (potential) in industry Video analysis, indexing, captioning.
Other thoughts Intersectionover-union (IoU) metric was used to measure segmentation quality.

Sentence awareness beneficial for actor and action specification.

Video-awareness beneficial for more accurate segmentation.

Good results:

actoract2

Session Title Egocentric Activity Recognition (EAR) on a Budget
Most impressive breakthrough from the session Use RL to learn policies with diff energy profiles
Architecture and tech details Paper here.

Smart glasses have limited battery and processing power.

ego

What data is used Dataset http://sheilacaceres.com/dataego/.

Benchmark dataset — Multimodal.

Application (potential) in industry Use AI for assistant living and nursing services (activity tracking and classification, use data from smart glasses). EAR can provide automatic reminds\warnings, help with cognitive impairments avoid hazardous situations.
Other thoughts Learning a user context is key for leveraging the energy b\w motion and vision methods.
Session Title Emotional Attention: A Study of Image Sentiment and Visual Attention
Most impressive breakthrough from the session The first study to focus on the relation between emotional properties of an image and visual attention.

Ccreate the EMOtional attention dataset (EMOd).

Architecture and tech details Project page, paper, code here.

Design a DNN for saliency prediction, which includes a novel subnetwork that learns the spatial and semantic context of the image scene.

CASNet: A channel weighting subnetwork (inside the dashed orange rectangle) computes a set of 1024-dimensional feature weights for each image to capture the relative importance of the semantic features of a particular image.

The gray dashed arrows illustrate how the relative saliency of different regions within an image are modified through the subnetwork.

emotatt

What data is used 3 eye-tracking datasets with emotional content.: EMOd (1019  images), NUSEF (751 images), CAT2000 (2000 images).
Application (potential) in industry Video surveillance, captioning
Other thoughts Emotional objects attract attention not only strongly, but also briefly.

The emotion prioritization effect  is stronger for human-related objects than objects unrelated to humans.

Session Title Multi-Oriented Scene Text Detection via Corner Localization and Region Segmentation
Most impressive breakthrough from the session Detect scene text by localizing corner points of text bounding boxes and segmenting text regions in relative positions.
Architecture and tech details Paper is here.

Combine the ideas of object detection and semantic segmentation and apply them in an alternative way.

Given an image, the network outputs corner points and segmentation maps by corner detection and position-sensitive segmentation. Then candidate boxes are generated by sampling and grouping corner points. Finally, those candidate boxes are scored by segmentation maps and suppressed by NMS.

textdet

What data is used  ICDAR2013, ICDAR2015, MSRA-TD500, MLT and COCO-Text
Application (potential) in industry Extracting textual information from natural scene images: product search, image retrieval, autonomous driving
Other thoughts Compared with general object detection, scene text detection is more complicated because:

  1. Scene text may exist in natural images with arbitrary orientation, so the bounding boxes can also be rotated rectangles or quadrangles;
  2. The aspect ratios of bounding boxes of scene text vary significantly;
  3. Since scene text can be in the form of characters, words, or text lines, algorithms might be confused when locating the boundaries.
Session Title Neural baby talk
Most impressive breakthrough from the session Achieve both natural sounding and visually grounded at the same time by reconciling the classical “slot filling” approach with a neural method.

Generate exponentially many possible templates; decouple the detection from captioning; incorporate different kinds of supervisions.

Neural Baby Talk – a novel framework for visually grounded image captioning that explicitly localizes

objects in the image while generating free-form natural language descriptions.

Architecture and tech details Github

Paper

baby1

baby2

What data is used Coco dataset
Application (potential) in industry Image captioning tasks
Other thoughts A further step after object detection and coupled with NLP

 

Various NN architectures for CV

Session Title Deep Layer Aggregation
Most impressive breakthrough from the session Better accuracy, fewer params.

Deep layer aggregation is a general and effective extension to deep visual architectures.

Architecture and tech details Paper is here. Git is here.

Learn to aggregate layer outputs.

Any block structure.

More expressive layer inputs.

Faster aggregation.

deepagg

Application (potential) in industry Image recognition, segmentation
Other thoughts The talk mentioned 2 trends in image recognition:

  1. Better building blocks.
  2. Skipped connections.

=>

How can  we make two  trends more compatible? How can we archive accuracy of DRN (Dilated residual N) with efficient skipped connections?

Author mentioned\promoted this open sources labeling tool (BSD license): http://www.scalabel.ai/

Session Title Practical Block-wise Neural Network Architecture Generation
Most impressive breakthrough from the session Provide a block-wise network generation pipeline called BlockQNN which automatically builds high-performance networks using the Q-Learning paradigm with epsilon-greedy exploration strategy.

Distributed asynchronous framework (speedup +)

blockwise

Architecture and tech details Paper is here.
What data is used CIFAR, ImageNet
Other thoughts Image classification (CIFAR): spends 3 days with 32 GPUs to generate network, which is way less that NASv1-Google (800GPUs, 28 days)
Session Title Relation Networks for Object Detection
Most impressive breakthrough from the session Proposed object relation module ORM could be embedded into  existing object detection pipeline (like Faster RCNN), improves mAP (+0.5-2).

The module processes a set of objects simultaneously through interaction between their appearance feature and geometry.

Architecture and tech details Paper here.  Github here.

relat1

relat2

Other thoughts Authors claim gaining +2.3 mAP for Faster RCNN by inserting 2 ORMs.

Network learns:

  • object pairs with high relation weight
  • class co-occurrence info
Session Title DeepGlobe: A Challenge for Parsing the Earth through Satellite Images
Most impressive breakthrough from the session http://deepglobe.org/

D-LinkNet: winner for road extractions.

Dense Fusion: winner for land cover classification

Multi-task U-net: winner for detecting buildings

Architecture and tech details D-LinkNet:

Architecture: The network is built with LinkNet architecture and has dilated convolution layers in its center part. Linknet architecture is efficient in computation and memory. Dilation convolution is a powerful tool that can enlarge the receptive field of feature points without reducing the resolution of the feature maps.

Loss function and optimizer: BCE (binary cross entropy) + dice coefficient loss as loss function and Adam as optimizer.

Data augmentation: test time augmentation(TTA) (including image horizontal flip, image vertical flip, image diagonal flip (predicting each image 2 × 2 × 2 = 8 times), and then restored the outputs to the match the origin images.)

Dense Fusion: Dense Fusion Classmate Network (DFCNet)

deepgl1

Multi-task U-net:

deepgl2

What data is used high-resolution satellite image datasets (courtesy of DigitalGlobe) and the corresponding training data
Application (potential) in industry Three challenge tracks: road extraction, build detection, land cover classification.
Other thoughts For instance segmentation, only saw one paper using maskRCNN. Unet are very popular, such as stacked Unet, NU-net, multi-task Unet etc.

Paper list: http://openaccess.thecvf.com/CVPR2018_workshops/CVPR2018_W4.py

Session Title Interpretable Machine Learning for Computer Vision
Most impressive breakthrough from the session Interpretability is NOT about understanding all bits and bytes of the model for all data points. It’s about knowing enough for your goals/downstream tasks.
Architecture and tech details Slides: Intro to interpretable ML

https://interpretablevision.github.io/

Application (potential) in industry When present AI project to others, most people still think AI is a black box. This session provides insights on
Other thoughts When to need interpretability: interpretability can help with them when we cannot formalize the ideas.

When we do not need interpretability: when predictions are all you need; sufficiently well-studied problem; mismatched objectives

Examples of interpretability: EDA; explain Rules, examples, sparsity and monotonicity; ablation test, input-feature importance, concept importance.

How to evaluate: human experiment and ground-truth experiment.

t-sne visualization (I did not get the slides) but google one article regarding t-sne visualization: Using T-SNE to Visualise how your Model thinks

Session Title What do deep networks like to see?
Most impressive breakthrough from the session Cross-feeding reconstructions to other classifiers.
Architecture and tech details  Project page.

see1

see2

What data is used  YFCC100m, Imagenet
Application (potential) in industry  Understand CNNs
Other thoughts More understanding in deep neural network layers. May be useful in choosing which layers to cut from, extracting features and training a new model using these features.
Session Title Context Encoding for Semantic Segmentation
Most impressive breakthrough from the session The context encoding module significantly improve semantic segmentation results with only marginal extra computation cost over FCN. It selectively highlights the class-dependent feature map and simplifies the problem for the network. The model achieves a final score of 0.5567 on ADE20K test set, which surpasses the winner entry of COCO Challenge 2017. It also improve the feature representation of relatively shadow networks for the image classification on CIFAR-10 dataset.
Architecture and tech details Github

Paper

Code

contextseg

Key contributions:

  1. Semantic Encoding Loss (SE-loss): a simple unit to leverage the global scene context information.
  2. The design and implementation of a new semantic segmentation framework Context Encoding Network (EncNet): augment a pre-trained deep residual network.
What data is used PSCAL-Context, PASCAL VOC 2012, ADE20K, CIFAR-10
Application (potential) in industry Semantic segmentation
Session Title Learn to See in the dark
Most impressive breakthrough from the session A pipeline for processsing low-light images based on end-to-end training of a fully convolutional network.
Architecture and tech details Github

Paper

dark

What data is used Data are collected by authors include both indoor and outdorr images at night. Total 5094 raw short-exposure images.
Application (potential) in industry Image preprocessing
Other thoughts Wonder whether it can be exposed as an API to use.

Image Enhancements and Manipulations

Session Title xUnit: Learning a Spatial Activation Function for Efficient Image Restoration
Most impressive breakthrough from the session Significantly decreases number of learning params  — for super resolution and denoising NNs number of params was cut in more than half.
Architecture and tech details Paper is here. Github.

“In contrast to the widespread per-pixel activation units, like ReLUs and sigmoids, our unit implements a learnable nonlinear function with spatial connections. This enables the net to capture much more complex features, thus requiring a significantly smaller number of layers in order to reach the same performance”

xUnit

What data is used BSD68, Rain12.
Application (potential) in industry Super resolution, denoising.
Session Title Deformation Aware Image Compression
Most impressive breakthrough from the session Encoder need not invest bits in describing the exact geometries of fine structures — more bits are invested in the important parts, leading to greatly better detail preservation (confirmed by user studies)
Architecture and tech details Paper here.

Easy to incorporate in any CODEC.

As human observers are indifferent to slight local translations, authors propose a deformation insensitive version of the SSD (sum of squared differences) measure: deformation aware SSD (DASSD).

What data is used Berkley segmentation dataset, Kodak dataset
Application (potential) in industry Compression
Other thoughts Presented visual result are quite impressive.
Session Title Residual Dense Network for Image Super-Resolution
Most impressive breakthrough from the session The goal is to make full use of the hierarchical features from the original low-resolution (LR) images.
Architecture and tech details Paper here. Github here.

resid1

Where RDB block is:

resid2

What data is used DIV2K, Set5, Set14, B100, Urban100, Manga109
Application (potential) in industry Photo enhancement apps
Session Title Attentive Generative Adversarial Network for Raindrop Removal from a Single Image
Most impressive breakthrough from the session Injection of  visual attention into both the generative and discriminative networks is proposed. In this use case special attention is given to the raindrop regions.
Architecture and tech details Paper here.

Mix of GANs with LSTMs and Unet.

drop1

What data is used Create there our dataset (1K image pairs).  To prepare dataset authors used two pieces of exactly the same glass when taking photos: one sprayed with water, and the other is left clean.
Application (potential) in industry Photo editing apps
Other thoughts Quite impressive results:

drop2

Authors use pre-trained VGG-16 (on ImageNet).

Session Title Burst Denoising with Kernel Prediction Networks
Most impressive breakthrough from the session CNN predicts spatially varying kernels that can both align and denoise frames.

Interesting approach for training data generation.

Architecture and tech details Paper here. Github here.

burst

What data is used Synthesize training data, using images from the Open Images dataset: modify images  to introduce synthetic misalignment and noise approximating the characteristics of real image bursts. To generate a synthetic burst of N frames, authrs take a single image and generate N cropped patches with misalignment.
Application (potential) in industry Photo apps
Other thoughts Authors report good performance on real dataset  — burst photos taken a Nexus 6P cellphone in under dim lighting (note: model was trained on synthetic dataset). Benchmarking on real dataset is done after some image preprocessing: subtracting the black level, suppressing hot pixels, coarse whole-pixel alignment of alternate frames.
Session Title Crafting a Toolchain for Image Restoration by Deep Reinforcement Learning
Most impressive breakthrough from the session Toolbox consisting of small-scale CNNs specialized in different tasks  combined with RL: learn policy to select appropriate tools to restore corrupted image.
Architecture and tech details Project page is here. Github is here.

restor

What data is used DIV2K, for training set of distortions is added synthetically
Application (potential) in industry Photo apps
Other thoughts Decent results on real data as well,  agent decides by its own when to stop restoration — the framework has the potential to deal with real distortions.

  

 Goal-Driven Navigation, Indoor 3D Scenes

Session Title Density Adaptive Point Set Registration
Most impressive breakthrough from the session Successfully handles severe density variations commonly encountered in terrestrial Lidar applications

 

Architecture and tech details Paper here.

Model the underlying structure of the scene as a latent probability distribution, and thereby induce invariance to point set density changes. Both the probabilistic model of the scene and the registration parameters are inferred by minimizing the Kullback-Leibler divergence in an Expectation Maximization based framework.

Introduce the observation weight function.

What data is used Synthetic dataset: construct synthetic point clouds by performing point sampling on a polygon mesh that simulates an indoor 3D scene.

Virtual Photo Sets, ETH TLS

Application (potential) in industry  Lidar applications, 3D mapping, scene understanding
Other thoughts Authors perform extensive experiments on several challenging real-world Lidar datasets.

Model both the underlying structure of the 3D scene and the acquisition process to obtain robustness to density variations.

Session Title Im2Pano3D: Extrapolating 360° Structure and Semantics Beyond the Field of View
Most impressive breakthrough from the session  To ease the prediction of 3D structure, we propose to parameterize 3D surfaces with their plane equations and train the model to predict these parameters directly
Architecture and tech details Project page here. Github here.

Key idea: indoor environment is highly structured. By learning stats over  many typical scenes model  should be able to leverage strong contextual cues  to predict what is beyond FoV.

Use multiple loss functions: pixel-wise accuracy, mid-level contextual consistency using Patch-GAN (adversarial) loss, and global scene consistency measured by scene category and object distributions. The final loss for each channel is a weighted sum of the three level losses.

Im2Pano3D

What data is used 3D House datasets: synthetic houses (SUNCG) and real houses (Matterplot3D)
Application (potential) in industry Robotics, goal-driven navigation, next-best-view approximation
Other thoughts Im2Pano3D is able to predict the semantics and 3D structure of the unobserved scene with more than 56% pixel accuracy and less than 0.52m average distance error.

Pre-training on SUNCG (synthetic dataset) significantly improves the model’s performance.

Paper also presents results how human coped with scene completion tasks — NN results are not as good a for humans, however quite promising.

Session Title Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
Most impressive breakthrough from the session Avg instruction length: 29 words (in natural language).

Propose first benchmark dataset for navigation in real buildings.

Architecture and tech details Project page.

Agent: RNN (Seq2Seq LSTM)

What data is used Matterport3D

Room-to-Room (R2R)

Test set: previously unseen buildings.

Application (potential) in industry Robotics
Other thoughts Success: if stopped within 3m of the goal (pretty far).

Authors introduce Matterport3D Simulator, a new large-scale visual RL simulation environment for the research and development of intelligent agents based on the Matterport3D dataset

Session Title Sim2Real View Invariant Visual Servoing by Recurrent Control
Most impressive breakthrough from the session Visual servoing system uses its memory of past movements to understand how the actions affect the robot motion from the current viewpoint, correcting mistakes and gradually moving closer to the target.

This recurrent controller is learnt using simulated data and a reinforcement learning objective.

Architecture and tech details Paper here. Project page with demo here.

Visual servoing: moving a tool or end-point to a desired location using primarily visual feedback.

The goal is indicated by an image of the query object, and the network must both figure out where this object is in the image.

sem2real

What data is used Synthesize the strongly supervised training data by generating a large set of episodes with varied camera locations, objects, and textures.
Application (potential) in industry Robotics
Other thoughts Interesting but not ready to get deployed in prod tomorrow.

 

People Related Analysis

Session Title Divide and Grow: Capturing Huge Diversity in Crowd Images With Incrementally Growing CNN
Most impressive breakthrough from the session Recursive CNN structure for crowd count estimation
Architecture and tech details Paper here.

After pretraining of the base CNN, a CNN tree is progressively built where each node represents a regressor finetuned on a subset of the dataset. This is done by replicating each regressor at the tree leaves into two and specializing the child network with differential training.

devide

What data is used Shanghaitech, UCF CC 50, World Expo’10
Application (potential) in industry Surveillance
Other thoughts Metrics: mean absolute error and mean squared error.

The regressors at the leaf nodes of the tree are finer experts on certain specialties mined without any manually specified criteria

Session Title Connecting Pixels to Privacy and Utility: Automatic Redaction of Private Information in Images
Most impressive breakthrough from the session Propose the first sizable dataset of private images “in the wild” annotated with pixel and instance level labels across a broad range of privacy classes.

The first model for automatic redaction of diverse private information.

Architecture and tech details Project page here. Github here.

Challenges: attributes occurring across multiple modalities (TEXTUAL, VISUAL,MULTIMODAL)

Redaction performed using ensemble (SEQ, FCIS. WSL:I)

privacy

What data is used New dataset that extends the Visual Privacy (VISPR) dataset to include high-quality pixel and instance-level annotations. To this end, authors propose a dataset containing 8.5k images annotated with 47.6k instances over 24 privacy attributes.
Application (potential) in industry Data cleaning for privacy, GDPR  complience
Other thoughts Proposed approach  is effective at achieving various privacy-utility trade-offs within 83% of the performance of manual redaction.

privacy2

Session Title Fashion AI
Most impressive breakthrough from the session Alibaba Fasion AI
Architecture and tech details  Paper: creating capsule wardrobes from Fashion images.

fashion

What data is used For 1, a lot of images in Taobao.

For 2, Key word search with attribute name on Google to get image data. Finally 100-300 images per attribute, in total 12K imags.

Application (potential) in industry Alibaba Fashion AI explores automated match outfits as stylists. They explored multiple images and tried to identify different styles of clothes as a first step. Then later in their website, given one t-shirt, it would recommend the matching pant/skirt. Pretty early stage. Not customize to each customer yet.

 

 

Efficient DNNs

Session Title Efficient and accurate CNN Models at Edge compute platforms 
Most impressive breakthrough from the session Real-time object detection running on DeepLens goes can go faster on CPU (XNOR.AI optimized model) than on GPU (see demo).

XNOR 5$ deep learning machine on Raspberry Pi Zero (see demo)

Architecture and tech details Presented by XNOR.AI team.

Growing demand for Edge devices is highlighted: preserve privacy, security, bandwidth.

Solutions:

  1. Lower Precision (Quantization): fixed point, binary (XNOR-Net)
  2. Sparse Models: lookup based CNN, factorization
  3. Compact Network Design: Mobile Net
  4. How to improve accuracy? => Label refinery
Application (potential) in industry Edge devices
Other thoughts Challenges with current labeling (like Image net):

1) Labels are misleading.

Example of “Persian cat”:

cat

2) Random cropping can make training data that does not have needed context.

3) Same amount of penalization if chihuahua gets misclassified as cat or car.

Labels should be:

  1. Soft: Violin (40%), roses (30%), cat (30%)
  2. Informative: image of big newfoundland dog  can be labeled as dog (60%), cat (10%), bear (30%)
Session Title Co-Design of Deep Neural Nets and Neural Net Accelerators for Embedded Vision Applications
Most impressive breakthrough from the session DNNs models and neural net accelerators should be co-designed.
Architecture and tech details Paper here.

Popular Nets and their computational requirements for object detection (source MobileNetV2 paper)

codes1

codes2

Speed is more related to memory access than operations.

Energy consumption is more related to  memory accesses than computations.

codes3

SqueezeNext: Hardware-Aware Neural Network Design (paper)

squeeze

Application (potential) in industry Edge devices
Other thoughts Key of efficient DNN computation is data reuse.

Different CNN layers have diff patterns for data reuse.

Different NN accelerators architectures favor different  types of reuse  (outputs vs weights)

Session Title Intel deployment tutorial
Most impressive breakthrough from the session OpenCV is undisputable champion of the CV world. Classical components widely used.
Architecture and tech details OpenVINO” (Open Visual Inference and Neural Network Optimization) toolkit: formerly Intel Computer Vision Toolkit

Include traditional computer vision (such as OpenCV) and deep learning toolkits across CPU,GPU, FPGA, VPU, IPU.

1.       Enables the CNN-based deep learning inference on the edge.

2.       Supports heterogeneous execution across Intel’s CV accelerators, using a common API for the CPU, Intel® Integrated Graphics, Intel® Movidius™ Neural Compute Stick, and FPGA.

3.       Speeds time-to-market through an easy-to-use library of CV functions and pre-optimized kernels.

4.       optimized calls for CV standards, including OpenCV*, OpenCL™, and OpenVX*

What data is used
Application (potential) in industry This tutorial raised up OpenCV again.Future CV projects may consider more about OpenCV.
Other thoughts Development toolkit for high performance CV and DL inference; Set of libraries to solve CV/DL deployment problems.

Document CV

Session Title DocUnet: Document Image Unwarping via A stacked U-Net
Most impressive breakthrough from the session Flatten a document image when the physical document sheet is folded or curved. The paper implemented a stacked U-Net with intermediate supervision to directly predict the forward mapping from a distorted image to its rectified version.
Architecture and tech details Paper

docu1

Data augmentation to improve generalization ability: Describable Texture Dataset (DTD)  to produce various background textures; add jitter

in the HSV color space to magnify illumination and paper color variations; a projective transform to cope with viewpoint change.

What data is used Synthetic Dataset

docu2

Application (potential) in industry Text analytics
Other thoughts Train using synthetic dataset but test on real dataset.

docu3

Session Title Document enhancement using visibility detection
Most impressive breakthrough from the session  doen1

docen2

Architecture and tech details Paper

docen3

Application (potential) in industry Text image binarization, Document unshadowing
Other thoughts Can be used to preprocess documents then pass the images to OCR; Even combine with the above paper, deal with distorting document images under bad lights.

Data and CV

Session Title Generate To Adapt: Aligning Domains using Generative Adversarial Network
Most impressive breakthrough from the session Leverages unsupervised data to bring the source and target distributions closer in a learned joint feature space using GANs.
Architecture and tech details Paper is here. Github is here.

Source domain updates: F and C NNs are updated using supervised classification loss; F, G. D are updated using adversarial loss to produced class-consistent source-like images.

Target domain updates: F NN is updated so that the target embeddings (when passed through GAN) produce source like images; this loss aligns the source and target feature representations.

genadapt

What data is used DIGITS, OFFICE

Domain adaptation from synthetic to real data: CAD to Pascal, VISDA

Application (potential) in industry Improve CNN perf on unseen data.
Other thoughts The quality of image generation is better in the digits experiments compared to the Office experiments.

The generator is able to produce source-like images for both the source and target inputs in a class-consistent manner.

There is mode collapse in the generations produced in the Office experiments.

The difficulty of GANs in generating realistic images in the Office and Synthetic to real datasets makes it significantly hard for the methods that use cross-domain image generation as a data augmentation step. Presented approach relies on the image generation as a mode for deriving rich gradients to the feature extraction network, so the method works well even in the presence of severe mode collapse and poor generation quality.

Session Title COCO-Stuff: Thing and Stuff Classes in Context
Most impressive breakthrough from the session COCO is missing stuff annotations. In this paper we augment COCO by adding dense pixel-wise stuff annotations. Since COCO is about complex, yet natural scenes containing substantial areas of stuff, COCO-Stuff enables the exploration of rich relations between things and stuff.

COCO-Stuff offers a valuable stepping stone towards complete scene understanding.

Architecture and tech details Paper
What data is used 164K images of COCO 2017 dataset with pixel-wise annotations for 91 stuff classes

COCO-Stuff contains 172 classes: 80 thing, 91 stuff, and 1 class unlabeled. The 80 thing classes are the same as in COCO [35]. The 91 stuff classes are curated by an expert annotator. The class unlabeled is used in two situations: if a label does not belong to any of the 171 predefined classes, or if the annotator cannot infer the label of a pixel.

Application (potential) in industry Improve the performance of semantic segmentation
Other thoughts
Session Title Workshop session: vision with sparse and scarce data
Most impressive breakthrough from the session Make your data count: sharing information across domains and tasks — by Judy Hoffman (keynote speaker)

Imagenet dataset are biased as they are mainly from social media, eg people would love to take dog’s front face as dog picture. In real life, if we just give a short video with low resolution, motion blur and pose variety, the model will not perform well.

Architecture and tech details How can we learn to generalize? When the visual environment change/bias exist?  ====> Learn a representation that cannot distinguish domains.

Two approaches:

1) Deep domain adaptation

sparce1

2) Domain adversarial adaptation

sparce2

What data is used ImageNet, SYNTHIA dataset
Application (potential) in industry Cross-city adaptation: train in Germany but test in San Francisco (signs, tunnesl, size of roads)

Cross-season adaptation: SYNTHIA dataset.

Cross season pixel adaptation, generate winter from fall

Synthetic to real pixel adaptation:

Train on GTA (synthetic) —— Test on germany

Can make big performance difference before and after adaptation

Other thoughts Some slides from Judy hoffman: it’s for GANs tutorial but also some slides cover this talk.

This is a common issue in computer vision. We used transfer learning for most time to deal with such problems. But this talk gives insights on using cross adaptation.

If you have read this far — I hope you’ve found something useful!

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s