【导读】计算机视觉最具影响力的学术会议之一的IEEE CVPR将于2018年6月18日-22日在美国盐湖城召开举行。据 CVPR 官网显示,今年大会有超过 3300 篇论文投稿,其中录取 979 篇;相比去年 783 篇论文,今年增长了近 25%。
详细录用名单日前已经公布,可参见:http://cvpr2018.thecvf.com/files/cvpr_2018_final_accept_list.txt
https://github.com/amusi/daily-paper-computer-vision/blob/master/2018/cvpr2018-paper-list.csv
▌论文列表:
Single-Shot Refinement Neural Network for Object Detection | ||
Video Captioning via Hierarchical Reinforcement Learning | ||
DensePose: Multi-Person Dense Human Pose Estimation In The Wild | ||
DensePose: Multi-Person Dense Human Pose Estimation In The Wild | ||
Frustum PointNets for 3D Object Detection from RGB-D Data | ||
Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge | ||
Rethinking the Faster R-CNN Architecture for Temporal Action Localization | ||
Shape from Shading through Shape Evolution | ||
Shape from Shading through Shape Evolution | ||
A High-Quality Denoising Dataset for Smartphone Cameras | ||
Improving Color Reproduction Accuracy in the Camera Imaging Pipeline | ||
End-to-End Dense Video Captioning with Masked Transformer | ||
End-to-End Dense Video Captioning with Masked Transformer | ||
pOSE: Pseudo Object Space Error for Initialization-Free Bundle Adjustment | ||
Learning to Segment Every Thing | ||
Density-aware Single Image De-raining using a Multi-stream Dense Network | ||
Densely Connected Pyramid Dehazing Network | ||
Embodied Question Answering | ||
TieNet: Text-Image Embedding Network for Common Thorax Disease Classification and Reporting in Chest X-rays | ||
TieNet: Text-Image Embedding Network for Common Thorax Disease Classification and Reporting in Chest X-rays | ||
Towards Open-Set Identity Preserving Face Synthesis | ||
Baseline Desensitizing In Translation Averaging | ||
Learning from the Deep: A Revised Underwater Image Formation Model | ||
Context Encoding for Semantic Segmentation | ||
Context Encoding for Semantic Segmentation | ||
Deep Texture Manifold for Ground Terrain Recognition | ||
DS*: Tighter Lifting-Free Convex Relaxations for Quadratic Matching Problems | ||
Sparse, Smart Contours to Represent and Edit Images | ||
Every Smile is Unique: Landmark-guided Diverse Smile Generation | ||
Generative Non-Rigid Shape Completion with Graph Convolutional Autoencoders | ||
Learning a Discriminative Prior for Blind Image Deblurring | ||
Attentional ShapeContextNet for Point Cloud Recognition | ||
Learning Superpixels with Segmentation-Aware Affinity Loss | ||
Real-World Repetition Estimation by Div, Grad and Curl | ||
Real-World Repetition Estimation by Div, Grad and Curl | ||
Recurrent Saliency Transformation Network: Incorporating Multi-Stage Visual Cues for Small Organ Segmentation | ||
MegaDepth: Learning Single-View Depth Prediction from Internet Photos | ||
Learning Intrinsic Image Decomposition from Watching the World | ||
Learning Intrinsic Image Decomposition from Watching the World | ||
Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering | ||
Human-centric Indoor Scene Synthesis Using Stochastic Grammar | ||
Learning by Asking Questions | ||
Instance Embedding Transfer to Unsupervised Video Object Segmentation | ||
Detect-and-Track: Efficient Pose Estimation in Videos | ||
Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval | ||
Guided Proofreading of Automatic Segmentations for Connectomics | ||
Augmented Skeleton Space Transfer for Depth-based Hand Pose Estimation | ||
Augmented Skeleton Space Transfer for Depth-based Hand Pose Estimation | ||
Context-aware Synthesis for Video Frame Interpolation | ||
2D/3D Pose Estimation and Action Recognition using Multitask Deep Learning | ||
NAG: Network for Adversary Generation | ||
LiteFlowNet: A Lightweight Convolutional Neural Network for Optical Flow Estimation | ||
LiteFlowNet: A Lightweight Convolutional Neural Network for Optical Flow Estimation | ||
Avatar-Net: Multi-scale Zero-shot Style Transfer by Feature Decoration | ||
Multi-view Harmonized Bilinear Network for 3D Object Recognition | ||
Multi-view Harmonized Bilinear Network for 3D Object Recognition | ||
Tangent Convolutions for Dense Prediction in 3D | ||
Tangent Convolutions for Dense Prediction in 3D | ||
Semi-parametric Image Synthesis | ||
Semi-parametric Image Synthesis | ||
Interactive Image Segmentation with Latent Diversity | ||
3D Hand Pose Estimation: From Current Achievements to Future Goals | ||
3D Hand Pose Estimation: From Current Achievements to Future Goals | ||
W2F: A Weakly-Supervised to Fully-Supervised Framework for Object Detection | ||
BlockDrop: Dynamic Inference Paths in Residual Networks | ||
BlockDrop: Dynamic Inference Paths in Residual Networks | ||
MapNet: Geometry-Aware Learning of Maps for Camera Localization | ||
MapNet: Geometry-Aware Learning of Maps for Camera Localization | ||
BPGrad: Towards Global Optimality in Deep Learning via Branch and Pruning | ||
Salient Object Detection Driven by Fixation Prediction | ||
3D Object Detection with Latent Support Surfaces | ||
Practical Block-wise Neural Network Architecture Generation | ||
Practical Block-wise Neural Network Architecture Generation | ||
Glimpse Clouds: Human Activity Recognition from Unstructured Feature Points | ||
Are You Talking to Me? Reasoned Visual Dialog Generation through Adversarial Learning | ||
Are You Talking to Me? Reasoned Visual Dialog Generation through Adversarial Learning | ||
Visual Grounding via Accumulated Attention | ||
Supervision-by-Registration: An Unsupervised Approach to Improve the Precision of Facial Landmark Detectors | ||
ISTA-Net: Interpretable Optimization-Inspired Deep Network for Image Compressive Sensing | ||
Perturbative Neural Networks: Rethinking Convolution in CNNs | ||
Nonlinear 3D Face Morphable Model | ||
Nonlinear 3D Face Morphable Model | ||
Neural Baby Talk | ||
Neural Baby Talk | ||
Towards Pose Invariant Face Recognition in the Wild | ||
MoNet: Deep Motion Exploitation for Video Object Segmentation | ||
Exploring Disentangled Feature Representation Beyond Face Identification | ||
Towards Effective Low-bitwidth Convolutional Neural Networks | ||
Parallel Attention: A Unified Framework for Visual Object Discovery through Dialogs and Queries | ||
Learning Facial Action Units from Web Images with Scalable Weakly Supervised Clustering | ||
Few-Shot Image Recognition by Predicting Parameters from Activations | ||
Few-Shot Image Recognition by Predicting Parameters from Activations | ||
Single-Shot Object Detection with Enriched Semantics | ||
Unifying Identification and Context Learning for Person Recognition | ||
Separating Self-Expression and Visual Content in Hashtag Supervision | ||
Multi-Cue Correlation Filters for Robust Visual Tracking | ||
Beyond Trade-off: Accelerate FCN-based Face Detection with Higher Accuracy | ||
On the Robustness of Semantic Segmentation Models to Adversarial Attacks | ||
PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume | ||
PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume | ||
Illuminant Spectra-based Source Separation Using Flash Photography | ||
Illuminant Spectra-based Source Separation Using Flash Photography | ||
Tracking Multiple Objects Outside the Line of Sight using Speckle Imaging | ||
Tracking Multiple Objects Outside the Line of Sight using Speckle Imaging | ||
Improved Human Pose Estimation through Adversarial Data Augmentation | ||
Generative Adversarial Learning Towards Fast Weakly Supervised Detection | ||
Audio to Body Dynamics | ||
Audio to Body Dynamics | ||
The Unreasonable Effectiveness of Deep Features as a Perceptual Metric | ||
Frame-Recurrent Video Super-Resolution | ||
Deep Mutual Learning | ||
Real-world Anomaly Detection in Surveillance Videos | ||
Soccer on Your Tabletop | ||
Diversity Regularized Spatiotemporal Attention for Video-based Person Re-identification | ||
HashGAN: Deep Learning to Hash with Pair Conditional Wasserstein GAN | ||
Excitation Backprop for RNNs | ||
Dynamic-Structured Semantic Propagation Network | ||
Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation | ||
Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation | ||
SPLATNet: Sparse Lattice Networks for Point Cloud Processing | ||
SPLATNet: Sparse Lattice Networks for Point Cloud Processing | ||
Video Representation Learning Using Discriminative Pooling | ||
Attend and Interact: Higher-Order Object Interactions for Video Understanding | ||
Human Pose Estimation with Parsing Induced Learner | ||
4D Human Body Correspondences from Panoramic Depth Maps | ||
Recognizing Human Actions as Evolution of Pose Estimation Maps | ||
GraphBit: Bitwise Interaction Mining via Deep Reinforcement Learning | ||
Deep Adversarial Metric Learning | ||
Deep Adversarial Metric Learning | ||
Revisiting Video Saliency: A Large-scale Benchmark and a New Model | ||
Graph-Cut RANSAC | ||
Five-point Fundamental Matrix Estimation for Uncalibrated Cameras | ||
Hashing as Tie-Aware Learning to Rank | ||
Optimizing Local Feature Descriptors for Nearest Neighbor Matching | ||
Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies | ||
Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies | ||
Consensus Maximization for Semantic Region Correspondences | ||
Consensus Maximization for Semantic Region Correspondences | ||
ST-GAN: Spatial Transformer Generative Adversarial Networks for Image Compositing | ||
Motion-Guided Cascaded Refinement Network for Video Object Segmentation | ||
Zigzag Learning for Weakly Supervised Object Detection | ||
Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models | ||
Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models | ||
VITON: An Image-based Virtual Try-on Network | ||
VITON: An Image-based Virtual Try-on Network | ||
Cross-Domain Self-supervised Multi-task Feature Learning Using Synthetic Game Imagery | ||
LayoutNet: Reconstructing the 3D Room Layout from a Single RGB Image | ||
Thoracic Disease Identification and Localization with Limited Supervision | ||
Stochastic Downsampling for Cost-Adjustable Inference and Improved Regularization in Convolutional Networks | ||
Learning Pixel-level Semantic Affinity with Image-level Supervision for Weakly Supervised Semantic Segmentation | ||
Deep End-to-End Time-of-Flight Imaging | ||
Fast and Accurate Online Video Object Segmentation via Tracking Parts | ||
Fast and Accurate Online Video Object Segmentation via Tracking Parts | ||
Min-Entropy Latent Model for Weakly Supervised Object Detection | ||
Future Frame Prediction for Anomaly Detection A New Baseline | ||
Face Aging with Identity-Preserved Conditional Generative Adversarial Networks | ||
Learning to Compare: Relation Network for Few-Shot Learning | ||
Deep Layer Aggregation | ||
Deep Layer Aggregation | ||
Style Aggregated Network for Facial Landmark Detection | ||
M3: Multimodal Memory Modelling for Video Captioning | ||
M3: Multimodal Memory Modelling for Video Captioning | ||
Classification Driven Dynamic Image Enhancement | ||
Generative Image Inpainting with Contextual Attention | ||
Iterative Visual Reasoning Beyond Convolutions | ||
Iterative Visual Reasoning Beyond Convolutions | ||
Dual Attention Matching Network for Context-Aware Feature Sequence based Person Re-Identification | ||
Textbook Question Answering under Teacher Guidance with Memory Networks | ||
Textbook Question Answering under Teacher Guidance with Memory Networks | ||
Multi-Level Factorisation Net for Person Re-Identification | ||
Functional Map of the World | ||
Functional Map of the World | ||
A Two-Step Disentanglement Method | ||
Towards Faster Training of Global Covariance Pooling Networks by Iterative Matrix Square Root Normalization | ||
Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? | ||
Left-Right Comparative Recurrent Model for Stereo Matching | ||
Left-Right Comparative Recurrent Model for Stereo Matching | ||
Analytic Expressions for Probabilistic Moments of PL-DNN with Gaussian Input | ||
Analytic Expressions for Probabilistic Moments of PL-DNN with Gaussian Input | ||
Zero-Shot Sketch-Image Hashing | ||
Zero-Shot Sketch-Image Hashing | ||
Interpretable Convolutional Neural Networks | ||
Interpretable Convolutional Neural Networks | ||
Reconstructing Thin Structures of Manifold Surfaces by Integrating Spatial Curves | ||
Enhancing the Spatial Resolution of Stereo Images using a Parallax Prior | ||
Anticipating Traffic Accidents with Adaptive Loss and Large-scale Incident DB | ||
Generating Synthetic X-ray Images of a Person from the Surface Geometry | ||
Generating Synthetic X-ray Images of a Person from the Surface Geometry | ||
Attentive Fashion Grammar Network for Fashion Landmark Detection and Clothing Category Classification | ||
Unsupervised CCA | ||
Discovering Point Lights with Intensity Distance Fields | ||
Universal Denoising Networks : A Novel CNN-based Network Architecture for Image Denoising | ||
Easy Identification from Better Constraints: Multi-Shot Person Re-Identification from Reference Constraints | ||
Recurrent Pixel Embedding for Instance Grouping | ||
Recurrent Pixel Embedding for Instance Grouping | ||
Recurrent Scene Parsing with Perspective Understanding in the Loop | ||
Learning to Hash by Discrepancy Minimization | ||
Fast End-to-End Trainable Guided Filter | ||
Disentangling Structure and Aesthetics for Content-aware Image Completion | ||
An Analysis of Scale Invariance in Object Detection - SNIP | ||
An Analysis of Scale Invariance in Object Detection - SNIP | ||
CSGNet: Neural Shape Parser for Constructive Solid Geometry | ||
Finding Tiny Faces in the Wild with Generative Adversarial Network | ||
Finding Tiny Faces in the Wild with Generative Adversarial Network | ||
SSNet: Scale Selection Network for Online 3D Action Prediction | ||
SSNet: Scale Selection Network for Online 3D Action Prediction | ||
Integrated facial landmark localization and super-resolution of real-world very low resolution faces in arbitrary poses with GANs | ||
Integrated facial landmark localization and super-resolution of real-world very low resolution faces in arbitrary poses with GANs | ||
The Best of Both Worlds: Combining CNNs and Geometric Constraints for Hierarchical Motion Segmentation | ||
In-Place Activated BatchNorm for Memory-Optimized Training of DNNs | ||
Wing Loss for Robust Facial Landmark Localisation with Convolutional Neural Networks | ||
Deep Cross-media Knowledge Transfer | ||
Deep Cross-media Knowledge Transfer | ||
Coupled End-to-end Transfer Learning with Generalized Fisher Information | ||
Knowledge Aided Consistency for Weakly Supervised Phrase Grounding | ||
Viewpoint-aware Attentive Multi-view Inference for Vehicle Re-identification | ||
MatNet: Modular Attention Network for Referring Expression Comprehension | ||
CBMV: A Coalesced Bidirectional Matching Volume for Disparity Estimation | ||
NISP: Pruning Networks using Neuron Importance Score Propagation | ||
NISP: Pruning Networks using Neuron Importance Score Propagation | ||
Who Let The Dogs Out? Modeling Dog Behavior From Visual Data | ||
Efficient Video Object Segmentation via Network Modulation | ||
Learning Deep Models for Face Anti-Spoofing: Binary or Auxiliary Supervision | ||
Feedback-prop: Convolutional Neural Network Inference under Partial Evidence | ||
A Memory Network Approach for Story-based Temporal Summarization of 360?Videos | ||
Improving Occlusion and Hard Negative Handling for Single-Stage Object Detectors | ||
UV-GAN: Adversarial Facial UV Map Completion for Pose-invariant Face Recognition | ||
Learning a Toolchain for Image Restoration | ||
Learning a Toolchain for Image Restoration | ||
Learning to Act Properly: Predicting and Explaining Affordances from Images | ||
Learning a Discriminative Feature Network for Semantic Segmentation | ||
Optimizing Video Object Detection via a Scale-Time Lattice | ||
ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices | ||
Cascaded Pyramid Network for Multi-Person Pose Estimation | ||
Seeing Temporal Modulation of Lights from Standard Cameras | ||
Point-wise Convolutional Neural Networks | ||
Fine-grained Video Captioning for Sports Narrative | ||
Fine-grained Video Captioning for Sports Narrative | ||
Dense 3D Regression for Hand Pose Estimation | ||
Missing Slice Recovery for Tensors Using a Low-rank Model in Embedded Space | ||
Learning Convolutional Networks for Content-weighted Image Compression | ||
Learning Attentions: Residual Attentional Siamese Network for High Performance Online Visual Tracking | ||
Deep Cost-Sensitive and Order-Preserving Feature Learning for Cross-Population Age Estimation | ||
First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations | ||
Hand PointNet: 3D Hand Pose Estimation using Point Sets | ||
Hand PointNet: 3D Hand Pose Estimation using Point Sets | ||
Recovering Realistic Texture in Image Super-resolution by Spatial Feature Modulation | ||
Cube Padding for Weakly-Supervised Saliency Prediction in 360$^{\circ}$ Videos | ||
A Face to Face Neural Conversation Model | ||
SurfConv: Bridging 3D and 2D Convolution for RGBD Images | ||
Dynamic Video Segmentation Network | ||
Multiple Granularity Group Interaction Prediction | ||
Visual Question Reasoning on General Dependency Tree | ||
Visual Question Reasoning on General Dependency Tree | ||
From Lifestyle VLOGs to Everyday Interactions | ||
COCO-Stuff: Thing and Stuff Classes in Context | ||
GANerated Hands for Real-Time 3D Hand Tracking from Monocular RGB | ||
GANerated Hands for Real-Time 3D Hand Tracking from Monocular RGB | ||
Non-local Neural Networks | ||
Zero-shot Recognition via Semantic Embeddings and Knowledge Graphs | ||
Taskonomy: Disentangling Task Transfer Learning | ||
Taskonomy: Disentangling Task Transfer Learning | ||
Embodied Real-World Active Perception | ||
Embodied Real-World Active Perception | ||
SfSNet : Learning Shape, Reflectance and Illuminance of Faces `in the wild' | ||
SfSNet : Learning Shape, Reflectance and Illuminance of Faces `in the wild' | ||
End-to-end Recovery of Human Shape and Pose | ||
Factoring Shape, Pose, and Layout from the 2D Image of a 3D Scene | ||
Multi-view Consistency as Supervisory Signal for Learning Shape and Pose Prediction | ||
A Fast Resection-Intersection Method for the Known Rotation Problem | ||
Image Generation from Scene Graphs | ||
What Makes a Video a Video: Analyzing Temporal Information in Video Understanding Models and Datasets | ||
What Makes a Video a Video: Analyzing Temporal Information in Video Understanding Models and Datasets | ||
PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation | ||
High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs | ||
High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs | ||
Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks | ||
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference | ||
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference | ||
Finding It": Weakly-Supervised Reference-Aware Visual Grounding in Instructional Video" | ||
Finding It": Weakly-Supervised Reference-Aware Visual Grounding in Instructional Video" | ||
Unsupervised Cross-dataset Person Re-identification by Transfer Learning of Spatio-temporal Patterns | ||
Kernelized Subspace Pooling for Deep Local Descriptors | ||
Video Rain Removal By Multiscale Convolutional Sparse Coding | ||
Learning from Millions of 3D Scans for Large-scale 3D Face Recognition | ||
Referring Relationships | ||
Improving Object Localization with Fitness NMS and Bounded IoU Loss | ||
Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination | ||
Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination | ||
CVM-Net: Cross-View Matching Network for Image-Based Ground-to-Aerial Geo-Localization | ||
CVM-Net: Cross-View Matching Network for Image-Based Ground-to-Aerial Geo-Localization | ||
Visual Question Generation as Dual Task of Visual Question Answering | ||
Visual Question Generation as Dual Task of Visual Question Answering | ||
Revisiting Dilated Convolution: A Simple Approach for Weakly- and Semi- Supervised Semantic Segmentation | ||
Revisiting Dilated Convolution: A Simple Approach for Weakly- and Semi- Supervised Semantic Segmentation | ||
Learning Dual Convolutional Neural Networks for Low-Level Vision | ||
Deep Video Super-Resolution Network Using Dynamic Upsampling Filters Without Explicit Motion Compensation | ||
MegDet: A Large Mini-Batch Object Detector | ||
MegDet: A Large Mini-Batch Object Detector | ||
AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks | ||
TOM-Net: Learning Transparent Object Matting from a Single Image | ||
TOM-Net: Learning Transparent Object Matting from a Single Image | ||
End-to-End Deep Kronecker-Product Matching for Person Re-identification | ||
Semantic Visual Localization | ||
Joint Cuts and Matching of Partitions in One Graph | ||
Benchmarking 6DOF Outdoor Visual Localization in Changing Conditions | ||
Benchmarking 6DOF Outdoor Visual Localization in Changing Conditions | ||
Crowd Counting via Adversarial Cross-Scale Consistency Pursuit | ||
Deep Group-shuffling Random Walk for Person Re-identification | ||
Learning to Detect Features in Texture Images | ||
Learning to Detect Features in Texture Images | ||
Transferable Joint Attribute-Identity Deep Learning for Unsupervised Person Re-Identification | ||
CarFusion: Combining Point Tracking and Part Detection for Dynamic 3D Reconstruction of Vehicles | ||
Context-aware Deep Feature Compression for High-speed Visual Tracking | ||
Deep Material-aware Cross-spectral Stereo Matching | ||
Deep Extreme Cut: From Extreme Points to Object Segmentation | ||
Label Denoising Adversarial Network (LDAN) for Inverse Lighting of Face Images | ||
Label Denoising Adversarial Network (LDAN) for Inverse Lighting of Face Images | ||
Harmonious Attention Network for Person Re-Identication | ||
Unsupervised Deep Generative Adversarial Hashing Network | ||
Unsupervised Deep Generative Adversarial Hashing Network | ||
Pseudo-Mask Augmented Object Detection | ||
LSTM stack-based Neural Multi-sequence Alignment TeCHnique (NeuMATCH) | ||
LSTM stack-based Neural Multi-sequence Alignment TeCHnique (NeuMATCH) | ||
Adversarial Complementary Learning for Weakly Supervised Object Localization | ||
Unsupervised Discovery of Object Landmarks as Structural Representations | ||
Unsupervised Discovery of Object Landmarks as Structural Representations | ||
DeLS-3D: Deep Localization and Segmentation with a 3D Semantic Map | ||
Monocular Relative Depth Perception with Web Stereo Data Supervision | ||
Image-Image Domain Adaptation with Preserved Self-Similarity and Domain-Dissimilarity for Person Re-identification | ||
Objects as context for detecting their semantic parts | ||
Camera Style Adaptation for Person Re-identification | ||
Conditional Generative Adversarial Network for Structured Domain Adaptation | ||
Rotation-sensitive Regression for Oriented Scene Text Detection | ||
Residual Parameter Transfer for Deep Domain Adaptation | ||
SGPN: Similarity Group Proposal Network for 3D Point Cloud Instance Segmentation | ||
SGPN: Similarity Group Proposal Network for 3D Point Cloud Instance Segmentation | ||
Weakly Supervised Instance Segmentation using Class Peak Response | ||
Weakly Supervised Instance Segmentation using Class Peak Response | ||
Robust Facial Landmark Detection via a Fully-Convolutional Local-Global Context Network | ||
Rotation Averaging and Strong Duality | ||
Rotation Averaging and Strong Duality | ||
PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning | ||
Im2Flow: Motion Hallucination from Static Images for Action Recognition | ||
Im2Flow: Motion Hallucination from Static Images for Action Recognition | ||
Feature Quantization for Defending Against Distortion of Images | ||
End-to-end weakly-supervised semantic alignment | ||
PointGrid: A Deep Network for 3D Shape Understanding | ||
PointGrid: A Deep Network for 3D Shape Understanding | ||
Imagine it for me: Generative Adversarial Approach for Zero-Shot Learning from Noisy Texts | ||
A Minimalist Approach to Type-Agnostic Detection of Quadrics in Point Clouds | ||
A Benchmark for Articulated Human Pose Estimation and Tracking | ||
Boosting Self-Supervised Learning via Knowledge Transfer | ||
PPFNet: Global Context Aware Local Features for Robust 3D Point Matching | ||
PPFNet: Global Context Aware Local Features for Robust 3D Point Matching | ||
Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments | ||
Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments | ||
Fast Video Object Segmentation by Reference-Guided Mask Propagation | ||
Fast Video Object Segmentation by Reference-Guided Mask Propagation | ||
Super-Resolving Very Low-Resolution Face Images with Supplementary Attributes | ||
Video Person Re-identification with Competitive Snippet-similarity Aggregation and Co-attentive Snippet Embedding | ||
One-shot Action Localization by Sequence Matching Network | ||
Efficient Subpixel Refinement with Symbolic Linear Predictors | ||
Distort-and-Recover: Color Enhancement using Deep Reinforcement Learning | ||
Group Consistent Similarity Learning via Deep CRFs for Person Re-Identification | ||
Group Consistent Similarity Learning via Deep CRFs for Person Re-Identification | ||
Single Image Reflection Separation with Perceptual Losses | ||
AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions | ||
AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions | ||
Recognize Actions by Disentangling Components of Dynamics | ||
Zoom and Learn: Generalizing Deep Stereo Matching to Novel Domains | ||
Attention-aware Compositional Network for Person Re-Identification | ||
HATS: Histograms of Averaged Time Surfaces for Robust Event-based Object Classification | ||
Mask-guided Contrastive Attention Model for Person Re-Identification | ||
Pose-Guided Photorealistic Face Rotation | ||
Pose-Guided Photorealistic Face Rotation | ||
Automatic 3D Indoor Scene Modeling from Single Panorama | ||
Automatic 3D Indoor Scene Modeling from Single Panorama | ||
SobolevFusion: 3D Reconstruction of Scenes Undergoing Free Non-rigid Motion | ||
SobolevFusion: 3D Reconstruction of Scenes Undergoing Free Non-rigid Motion | ||
A Biresolution Spectral framework for Product Quantization | ||
Dynamic Zoom-in Network for Fast Object Detection in Large Images | ||
On the Importance of Label Quality for Semantic Segmentation | ||
EPINET: A Fully-Convolutional Neural Network for Light Field Depth Estimation by Using Epipolar Geometry | ||
A Pose-Sensitive Embedding for Person Re-Identification with Expanded Cross Neighborhood Re-Ranking | ||
Erase or Fill? Deep Joint Recurrent Rain Removal and Reconstruction in Videos | ||
Scalable and Effective Deep CCA via Soft Decorrelation | ||
High-order tensor regularization with application to attribute ranking | ||
3D-RCNN: Instance-level 3D Scene Understanding via Render-and-Compare | ||
3D-RCNN: Instance-level 3D Scene Understanding via Render-and-Compare | ||
FoldingNet: Interpretable Unsupervised Learning on 3D Point Clouds | ||
FoldingNet: Interpretable Unsupervised Learning on 3D Point Clouds | ||
Defocus Blur Detection via Multi-Stream Bottom-Top-Bottom Fully Convolutional Network | ||
Decorrelated Batch Normalization | ||
Unsupervised Textual Grounding: Linking Words to Image Concepts | ||
Unsupervised Textual Grounding: Linking Words to Image Concepts | ||
Scale-recurrent Network for Deep Image Deblurring | ||
Low-Shot Recognition with Imprinted Weights | ||
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering | ||
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering | ||
Cross-Domain Weakly-Supervised Object Detection through Progressive Domain Adaptation | ||
Facelet-Bank for Fast Portrait Manipulation | ||
Duplex Generative Adversarial Network for Unsupervised Domain Adaptation | ||
Quantization of Fully Convolutional Networks for Accurate Biomedical Image Segmentation | ||
Real-Time Rotation-Invariant Face Detection with Progressive Calibration Networks | ||
Structure Preserving Video Prediction | ||
Tagging Like Humans: Diverse and Distinct Image Annotation | ||
Learning to Sketch with Shortcut Cycle Consistency | ||
GroupCap: Group-based Image Captioning with Structured Relevance and Diversity Constraints | ||
Dynamic Scene Deblurring Using Spatially Variant Recurrent Neural Networks | ||
Dynamic Scene Deblurring Using Spatially Variant Recurrent Neural Networks | ||
Hyperparameter Optimization for Tracking with Continuous Deep Q-Learning | ||
Deep Unsupervised Saliency Detection: A Multiple Noisy Labeling Perspective | ||
Deep Unsupervised Saliency Detection: A Multiple Noisy Labeling Perspective | ||
NeuralNetwork-Viterbi: A Framework for Weakly Supervised Video Learning | ||
NeuralNetwork-Viterbi: A Framework for Weakly Supervised Video Learning | ||
Detecting and Recognizing Human-Object Interactions | ||
Detecting and Recognizing Human-Object Interactions | ||
Augmenting Crowd-Sourced 3D Reconstructions using Semantic Detections | ||
Visual Relationship Learning with a Factorization-based Prior | ||
Re-weighted Adversarial Adaptation Network for Unsupervised Domain Adaptation | ||
Flow Guided Recurrent Neural Encoder for Video Salient Object Detection | ||
Disentangling 3D Pose in A Dendritic CNN for Unconstrained 2D Face Alignment | ||
Progressive Attention Guided Recurrent Network for Salient Object Detection | ||
Answer with Grounding Snippets: Focal Visual-Text Attention for Visual Question Answering | ||
Answer with Grounding Snippets: Focal Visual-Text Attention for Visual Question Answering | ||
Unsupervised Learning of Depth and Egomotion from Monocular Video Using 3D Geometric Constraints | ||
Repulsion Loss: Detecting Pedestrians in a Crowd | ||
PU-Net: Point Cloud Upsampling Network | ||
Video Object Segmentation via Inference in A CNN-Based Higher-Order Spatio-Temporal MRF | ||
Video Object Segmentation via Inference in A CNN-Based Higher-Order Spatio-Temporal MRF | ||
PiCANet: Learning Pixel-wise Contextual Attention for Saliency Detection | ||
Gated Fusion Network for Single Image Dehazing | ||
Interleaved Structured Sparse Convolutional Neural Networks | ||
Interleaved Structured Sparse Convolutional Neural Networks | ||
Where and Why Are They Looking? Jointly Inferring Human Attention and Intentions in Complex Tasks | ||
End-to-end Flow Correlation Tracking with Spatial-temporal Attention | ||
Left/Right Asymmetric Layer Skippable Networks | ||
Context Contrasted Feature and Gated Multi-scale Aggregation for Scene Segmentation | ||
Context Contrasted Feature and Gated Multi-scale Aggregation for Scene Segmentation | ||
VITAL: VIsual Tracking via Adversarial Learning | ||
VITAL: VIsual Tracking via Adversarial Learning | ||
RotationNet: Joint Object Categorization and Pose Estimation Using Multiviews from Unsupervised Viewpoints | ||
Action Sets: Weakly Supervised Action Segmentation without Ordering Constraints | ||
Action Sets: Weakly Supervised Action Segmentation without Ordering Constraints | ||
Squeeze-and-Excitation Networks | ||
Squeeze-and-Excitation Networks | ||
Edit Probability for Scene Text Recognition | ||
Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning | ||
Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning | ||
Exploit the Unknown Gradually:~ One-Shot Video-Based Person Re-Identification by Stepwise Learning | ||
Learning to Localize Sound Source in Visual Scenes | ||
Dynamic Few-Shot Visual Learning without Forgetting | ||
Weakly-Supervised Semantic Segmentation by Iteratively Mining Common Object Features | ||
SINT++: Robust Visual Tracking via Adversarial Hard Positive Generation | ||
Real-Time Monocular Depth Estimation using Synthetic Data with Domain Adaptation via Image Style Transfer | ||
Fast and Accurate Single Image Super-Resolution via Information Distillation Network | ||
Low-Latency Video Semantic Segmentation | ||
Low-Latency Video Semantic Segmentation | ||
Domain Adaptive Faster R-CNN for Object Detection in the Wild | ||
DoubleFusion: Real-time Capture of Human Performance with Inner Body Shape from a Single Depth Sensor | ||
DoubleFusion: Real-time Capture of Human Performance with Inner Body Shape from a Single Depth Sensor | ||
Lean Multiclass Crowdsourcing | ||
Lean Multiclass Crowdsourcing | ||
Tell Me Where To Look: Guided Attention Inference Network | ||
Tell Me Where To Look: Guided Attention Inference Network | ||
Residual Dense Network for Image Super-Resolution | ||
Residual Dense Network for Image Super-Resolution | ||
Look at Boundary: A Boundary-Aware Face Alignment Algorithm | ||
Imagination-IQA: No-reference Image Quality Assessment via Adversarial Learning | ||
Memory Matching Networks for One-Shot Image Recognition | ||
3D Human Pose Estimation in the Wild by Adversarial Learning | ||
Unsupervised Training for 3D Morphable Model Regression | ||
Unsupervised Training for 3D Morphable Model Regression | ||
Scalable Dense Non-rigid Structure-from-Motion: A Grassmannian Perspective | ||
IQA: Visual Question Answering in Interactive Environments | ||
Learning Spatial-Temporal Regularized Correlation Filters for Visual Tracking | ||
Low-shot Learning from Imaginary Data | ||
Low-shot Learning from Imaginary Data | ||
Deep Regression Forests for Age Estimation | ||
Partial Transfer Learning with Selective Adversarial Networks | ||
Partial Transfer Learning with Selective Adversarial Networks | ||
A Bi-directional Message Passing Model for Salient Object Detection | ||
Transductive Unbiased Embedding for Zero-Shot Learning | ||
Scale-Transferrable Object Detection | ||
Crowd Counting with Deep Negative Correlation Learning | ||
Deep Cauchy Hashing for Hamming Space Retrieval | ||
Demo2Vec: Reasoning Object Affordances from Online Videos | ||
GVCNN: Group-View Convolutional Neural Networks for 3D Shape Recognition | ||
An End-to-End TextSpotter with Explicit Alignment and Attention | ||
Stereoscopic Neural Style Transfer | ||
Bootstrapping the Performance of Webly Supervised Semantic Segmentation | ||
Learning Markov Clustering Networks for Scene Text Detection | ||
Collaborative and Adversarial Network for Unsupervised domain adaptation | ||
Collaborative and Adversarial Network for Unsupervised domain adaptation | ||
Reflection Removal for Large-Scale 3D Point Clouds | ||
Pose Transferrable Person Re-Identification | ||
Learning to Adapt Structured Output Space for Semantic Segmentation | ||
Learning to Adapt Structured Output Space for Semantic Segmentation | ||
Efficient Diverse Ensemble for Discriminative Co-Tracking | ||
Learning a Single Convolutional Super-Resolution Network for Multiple Degradations | ||
Probabilistic Plant Modeling via Multi-View Image-to-Image Translation | ||
Learning to Parse Wireframes in Images of Man-Made Environments | ||
A Variational U-Net for Conditional Appearance and Shape Generation | ||
A Variational U-Net for Conditional Appearance and Shape Generation | ||
Learning to Find Good Correspondences | ||
Learning to Find Good Correspondences | ||
Actor and Action Video Segmentation from a Sentence | ||
Actor and Action Video Segmentation from a Sentence | ||
Towards a Mathematical Understanding of the Difficulty in Learning with Feedforward Neural Networks | ||
Weakly-supervised Deep Convolutional Neural Network Learning for Facial Action Unit Intensity Estimation | ||
Maximum Classifier Discrepancy for Unsupervised Domain Adaptation | ||
Maximum Classifier Discrepancy for Unsupervised Domain Adaptation |
由于微信字数限制,没有全部显示,详细list 请查看Amusi整理的
https://github.com/amusi/daily-paper-computer-vision
-END-
专 · 知
人工智能领域主题知识资料查看与加入专知人工智能服务群:
【专知AI服务计划】专知AI知识技术服务会员群加入与人工智能领域26个主题知识资料全集获取。欢迎微信扫一扫加入专知人工智能知识星球群,获取专业知识教程视频资料和与专家交流咨询!
请PC登录www.zhuanzhi.ai或者点击阅读原文,注册登录专知,获取更多AI知识资料!
请加专知小助手微信(扫一扫如下二维码添加),加入专知主题群(请备注主题类型:AI、NLP、CV、 KG等)交流~
请关注专知公众号,获取人工智能的专业知识!
点击“阅读原文”,使用专知