【导读】计算机视觉最具影响力的学术会议之一的IEEE CVPR将于2018年6月18日-22日在美国盐湖城召开举行。据 CVPR 官网显示,今年大会有超过 3300 篇论文投稿,其中录取 979 篇;相比去年 783 篇论文,今年增长了近 25%。
详细录用名单日前已经公布,可参见:http://cvpr2018.thecvf.com/files/cvpr_2018_final_accept_list.txt
https://github.com/amusi/daily-paper-computer-vision/blob/master/2018/cvpr2018-paper-list.csv
▌论文列表:
| Single-Shot Refinement Neural Network for Object Detection | ||
| Video Captioning via Hierarchical Reinforcement Learning | ||
| DensePose: Multi-Person Dense Human Pose Estimation In The Wild | ||
| DensePose: Multi-Person Dense Human Pose Estimation In The Wild | ||
| Frustum PointNets for 3D Object Detection from RGB-D Data | ||
| Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge | ||
| Rethinking the Faster R-CNN Architecture for Temporal Action Localization | ||
| Shape from Shading through Shape Evolution | ||
| Shape from Shading through Shape Evolution | ||
| A High-Quality Denoising Dataset for Smartphone Cameras | ||
| Improving Color Reproduction Accuracy in the Camera Imaging Pipeline | ||
| End-to-End Dense Video Captioning with Masked Transformer | ||
| End-to-End Dense Video Captioning with Masked Transformer | ||
| pOSE: Pseudo Object Space Error for Initialization-Free Bundle Adjustment | ||
| Learning to Segment Every Thing | ||
| Density-aware Single Image De-raining using a Multi-stream Dense Network | ||
| Densely Connected Pyramid Dehazing Network | ||
| Embodied Question Answering | ||
| TieNet: Text-Image Embedding Network for Common Thorax Disease Classification and Reporting in Chest X-rays | ||
| TieNet: Text-Image Embedding Network for Common Thorax Disease Classification and Reporting in Chest X-rays | ||
| Towards Open-Set Identity Preserving Face Synthesis | ||
| Baseline Desensitizing In Translation Averaging | ||
| Learning from the Deep: A Revised Underwater Image Formation Model | ||
| Context Encoding for Semantic Segmentation | ||
| Context Encoding for Semantic Segmentation | ||
| Deep Texture Manifold for Ground Terrain Recognition | ||
| DS*: Tighter Lifting-Free Convex Relaxations for Quadratic Matching Problems | ||
| Sparse, Smart Contours to Represent and Edit Images | ||
| Every Smile is Unique: Landmark-guided Diverse Smile Generation | ||
| Generative Non-Rigid Shape Completion with Graph Convolutional Autoencoders | ||
| Learning a Discriminative Prior for Blind Image Deblurring | ||
| Attentional ShapeContextNet for Point Cloud Recognition | ||
| Learning Superpixels with Segmentation-Aware Affinity Loss | ||
| Real-World Repetition Estimation by Div, Grad and Curl | ||
| Real-World Repetition Estimation by Div, Grad and Curl | ||
| Recurrent Saliency Transformation Network: Incorporating Multi-Stage Visual Cues for Small Organ Segmentation | ||
| MegaDepth: Learning Single-View Depth Prediction from Internet Photos | ||
| Learning Intrinsic Image Decomposition from Watching the World | ||
| Learning Intrinsic Image Decomposition from Watching the World | ||
| Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering | ||
| Human-centric Indoor Scene Synthesis Using Stochastic Grammar | ||
| Learning by Asking Questions | ||
| Instance Embedding Transfer to Unsupervised Video Object Segmentation | ||
| Detect-and-Track: Efficient Pose Estimation in Videos | ||
| Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval | ||
| Guided Proofreading of Automatic Segmentations for Connectomics | ||
| Augmented Skeleton Space Transfer for Depth-based Hand Pose Estimation | ||
| Augmented Skeleton Space Transfer for Depth-based Hand Pose Estimation | ||
| Context-aware Synthesis for Video Frame Interpolation | ||
| 2D/3D Pose Estimation and Action Recognition using Multitask Deep Learning | ||
| NAG: Network for Adversary Generation | ||
| LiteFlowNet: A Lightweight Convolutional Neural Network for Optical Flow Estimation | ||
| LiteFlowNet: A Lightweight Convolutional Neural Network for Optical Flow Estimation | ||
| Avatar-Net: Multi-scale Zero-shot Style Transfer by Feature Decoration | ||
| Multi-view Harmonized Bilinear Network for 3D Object Recognition | ||
| Multi-view Harmonized Bilinear Network for 3D Object Recognition | ||
| Tangent Convolutions for Dense Prediction in 3D | ||
| Tangent Convolutions for Dense Prediction in 3D | ||
| Semi-parametric Image Synthesis | ||
| Semi-parametric Image Synthesis | ||
| Interactive Image Segmentation with Latent Diversity | ||
| 3D Hand Pose Estimation: From Current Achievements to Future Goals | ||
| 3D Hand Pose Estimation: From Current Achievements to Future Goals | ||
| W2F: A Weakly-Supervised to Fully-Supervised Framework for Object Detection | ||
| BlockDrop: Dynamic Inference Paths in Residual Networks | ||
| BlockDrop: Dynamic Inference Paths in Residual Networks | ||
| MapNet: Geometry-Aware Learning of Maps for Camera Localization | ||
| MapNet: Geometry-Aware Learning of Maps for Camera Localization | ||
| BPGrad: Towards Global Optimality in Deep Learning via Branch and Pruning | ||
| Salient Object Detection Driven by Fixation Prediction | ||
| 3D Object Detection with Latent Support Surfaces | ||
| Practical Block-wise Neural Network Architecture Generation | ||
| Practical Block-wise Neural Network Architecture Generation | ||
| Glimpse Clouds: Human Activity Recognition from Unstructured Feature Points | ||
| Are You Talking to Me? Reasoned Visual Dialog Generation through Adversarial Learning | ||
| Are You Talking to Me? Reasoned Visual Dialog Generation through Adversarial Learning | ||
| Visual Grounding via Accumulated Attention | ||
| Supervision-by-Registration: An Unsupervised Approach to Improve the Precision of Facial Landmark Detectors | ||
| ISTA-Net: Interpretable Optimization-Inspired Deep Network for Image Compressive Sensing | ||
| Perturbative Neural Networks: Rethinking Convolution in CNNs | ||
| Nonlinear 3D Face Morphable Model | ||
| Nonlinear 3D Face Morphable Model | ||
| Neural Baby Talk | ||
| Neural Baby Talk | ||
| Towards Pose Invariant Face Recognition in the Wild | ||
| MoNet: Deep Motion Exploitation for Video Object Segmentation | ||
| Exploring Disentangled Feature Representation Beyond Face Identification | ||
| Towards Effective Low-bitwidth Convolutional Neural Networks | ||
| Parallel Attention: A Unified Framework for Visual Object Discovery through Dialogs and Queries | ||
| Learning Facial Action Units from Web Images with Scalable Weakly Supervised Clustering | ||
| Few-Shot Image Recognition by Predicting Parameters from Activations | ||
| Few-Shot Image Recognition by Predicting Parameters from Activations | ||
| Single-Shot Object Detection with Enriched Semantics | ||
| Unifying Identification and Context Learning for Person Recognition | ||
| Separating Self-Expression and Visual Content in Hashtag Supervision | ||
| Multi-Cue Correlation Filters for Robust Visual Tracking | ||
| Beyond Trade-off: Accelerate FCN-based Face Detection with Higher Accuracy | ||
| On the Robustness of Semantic Segmentation Models to Adversarial Attacks | ||
| PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume | ||
| PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume | ||
| Illuminant Spectra-based Source Separation Using Flash Photography | ||
| Illuminant Spectra-based Source Separation Using Flash Photography | ||
| Tracking Multiple Objects Outside the Line of Sight using Speckle Imaging | ||
| Tracking Multiple Objects Outside the Line of Sight using Speckle Imaging | ||
| Improved Human Pose Estimation through Adversarial Data Augmentation | ||
| Generative Adversarial Learning Towards Fast Weakly Supervised Detection | ||
| Audio to Body Dynamics | ||
| Audio to Body Dynamics | ||
| The Unreasonable Effectiveness of Deep Features as a Perceptual Metric | ||
| Frame-Recurrent Video Super-Resolution | ||
| Deep Mutual Learning | ||
| Real-world Anomaly Detection in Surveillance Videos | ||
| Soccer on Your Tabletop | ||
| Diversity Regularized Spatiotemporal Attention for Video-based Person Re-identification | ||
| HashGAN: Deep Learning to Hash with Pair Conditional Wasserstein GAN | ||
| Excitation Backprop for RNNs | ||
| Dynamic-Structured Semantic Propagation Network | ||
| Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation | ||
| Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation | ||
| SPLATNet: Sparse Lattice Networks for Point Cloud Processing | ||
| SPLATNet: Sparse Lattice Networks for Point Cloud Processing | ||
| Video Representation Learning Using Discriminative Pooling | ||
| Attend and Interact: Higher-Order Object Interactions for Video Understanding | ||
| Human Pose Estimation with Parsing Induced Learner | ||
| 4D Human Body Correspondences from Panoramic Depth Maps | ||
| Recognizing Human Actions as Evolution of Pose Estimation Maps | ||
| GraphBit: Bitwise Interaction Mining via Deep Reinforcement Learning | ||
| Deep Adversarial Metric Learning | ||
| Deep Adversarial Metric Learning | ||
| Revisiting Video Saliency: A Large-scale Benchmark and a New Model | ||
| Graph-Cut RANSAC | ||
| Five-point Fundamental Matrix Estimation for Uncalibrated Cameras | ||
| Hashing as Tie-Aware Learning to Rank | ||
| Optimizing Local Feature Descriptors for Nearest Neighbor Matching | ||
| Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies | ||
| Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies | ||
| Consensus Maximization for Semantic Region Correspondences | ||
| Consensus Maximization for Semantic Region Correspondences | ||
| ST-GAN: Spatial Transformer Generative Adversarial Networks for Image Compositing | ||
| Motion-Guided Cascaded Refinement Network for Video Object Segmentation | ||
| Zigzag Learning for Weakly Supervised Object Detection | ||
| Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models | ||
| Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models | ||
| VITON: An Image-based Virtual Try-on Network | ||
| VITON: An Image-based Virtual Try-on Network | ||
| Cross-Domain Self-supervised Multi-task Feature Learning Using Synthetic Game Imagery | ||
| LayoutNet: Reconstructing the 3D Room Layout from a Single RGB Image | ||
| Thoracic Disease Identification and Localization with Limited Supervision | ||
| Stochastic Downsampling for Cost-Adjustable Inference and Improved Regularization in Convolutional Networks | ||
| Learning Pixel-level Semantic Affinity with Image-level Supervision for Weakly Supervised Semantic Segmentation | ||
| Deep End-to-End Time-of-Flight Imaging | ||
| Fast and Accurate Online Video Object Segmentation via Tracking Parts | ||
| Fast and Accurate Online Video Object Segmentation via Tracking Parts | ||
| Min-Entropy Latent Model for Weakly Supervised Object Detection | ||
| Future Frame Prediction for Anomaly Detection A New Baseline | ||
| Face Aging with Identity-Preserved Conditional Generative Adversarial Networks | ||
| Learning to Compare: Relation Network for Few-Shot Learning | ||
| Deep Layer Aggregation | ||
| Deep Layer Aggregation | ||
| Style Aggregated Network for Facial Landmark Detection | ||
| M3: Multimodal Memory Modelling for Video Captioning | ||
| M3: Multimodal Memory Modelling for Video Captioning | ||
| Classification Driven Dynamic Image Enhancement | ||
| Generative Image Inpainting with Contextual Attention | ||
| Iterative Visual Reasoning Beyond Convolutions | ||
| Iterative Visual Reasoning Beyond Convolutions | ||
| Dual Attention Matching Network for Context-Aware Feature Sequence based Person Re-Identification | ||
| Textbook Question Answering under Teacher Guidance with Memory Networks | ||
| Textbook Question Answering under Teacher Guidance with Memory Networks | ||
| Multi-Level Factorisation Net for Person Re-Identification | ||
| Functional Map of the World | ||
| Functional Map of the World | ||
| A Two-Step Disentanglement Method | ||
| Towards Faster Training of Global Covariance Pooling Networks by Iterative Matrix Square Root Normalization | ||
| Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? | ||
| Left-Right Comparative Recurrent Model for Stereo Matching | ||
| Left-Right Comparative Recurrent Model for Stereo Matching | ||
| Analytic Expressions for Probabilistic Moments of PL-DNN with Gaussian Input | ||
| Analytic Expressions for Probabilistic Moments of PL-DNN with Gaussian Input | ||
| Zero-Shot Sketch-Image Hashing | ||
| Zero-Shot Sketch-Image Hashing | ||
| Interpretable Convolutional Neural Networks | ||
| Interpretable Convolutional Neural Networks | ||
| Reconstructing Thin Structures of Manifold Surfaces by Integrating Spatial Curves | ||
| Enhancing the Spatial Resolution of Stereo Images using a Parallax Prior | ||
| Anticipating Traffic Accidents with Adaptive Loss and Large-scale Incident DB | ||
| Generating Synthetic X-ray Images of a Person from the Surface Geometry | ||
| Generating Synthetic X-ray Images of a Person from the Surface Geometry | ||
| Attentive Fashion Grammar Network for Fashion Landmark Detection and Clothing Category Classification | ||
| Unsupervised CCA | ||
| Discovering Point Lights with Intensity Distance Fields | ||
| Universal Denoising Networks : A Novel CNN-based Network Architecture for Image Denoising | ||
| Easy Identification from Better Constraints: Multi-Shot Person Re-Identification from Reference Constraints | ||
| Recurrent Pixel Embedding for Instance Grouping | ||
| Recurrent Pixel Embedding for Instance Grouping | ||
| Recurrent Scene Parsing with Perspective Understanding in the Loop | ||
| Learning to Hash by Discrepancy Minimization | ||
| Fast End-to-End Trainable Guided Filter | ||
| Disentangling Structure and Aesthetics for Content-aware Image Completion | ||
| An Analysis of Scale Invariance in Object Detection - SNIP | ||
| An Analysis of Scale Invariance in Object Detection - SNIP | ||
| CSGNet: Neural Shape Parser for Constructive Solid Geometry | ||
| Finding Tiny Faces in the Wild with Generative Adversarial Network | ||
| Finding Tiny Faces in the Wild with Generative Adversarial Network | ||
| SSNet: Scale Selection Network for Online 3D Action Prediction | ||
| SSNet: Scale Selection Network for Online 3D Action Prediction | ||
| Integrated facial landmark localization and super-resolution of real-world very low resolution faces in arbitrary poses with GANs | ||
| Integrated facial landmark localization and super-resolution of real-world very low resolution faces in arbitrary poses with GANs | ||
| The Best of Both Worlds: Combining CNNs and Geometric Constraints for Hierarchical Motion Segmentation | ||
| In-Place Activated BatchNorm for Memory-Optimized Training of DNNs | ||
| Wing Loss for Robust Facial Landmark Localisation with Convolutional Neural Networks | ||
| Deep Cross-media Knowledge Transfer | ||
| Deep Cross-media Knowledge Transfer | ||
| Coupled End-to-end Transfer Learning with Generalized Fisher Information | ||
| Knowledge Aided Consistency for Weakly Supervised Phrase Grounding | ||
| Viewpoint-aware Attentive Multi-view Inference for Vehicle Re-identification | ||
| MatNet: Modular Attention Network for Referring Expression Comprehension | ||
| CBMV: A Coalesced Bidirectional Matching Volume for Disparity Estimation | ||
| NISP: Pruning Networks using Neuron Importance Score Propagation | ||
| NISP: Pruning Networks using Neuron Importance Score Propagation | ||
| Who Let The Dogs Out? Modeling Dog Behavior From Visual Data | ||
| Efficient Video Object Segmentation via Network Modulation | ||
| Learning Deep Models for Face Anti-Spoofing: Binary or Auxiliary Supervision | ||
| Feedback-prop: Convolutional Neural Network Inference under Partial Evidence | ||
| A Memory Network Approach for Story-based Temporal Summarization of 360?Videos | ||
| Improving Occlusion and Hard Negative Handling for Single-Stage Object Detectors | ||
| UV-GAN: Adversarial Facial UV Map Completion for Pose-invariant Face Recognition | ||
| Learning a Toolchain for Image Restoration | ||
| Learning a Toolchain for Image Restoration | ||
| Learning to Act Properly: Predicting and Explaining Affordances from Images | ||
| Learning a Discriminative Feature Network for Semantic Segmentation | ||
| Optimizing Video Object Detection via a Scale-Time Lattice | ||
| ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices | ||
| Cascaded Pyramid Network for Multi-Person Pose Estimation | ||
| Seeing Temporal Modulation of Lights from Standard Cameras | ||
| Point-wise Convolutional Neural Networks | ||
| Fine-grained Video Captioning for Sports Narrative | ||
| Fine-grained Video Captioning for Sports Narrative | ||
| Dense 3D Regression for Hand Pose Estimation | ||
| Missing Slice Recovery for Tensors Using a Low-rank Model in Embedded Space | ||
| Learning Convolutional Networks for Content-weighted Image Compression | ||
| Learning Attentions: Residual Attentional Siamese Network for High Performance Online Visual Tracking | ||
| Deep Cost-Sensitive and Order-Preserving Feature Learning for Cross-Population Age Estimation | ||
| First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations | ||
| Hand PointNet: 3D Hand Pose Estimation using Point Sets | ||
| Hand PointNet: 3D Hand Pose Estimation using Point Sets | ||
| Recovering Realistic Texture in Image Super-resolution by Spatial Feature Modulation | ||
| Cube Padding for Weakly-Supervised Saliency Prediction in 360$^{\circ}$ Videos | ||
| A Face to Face Neural Conversation Model | ||
| SurfConv: Bridging 3D and 2D Convolution for RGBD Images | ||
| Dynamic Video Segmentation Network | ||
| Multiple Granularity Group Interaction Prediction | ||
| Visual Question Reasoning on General Dependency Tree | ||
| Visual Question Reasoning on General Dependency Tree | ||
| From Lifestyle VLOGs to Everyday Interactions | ||
| COCO-Stuff: Thing and Stuff Classes in Context | ||
| GANerated Hands for Real-Time 3D Hand Tracking from Monocular RGB | ||
| GANerated Hands for Real-Time 3D Hand Tracking from Monocular RGB | ||
| Non-local Neural Networks | ||
| Zero-shot Recognition via Semantic Embeddings and Knowledge Graphs | ||
| Taskonomy: Disentangling Task Transfer Learning | ||
| Taskonomy: Disentangling Task Transfer Learning | ||
| Embodied Real-World Active Perception | ||
| Embodied Real-World Active Perception | ||
| SfSNet : Learning Shape, Reflectance and Illuminance of Faces `in the wild' | ||
| SfSNet : Learning Shape, Reflectance and Illuminance of Faces `in the wild' | ||
| End-to-end Recovery of Human Shape and Pose | ||
| Factoring Shape, Pose, and Layout from the 2D Image of a 3D Scene | ||
| Multi-view Consistency as Supervisory Signal for Learning Shape and Pose Prediction | ||
| A Fast Resection-Intersection Method for the Known Rotation Problem | ||
| Image Generation from Scene Graphs | ||
| What Makes a Video a Video: Analyzing Temporal Information in Video Understanding Models and Datasets | ||
| What Makes a Video a Video: Analyzing Temporal Information in Video Understanding Models and Datasets | ||
| PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation | ||
| High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs | ||
| High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs | ||
| Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks | ||
| Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference | ||
| Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference | ||
| Finding It": Weakly-Supervised Reference-Aware Visual Grounding in Instructional Video" | ||
| Finding It": Weakly-Supervised Reference-Aware Visual Grounding in Instructional Video" | ||
| Unsupervised Cross-dataset Person Re-identification by Transfer Learning of Spatio-temporal Patterns | ||
| Kernelized Subspace Pooling for Deep Local Descriptors | ||
| Video Rain Removal By Multiscale Convolutional Sparse Coding | ||
| Learning from Millions of 3D Scans for Large-scale 3D Face Recognition | ||
| Referring Relationships | ||
| Improving Object Localization with Fitness NMS and Bounded IoU Loss | ||
| Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination | ||
| Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination | ||
| CVM-Net: Cross-View Matching Network for Image-Based Ground-to-Aerial Geo-Localization | ||
| CVM-Net: Cross-View Matching Network for Image-Based Ground-to-Aerial Geo-Localization | ||
| Visual Question Generation as Dual Task of Visual Question Answering | ||
| Visual Question Generation as Dual Task of Visual Question Answering | ||
| Revisiting Dilated Convolution: A Simple Approach for Weakly- and Semi- Supervised Semantic Segmentation | ||
| Revisiting Dilated Convolution: A Simple Approach for Weakly- and Semi- Supervised Semantic Segmentation | ||
| Learning Dual Convolutional Neural Networks for Low-Level Vision | ||
| Deep Video Super-Resolution Network Using Dynamic Upsampling Filters Without Explicit Motion Compensation | ||
| MegDet: A Large Mini-Batch Object Detector | ||
| MegDet: A Large Mini-Batch Object Detector | ||
| AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks | ||
| TOM-Net: Learning Transparent Object Matting from a Single Image | ||
| TOM-Net: Learning Transparent Object Matting from a Single Image | ||
| End-to-End Deep Kronecker-Product Matching for Person Re-identification | ||
| Semantic Visual Localization | ||
| Joint Cuts and Matching of Partitions in One Graph | ||
| Benchmarking 6DOF Outdoor Visual Localization in Changing Conditions | ||
| Benchmarking 6DOF Outdoor Visual Localization in Changing Conditions | ||
| Crowd Counting via Adversarial Cross-Scale Consistency Pursuit | ||
| Deep Group-shuffling Random Walk for Person Re-identification | ||
| Learning to Detect Features in Texture Images | ||
| Learning to Detect Features in Texture Images | ||
| Transferable Joint Attribute-Identity Deep Learning for Unsupervised Person Re-Identification | ||
| CarFusion: Combining Point Tracking and Part Detection for Dynamic 3D Reconstruction of Vehicles | ||
| Context-aware Deep Feature Compression for High-speed Visual Tracking | ||
| Deep Material-aware Cross-spectral Stereo Matching | ||
| Deep Extreme Cut: From Extreme Points to Object Segmentation | ||
| Label Denoising Adversarial Network (LDAN) for Inverse Lighting of Face Images | ||
| Label Denoising Adversarial Network (LDAN) for Inverse Lighting of Face Images | ||
| Harmonious Attention Network for Person Re-Identication | ||
| Unsupervised Deep Generative Adversarial Hashing Network | ||
| Unsupervised Deep Generative Adversarial Hashing Network | ||
| Pseudo-Mask Augmented Object Detection | ||
| LSTM stack-based Neural Multi-sequence Alignment TeCHnique (NeuMATCH) | ||
| LSTM stack-based Neural Multi-sequence Alignment TeCHnique (NeuMATCH) | ||
| Adversarial Complementary Learning for Weakly Supervised Object Localization | ||
| Unsupervised Discovery of Object Landmarks as Structural Representations | ||
| Unsupervised Discovery of Object Landmarks as Structural Representations | ||
| DeLS-3D: Deep Localization and Segmentation with a 3D Semantic Map | ||
| Monocular Relative Depth Perception with Web Stereo Data Supervision | ||
| Image-Image Domain Adaptation with Preserved Self-Similarity and Domain-Dissimilarity for Person Re-identification | ||
| Objects as context for detecting their semantic parts | ||
| Camera Style Adaptation for Person Re-identification | ||
| Conditional Generative Adversarial Network for Structured Domain Adaptation | ||
| Rotation-sensitive Regression for Oriented Scene Text Detection | ||
| Residual Parameter Transfer for Deep Domain Adaptation | ||
| SGPN: Similarity Group Proposal Network for 3D Point Cloud Instance Segmentation | ||
| SGPN: Similarity Group Proposal Network for 3D Point Cloud Instance Segmentation | ||
| Weakly Supervised Instance Segmentation using Class Peak Response | ||
| Weakly Supervised Instance Segmentation using Class Peak Response | ||
| Robust Facial Landmark Detection via a Fully-Convolutional Local-Global Context Network | ||
| Rotation Averaging and Strong Duality | ||
| Rotation Averaging and Strong Duality | ||
| PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning | ||
| Im2Flow: Motion Hallucination from Static Images for Action Recognition | ||
| Im2Flow: Motion Hallucination from Static Images for Action Recognition | ||
| Feature Quantization for Defending Against Distortion of Images | ||
| End-to-end weakly-supervised semantic alignment | ||
| PointGrid: A Deep Network for 3D Shape Understanding | ||
| PointGrid: A Deep Network for 3D Shape Understanding | ||
| Imagine it for me: Generative Adversarial Approach for Zero-Shot Learning from Noisy Texts | ||
| A Minimalist Approach to Type-Agnostic Detection of Quadrics in Point Clouds | ||
| A Benchmark for Articulated Human Pose Estimation and Tracking | ||
| Boosting Self-Supervised Learning via Knowledge Transfer | ||
| PPFNet: Global Context Aware Local Features for Robust 3D Point Matching | ||
| PPFNet: Global Context Aware Local Features for Robust 3D Point Matching | ||
| Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments | ||
| Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments | ||
| Fast Video Object Segmentation by Reference-Guided Mask Propagation | ||
| Fast Video Object Segmentation by Reference-Guided Mask Propagation | ||
| Super-Resolving Very Low-Resolution Face Images with Supplementary Attributes | ||
| Video Person Re-identification with Competitive Snippet-similarity Aggregation and Co-attentive Snippet Embedding | ||
| One-shot Action Localization by Sequence Matching Network | ||
| Efficient Subpixel Refinement with Symbolic Linear Predictors | ||
| Distort-and-Recover: Color Enhancement using Deep Reinforcement Learning | ||
| Group Consistent Similarity Learning via Deep CRFs for Person Re-Identification | ||
| Group Consistent Similarity Learning via Deep CRFs for Person Re-Identification | ||
| Single Image Reflection Separation with Perceptual Losses | ||
| AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions | ||
| AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions | ||
| Recognize Actions by Disentangling Components of Dynamics | ||
| Zoom and Learn: Generalizing Deep Stereo Matching to Novel Domains | ||
| Attention-aware Compositional Network for Person Re-Identification | ||
| HATS: Histograms of Averaged Time Surfaces for Robust Event-based Object Classification | ||
| Mask-guided Contrastive Attention Model for Person Re-Identification | ||
| Pose-Guided Photorealistic Face Rotation | ||
| Pose-Guided Photorealistic Face Rotation | ||
| Automatic 3D Indoor Scene Modeling from Single Panorama | ||
| Automatic 3D Indoor Scene Modeling from Single Panorama | ||
| SobolevFusion: 3D Reconstruction of Scenes Undergoing Free Non-rigid Motion | ||
| SobolevFusion: 3D Reconstruction of Scenes Undergoing Free Non-rigid Motion | ||
| A Biresolution Spectral framework for Product Quantization | ||
| Dynamic Zoom-in Network for Fast Object Detection in Large Images | ||
| On the Importance of Label Quality for Semantic Segmentation | ||
| EPINET: A Fully-Convolutional Neural Network for Light Field Depth Estimation by Using Epipolar Geometry | ||
| A Pose-Sensitive Embedding for Person Re-Identification with Expanded Cross Neighborhood Re-Ranking | ||
| Erase or Fill? Deep Joint Recurrent Rain Removal and Reconstruction in Videos | ||
| Scalable and Effective Deep CCA via Soft Decorrelation | ||
| High-order tensor regularization with application to attribute ranking | ||
| 3D-RCNN: Instance-level 3D Scene Understanding via Render-and-Compare | ||
| 3D-RCNN: Instance-level 3D Scene Understanding via Render-and-Compare | ||
| FoldingNet: Interpretable Unsupervised Learning on 3D Point Clouds | ||
| FoldingNet: Interpretable Unsupervised Learning on 3D Point Clouds | ||
| Defocus Blur Detection via Multi-Stream Bottom-Top-Bottom Fully Convolutional Network | ||
| Decorrelated Batch Normalization | ||
| Unsupervised Textual Grounding: Linking Words to Image Concepts | ||
| Unsupervised Textual Grounding: Linking Words to Image Concepts | ||
| Scale-recurrent Network for Deep Image Deblurring | ||
| Low-Shot Recognition with Imprinted Weights | ||
| Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering | ||
| Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering | ||
| Cross-Domain Weakly-Supervised Object Detection through Progressive Domain Adaptation | ||
| Facelet-Bank for Fast Portrait Manipulation | ||
| Duplex Generative Adversarial Network for Unsupervised Domain Adaptation | ||
| Quantization of Fully Convolutional Networks for Accurate Biomedical Image Segmentation | ||
| Real-Time Rotation-Invariant Face Detection with Progressive Calibration Networks | ||
| Structure Preserving Video Prediction | ||
| Tagging Like Humans: Diverse and Distinct Image Annotation | ||
| Learning to Sketch with Shortcut Cycle Consistency | ||
| GroupCap: Group-based Image Captioning with Structured Relevance and Diversity Constraints | ||
| Dynamic Scene Deblurring Using Spatially Variant Recurrent Neural Networks | ||
| Dynamic Scene Deblurring Using Spatially Variant Recurrent Neural Networks | ||
| Hyperparameter Optimization for Tracking with Continuous Deep Q-Learning | ||
| Deep Unsupervised Saliency Detection: A Multiple Noisy Labeling Perspective | ||
| Deep Unsupervised Saliency Detection: A Multiple Noisy Labeling Perspective | ||
| NeuralNetwork-Viterbi: A Framework for Weakly Supervised Video Learning | ||
| NeuralNetwork-Viterbi: A Framework for Weakly Supervised Video Learning | ||
| Detecting and Recognizing Human-Object Interactions | ||
| Detecting and Recognizing Human-Object Interactions | ||
| Augmenting Crowd-Sourced 3D Reconstructions using Semantic Detections | ||
| Visual Relationship Learning with a Factorization-based Prior | ||
| Re-weighted Adversarial Adaptation Network for Unsupervised Domain Adaptation | ||
| Flow Guided Recurrent Neural Encoder for Video Salient Object Detection | ||
| Disentangling 3D Pose in A Dendritic CNN for Unconstrained 2D Face Alignment | ||
| Progressive Attention Guided Recurrent Network for Salient Object Detection | ||
| Answer with Grounding Snippets: Focal Visual-Text Attention for Visual Question Answering | ||
| Answer with Grounding Snippets: Focal Visual-Text Attention for Visual Question Answering | ||
| Unsupervised Learning of Depth and Egomotion from Monocular Video Using 3D Geometric Constraints | ||
| Repulsion Loss: Detecting Pedestrians in a Crowd | ||
| PU-Net: Point Cloud Upsampling Network | ||
| Video Object Segmentation via Inference in A CNN-Based Higher-Order Spatio-Temporal MRF | ||
| Video Object Segmentation via Inference in A CNN-Based Higher-Order Spatio-Temporal MRF | ||
| PiCANet: Learning Pixel-wise Contextual Attention for Saliency Detection | ||
| Gated Fusion Network for Single Image Dehazing | ||
| Interleaved Structured Sparse Convolutional Neural Networks | ||
| Interleaved Structured Sparse Convolutional Neural Networks | ||
| Where and Why Are They Looking? Jointly Inferring Human Attention and Intentions in Complex Tasks | ||
| End-to-end Flow Correlation Tracking with Spatial-temporal Attention | ||
| Left/Right Asymmetric Layer Skippable Networks | ||
| Context Contrasted Feature and Gated Multi-scale Aggregation for Scene Segmentation | ||
| Context Contrasted Feature and Gated Multi-scale Aggregation for Scene Segmentation | ||
| VITAL: VIsual Tracking via Adversarial Learning | ||
| VITAL: VIsual Tracking via Adversarial Learning | ||
| RotationNet: Joint Object Categorization and Pose Estimation Using Multiviews from Unsupervised Viewpoints | ||
| Action Sets: Weakly Supervised Action Segmentation without Ordering Constraints | ||
| Action Sets: Weakly Supervised Action Segmentation without Ordering Constraints | ||
| Squeeze-and-Excitation Networks | ||
| Squeeze-and-Excitation Networks | ||
| Edit Probability for Scene Text Recognition | ||
| Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning | ||
| Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning | ||
| Exploit the Unknown Gradually:~ One-Shot Video-Based Person Re-Identification by Stepwise Learning | ||
| Learning to Localize Sound Source in Visual Scenes | ||
| Dynamic Few-Shot Visual Learning without Forgetting | ||
| Weakly-Supervised Semantic Segmentation by Iteratively Mining Common Object Features | ||
| SINT++: Robust Visual Tracking via Adversarial Hard Positive Generation | ||
| Real-Time Monocular Depth Estimation using Synthetic Data with Domain Adaptation via Image Style Transfer | ||
| Fast and Accurate Single Image Super-Resolution via Information Distillation Network | ||
| Low-Latency Video Semantic Segmentation | ||
| Low-Latency Video Semantic Segmentation | ||
| Domain Adaptive Faster R-CNN for Object Detection in the Wild | ||
| DoubleFusion: Real-time Capture of Human Performance with Inner Body Shape from a Single Depth Sensor | ||
| DoubleFusion: Real-time Capture of Human Performance with Inner Body Shape from a Single Depth Sensor | ||
| Lean Multiclass Crowdsourcing | ||
| Lean Multiclass Crowdsourcing | ||
| Tell Me Where To Look: Guided Attention Inference Network | ||
| Tell Me Where To Look: Guided Attention Inference Network | ||
| Residual Dense Network for Image Super-Resolution | ||
| Residual Dense Network for Image Super-Resolution | ||
| Look at Boundary: A Boundary-Aware Face Alignment Algorithm | ||
| Imagination-IQA: No-reference Image Quality Assessment via Adversarial Learning | ||
| Memory Matching Networks for One-Shot Image Recognition | ||
| 3D Human Pose Estimation in the Wild by Adversarial Learning | ||
| Unsupervised Training for 3D Morphable Model Regression | ||
| Unsupervised Training for 3D Morphable Model Regression | ||
| Scalable Dense Non-rigid Structure-from-Motion: A Grassmannian Perspective | ||
| IQA: Visual Question Answering in Interactive Environments | ||
| Learning Spatial-Temporal Regularized Correlation Filters for Visual Tracking | ||
| Low-shot Learning from Imaginary Data | ||
| Low-shot Learning from Imaginary Data | ||
| Deep Regression Forests for Age Estimation | ||
| Partial Transfer Learning with Selective Adversarial Networks | ||
| Partial Transfer Learning with Selective Adversarial Networks | ||
| A Bi-directional Message Passing Model for Salient Object Detection | ||
| Transductive Unbiased Embedding for Zero-Shot Learning | ||
| Scale-Transferrable Object Detection | ||
| Crowd Counting with Deep Negative Correlation Learning | ||
| Deep Cauchy Hashing for Hamming Space Retrieval | ||
| Demo2Vec: Reasoning Object Affordances from Online Videos | ||
| GVCNN: Group-View Convolutional Neural Networks for 3D Shape Recognition | ||
| An End-to-End TextSpotter with Explicit Alignment and Attention | ||
| Stereoscopic Neural Style Transfer | ||
| Bootstrapping the Performance of Webly Supervised Semantic Segmentation | ||
| Learning Markov Clustering Networks for Scene Text Detection | ||
| Collaborative and Adversarial Network for Unsupervised domain adaptation | ||
| Collaborative and Adversarial Network for Unsupervised domain adaptation | ||
| Reflection Removal for Large-Scale 3D Point Clouds | ||
| Pose Transferrable Person Re-Identification | ||
| Learning to Adapt Structured Output Space for Semantic Segmentation | ||
| Learning to Adapt Structured Output Space for Semantic Segmentation | ||
| Efficient Diverse Ensemble for Discriminative Co-Tracking | ||
| Learning a Single Convolutional Super-Resolution Network for Multiple Degradations | ||
| Probabilistic Plant Modeling via Multi-View Image-to-Image Translation | ||
| Learning to Parse Wireframes in Images of Man-Made Environments | ||
| A Variational U-Net for Conditional Appearance and Shape Generation | ||
| A Variational U-Net for Conditional Appearance and Shape Generation | ||
| Learning to Find Good Correspondences | ||
| Learning to Find Good Correspondences | ||
| Actor and Action Video Segmentation from a Sentence | ||
| Actor and Action Video Segmentation from a Sentence | ||
| Towards a Mathematical Understanding of the Difficulty in Learning with Feedforward Neural Networks | ||
| Weakly-supervised Deep Convolutional Neural Network Learning for Facial Action Unit Intensity Estimation | ||
| Maximum Classifier Discrepancy for Unsupervised Domain Adaptation | ||
| Maximum Classifier Discrepancy for Unsupervised Domain Adaptation | ||
由于微信字数限制,没有全部显示,详细list 请查看Amusi整理的
https://github.com/amusi/daily-paper-computer-vision
-END-
专 · 知
人工智能领域主题知识资料查看与加入专知人工智能服务群:
【专知AI服务计划】专知AI知识技术服务会员群加入与人工智能领域26个主题知识资料全集获取。欢迎微信扫一扫加入专知人工智能知识星球群,获取专业知识教程视频资料和与专家交流咨询!
请PC登录www.zhuanzhi.ai或者点击阅读原文,注册登录专知,获取更多AI知识资料!
请加专知小助手微信(扫一扫如下二维码添加),加入专知主题群(请备注主题类型:AI、NLP、CV、 KG等)交流~
请关注专知公众号,获取人工智能的专业知识!
点击“阅读原文”,使用专知