Inspired by the conclusion that human choose the visual cortex regions which corresponding to the real size of the object to analyze the features of the object, when realizing the objects in the real world. This paper presents a framework -- SizeNet which based on both the real sizes and the features of objects, to solve objects recognition problems. SizeNet was used for the objects recognition experiments on the homemade Rsize dataset, and compared with State-of-the-art Methods AlexNet, VGG-16, Inception V3, Resnet-18 DenseNet-121. The results show that SizeNet provides much higher accuracy rates for the objects recognition than the other algorithms. SizeNet can solve the two problems that correctly recognize the objects whose features are highly similar but the real sizes are obviously different from each other, and correctly distinguish the target object from the interference objects whose real sizes are obviously different from the target object. This is because SizeNet recognizes the object based not only the features, but also the real size. The real size of the object can help to exclude the interference object categories whose real size ranges do not match the real size of the object, which greatly reducing the object categories' number in the label set used for the downstream object recognition based on object features. SizeNet is of great significance to the study of interpretable computer vision. Our code and dataset will be made public.
An increasing number of applications in the computer vision domain, specially, in medical imaging and remote sensing, are challenging when the goal is to classify very large images with tiny objects. More specifically, these type of classification tasks face two key challenges: $i$) the size of the input image in the target dataset is usually in the order of megapixels, however, existing deep architectures do not easily operate on such big images due to memory constraints, consequently, we seek a memory-efficient method to process these images; and $ii$) only a small fraction of the input images are informative of the label of interest, resulting in low region of interest (ROI) to image ratio. However, most of the current convolutional neural networks (CNNs) are designed for image classification datasets that have relatively large ROIs and small image size (sub-megapixel). Existing approaches have addressed these two challenges in isolation. We present an end-to-end CNN model termed Zoom-In network that leverages hierarchical attention sampling for classification of large images with tiny objects using a single GPU. We evaluate our method on two large-image datasets and one gigapixel dataset. Experimental results show that our model achieves higher accuracy than existing methods while requiring less computing resources.
This paper presents a new proposal of an efficient computational model of face and object recognition which uses cues from the distributed face and object recognition mechanism of the brain, and by gathering engineering equivalent of these cues from existing literature. Three distinct and widely used features, Histogram of Oriented Gradients, Local Binary Patterns, and Principal components extracted from target images are used in a manner which is simple, and yet effective. Our model uses multi-layer perceptrons (MLP) to classify these three features and fuse them at the decision level using sum rule. A computational theory is first developed by using concepts from the information processing mechanism of the brain. Extensive experiments are carried out using fifteen publicly available datasets to validate the performance of our proposed model in recognizing faces and objects with extreme variation of illumination, pose angle, expression, and background. Results obtained are extremely promising when compared with other face and object recognition algorithms including CNN and deep learning based methods. This highlights that simple computational processes, if clubbed properly, can produce competing performance with best algorithms.
We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Our models achieve strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by our SlowFast concept. We report state-of-the-art accuracy on major video recognition benchmarks, Kinetics, Charades and AVA. Code will be made publicly available in PyTorch.
State-of-the-art deep convolutional networks (DCNs) such as squeeze-and- excitation (SE) residual networks implement a form of attention, also known as contextual guidance, which is derived from global image features. Here, we explore a complementary form of attention, known as visual saliency, which is derived from local image features. We extend the SE module with a novel global-and-local attention (GALA) module which combines both forms of attention -- resulting in state-of-the-art accuracy on ILSVRC. We further describe ClickMe.ai, a large-scale online experiment designed for human participants to identify diagnostic image regions to co-train a GALA network. Adding humans-in-the-loop is shown to significantly improve network accuracy, while also yielding visual features that are more interpretable and more similar to those used by human observers.
Compared with visible object tracking, thermal infrared (TIR) object tracking can track an arbitrary target in total darkness since it cannot be influenced by illumination variations. However, there are many unwanted attributes that constrain the potentials of TIR tracking, such as the absence of visual color patterns and low resolutions. Recently, structured output support vector machine (SOSVM) and discriminative correlation filter (DCF) have been successfully applied to visible object tracking, respectively. Motivated by these, in this paper, we propose a large margin structured convolution operator (LMSCO) to achieve efficient TIR object tracking. To improve the tracking performance, we employ the spatial regularization and implicit interpolation to obtain continuous deep feature maps, including deep appearance features and deep motion features, of the TIR targets. Finally, a collaborative optimization strategy is exploited to significantly update the operators. Our approach not only inherits the advantage of the strong discriminative capability of SOSVM but also achieves accurate and robust tracking with higher-dimensional features and more dense samples. To the best of our knowledge, we are the first to incorporate the advantages of DCF and SOSVM for TIR object tracking. Comprehensive evaluations on two thermal infrared tracking benchmarks, i.e. VOT-TIR2015 and VOT-TIR2016, clearly demonstrate that our LMSCO tracker achieves impressive results and outperforms most state-of-the-art trackers in terms of accuracy and robustness with sufficient frame rate.
Classifying large scale networks into several categories and distinguishing them according to their fine structures is of great importance with several applications in real life. However, most studies of complex networks focus on properties of a single network but seldom on classification, clustering, and comparison between different networks, in which the network is treated as a whole. Due to the non-Euclidean properties of the data, conventional methods can hardly be applied on networks directly. In this paper, we propose a novel framework of complex network classifier (CNC) by integrating network embedding and convolutional neural network to tackle the problem of network classification. By training the classifiers on synthetic complex network data and real international trade network data, we show CNC can not only classify networks in a high accuracy and robustness, it can also extract the features of the networks automatically.
This paper proposes an Agile Aggregating Multi-Level feaTure framework (Agile Amulet) for salient object detection. The Agile Amulet builds on previous works to predict saliency maps using multi-level convolutional features. Compared to previous works, Agile Amulet employs some key innovations to improve training and testing speed while also increase prediction accuracy. More specifically, we first introduce a contextual attention module that can rapidly highlight most salient objects or regions with contextual pyramids. Thus, it effectively guides the learning of low-layer convolutional features and tells the backbone network where to look. The contextual attention module is a fully convolutional mechanism that simultaneously learns complementary features and predicts saliency scores at each pixel. In addition, we propose a novel method to aggregate multi-level deep convolutional features. As a result, we are able to use the integrated side-output features of pre-trained convolutional networks alone, which significantly reduces the model parameters leading to a model size of 67 MB, about half of Amulet. Compared to other deep learning based saliency methods, Agile Amulet is of much lighter-weight, runs faster (30 fps in real-time) and achieves higher performance on seven public benchmarks in terms of both quantitative and qualitative evaluation.
Inspired by predictive coding in neuroscience, we designed a bi-directional and recurrent neural net, namely deep predictive coding networks (PCN). It uses convolutional layers in both feedforward and feedback networks, and recurrent connections within each layer. Feedback connections from a higher layer carry the prediction of its lower-layer representation; feedforward connections carry the prediction errors to its higher-layer. Given image input, PCN runs recursive cycles of bottom-up and top-down computation to update its internal representations to reduce the difference between bottom-up input and top-down prediction at every layer. After multiple cycles of recursive updating, the representation is used for image classification. In training, the classification error backpropagates across layers and in time. With benchmark data (CIFAR-10/100, SVHN, and MNIST), PCN was found to always outperform its feedforward-only counterpart: a model without any mechanism for recurrent dynamics, and its performance tended to improve given more cycles of computation over time. In short, PCN reuses a single architecture to recursively run bottom-up and top-down process, enabling an increasingly longer cascade of non-linear transformation. For image classification, PCN refines its representation over time towards more accurate and definitive recognition.
Visual Question Answering (VQA) is a novel problem domain where multi-modal inputs must be processed in order to solve the task given in the form of a natural language. As the solutions inherently require to combine visual and natural language processing with abstract reasoning, the problem is considered as AI-complete. Recent advances indicate that using high-level, abstract facts extracted from the inputs might facilitate reasoning. Following that direction we decided to develop a solution combining state-of-the-art object detection and reasoning modules. The results, achieved on the well-balanced CLEVR dataset, confirm the promises and show significant, few percent improvements of accuracy on the complex "counting" task.
Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a novel architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes one third of a second for a typical image.