TCN v2 + 3Dconv 运动信息

2019 年 1 月 8 日 CreateAMind
TCN v2 + 3Dconv 运动信息

tcn v2

tcn v1

Time Contrastive Networks

This implements "Time Contrastive Networks", which is part of the larger Self-Supervised Imitation Learning project.Contacts

Maintainers of TCN:

  • Corey Lynch: github, twitter

  • Pierre Sermanet: github, twitter


  • Getting Started

    • Install Dependencies

    • Download the Inception v3 Checkpoint

    • Run all the tests

  • Concepts

    • Nearest Neighbor Imitation Videos

    • PCA & T-SNE Visualization

    • KNN Classification Error

    • KNN Classification Error

    • Multi-view Webcam Video

    • Data Pipelines

    • Estimators

    • Models

    • Losses

    • Inference

    • Configuration

    • Monitoring Training

    • Visualization

  • Tutorial Part I: Collecting Multi-View Webcam Videos

    • Collect Webcam Videos

    • Create TFRecords

  • Tutorial Part II: Training, Evaluation, and Visualization

    • Generate Imitation Videos

    • Run PCA & T-SNE Visualization

    • Download Data

    • Download the Inception v3 Checkpoint

    • Define a Config

    • Train

    • Evaluate

    • Monitor training

    • Visualize

Getting started

Install Dependencies

  • Tensorflow nightly build or via pip install tf-nightly-gpu.

  • Bazel

  • matplotlib

  • sklearn

  • opencv

Download Pretrained InceptionV3 Checkpoint

Run the script that downloads the pretrained InceptionV3 checkpoint:

cd tensorflow-models/tcn

Run all the tests

bazel test :all


Multi-View Webcam Video

We provide utilities to collect your own multi-view videos in dataset/ See the webcam tutorial for an end to end example of how to collect multi-view webcam data and convert it to the TFRecord format expected by this library.

Data Pipelines

We use the API to construct input pipelines that feed training, evaluation, and visualization. These pipelines are defined in


We define training, evaluation, and inference behavior using the tf.estimator.Estimator API. Seeestimators/ for an example of how multi-view TCN training, evaluation, and inference is implemented.


Different embedder architectures are implemented in We used the InceptionConvSSFCEmbedder in the pouring experiments, but we're also evaluating Resnet embedders.


We use the tf.contrib.losses.metric_learning library's implementations of triplet loss with semi-hard negative mining and npairs loss. In our experiments, npairs loss has better empirical convergence and produces the best qualitative visualizations, and will likely be our choice for future experiments. See the paper for details on the algorithm.


We support 3 modes of inference for trained TCN models:

  • Mode 1: Input is a tf.Estimator input_fn (see this for details). Output is an iterator over embeddings and additional metadata. See for a usage example.

  • Mode 2: Input is a TFRecord or (or list of TFRecords). This returns an iterator over tuples of (embeddings, raw_image_strings, sequence_name), where embeddings is the [num views, sequence length, embedding size] numpy array holding the full embedded sequence (for all views), raw_image_strings is a [num views, sequence length] string array holding the jpeg-encoded raw image strings, and sequence_name is the name of the sequence. See for a usage example.

  • Mode 3: Input is a numpy array of size [num images, height, width, num channels]. This returns a tuple of (embeddings, raw_image_strings), where embeddings is a 2-D float32 numpy array holding [num_images, embedding_size] image embeddings, and raw_image_strings is a 1-D string numpy array holding [batch_size] jpeg-encoded image strings. This can be used as follows:

    images = np.random.uniform(0, 1, size=(batch_size, 1080, 1920, 3))
    embeddings, _ = estimator.inference(
        images, checkpoint_path=checkpoint_path)

See estimators/ for details.


Data pipelines, training, eval, and visualization are all configured using key-value parameters passed as YAML files. Configurations can be nested, e.g.:

learning:  optimizer: 'adam'
  learning_rate: 0.001

T objects

YAML configs are converted to LuaTable-like T object (see utils/, which behave like a python dict, but allow you to use dot notation to access (nested) keys. For example we could access the learning rate in the above config snippet via config.learning.learning_rate.

Multiple Configs

Multiple configs can be passed to the various binaries as a comma separated list of config paths via the --config_paths flag. This allows us to specify a default config that applies to all experiments (e.g. how often to write checkpoints, default embedder hyperparams) and one config per experiment holding the just hyperparams specific to the experiment (path to data, etc.).

See configs/tcn_default.yml for an example of our default config and configs/pouring.yml for an example of how we define the pouring experiments.

Configs are applied left to right. For example, consider two config files:


learning:  learning_rate: 0.001 # Default learning rate.
  optimizer: 'adam'


learning:  learning_rate: 1.0 # Experiment learning rate (overwrites default).data:  training: '/path/to/myexperiment/training.tfrecord'


bazel run --config_paths='default.yml,myexperiment.yml'

results in a final merged config called final_training_config.yml

learning:  optimizer: 'adam'
  learning_rate: 1.0data:  training: '/path/to/myexperiment/training.tfrecord'

which is created automatically and stored in the experiment log directory alongside model checkpoints and tensorboard summaries. This gives us a record of the exact configs that went into each trial.

Monitoring training

We usually look at two validation metrics during training: knn classification error and multi-view alignment.

KNN-Classification Error

In cases where we have labeled validation data, we can compute the average cross-sequence KNN classification error (1.0 - recall@k=1) over all embedded labeled images in the validation set. See

Multi-view Alignment

In cases where there is no labeled validation data, we can look at the how well our model aligns multiple views of same embedded validation sequences. That is, for each embedded validation sequence, for all cross-view pairs, we compute the scaled absolute distance between ground truth time indices and knn time indices. See


We visualize the embedding space learned by our models in two ways: nearest neighbor imitation videos and PCA/T-SNE.

Nearest Neighbor Imitation Videos

One of the easiest way to evaluate the understanding of your model is to see how well the model can semantically align two videos via nearest neighbors in embedding space.

Consider the case where we have multiple validation demo videos of a human or robot performing the same task. For example, in the pouring experiments, we collected many different multiview validation videos of a person pouring the contents of one container into another, then setting the container down. If we'd like to see how well our embeddings generalize across viewpoint, object/agent appearance, and background, we can construct what we call "Nearest Neighbor Imitation" videos, by embedding some validation query sequence i from view 1, and finding the nearest neighbor for each query frame in some embedded target sequence j filmed from view 1. Here's an example of the final product.

See for details.

PCA & T-SNE Visualization

We can also embed a set of images taken randomly from validation videos and visualize the embedding space using PCA projection and T-SNE in the tensorboard projector. See for details.

Tutorial Part I: Collecting Multi-View Webcam Videos

Here we give an end-to-end example of how to collect your own multiview webcam videos and convert them to the TFRecord format expected by training.

Note: This was tested with up to 8 concurrent Logitech c930e webcams extended with Plugable 5 Meter (16 Foot) USB 2.0 Active Repeater Extension Cables.

Collect webcam videos

Go to dataset/

  1. Plug your webcams in and run

    ls -ltrh /dev/video*

    You should see one device listed per connected webcam.

  2. Define some environment variables describing the dataset you're collecting.

    dataset=tutorial  # Name of the dataset.mode=train  # E.g. 'train', 'validation', 'test', 'demo'.num_views=2 # Number of webcams.viddir=/tmp/tcn/videos # Output directory for the videos.tmp_imagedir=/tmp/tcn/tmp_images # Temp directory to hold images.debug_vids=1 # Whether or not to generate side-by-side debug videos.export DISPLAY=:0.0  # This allows real time matplotlib display.
  3. Run the script.

    bazel build -c opt --copt=-mavx webcam && \
    bazel-bin/webcam \
    --dataset $dataset \
    --mode $mode \
    --num_views $num_views \
    --tmp_imagedir $tmp_imagedir \
    --viddir $viddir \
    --debug_vids 1
  4. Hit Ctrl-C when done collecting, upon which the script will compile videos for each view and optionally a debug video concatenating multiple simultaneous views.

  5. If --seqname flag isn't set, the script will name the first sequence '0', the second sequence '1', and so on (meaning you can just keep rerunning step 3.). When you are finished, you should see an output viddir with the following structure:

    videos/N_viewM.movfor N sequences and M webcam views.

Create TFRecords

Use dataset/ to convert the directory of videos into a directory of TFRecords files, one per multi-view sequence.

videos=$viddir/$datasetbazel build -c opt videos_to_tfrecords && \
bazel-bin/videos_to_tfrecords --logtostderr \
--input_dir $videos/$mode \
--output_dir ~/tcn_data/$dataset/$mode \
--max_per_shard 400

Setting --max_per_shard > 0 allows you to shard training data. We've observed that sharding long training sequences provides better performance in terms of global steps/sec.

This should be left at the default of 0 for validation / test data.

You should now have a directory of TFRecords files with the following structure:


1 TFRecord file for each of N multi-view sequences.

Now we're ready to move on to part II: training, evaluation, and visualization.

Tutorial Part II: Training, Evaluation, and Visualization

Here we give an end-to-end example of how to train, evaluate, and visualize the embedding space learned by TCN models.

Download Data

We will be using the 'Multiview Pouring' dataset, which can be downloaded using the script here.

The rest of the tutorial will assume that you have your data downloaded to a folder at ~/tcn_data.

mkdir ~/tcn_data
mv ~/Downloads/ ~/tcn_data

You should now have the following path containing all the data:

ls ~/tcn_data/multiview-pouring
labels  README.txt  tfrecords  videos

Download Pretrained Inception Checkpoint

If you haven't already, run the script that downloads the pretrained InceptionV3 checkpoint:


Define A Config

For our experiment, we create 2 configs:

  • configs/tcn_default.yml: This contains all the default hyperparameters that generally don't vary across experiments.

  • configs/pouring.yml: This contains all the hyperparameters that are specific to the pouring experiment.

Important note about configs/pouring.yml:

  • data.eval_cropping: We use 'pad200' for the pouring dataset, which was filmed rather close up on iphone cameras. A better choice for data filmed on webcam is likely 'crop_center'. See for options.


Run the training binary:

logdir=/tmp/tcn/pouringc=configsconfigs=$c/tcn_default.yml,$c/pouring.ymlbazel build -c opt --copt=-mavx --config=cuda train && \bazel-bin/train \--config_paths $configs --logdir $logdir


Run the binary that computes running validation loss. Set export CUDA_VISIBLE_DEVICES= to run on CPU.

bazel build -c opt --copt=-mavx eval && \
bazel-bin/eval \
--config_paths $configs --logdir $logdir

Run the binary that computes running validation cross-view sequence alignment. Set export CUDA_VISIBLE_DEVICES= to run on CPU.

bazel build -c opt --copt=-mavx alignment && \
bazel-bin/alignment \
--config_paths $configs --checkpointdir $logdir --outdir $logdir

Run the binary that computes running labeled KNN validation error. Set export CUDA_VISIBLE_DEVICES= to run on CPU.

bazel build -c opt --copt=-mavx labeled_eval && \
bazel-bin/labeled_eval \
--config_paths $configs --checkpointdir $logdir --outdir $logdir

Monitor training

Run tensorboard --logdir=$logdir. After a bit of training, you should see curves that look like this:

Training lossValidation lossValidation AlignmentAverage Validation KNN Classification ErrorIndividual Validation KNN Classification ErrorsVisualize

To visualize the embedding space learned by a model, we can:

Generate Imitation Videos

# Use the automatically generated final config file as config.configs=$logdir/final_training_config.yml# Visualize checkpoint 40001.checkpoint_iter=40001# Use validation records for visualization.records=~/tcn_data/multiview-pouring/tfrecords/val# Write videos to this location.outdir=$logdir/tcn_viz/imitation_vids
bazel build -c opt --config=cuda --copt=-mavx generate_videos && \
bazel-bin/generate_videos \
--config_paths $configs \
--checkpointdir $logdir \
--checkpoint_iter $checkpoint_iter \
--query_records_dir $records \
--target_records_dir $records \
--outdir $outdir

After the script completes, you should see a directory of videos with names like:


that look like this:

T-SNE / PCA Visualization

Run the binary that generates embeddings and metadata.

bazel build -c opt --config=cuda --copt=-mavx visualize_embeddings && \
bazel-bin/visualize_embeddings \
--config_paths $configs \
--checkpointdir $logdir \
--checkpoint_iter $checkpoint_iter \
--embedding_records $records \
--outdir $outdir \
--num_embed 1000 \
--sprite_dim 64

Run tensorboard, pointed at the embedding viz output directory.

tensorboard --logdir=$outdir

You should see something like this in tensorboard.



“知识神经元网络”KNN(Knowledge neural network)是一种以“神经元网络”模型 为基础的知识组织方法。 在“知识神经元网络”KNN 中,所谓的“知识”,是描述一个“知识”的文本,如一个网页、Word、PDF 文档等。

Breast cancer remains a global challenge, causing over 1 million deaths globally in 2018. To achieve earlier breast cancer detection, screening x-ray mammography is recommended by health organizations worldwide and has been estimated to decrease breast cancer mortality by 20-40%. Nevertheless, significant false positive and false negative rates, as well as high interpretation costs, leave opportunities for improving quality and access. To address these limitations, there has been much recent interest in applying deep learning to mammography; however, obtaining large amounts of annotated data poses a challenge for training deep learning models for this purpose, as does ensuring generalization beyond the populations represented in the training dataset. Here, we present an annotation-efficient deep learning approach that 1) achieves state-of-the-art performance in mammogram classification, 2) successfully extends to digital breast tomosynthesis (DBT; "3D mammography"), 3) detects cancers in clinically-negative prior mammograms of cancer patients, 4) generalizes well to a population with low screening rates, and 5) outperforms five-out-of-five full-time breast imaging specialists by improving absolute sensitivity by an average of 14%. Our results demonstrate promise towards software that can improve the accuracy of and access to screening mammography worldwide.


This work addresses a novel and challenging problem of estimating the full 3D hand shape and pose from a single RGB image. Most current methods in 3D hand analysis from monocular RGB images only focus on estimating the 3D locations of hand keypoints, which cannot fully express the 3D shape of hand. In contrast, we propose a Graph Convolutional Neural Network (Graph CNN) based method to reconstruct a full 3D mesh of hand surface that contains richer information of both 3D hand shape and pose. To train networks with full supervision, we create a large-scale synthetic dataset containing both ground truth 3D meshes and 3D poses. When fine-tuning the networks on real-world datasets without 3D ground truth, we propose a weakly-supervised approach by leveraging the depth map as a weak supervision in training. Through extensive evaluations on our proposed new datasets and two public datasets, we show that our proposed method can produce accurate and reasonable 3D hand mesh, and can achieve superior 3D hand pose estimation accuracy when compared with state-of-the-art methods.


3D vehicle detection and tracking from a monocular camera requires detecting and associating vehicles, and estimating their locations and extents together. It is challenging because vehicles are in constant motion and it is practically impossible to recover the 3D positions from a single image. In this paper, we propose a novel framework that jointly detects and tracks 3D vehicle bounding boxes. Our approach leverages 3D pose estimation to learn 2D patch association overtime and uses temporal information from tracking to obtain stable 3D estimation. Our method also leverages 3D box depth ordering and motion to link together the tracks of occluded objects. We train our system on realistic 3D virtual environments, collecting a new diverse, large-scale and densely annotated dataset with accurate 3D trajectory annotations. Our experiments demonstrate that our method benefits from inferring 3D for both data association and tracking robustness, leveraging our dynamic 3D tracking dataset.


Finding correspondences between images or 3D scans is at the heart of many computer vision and image retrieval applications and is often enabled by matching local keypoint descriptors. Various learning approaches have been applied in the past to different stages of the matching pipeline, considering detector, descriptor, or metric learning objectives. These objectives were typically addressed separately and most previous work has focused on image data. This paper proposes an end-to-end learning framework for keypoint detection and its representation (descriptor) for 3D depth maps or 3D scans, where the two can be jointly optimized towards task-specific objectives without a need for separate annotations. We employ a Siamese architecture augmented by a sampling layer and a novel score loss function which in turn affects the selection of region proposals. The positive and negative examples are obtained automatically by sampling corresponding region proposals based on their consistency with known 3D pose labels. Matching experiments with depth data on multiple benchmark datasets demonstrate the efficacy of the proposed approach, showing significant improvements over state-of-the-art methods.


We propose Human Pose Models that represent RGB and depth images of human poses independent of clothing textures, backgrounds, lighting conditions, body shapes and camera viewpoints. Learning such universal models requires training images where all factors are varied for every human pose. Capturing such data is prohibitively expensive. Therefore, we develop a framework for synthesizing the training data. First, we learn representative human poses from a large corpus of real motion captured human skeleton data. Next, we fit synthetic 3D humans with different body shapes to each pose and render each from 180 camera viewpoints while randomly varying the clothing textures, background and lighting. Generative Adversarial Networks are employed to minimize the gap between synthetic and real image distributions. CNN models are then learned that transfer human poses to a shared high-level invariant space. The learned CNN models are then used as invariant feature extractors from real RGB and depth frames of human action videos and the temporal variations are modelled by Fourier Temporal Pyramid. Finally, linear SVM is used for classification. Experiments on three benchmark cross-view human action datasets show that our algorithm outperforms existing methods by significant margins for RGB only and RGB-D action recognition.


In this paper we present a large-scale visual object detection and tracking benchmark, named VisDrone2018, aiming at advancing visual understanding tasks on the drone platform. The images and video sequences in the benchmark were captured over various urban/suburban areas of 14 different cities across China from north to south. Specifically, VisDrone2018 consists of 263 video clips and 10,209 images (no overlap with video clips) with rich annotations, including object bounding boxes, object categories, occlusion, truncation ratios, etc. With intensive amount of effort, our benchmark has more than 2.5 million annotated instances in 179,264 images/video frames. Being the largest such dataset ever published, the benchmark enables extensive evaluation and investigation of visual analysis algorithms on the drone platform. In particular, we design four popular tasks with the benchmark, including object detection in images, object detection in videos, single object tracking, and multi-object tracking. All these tasks are extremely challenging in the proposed dataset due to factors such as occlusion, large scale and pose variation, and fast motion. We hope the benchmark largely boost the research and development in visual analysis on drone platforms.


In this letter, we propose a pseudo-siamese convolutional neural network (CNN) architecture that enables to solve the task of identifying corresponding patches in very-high-resolution (VHR) optical and synthetic aperture radar (SAR) remote sensing imagery. Using eight convolutional layers each in two parallel network streams, a fully connected layer for the fusion of the features learned in each stream, and a loss function based on binary cross-entropy, we achieve a one-hot indication if two patches correspond or not. The network is trained and tested on an automatically generated dataset that is based on a deterministic alignment of SAR and optical imagery via previously reconstructed and subsequently co-registered 3D point clouds. The satellite images, from which the patches comprising our dataset are extracted, show a complex urban scene containing many elevated objects (i.e. buildings), thus providing one of the most difficult experimental environments. The achieved results show that the network is able to predict corresponding patches with high accuracy, thus indicating great potential for further development towards a generalized multi-sensor key-point matching procedure. Index Terms-synthetic aperture radar (SAR), optical imagery, data fusion, deep learning, convolutional neural networks (CNN), image matching, deep matching


ASR (automatic speech recognition) systems like Siri, Alexa, Google Voice or Cortana has become quite popular recently. One of the key techniques enabling the practical use of such systems in people's daily life is deep learning. Though deep learning in computer vision is known to be vulnerable to adversarial perturbations, little is known whether such perturbations are still valid on the practical speech recognition. In this paper, we not only demonstrate such attacks can happen in reality, but also show that the attacks can be systematically conducted. To minimize users' attention, we choose to embed the voice commands into a song, called CommandSong. In this way, the song carrying the command can spread through radio, TV or even any media player installed in the portable devices like smartphones, potentially impacting millions of users in long distance. In particular, we overcome two major challenges: minimizing the revision of a song in the process of embedding commands, and letting the CommandSong spread through the air without losing the voice "command". Our evaluation demonstrates that we can craft random songs to "carry" any commands and the modify is extremely difficult to be noticed. Specially, the physical attack that we play the CommandSongs over the air and record them can success with 94 percentage.


This paper addresses the problem of estimating and tracking human body keypoints in complex, multi-person video. We propose an extremely lightweight yet highly effective approach that builds upon the latest advancements in human detection and video understanding. Our method operates in two-stages: keypoint estimation in frames or short clips, followed by lightweight tracking to generate keypoint predictions linked over the entire video. For frame-level pose estimation we experiment with Mask R-CNN, as well as our own proposed 3D extension of this model, which leverages temporal information over small clips to generate more robust frame predictions. We conduct extensive ablative experiments on the newly released multi-person video pose estimation benchmark, PoseTrack, to validate various design choices of our model. Our approach achieves an accuracy of 55.2% on the validation and 51.8% on the test set using the Multi-Object Tracking Accuracy (MOTA) metric, and achieves state of the art performance on the ICCV 2017 PoseTrack keypoint tracking challenge.


The work in this paper is driven by the question how to exploit the temporal cues available in videos for their accurate classification, and for human action recognition in particular? Thus far, the vision community has focused on spatio-temporal approaches with fixed temporal convolution kernel depths. We introduce a new temporal layer that models variable temporal convolution kernel depths. We embed this new temporal layer in our proposed 3D CNN. We extend the DenseNet architecture - which normally is 2D - with 3D filters and pooling kernels. We name our proposed video convolutional network `Temporal 3D ConvNet'~(T3D) and its new temporal layer `Temporal Transition Layer'~(TTL). Our experiments show that T3D outperforms the current state-of-the-art methods on the HMDB51, UCF101 and Kinetics datasets. The other issue in training 3D ConvNets is about training them from scratch with a huge labeled dataset to get a reasonable performance. So the knowledge learned in 2D ConvNets is completely ignored. Another contribution in this work is a simple and effective technique to transfer knowledge from a pre-trained 2D CNN to a randomly initialized 3D CNN for a stable weight initialization. This allows us to significantly reduce the number of training samples for 3D CNNs. Thus, by finetuning this network, we beat the performance of generic and recent methods in 3D CNNs, which were trained on large video datasets, e.g. Sports-1M, and finetuned on the target datasets, e.g. HMDB51/UCF101. The T3D codes will be released

CCF推荐 | 国际会议信息10条
7+阅读 · 2019年5月27日
人工智能 | CCF推荐期刊专刊约稿信息6条
3+阅读 · 2019年2月18日
大数据 | 顶级SCI期刊专刊/国际会议信息7条
7+阅读 · 2018年12月29日
医学 | 顶级SCI期刊专刊/国际会议信息4条
3+阅读 · 2018年12月28日
15+阅读 · 2018年10月30日
20+阅读 · 2018年10月28日
计算机类 | 11月截稿会议信息9条
6+阅读 · 2018年10月14日
人工智能 | COLT 2019等国际会议信息9条
6+阅读 · 2018年9月21日
5+阅读 · 2017年11月16日
Robust breast cancer detection in mammography and digital breast tomosynthesis using annotation-efficient deep learning approach
William Lotter,Abdul Rahman Diab,Bryan Haslam,Jiye G. Kim,Giorgia Grisot,Eric Wu,Kevin Wu,Jorge Onieva Onieva,Jerrold L. Boxerman,Meiyun Wang,Mack Bandler,Gopal Vijayaraghavan,A. Gregory Sorensen
10+阅读 · 2019年12月27日
3D Hand Shape and Pose Estimation from a Single RGB Image
Liuhao Ge,Zhou Ren,Yuncheng Li,Zehao Xue,Yingying Wang,Jianfei Cai,Junsong Yuan
15+阅读 · 2019年3月3日
Joint Monocular 3D Vehicle Detection and Tracking
Hou-Ning Hu,Qi-Zhi Cai,Dequan Wang,Ji Lin,Min Sun,Philipp Krähenbühl,Trevor Darrell,Fisher Yu
8+阅读 · 2018年12月2日
Georgios Georgakis,Srikrishna Karanam,Ziyan Wu,Jan Ernst,Jana Kosecka
7+阅读 · 2018年5月9日
Jian Liu,Naveed Akhtar,Ajmal Mian
3+阅读 · 2018年5月1日
Pengfei Zhu,Longyin Wen,Xiao Bian,Haibin Ling,Qinghua Hu
6+阅读 · 2018年4月23日
Lloyd H. Hughes,Michael Schmitt,Lichao Mou,Yuanyuan Wang,Xiao Xiang Zhu
9+阅读 · 2018年1月25日
Xuejing Yuan,Yuxuan Chen,Yue Zhao,Yunhui Long,Xiaokang Liu,Kai Chen,Shengzhi Zhang,Heqing Huang,Xiaofeng Wang,Carl A. Gunter
10+阅读 · 2018年1月24日
Rohit Girdhar,Georgia Gkioxari,Lorenzo Torresani,Manohar Paluri,Du Tran
7+阅读 · 2017年12月26日
Ali Diba,Mohsen Fayyaz,Vivek Sharma,Amir Hossein Karami,Mohammad Mahdi Arzani,Rahman Yousefzadeh,Luc Van Gool
8+阅读 · 2017年11月22日