Abstract reasoning is a key indicator of intelligence. The ability to hypothesise, develop abstract concepts based on concrete observations and apply this hypothesis to justify future actions has been paramount in human development. An existing line of research in outfitting intelligent machines with abstract reasoning capabilities revolves around the Raven's Progressive Matrices (RPM), a multiple-choice visual puzzle where one must identify the missing component which completes the pattern. There have been many breakthroughs in supervised approaches to solving RPM in recent years. However, since this process requires external assistance, we cannot claim that machines have achieved reasoning ability comparable to humans. Namely, when the RPM rule that relations can only exist row/column-wise is properly introduced, humans can solve RPM problems without supervision or prior experience. In this paper, we introduce a pairwise relations discriminator (PRD), a technique to develop unsupervised models with sufficient reasoning abilities to tackle an RPM problem. PRD reframes the RPM problem into a relation comparison task, which we can solve without requiring the labelling of the RPM problem. We can identify the optimal candidate by adapting the application of PRD on the RPM problem. The previous state-of-the-art approach "mcpt" in this domain achieved 28.5% accuracy on the RAVEN dataset "drt", a standard dataset for computational work on RPM. Our approach, the PRD, establishes a new state-of-the-art benchmark with an accuracy of 50.74% on the same dataset, presenting a significant improvement and a step forward in equipping machines with abstract reasoning.
An important aspect of developing dialogue systems is how to evaluate and compare the performance of different systems. Existing automatic evaluation metrics are based on turn-level quality evaluation and use average scores for system-level comparison. In this paper, we propose to measure the performance of a dialogue system by computing the distribution-wise distance between its generated conversations and real-world conversations. Specifically, two distribution-wise metrics, FBD and PRD, are developed and evaluated. Experiments on several dialogue corpora show that our proposed metrics correlate better with human judgments than existing metrics.