Perception research is increasingly modelled using streetscapes, yet many approaches still rely on pixel features or object co-occurrence statistics, overlooking the explicit relations that shape human perception. This study proposes a three stage pipeline that transforms street view imagery (SVI) into structured representations for predicting six perceptual indicators. In the first stage, each image is parsed using an open-set Panoptic Scene Graph model (OpenPSG) to extract object predicate object triplets. In the second stage, compact scene-level embeddings are learned through a heterogeneous graph autoencoder (GraphMAE). In the third stage, a neural network predicts perception scores from these embeddings. We evaluate the proposed approach against image-only baselines in terms of accuracy, precision, and cross-city generalization. Results indicate that (i) our approach improves perception prediction accuracy by an average of 26% over baseline models, and (ii) maintains strong generalization performance in cross-city prediction tasks. Additionally, the structured representation clarifies which relational patterns contribute to lower perception scores in urban scenes, such as graffiti on wall and car parked on sidewalk. Overall, this study demonstrates that graph-based structure provides expressive, generalizable, and interpretable signals for modelling urban perception, advancing human-centric and context-aware urban analytics.
翻译:感知研究日益采用街景进行建模,然而许多方法仍依赖于像素特征或对象共现统计,忽视了塑造人类感知的显式关系。本研究提出一个三阶段流程,将街景图像(SVI)转化为结构化表征,用于预测六项感知指标。第一阶段,使用开放集全景场景图模型(OpenPSG)解析每幅图像,提取对象-谓词-对象三元组。第二阶段,通过异构图自编码器(GraphMAE)学习紧凑的场景级嵌入表示。第三阶段,利用神经网络从这些嵌入中预测感知评分。我们在准确性、精确度和跨城市泛化能力方面,将所提方法与纯图像基线模型进行对比评估。结果表明:(i)相较于基线模型,我们的方法将感知预测准确率平均提升了26%;(ii)在跨城市预测任务中保持了强大的泛化性能。此外,结构化表征揭示了导致城市场景感知评分较低的关系模式,例如“墙上的涂鸦”和“人行道上的停车”。总体而言,本研究证明基于图的结构为建模城市感知提供了表达力强、可泛化且可解释的信号,推动了以人为本和情境感知的城市分析发展。