In recent years, artificial intelligence has undergone tremendous advancements across all domains, particularly in computer vision and natural language processing, largely attributed to the progress in deep learning. Despite the impressive performance of these systems in numerous tasks, unresolved questions remain about AI's depth of understanding and its ability to explain its decisions. Our project is committed to addressing these critical issues, with a particular focus on the problem of anomaly detection in images. The project's main goal is to develop multimodal models that can detect IF there is something wrong in an image and pinpoint WHERE, but also understanding and explaining WHY. This involves integrating visual and linguistic information to tackle three key areas in contemporary AI: (image) understanding, multimodality, and explainability. The first research challenge, Semantic Image Understanding, targets the deficiencies of surface anomaly detection models in recognizing complex logical anomalies. The objective is to enhance the semantic comprehension of these models, enabling them to discern intricate visual compositions and structural variations. The second challenge, Multimodal Image Understanding, is an endeavor to enhance vision-based anomaly detection in images with linguistic information. Our goal is to develop a zero-shot anomaly detection method, designed to identify surface anomalies without prior exposure to images of specific object classes, relying instead on the concept of anomalies encoded in vision-language models. Additionally, we plan to create methods that incorporate textual descriptions of anomalies at both the task and instance levels, complementing the visual data. In the third challenge, Multimodal Explanations, our focus will be on enriching visual explanations, such as anomaly heatmaps and segmentation maps, with accompanying textual descriptions. MUXAD therefore aims to elevate anomaly detection to a more advanced level. By harnessing the power of multimodal AI, it aims to create models that are effective, but also intuitive and explainable, marking a pivotal shift towards more transparent and understandable AI systems.
The phases of the project and their realization:
WP1: Semantic image understanding for anomaly detection
WP2: Multimodal image understanding
WP3: Multimodal explanations
WP4: Applied use cases: Manufacturing visual inspection and Medical imaging interpretation
Year 1: · Local and global appearance learning · Object composition learning · Dataset curation
Year 2: · Zero-shot anomaly detection · Text-based knowledge injection · Text-based weakly labelled supervision · Manufacturing visual inspection
Year 3: · Text-driven explanations · Uncertainty in vision-language models