Semantic Segmentation with Dense Prediction Transformers via Grasshopper
The Ambrosinus-Toolkit v1.2.9 has been implemented with Semantic Segmentation a new DPT tool, following the inspiration from René Ranftl’s research at Intel. It is another AI tool that brings artificial intelligence power inside the Grasshopper platform and completes (for now) my original idea to include DPT technology into ATk.
Dense Prediction Transformers (DPT) are a deep learning architecture utilized in computer vision applications like semantic segmentation, object detection, and instance segmentation. The fundamental concept of DPT involves generating dense image labels through the incorporation of global and local context information. An extensive explanation of Ambrosinus Toolkit and the integration of the AI Diffuse Models in the AEC creative design processes has been discussed in the official research document within the Coding Architecture book chapter 😉. DPT represents an advanced architecture in the field of visual processing, which breaks away from traditional convolutional networks (CNN) to embrace transformers, known for their effectiveness in processing sequential data. These models are particularly well suited for dense prediction tasks, such as monocular depth estimation and semantic segmentation, due to their ability to handle high-resolution representations and capture global contexts within images. Combining token representations at different resolutions and using a convolutional decoder to synthesize full-resolution predictions allows DPTs to provide more precise and detailed results. This technology has already been shown to significantly improve performance in various computer vision tasks, setting new benchmarks in the field. Imagine you need to label objects in an image. Here’s how they work:
Convolutional Networks (CNN): CNNs are like artists who specialize in analyzing specific parts of the image. They scan the image with filters to detect edges, textures and shapes. Example: If they see a car, they might say: “There’s a car!”.
Dense Prediction Transformers (DPT): DPTs are like poetry writers who understand the image as a whole. They look at the image and create a detailed map of what they see. Example: If they see a road, they might say: “This is an asphalt road, with trees on the sides and a parked car.”
In short, CNNs focus on specific parts, while DPTs understand the image as a whole.
Requirements
To run the DPTSemSeg component (subcategory AI) some Python libraries are necessary as the other AI tools, for this reason, I have shared a “requirements.txt” file allowing the designer in this step in a unique command line from cmd.exe (Windows OS side). After downloading the file to a custom folder (I suggest in C:/CustomFolder or something like that) run the following command from cmd.exe after logging in the “CustomFolder”: pip install -r requirements.txt and wait till you see the start prompt string (see the image below on the right).
If you have already used/installed DPTto3D component follow this instruction
From your CMD window viewport, you can simply launch this command: pip install atoolkitdpt (I recommend this option) – in this way, all necessary (MiDaS and DPT) libraries will be installed on your machine. For a clean installation you can uninstall the previous version in this way from CMD window viewport: pip uninstall atoolkitdpt .
Python libraries required
The following Python libraries will be installed:
|
|
In particular, I have created the atoolkitdpt python library to run DPT estimation and the Semantic Segmentation. I have added all the MiDaS and Dense Prediction Transformers functions developed by Intelligent Systems Lab Org (Intel Labs) to atoolkitdpt 0.0.2 library. In this way, I have integrated the possibility to exploit the MiDaS pre-trained dataset and the DPT large and hybrid datasets shared by Intel researchers directly inside Grasshopper. For this reason, this Python package is available directly from my PyPI repository page at this link. All future updates will be publicly notified on my GitHub page AToolkitDpt. Fundamental is downloading these 2 weights models (dpt_large_ade20k ~1.3GB and dpt-hybrid_ade20k ~400MB) shared by Intel researchers. These pre-trained datasets can generate segmented images (see the GitHub page aforementioned for details). Finally, the GH CPython is still necessary for running the “DPTSemSeg” component properly, as for all the other AI tools coded in Python language.
What is ADE20K trained dataset?
The ADE20K dataset is a large-scale semantic segmentation dataset used for computer vision research. It comprises more than 27,000 images from the SUN and Places databases. Here are some key details about ADE20K:
- Annotations: Images are fully annotated with pixel-level objects and object parts labels.
- Object Categories: The dataset spans over 3,000 object categories, including both stuff (e.g., sky, road, grass) and discrete objects (e.g., person, car, bed). The Intel Labs has used an ADE20k model of 150 classes.
- Additional Information: Many images also contain object parts and parts of parts.
- Anonymization: Faces and license plates are blurred in the images.
- Structure: Each image has associated annotations, including pixel-wise object and instance information.
For more details and a starter CoDe to explore the data, check out the official repository. Researchers often use ADE20K for semantic segmentation tasks and scene parsing.
Parameters
- BaseIMG: the image source file (you can pass also a folder with a set of images to run a batch process);
- DirPath: the folder where all files generated will be stored;
- ModFile: the dataset file (format .pt) selected in order to run the Semantic Segmentation process, more info at GitHub page AToolkitDpt;
- Optimize: DPT function to optimize the semantic segmentation (True/False);
- Run: start the process
DPTSemSeg component in action
Main features
The DPTSemSeg component, utilizing DPT technology, can perform semantic segmentation directly from a 2D RGB image. The ADE20k dataset includes 150 classes (label IDs), and a common issue is the identical color palette used for two specific categories: road and skyscraper. I have implemented an internal prediction re-mapping to address this issue (hopefully successfully). However, the detection of skyscraper classes remains imperfect, as they are only partially recognized and often conflated with the building category (which is not too problematic, in my opinion). Summarizing the component can:
- Generate the segmented image overlay on the original BaseIMG
- Generate the segmented image mask
- Setting the “Optimize” option to false may correct the greyish image produced by the model (alternatively, the dpt-hybrid model can be used, though it is less accurate)
- Generate a TXT file containing all data extracted from the segmented image, such as RGB colors, the percentage of the classes detected, and label IDs. This file will be saved in the “DirPath” folder.
Contents of the "DirPath" folder after a couple of test
Video demo
As always, a video is worth more than thousand words. if you watch it on my YouTube channel it is possible to jump among different highlights 😉