Pointcept
/

Concerto

Graph Machine Learning

self-supervised-learning

Model card Files Files and versions

yujia0913 commited on Nov 4

Commit

1df1412

·

verified ·

1 Parent(s): f32921d

Update README.md

Files changed (1) hide show

README.md +3 -0

README.md CHANGED Viewed

@@ -17,6 +17,9 @@ This repository contains the model weights for **Concerto**, a novel approach fo
 - **Codebase:** [https://github.com/Pointcept/Pointcept](https://github.com/Pointcept/Pointcept)
 - **Inference:** [https://github.com/Pointcept/Concerto](https://github.com/Pointcept/Concerto)
 ## Abstract
 Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP's language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.

 - **Codebase:** [https://github.com/Pointcept/Pointcept](https://github.com/Pointcept/Pointcept)
 - **Inference:** [https://github.com/Pointcept/Concerto](https://github.com/Pointcept/Concerto)
+## Models
+The default models(concerto_large/base/small/tiny) are the pre-release version of our next work, which can deal with input without color and normal. We pre-release these for general public use because many tasks lack such information. The original Concerto model is `concerto_base_origin.pth`.
 ## Abstract
 Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP's language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.