PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding

Abstract

We introduce PARTONOMY, a benchmark and task suite for explanatory part segmentation, where a model must (1) identify visible object parts, (2) compare/contrast parts across objects, and (3) perform part–whole reasoning—while grounding its textual answer with pixel-level segmentations. PARTONOMY integrates prior datasets and contributes an evaluation-only PARTONOMY-Core split with 534 object and 862 part labels, focusing on specialized, object-centric images (e.g., agricultural airplanes, combat drones).

We further propose PLUM, a segmenting LMM that fixes two limitations in existing approaches: reliance on new “[SEG]” tokens that cause distribution shift, and discarding past masks during decoding. PLUM uses BIO span tagging to select segmentation-relevant text spans (no new tokens) and a mask feedback loop to condition future masks on previous predictions. Pretrained PLUM outperforms prior segmenting LMMs on reasoning segmentation, VQA, and hallucination; when finetuned on PARTONOMY, it is competitive with models trained on far more mask data.

Benchmark & Task

Figure 1: PARTONOMY tasks: Part Identification, Part Comparison (Intersection/Difference), and Part-Whole Reasoning. Models must select the correct textual response and ground its parts with pixel masks.

Explanatory Part Segmentation

Part Identification: list and segment visible parts of the object in the image.
Part Comparison: segment parts in common (Intersection) or unique (Difference) between the image object and a queried concept.
Part–Whole Reasoning: predict parts then object (Part→Whole), or object then parts (Whole→Part).

PARTONOMY-Core (Eval)

534 object labels

862 distinct part labels

1,068 images, object-centric

4,968 pixel masks

Answer choices are produced via mutation (add/remove/replace parts) to create challenging, plausible distractors.

PLUM: Part-Level Understanding LMM

Figure 2: PLUM avoids special tokens via BIO span tagging and conditions on prior masks via a feedback loop.

Key Ideas

Span Tagging (BIO): selects text spans to segment—no new tokens—preserving the LLM's pretraining distribution.
Query Projection + KL: projects span embeddings into mask queries with a mild KL constraint to retain language reasoning.
Mask Feedback Loop: encodes prior predicted masks (FiLM + pooling) so future masks are consistent and better localized.

Results

PARTONOMY-Core (Segmentation gIoU)

Method	Extra Seg Data	Identification (micro / macro)	Intersection (micro / macro)	Difference (micro / macro)
LISA-13B (0-shot)	✗	5.9 / 7.0	7.1 / 7.5	6.1 / 7.1
GLaMM (0-shot)	✓	5.3 / 5.9	5.9 / 6.2	5.2 / 6.0
PLUM (0-shot)	✗	14.5 / 27.4	23.7 / 29.9	14.9 / 24.8
LISA-13B (ft)	✗	33.6 / 35.4	37.0 / 38.4	30.4 / 31.6
GLaMM (ft)	✓	36.6 / 38.8	40.3 / 42.1	33.6 / 34.8
PLUM (ft)	✗	36.2 / 41.6	42.1 / 45.9	33.0 / 39.4

Numbers adapted from the paper’s Table 2.

Part–Whole Reasoning (Core, gIoU on parts)

Method	Part→Whole (micro / macro)	Whole→Part (micro / macro)
LISA-13B (0-shot)	5.7 / 6.6	6.0 / 6.8
GLaMM (0-shot)	4.8 / 5.6	4.9 / 5.8
PLUM (0-shot)	14.3 / 26.8	15.4 / 27.5
GLaMM (ft)	36.1 / 38.5	35.7 / 38.0
PLUM (ft)	36.7 / 40.8	36.2 / 39.8

Predicting the object first (Whole→Part) tends to improve subsequent part masks.

Generalization to Other Tasks

ReasonSeg: PLUM-13B (ft) 57.3 gIoU vs. LISA-13B (ft) 56.2.
VQA / Hallucination: avoids collapse from special tokens; beats LLaVA-13B on TextVQA (+31.8% rel.) and POPE (+8.9% rel.).
Zero-shot on public part datasets: large macro-gIoU gains on PACO_LVIS, PartImageNet, PascalParts.

Ablations (What matters?)

Mask Feedback Loop: −9.6% micro / −8% macro gIoU without it.
BIO Tagging: removes token shift; big macro gains on long-tail parts.
KL Weight: trades segmentation vs. reasoning; λ_KL=0.1 is a good balance.

Resources

Code

Training/eval code for PLUM and the PARTONOMY data pipeline.

GitHub

Paper

Preprint under review. Link coming soon.

Coming soon

Dataset

Instructions and scripts to build PARTONOMY and PARTONOMY-Core.

Instructions

Citation

@misc{blume2025partonomy,
  title        = {PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding},
  author       = {Ansel Blume* and Jeonghwan Kim* and Hyeonjeong Ha and Elen Chatikyan and
                  Xiaomeng Jin and Khanh Duy Nguyen and Nanyun Peng and Kai-Wei Chang and
                  Derek Hoiem and Heng Ji},
  year         = {2025},
  note         = {Preprint. Under review. Code: https://github.com/AnselBlume/partonomy}
}