Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Learning medical visual representations through vision-language pre-training has reached remarkable progress. Despite the promising performance, it still faces challenges, i.e., local alignment lacks interpretability and clinical relevance, and the insufficient internal and external representation learning of image-report pairs. To address these issues, we propose an Anatomical Structure-Guided (ASG) framework. Specifically, we parse raw reports into triplets <anatomical region, finding, existence>, and fully utilize each element as supervision to enhance representation learning. For anatomical region, we design an automatic anatomical region-sentence alignment paradigm in collaboration with radiologists, considering them as the minimum semantic units to explore fine-grained local alignment. For finding and existence, we regard them as image tags, applying an image-tag recognition decoder to associate image features with their respective tags within each sample and constructing soft labels for contrastive learning to improve the semantic association of different image-report pairs. We evaluate the proposed ASG framework on two downstream tasks, including five public benchmarks. Experimental results demonstrate that our method outperforms the state-of-the-art methods. Our code is available at https://asgmvlp.github.io.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2014_paper.pdf

SharedIt Link: https://rdcu.be/dV573

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72120-5_8

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/2014_supp.pdf

Link to the Code Repository

https://github.com/ASGMVLP/ASGMVLP_CODE

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Li_Anatomical_MICCAI2024,
        author = { Li, Qingqiu and Yan, Xiaohan and Xu, Jilan and Yuan, Runtian and Zhang, Yuejie and Feng, Rui and Shen, Quanli and Zhang, Xiaobo and Wang, Shujun},
        title = { { Anatomical Structure-Guided Medical Vision-Language Pre-training } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15011},
        month = {October},
        page = {80 -- 90}
}

Reviews

Review #1

Please describe the contribution of the paper

The paper addresses vision-language pre-training on chest x-rays. The authors propose two contributions: Anatomical Region-Sentence Alignment (ARSA), and Internal and External Representation Learning (IERL).
First, in the ARSA, the authors propose an automated asignment between relations in the RadGraph extractions and the detected regions via a rule based system. This allows to integrate anatomical locality into the alignment procedure. IERL utilizes extracted class tags to generate soft labels based on tag similarity.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper utilizes anatomical features provided by the ChestImaGenome dataset to generate more align feature representations although missing comparisons against methods such as [1]. While there have been several works that utilize these annotations for disease localization & classification or report generation [2,3], this has seemingly not been done for pre-training. It has to be noted, however, that sentence-wise pre-training has been done and is not a novel concept as it has been done implicetely in [4,5].
- The paper shows slight improvements on all considered datasets compared to related work apart from the COVIDx dataset where the improvements become more noticeable.
- The paper provides source code.
[1] GLIP: Grounded Language-Image Pre-training [2] AnaXNet: Anatomy Aware Multi-label Finding Classification in Chest X-ray [3] Anatomy-Guided Weakly-Supervised Abnormality Localization in Chest X-rays [4] Joint Learning of Localized Representations from Medical Images and Reports [5] Breaking with Fixed Set Pathology Recognition through Report-Guided Contrastive Training [6] Interactive and Explainable Region-guided Radiology Report Generation
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
Technical contribution is limited as introducing an additional contrastive term based on pre-extracted labels is basically supervised contrastive training [2]. The way the soft labels have been build also reminds of [6]. Performance boost only seems minor. No standard deviations in the experiments.

Minor:
- Training Details: – Pre-training image size has a big effect on performance but used size is not named in the paper. – Not stated if the encoder is already pre-trained on something like imagenet.
- Evaluation – Baseline performance seems off… Even a randomly initialized R50 achieves easily >75% AUC on NIH X-Ray (CXR14) with 100% data. 55.3%/71% AUC is worse than Wang et al.’s [1] original performance. (also it is not named if the baseline performance is based on R50 or VitB) – Only fine-tuning performance is shown, but not zero-Shot performance similar to [3,4]. – Not shown how this does perform for pathologies that are not part of the chosen tags - i.e. the CLiP dataset [5]
- Comparisons – if IERL already plays on pre-extracted tags, there should be a comparison against a simple additional classification head as this seems to be the most similar function that has way less complexity. Also missing comparison to [2] and [6].
[1] Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases [2] Supervised Contrastive Learning [3] Breaking with Fixed Set Pathology Recognition through Report-Guided Contrastive Training [4] Xplainer: From x-ray observations to explainable zero-shot diagnosis [5] [6] Ranking info noise contrastive estimation: Boosting contrastive learning via ranked positives
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Do you have any additional comments regarding the paper’s reproducibility?

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

The paper seems overall well structured although the method design seems not well investigated. Although there has been an ablation of the different choices that went into the method it is hard to see if these were the best possible choices. There have not been a lot of comparisons to similar related choices. Adding these comparisons might strengthen the paper overall.

Similarly adding evaluations in terms of 0-shot classification and localization on datasets such as ChestXRay8( +bbox’s) will provide value.

Also there have been no evaluations as to whether ChestImaGenome is the best data to gather anatomy annotations, as there also exist datasets such as JSRT[1] and PAX-Ray[2] .

[1] http://db.jsrt.or.jp/eng.php [2] Detailed Annotations of Chest X-Rays via CT Projection for Report Understanding
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Reject — could be rejected, dependent on rebuttal (3)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The work overall seems rather incremental. While improvements have been achieved it is hard to estimate what the long term impact of this work would be due to missing comparisons to related work in terms of method design.
It seems in general that limited amount of work went into the investigation of the field as not even majorly related work in terms of method design have been cited (i.e. supervised contrastive learning by Khosla et al. 2020).

As it stands, I tend towards rejecting this paper.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Weak Accept — could be accepted, dependent on rebuttal (4)
[Post rebuttal] Please justify your decision

While I agree that the use of the anatomical features is a notable direction to take, I still stand with the sentiment that the work provides limited novelty. Overall this work has a limited consideration of related work in this direction which would validate different design choices in this work. I, however, notice the effort that went into this work (i.e. letting radiologists relabel reports, etc.), which, paired with the other reviewers estimates, I change my assessment to a weak accept.

Review #2

Please describe the contribution of the paper

This paper proposes to pre-train medical vision-langauge models on xray and report datasets. The authors proposed to (1) align anatomical regions (as opposed to tiled image patches) between image and report, (2) add classification loss for common pathologies, and (3) use soft labels computed from pathology labels to mitigate false negatives. The authors demonstated that their proposed pipeline improves upon baselines for classifiction and semgentation tasks.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Alignment between Anatomical Sets is Useful: This paper proposes to establish the connection between the set of possible anatomical regions extracted from xray image and report using expert radiologists’ knowledge. This mapping that can be useful for other researchers in this area.
- Comprehensive Experiments: The authors did a pretty comprehensive job evaluating different baselines as well as proper ablations. Although, the paper could benefit from further evaluations, e.g., retrieval, grounding, as done in [1]
[1] Making the Most of Text Semantics to Improve Biomedical Vision–Language Processing
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- Marginal Performance Improvements: In many cases, the proposed method marginally improves upon baselines evaluated. This is problematic since the proposed pipeline is heavily tuned, e.g., with 4 different losses and hyperparameters.
- Some Important Results Missing: The reviewer think this paper can be strengthened by adding 1 more baseline, e.g., BioViL [1], and report zero-shot classification & segmentation performance. More details in detailed comments.
- Missing Important References: The authors should discuss the very relevant prior works on using soft-labels for CLIP-like pre-training [2-3] and on alternative ways to prevent false negatives in the contrastive loss [4]. More details in detailed comments.
[1] Making the Most of Text Semantics to Improve Biomedical Vision–Language Processing [2] PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining [3] SoftCLIP: Softer Cross-modal Alignment Makes CLIP Stronger [4] Sample-Specific Debiasing for Better Image-Text Models
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Do you have any additional comments regarding the paper’s reproducibility?
Major
- There are existing work that aims to correct for false negatives in the context of training multimodal models over xray and report dataset [1]. The authors might want to take a look at the paper, especially the baselines they compared to, to get an idea of the relevant work in this area. The authors should discuss how the proposed soft label approach to reduce false negatives are related and differ from existing work in this area.
- I would suggest the authors to add BioViL to their baselines [2] to strengthen their claim that their method is indeed stronger. BioViL is tuned to both classification and segmentation tasks and it would be curious to see how the proposed method compares with this pretty standard baseline.
- I would suggest the authors to focus on zero-shot classification/segmentation performance as they help distionguish the different method better. With enough finetuning data, different pre-training schemes do not matter that much. In fact, it seems the proposed method is providing the largest gain when using 1% data for fintuning.
Minor
- Explain what is used to model f_enc. Is it a transformer?
- Explain why KL divergence is used in Equation (6). How does it compare with alternative loss functions such as MSE?
[1] Sample-Specific Debiasing for Better Image-Text Models [2] Making the Most of Text Semantics to Improve Biomedical Vision–Language Processing
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
Major
- There are existing work that aims to correct for false negatives in the context of training multimodal models over xray and report dataset [1]. The authors might want to take a look at the paper, especially the baselines they compared to, to get an idea of the relevant work in this area. The authors should discuss how the proposed soft label approach to reduce false negatives are related and differ from existing work in this area.
- I would suggest the authors to add BioViL to their baselines [2] to strengthen their claim that their method is indeed stronger. BioViL is tuned to both classification and segmentation tasks and it would be curious to see how the proposed method compares with this pretty standard baseline.
- I would suggest the authors to focus on zero-shot classification/segmentation performance as they help distionguish the different method better. With enough finetuning data, different pre-training schemes do not matter that much. In fact, it seems the proposed method is providing the largest gain when using 1% data for fintuning.
Minor
- Explain what is used to model f_enc. Is it a transformer?
- Explain why KL divergence is used in Equation (6). How does it compare with alternative loss functions such as MSE?
[1] Sample-Specific Debiasing for Better Image-Text Models [2] Making the Most of Text Semantics to Improve Biomedical Vision–Language Processing
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Reject — could be rejected, dependent on rebuttal (3)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

On the positive side, this paper provides an alignment of anatomical regions in image and text relying on expert knowledge that prove useful for other researchers. However, this paper’s performance improvement is somewhat marginal given the complexity of the setup (e.g., 4 different losses). Key contributions, e.g., using soft labels, are not new and lacks discussion with respect to relevant references. Some key baseline and evaluation metrics are missing. I would be inclined to raise the rating if the authors can address my concerns.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Weak Accept — could be accepted, dependent on rebuttal (4)
[Post rebuttal] Please justify your decision

The authors partially addressed my concerns/questions, e.g., added related references on soft-labels for CLIP-like training and promised to add BioViL baseline. Therefore I increase the rating to weak accept.

However, there are issues that are not sufficiently addressed fin the rebuttal. For example, the comment on marginal performance gain compared to existing baselines still stand. In addition, it might be quite possible that the proposed approach will be worse than BioViL. The authors also could not claim the previously stated key contribution of using soft-labels for contrastive learning, and therefore reduces the contribution of this paper.

Review #3

Please describe the contribution of the paper

The paper proposes an Anatomical Structure-Guided framework for medical vision-language pre-training. The framework utilizes an automatic anatomical region-sentence alignment paradigm to better align image and text pairs, and adopts contrastive learning soft labels to optimize internal and external representation learning, thereby enabling the pre-trained model to achieve better representation capabilities.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper focuses on the problems of lack of clinical relevance and the existence of false negative samples in the pre-training of existing medical vision-language models, and proposes corresponding solutions, which can alleviate the relatively insufficient situation of medical pre-training samples to a certain extent. The method of image and text alignment based on the prior knowledge of anatomical structures has certain novelty and feasibility, and its effectiveness has also been proved in experiments.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. The soft label method proposed in this paper is similar to that in MedCLIP, which was proposed in EMNLP 2022. Are the authors sure they didn’t reference this paper’s methodology?MedCLIP is a famous work of medical pre-trained vision-language model. It is suggested that the authors compare with this work and explore the differences in methods to highlight the innovation of the method in this paper. （Wang Z, Wu Z, Agarwal D, et al. MedCLIP: Contrastive Learning from Unpaired Medical Images and Text[C]//Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022: 3876-3887.）
2. In the introduction, the authors state that this paper “design an automatic anatomical region-sentence alignment paradigm, which aligns with radiologists’reading workflow and enhances interpretability. ” However, there is little evidence that the interpretability is enhanced, either quantitatively or qualitatively, and the authors should provide some validation or be more rigorous in their wording.
3. There are inaccuracies in the summary of existing methods. For example, GLoRIA is not such a patch-word alignment method as the authors mentioned in the paper, please pay attention to the rigour when citing and summarising existing literature.
4. As can be seen from Table 1, the results using the CNN-based backbone show little or even negligible advantage in terms of classification performance, especially on the NIH X-ray, CheXpert and RSNA datasets.
5. The authors do not provide an analysis of the reasons for the not so satisfying classification performance of their own model on the CheXpert and RSNA datasets (only mention the advantages of the MGCA model, but do not mention the problems of their own model).
6. In the ablation study, the authors did not clarify what baseline was used in the corresponding module of the model without ARSA and IERL.
7. Internal representation learning and External representation learning seem to have two independent losses. However, the authors did not verify whether the contribution of the IERL module to the performance improvement was due to the internal or external components in the ablation study.
8. The proposed method performs well on the segmentation task, but not well on the classification task. Could the authors try to analyze the reasons?
9. Why do some datasets use the AUC metric and others use the ACC metric? Why is there such a difference? Are there any special considerations?
10. Some other small details： (1) Figure 1 is not referenced in the paper. (2) There are some spelling mistakes, such as AUC misspelled as AUR, etc.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Do you have any additional comments regarding the paper’s reproducibility?

There are some results in Table2 that were not reported in the cited original paper, and it is not known whether they were reproduced by the authors themselves. These experimental details are not described in the paper and need to be clarified so that the readers can better reproduce these results. Also, it is recommended that the source code be provided with more adequate instructions for readers to better reproduce the model.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
1. It is suggested that the paper be compared with MedCLIP method (including methods and performance).
2. It is recommend that authors provide verification of interpretability or modify the wording of interpretability in the paper to make it more rigorous.
3. It is suggested that the authors analyze the reasons why the proposed method does not perform well in classification performance.
4. It is recommended that authors separately verify the role of Internal representation learning and External representation learning in ablation studies.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Accept — could be accepted, dependent on rebuttal (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The perspective of the paper is prospective and the methodology is somewhat innovative, but the experimental support is insufficient and needs to be supplemented. And the improvement of classification performance of the proposed method is so small that it is almost negligible.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Weak Accept — could be accepted, dependent on rebuttal (4)
[Post rebuttal] Please justify your decision

The authors partially addressed my questions, but there are issues that are not sufficiently addressed.

For example, the authors do not provide an analysis of the reasons for the not so satisfying classification performance of their own model on the CheXpert and RSNA datasets (only mention the advantages of the MGCA model, but do not mention the problems of their own model).

In the ablation study, the authors did not clarify what baseline was used in the corresponding module of the model without ARSA and IERL.

Internal representation learning and External representation learning seem to have two independent losses. However, the authors did not verify whether the contribution of the IERL module to the performance improvement was due to the internal or external components in the ablation study.

Why do some datasets use the AUC metric and others use the ACC metric? Why is there such a difference? Are there any special considerations?

Author Feedback

We thank the reviewers for their constructive feedback. #R1 [ARSA’s Contribution]ARSA, as our key contribution, is the first to use anatomical regions and sentences as the minimum semantic units in MedVLP. Its design is in line with the reading process of X-rays, as validated by professional radiologists. AnaXNet and AGXNet only use single-modal information. RGRG uses Chest ImaGenome’s annotations. What we want to emphasize is that we redesign an automatic alignment, which has advantages over Chest ImaGenome in ①removing redundancy (Supp.Tab.1) ②strict semantic alignment, e.g., in Chest ImaGenome, bbox left lung maps to sentence “lungs are clear”, which is inaccurate and illusory. We provide two solutions: split sentence or merge bbox (Fig.1). #R1,R3,R4 [ERL’s Contribution and Comparison with Previous Methods]We use Soft Label to further enhance our method. Comparison: ①Methods in the natural image domain (e.g. SoftCLIP) generally rely on image tags output by object detectors as the pseudo-labels for contrastive learning, which could lead to error accumulation due to detector’s inaccurate predictions. In contrast, thanks to the detailed nature of medical reports, the soft labels we use are directly parsed from reports, which is more accurate and computationally efficient. ②For existing methods in medical domain, our method is superior in both accuracy and granularity of predefined categories, leading to better performance. For granularity, MedCLIP only adopts 14 categories to construct soft labels, which is insufficient to distinguish subtle differences in X-rays. In contrast, we expand them to 40 categories defined by radiologists. For accuracy, the pseudo-labels produced by MedCLIP include uncertainty(labeled as -1). In contrast, we ask radiologists to re-label these cases to obtain more accurate soft labels. The entire process is natural since we already obtain pseudo-labels in Sec.2.2. This is why we don’t use other methods to reduce FN. We will add the comparison in the revised version. #R1,R3,R4 [IRL’s Contribution and Zero-Shot Performance]IRL not only strengthens the image encoder but also aligns with the patterns of downstream classification tasks. This is an important reason why our 1% finetuning significantly outperforms previous methods. Similarly, we validate our model on zero-shot, surpassing SOTA methods. Due to space limits in paper, we follow the exp setup of MGCA(NIPS 22) and omit the results. We will add them in the revised version. #R1,R3,R4 [Performance Improvement Analysis]We survey recent MedVLP methods and observe that our method achieves comparable performance gains with them on classification and much larger on segmentation (thanks to ARSA). Our model has appropriate standard deviations under different data proportions. We will add them in the revised version. #R1 [Baseline Performance]The exp setup (Chestx-ray8:finetuning; Ours:linear probe) and data split are different, so the metrics differ significantly. #R1,R3 [Training Details]We use the same settings and metrics as MGCA. f_dec is a transformer decoder. We’ve clarified all details on our GitHub. #R3 [Interpretability]In ARSA, we treat anatomical regions and sentences as the minimum semantic units, which is more interpretable than prior patch-word alignment as patches lack anatomical info. We provide heat maps (Fig.3) for further proof, superior to those in MGCA. #R3 [GLoRIA]The sub-region referred to is a split of feature map (actually a patch). #R3,R4 [Loss Design and Ablation]We parse raw reports into triplets, and all modules make full and reasonable use of this structured info, introducing corresponding losses. This is not a forced combination to improve the metrics. From ablation, each loss contributes to performance gains. Since bce loss and soft loss are both based on <find,exist> and bring similar improvements, we consider them together as IERL. #R4 [Reference]In the revised version, we will compare with the strong baseline BioViL.

Meta-Review

Meta-review #1

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

The reviewers were all quite positive about the main idea presented by the paper. Two reviewers even changed their scores post-rebuttal. However, there are still multiple points that need to be addressed by the authors, requested post-rebuttal, namely: ablation on IERL losses, details on metrics used, marginal gain compared to baselines, etc. This can easily be addressed for the camera-ready version of the paper or perhaps in future work.
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

The reviewers were all quite positive about the main idea presented by the paper. Two reviewers even changed their scores post-rebuttal. However, there are still multiple points that need to be addressed by the authors, requested post-rebuttal, namely: ablation on IERL losses, details on metrics used, marginal gain compared to baselines, etc. This can easily be addressed for the camera-ready version of the paper or perhaps in future work.

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

All reviewers changed their rating to WA after rebuttal. However, in their post-rebuttal comments, they all thought that there are still issues related to the novelty and marginal performance gain of the paper. Overall, considering the reviewers’ final ratings and the authors’ promise in the confidential comments that they “will make all alignment rules, code, and re-labeled dataset(1.3M region-sentence pairs for 217k samples) publicly available”, I recommend acceptance.
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

All reviewers changed their rating to WA after rebuttal. However, in their post-rebuttal comments, they all thought that there are still issues related to the novelty and marginal performance gain of the paper. Overall, considering the reviewers’ final ratings and the authors’ promise in the confidential comments that they “will make all alignment rules, code, and re-labeled dataset(1.3M region-sentence pairs for 217k samples) publicly available”, I recommend acceptance.

back to top

Anatomical Structure-Guided Medical Vision-Language Pre-training

Author(s):