Abstract

Cervical cancer remains a leading cause of cancer-related mortality among females globally, with diagnosis primarily relying on multi-sequence magnetic resonance imaging (MRI). However, existing Multi-modal Large Language Models (MLLMs) struggle with processing 3D multi-sequence inputs due to high computational complexity and inefficient long-sequence modeling. To this end, we present Cervical-RG, to the best of our knowledge, this is the first framework that utilizes 3D multi-sequence MRI images for automated report generation. International Federation of Gynecology and Obstetrics (FIGO) staging, which plays a critical role in cervical cancer management, is also incorporated into the report. The workflow consists of (1) image diagnosis generation, (2) Chain of Thought(CoT)-guided FIGO staging with rationale, and (3) cross-stage consistency verification, Meanwhile, the entire pipeline simulates the collaborative diagnostic process of multi-disciplinary experts in clinical practice. Besides, we propose a novel model to handle multi-sequence inputs, comprising a volumetric multi-sequence encoder and a Mamba-Transformer hybrid decoder, which integrates global attention with selective state-space modeling to effectively handle long-range dependencies and spatial relationships. To validate our method, we curate Cervical-MD—a multi-modal dataset comprising 3,137 volumetrically aligned MRI-report pairs across five sequences (ADC, T1CA, T1CS, T2A, T2S), annotated by two radiologists. Experimental results demonstrate state-of-the-art performance in automated cervical cancer report generation. Our codes will be open-sourced soon.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0177_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/LongYu-LY/Cervical-RG

Link to the Dataset(s)

N/A

BibTex

@InProceedings{ZhaHan_CervicalRG_MICCAI2025,
        author = { Zhang, Hanwen and Long, Yu and Fan, Yimeng and Wang, Yu and Zhan, Zhaoyi and Wang, Sen and Jiang, Yuncheng and Sun, Rui and Xing, Zheng and Li, Zhen and Duan, Xiaohui and Zhao, Weibing},
        title = { { Cervical-RG: Automated Cervical Cancer Report Generation from 3D Multi-sequence MRI via CoT-guided Hierarchical Experts } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15964},
        month = {September},
        page = {77 -- 87}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    They proposed a cervical cancer report generation framework Cervical-RG that is specifically designed for processing multi-sequence 3D MRI images, enabling across sequence interactive learning to enhance report quality.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed framework utilized 3D multi-sequence MRI images for automated report generation. Moreover, the FIGO staging was incorporated into the report. To enhance clinical interpretability, the framework divided the report generation process into three phases: image diagnosis generation, FIGO staging, and cross-stage consistency verification. They also collected the multi-modal dataset for cervical cancer, Cervical-MD dataset, which includes 3,177 cases of cervical cancer patients from seven tertiary hospitals. Each case in the dataset consists of five 3D MRI sequences, along with their corresponding imaging diagnoses, and staging information labeled according to the FIGO 2018 criteria.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    There are lack of the implementation details of the method as well as the experiment settings. For example, how to integrate images with different modalities; how to align image and text… The description of the collected dataset is unclear too.

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (1) Strong Reject — must be rejected due to major flaws

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The submission does not provide sufficient information for reproducibility.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    I agree to accept it as a poster paper.



Review #2

  • Please describe the contribution of the paper

    This paper propose a new framework, Cervical-RG, to produce diagnositic report from 3D multi-sequence MRI images. The authors also collected a large-scale mutli-modal dataset for model training. To improve the performance, the pipeline integrates FIGO staging for chain-of-thought reasoning.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This paper proposes to incorporate FIGO staging for CoT reasoning. This step generates intermediate descriptions that provide evidence for high-quality report generation and better interpretability.

    2. The overall hierarchical experts mechanism breaks down the complex diagnostic process into easier tasks, aligned with the diagnositic process of physicians.

    3. The experiments show that the proposed framework outperforms previous methods prominently.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. In Fig.1, the LLM generates image diagnosis in Phase I, and then the LLM generates coarse imaging via CoT based on the embedding of image diagnosis. The image diagnosis here is unclear. What is included in the generated image diagnosis? Why is it necessary to generate this intermediate diagnosis rather than generating coarse staging via CoT directly?

    2. In the pipeline, the review specialist stage relies on GPT-4o, which is a proprietary model. This may raise the privacy issue when uploading the generated diagnosis report to GPT. In addition, many hospitals require deployment without Internet access. The reliance of GPT-4o will limit the practical utility of the proposed approach.

    3. In Sec. 3.3, the authors mention that they conduct clinician evaluation in the experiments. However, this is only used for ablation study. It’s more reasonable to use clinician evaluation for comparison with previous methods (Table 1).

    4. The F-Acc in Table 2 is very low. Though the authors argue that it’s still acceptable for clinicians, the explanations are very vague. I’m wondering the criteria that are used to make this judgement. What’s the minimum accuracy that meets diagnostic acceptability criteria?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Though this paper presents a new pipeline for 3D multi-squence MRI report generation, the proposed method relies on the proprietary model GPT-4o. This is not practical for real-world utility. There are also some problems in experiments that are specified in weaknesses. Thus I tend to give “weak reject” to the current draft.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    Thanks for the rebuttal and reviews of other reviewers. The authors make some promise about providing additional physician evaluation and replacing GPT-4o with open-sourced model. I don not think these big changes can be completed within the time of camera-ready preparation. I suggest the authros update the paper and submit it again.



Review #3

  • Please describe the contribution of the paper

    Cervical-MD: The development and curation of a new multi-modal dataset, containing 3,177 patient cases from seven hospitals. Each case includes five 3D MRI sequences (ADC, TICA, T1CS, T2A, T2S) paired with clinical imaging reports and FIGO 2018 staging information.

    Cervical-RG: A proposed “Hierarchical Experts” framework mimicking clinical workflow with thre phases in high level overview: Phase I (‘Initial Radiologist’) generates the diagnostic report; Phase II (‘Follow-up Oncologist’) performs FIGO staging using Chain-of-Thought (CoT) reasoning guided by specific tags; Phase III (‘Report Reviewing Specialist’) verifies cross-stage consistency.

    An empirical demonstration showing that Cervical-RG which first integrates FIGO staging into report generation achieves state-of-the-art performance compared to several baseline methods (including 2D and 3D MLLMs) on their Cervical-MD dataset, evaluated using standard NLP metrics for report generation and accuracy metrics for FIGO staging.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Clinical Significance and Novel Framework: The paper addresses a highly relevant clinical problem – automating the complex task of generating comprehensive cervical cancer reports and FIGO stages from multi-sequence MRI. The proposed hierarchical expert framework is conceptually appealing as it attempts to mimic the collaborative clinical decision-making process, potentially enhancing interpretability.

    Valuable Dataset Contribution: The creation and description of the Cervical-MD dataset represent a significant strength. Multi-institutional, well-annotated, multi-sequence 3D MRI datasets with corresponding clinical reports and staging are rare publicly. This dataset, if released as promised, would be a valuable resource for the research community.

    Advanced Technical Approach: The paper incorporates recent advancements in vision-language modeling. Using a dedicated 3D encoder for the data is appropriate. The Mamba-Transformer hybrid decoder is a technically interesting choice for potentially balancing performance and efficiency in generating long text sequences from complex visual inputs. The integration of CoT for staging is conceptually sound and relatively new for emulating clinical reasoning in medical imaging domain.

    Comprehensive Evaluation: The authors perform a thorough empirical evaluation on their dataset. They compare Cervical-RG against a relevant set of baseline methods using a wide array of standard metrics for both report generation (BLEU, ROUGE, METEOR, BertScore, RadGraph, RadCliQ) and staging classification (C-Acc, F-Acc). The ablation studies provide insights into the contributions of different components (3D vs. 2D, multi-sequence, multi-stage training, hierarchical experts).

    Timely novel application: The paper addresses an important and previously unfilled gap: automated report generation for pelvic MRI in cervical cancer. This is clinically significant because MRI is crucial for cervical cancer staging and treatment planning, while radiology reporting is time-consuming. By focusing on multi-sequence 3D data, the work pushes beyond prior radiology report generation which mostly dealt with 2D images or single-modality scans. The inclusion of FIGO staging is a valid practical contribution, as staging determines management; automating its inference (with a reported accuracy presumably high for certain stages) could directly assist oncologic decision-making.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Ambiguity and Reproducibility Concerns Regarding Phase III (Report Reviewing Specialist): The paper states Phase III uses GPT-4o as an “external decision-making model” for consistency verification. This creates significant ambiguity:

    • Is GPT-4o part of the core Cervical-RG inference pipeline? If this is the case, the framework’s performance seems to rely heavily on a large, proprietary, external model, undermining claims of a self-contained novel method. This raises serious reproducibility issues (access, cost, potential API changes) and makes it difficult to isolate the contribution of the authors’ proposed architecture.

    • Is GPT-4o used only for post-hoc evaluation or correction? If so, this needs to be explicitly stated. Currently, its role seems integrated to me (“The workflow consists of… (3) cross-stage consistency verification”). I failed to see the clarification, which makes it hard for me to assess the true capabilities and limitations of the Cervical-RG model itself in ensuring coherent outputs. The lack of detail violates the reproducibility principle, as researchers cannot replicate this crucial step without access to the specific prompts and methodology used with GPT-4o.

    Insufficient Detail on Chain-of-Thought (CoT) Implementation: While incorporating CoT for the ‘Follow-up Oncologist’ (Phase II) is interesting, the implementation lacks crucial details necessary for understanding and reproduction.

    • The method section introduces six reasoning tags (, , etc.) but doesn’t explain how the model generates the reasoning steps associated with these tags. Is it generating free-form natural language rationale, or is it simply predicting these tags as intermediate outputs in a multi-task setup?

    • It’s also unclear how these generated reasoning steps or tags mechanistically guide the final FIGO stage prediction. Without these details, it’s difficult to assess the actual novelty and effectiveness of the CoT implementation beyond potentially being a structured intermediate prediction task. This lack of clarity hinders the ability to evaluate the claimed benefit of CoT for improving interpretability and accuracy in this context.

    Lacking Detail and Validation for FIGO Staging:

    • The model’s staging accuracy is not compared against the original radiologists’ performance, only against the recorded clinician label. Furthermore, there is no analysis of the types of staging errors made (e.g., tendencies to under-stage or over-stage specific conditions), which is crucial for understanding clinical applicability. Although I see the reported fine-grained accuracy (F-Acc = 0.209) is significantly higher than your baselines, it is still low, highlighting the need for deeper analysis beyond simple accuracy scores.

    Limited Dataset Characterization and Potential Bias Concerns: Although the dataset size (3,177 cases) is fine here, the characterization provided is insufficient to fully assess its scope and potential limitations.

    • Lack of detail on patient demographics (age distribution, etc.), distribution across different FIGO stages (especially rarer stages), and variations in MRI scanner manufacturers/models/protocols across the seven hospitals. This information is crucial for understanding potential biases in the dataset and the model trained on it.

    • The test set size (105 cases, ~3.3% of the data) seems relatively small, which may limit the statistical significance of performance differences between models, especially when analyzing performance across subgroups (e.g., different FIGO stages).

    • No information on inter-rater reliability for the report annotations and FIGO staging provided by the two radiologists. Knowing the level of agreement/disagreement in the ground truth is essential for interpreting the model’s performance metrics realistically.

    Generalizability Remains Unproven: The evaluation is conducted solely on a held-out test set derived from the same data pool as the training set (albeit from multiple institutions). This setup does not adequately assess the model’s ability to generalize to data from entirely new institutions or settings, which may have different imaging protocols, patient populations, or annotation styles (domain shift). While collecting data from seven hospitals helps diversity, true clinical applicability requires demonstrating robustness on completely external datasets. Without such validation, the claimed state-of-the-art performance might not translate to real-world deployment. Due to lack of other datasets in this domain, acknowledge the limit is necessary.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Will the dataset be publicly available?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The recommendation is ‘Weak Accept’ (4), acknowledging the paper’s significant strengths but contingent on satisfactory author rebuttal addressing the major weaknesses.

    I believe this paper has high clinical relevance, novel hierarchical framework concept, substantial and valuable dataset contribution (Cervical-MD), use of appropriate 3D modeling and advanced decoder architecture, and comprehensive empirical evaluation showing promising results on their data.

    Weaknesses need to be addressed during rebuttal: The score is moderated by significant concerns regarding clarity and reproducibility. Specifically:

    • The ambiguity surrounding the role and implementation of GPT-4o in Phase III is a major issue impacting the understanding of the core method’s capabilities and reproducibility.

    • Lack of detail on the CoT implementation and FIGO staging issues prevent a clear assessment of its mechanism and contribution.

    • Limited dataset characterization and the acknowledge of external validation restrict the confidence in the generalizability of the reported results.

    The paper presents a potentially impactful contribution. However, the ambiguities regarding key components (Phase III, CoT) and validation limitations currently prevent a stronger recommendation. A convincing rebuttal that clearly explains the GPT-4o role, details the CoT mechanism, and acknowledges limitations appropriately could potentially raise the score. Without such clarification, the paper’s claims are harder to fully verify and accept.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have addressed most of my concerns, some more comprehensively than others:

    Role and Ambiguity of GPT-4o (W1 - Major Concern): This was my primary concern, and the authors’ clarification in Q3 is significant. They state that GPT-4o is not the main contributor, has a marginal effect on F-Acc (2.8% improvement), and is used only as a fallback for rare cases in Stage 3. Crucially, they mention it can be replaced by open-source models like Qwen3-4B. This alleviates concerns about reliance on a proprietary model and reproducibility. This concern is addressed. A lingering question would be the exact definition of “rare cases” and if these results (with potential fallback) are what’s reported, but the core method’s independence is clearer.

    Insufficient Detail on CoT Implementation (W2 - Major Concern): The authors’ response in Q1 clarifies that CoT implementation involves decomposing FIGO staging into six key indicators, and during data construction, each indicator is annotated with structured tags (e.g., ...). This structured format guides the model. This explanation makes the “how” of CoT more tangible, it’s guided by structured annotation during training. This concern is partially addressed. While the annotation strategy is clearer, the mechanism by which the model learns to generate the intermediate reasoning text beyond just predicting tags could still be more detailed, but the current explanation is an improvement.

    Lacking Detail and Validation for FIGO Staging (W3): In Q2, the authors provide context for the F-Acc (20.9%), emphasizing the high C-Acc (60.9%) for major stages and that misclassifications are often between adjacent sub-stages, which is clinically more tolerable. They also point out the difficulty of FIGO staging directly from MRI. In Q4, they promise full expert scores in the final version’s Table 1. This concern is partially addressed and the justification for F-Acc is reasonable. The promise to complete Table 1 is expected.

    Limited Dataset Characterization (W4): Q4 provides some more details (age range, high-risk group representation) and promises detailed distributions forthcoming. This is a standard response. Inter-rater reliability for annotations wasn’t explicitly mentioned. This concern is partially addressed.

    Generalizability Unproven (W5): In Q6, the authors refer to Sec. 3.1 (data from seven hospitals) as evidence of generalizability and note the difficulty of finding comparable public datasets. This is a fair point; data from seven hospitals does provide a degree of diversity. This concern is partially addressed.

    Reviewer #1’s primary concerns were the lack of implementation details and unclear dataset description, leading to a “Strong Reject.” The authors’ rebuttal (Q1, Q4) seems to adequately address these general points, making R1’s initial assessment appear somewhat harsh post-rebuttal.

    Reviewer #2’s concerns included: The role of “Image Diagnosis” (addressed in Q5). Reliance on GPT-4o (a key point, addressed in Q3, aligning with my major concern). Clinician evaluation only for ablation study (authors promise in Q4 to add full expert scores to Table 1, which would address this). Low F-Acc (addressed in Q2, similar to my W3). Key Shared Weakness & Rebuttal Impact: The most critical point, shared by me (R3) and R2, was the role of GPT-4o. The authors’ clarification that it’s a minor, replaceable fallback is a game-changer for the paper’s perceived core contribution and reproducibility. This strengthens the paper as author claimed. New Insights/Agreement: I agree with R2 that the low F-Acc, while understandable in context, remains a limitation to acknowledge, which the authors do. The promise to include full expert evaluation scores in the main comparison (Table 1) as suggested by R2’s critique and addressed in Q4 of the rebuttal would be a valuable addition to the final paper. The authors’ response addressed the majority of major concerns raised across reviews.

    Given the authors’ clarifications, particularly regarding: The minimal and replaceable role of GPT-4o (resolving a major ambiguity and reproducibility concern). The improved explanation of the CoT implementation via structured annotation. The contextualization of the F-Acc for FIGO staging. The promise is to include full expert evaluation in the main comparison table. The paper’s contribution in terms of a novel framework for a challenging clinical task (cervical cancer report generation with FIGO staging from multi-sequence 3D MRI) and the valuable Cervical-MD dataset is now clearer. The core methodology seems less dependent on external proprietary models than initially perceived. I would suggest “Accept”




Author Feedback

We sincerely appreciate all reviewers for their constructive suggestions and for recognizing the key strengths of our work as below: Clinical impact: Addresses a “highly relevant clinical problem” (R3) with a framework “aligned with physicians’ workflow” (R2). Technical innovation: FIGO-CoT reasoning and a “technically interesting” Mamba decoder “provide evidence for high-quality reports” (R2, R3). Dataset value: Cervical-MD is a “valuable resource” that “fills a critical gap” (R1, R3). Empirical validation: “Significantly outperforms prior methods” (R2), with ablations offering “insights into components” (R3).

Q1: Method Implementation (R1, R3): The visual features are projected into the textual semantic space via an MLP layer, enabling cross-modal feature alignment. We then replace the placeholder tokens in text inputs with the projected visual representations, accomplishing end-to-end feature fusion. For implementation of Chain-of-Thought (CoT) reasoning, following FIGO staging criteria, we decompose the clinical staging decision process into six key indicators. During data construction, each indicator is annotated with structured tags marking the beginning and end of its reasoning segment (e.g., The tumor size is 34mm*12mm*10mm). This structured CoT format guides the model to learn inter-indicator relationships, thereby enhancing interpretability and improving the robustness of inference.

Q2: F-Acc metrics of FIGO staging in Tab. 2 (R2, R3): Our model can achieve a C-Acc of 60.9% for the 4 major stages while achieving a F-Acc of 20.9% for 19 minor stages, both outperforming existing sota approaches. The model’s clinical acceptability stems from its high accuracy in major-stage classification, with rare errors across distant stages (e.g., I vs. IV). Misclassifications mainly occur between adjacent sub-stages (e.g., IIA vs. IIB), which are generally tolerable in clinical practice. Importantly, our model infers FIGO staging directly from MRI scans instead of WSI, which is a highly challenging and under-explored task, we believe our work provides a strong baseline for this new direction and acknowledge that further improvements are needed.

Q3: Role of GPT-4o (R2, R3): We clarify that GPT-4o is not the main contributor to our model’s performance. As shown in Tab. 1 (row 4), GPT-4o alone achieves a BLEU-2 score of 0.029, and adding it only improves F-Acc by 2.8% (Tab. 2), indicating a marginal effect. The key gains come from our multi-MRI sequence modeling and CoT mechanism. GPT-4o is used only in Stage 3 as a fallback for rare cases and can be replaced by open-source models like Qwen3-4B for local deployment. This will be clarified in the final version to avoid confusion.

Q4: Dataset Details and Expert Evaluation (R1, R2, R3): The Cervical-MD dataset includes five MRI sequences, radiology reports, and FIGO labels per patient (ages 20–90), with the 40–60 high-risk group most represented. Limited data led to allocating more samples for training, both dataset and test set are being expanded, with detailed distributions forthcoming. Two radiologists evaluated model outputs using standardized criteria to reduce subjectivity. To save evaluation burden, only partial results appear in Tab. 1; full expert scores will be added in the final version.

Q5: Clarification on “Image Diagnosis” (R2): “Image diagnosis” refers to the medical report generated in Phase 1, and Fig. 3. offers an example.

Q6: External Validation (R3): As noted in Sec. 3.1, experiments on data from seven hospitals demonstrate strong generalizability. Due to the richness of Cervical-MD, no public dataset resembles its input format. Additional data from other institutions are being collected for further validation, and results will be included in the final version.

Upon acceptance, we will release all codes, and researchers can apply for access to the Cervical-MD dataset by contacting us via email.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper proposes a chain-of-thought–guided framework for report generation from 3D MRI. It is a well presented application study that demonstrates clear clinical utility and integrates FIGO staging for cervical cancer into the workflow. The dataset comprises clinically validated pairs of five-sequence MRIs and their corresponding reports. The method outperforms baselines in both diagnostic report generation and FIGO staging classification, with a thorough evaluation. To further strengthen the work, the suggested expert evaluation should be included in the final version. The comparison against UNETR and Swin-UNETR for FIGO staging classification is limited and not entirely convincing.



back to top