Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

User research is increasingly recognized as an essential strategy for ensuring the usability, safety, and effectiveness of emerging technologies in surgery. From a human-centered perspective, user studies are key to evaluating how technology-assisted interventions affect human behavior and system perceptions. For feasibility and scalability, these studies are typically conducted in controlled, desk-based lab settings. However, these settings often lack ecological validity, raising questions about how well they capture the actual surgical environment’s emotional, perceptual, and interactive complexities. Previous work in human-centered assurance for image-based navigation, for example, described office-like laboratory studies where participants were asked to assess the adequacy of image-based 2D/3D registration, revealing that evaluators struggled to identify misalignments reliably. For that same task in robotic surgery, this study investigates whether–and how–the environment in which user studies are administered influences user behavior and performance. Specifically, we compare a conventional office-like lab to a high-fidelity mock operating room (mock OR) with an active robotic system, where the latter is contextually more relevant to the surgical task. Twenty-one participants first trained in an office, then were randomly assigned to either return to the office or proceed to the mock OR. Although task performance did not differ significantly, likely due to task difficulty, participants in the mock OR showed significantly higher interaction, perceived stakes, and NASA-TLX workload changes, despite completing the same task. These findings suggest that realistic, contextually relevant environments modulate user responses and behavior, with important implications for how user studies are designed, interpreted, and applied in computer-assisted interventions.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4505_paper.pdf

SharedIt Link: https://rdcu.be/eHxb9

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05185-1_19

Supplementary Material: https://papers.miccai.org/miccai-2025/supp/4505_supp.zip

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{ChoSue_Feeling_MICCAI2025,
        author = { Cho, Sue Min AND Wu, Winnie AND Kilmer, Ethan AND Taylor, Russell H. AND Unberath, Mathias},
        title = { { Feeling the Stakes: Realism and Ecological Validity in User Research for Computer-Assisted Interventions } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15973},
        month = {September},
        page = {189 -- 197}
}

Reviews

Review #1

Please describe the contribution of the paper

The paper studied the impact of the ecological validity of a simulated environment on user behavior and performance. Specifically, they compared a conventional office-like lab to a high-fidelity mock operating room with an active robotic system. They asked Twenty-one participants were randomly assigned to either return to the office or proceed to the mock OR to performa a basic task. They measured task performance, subjective ratings on perceived stakes and stress, and NASA-TLX workload changes and showed differences.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Ecological validity of user conditions is crucial and rarely studied.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Figure 1 clearly demonstrates how unfair it is to compare both so different conditions. The evaluation metrics are a bit weak. There is more to measure. The participants should be professionals. The notion of ecological validity sounds differently with this population. The results would be different and also very interesting.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

The evaluation metrics are a bit weak. There is more to measure. The participants should be professionals. The notion of ecological validity sounds differently with this population. The results would be different and also very interesting.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The evaluation metrics are a bit weak. There is more to measure. The participants should be professionals. The notion of ecological validity sounds differently with this population. The results would be different and also very interesting.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

Rebuttal satisfactory

Review #2

Please describe the contribution of the paper

The paper investigates how the testing environment influences user behavior and subjective perceptions in a computer-assisted intervention task. By comparing an office-like laboratory setting with a high-fidelity mock operating room (mock OR) equipped with an active robotic system, the study evaluates differences in objective performance, interaction metrics (click counts), and subjective workload/stress ratings during a 2D/3D registration assessment task.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The work directly addresses the critical issue of ecological validity in user studies for surgical navigation, emphasizing the importance of realistic testing environments.
- The study employs a controlled experimental design, comparing performance and subjective measures between two distinctly different environments.
- Both objective measures (accuracy, sensitivity, click counts) and subjective measures (NASA-TLX, perceived stakes, stress) are reported, providing a multi-faceted evaluation of user response.
- Appropriate statistical tests and effect size calculations are applied to evaluate differences between groups.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- With only 18 participants after exclusions, the small sample size may limit the generalizability and statistical power of the findings.
- The experimental task (a 2D/3D registration assessment) is narrowly defined and may not capture the full complexity of real surgical procedures or other computer-assisted intervention tasks.
- While the mock OR provides valuable contextual realism, such setups are resource-intensive and may not be widely replicable, potentially limiting the study’s practical applicability.
- The study focuses on a single registration task without assessing how prolonged exposure to realistic environments might affect performance or learning curves over time.
- Although subjective and objective metrics are reported, the paper could benefit from a deeper discussion of specific user errors or potential usability challenges that arise uniquely in realistic versus lab environments.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The study makes an important contribution by highlighting the impact of realistic testing environments on user engagement and workload in computer-assisted interventions. However, the limited sample size, narrow task focus, and reliance on a resource-intensive mock OR setup constrain the generalizability of the findings. A broader investigation across multiple tasks or extended study periods would strengthen the conclusions. The paper provides useful insights, yet its current scope and limitations suggest that further work is needed before it can have a wide clinical or research impact.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.

Despite the authors’ thoughtful rebuttal, I remain unconvinced that the current version of the paper warrants acceptance. While the study addresses an important and underexplored question about ecological validity in CAI research, the fundamental limitations—namely the small sample size, lack of expert participants, narrowly defined task, and reliance on a resource-intensive environment—substantially limit the generalizability, reproducibility, and broader impact of the findings. The rebuttal appropriately acknowledges these concerns but frames them largely as directions for future work. However, MICCAI accepts papers based on completed, compelling contributions, not primarily on potential. Without a more diverse subject pool or task complexity to strengthen the conclusions, the paper remains premature for publication at this stage.

Review #3

Please describe the contribution of the paper

The authors explore the idea that realism of the setting in which a user study takes place effects the results, even if the interface being investigated and the task itself do not change. The specific task used here was to determine the usability and acceptability of a particular registration algorithm for spinal images in robot-assisted surgery.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The idea is interesting and extremely relevant as to how CAI research (and other MICCAI research more generally) is performed.
- The results are fairly convincing and statistics have been performed, even given the relatively low number of subjects (18 novices) and the structure of the hypotheses and experiments are described very clearly and evaluated in good faith.
- The MockOR environment looks good and the discussion on using realistic virtual reality environments is warranted given the subject and results of the paper.
- The authors perform good experimental sanity checks, such as verifying the perceived stakes of the MockOR, rather than just leaving it as an assumption. The end hypotheses, that the environment would change the interaction behaviour and the subjective assessment of the task, largely line up.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The use of only novices is problematic as it is unclear if expert surgeons also have a similar distinction between mock real-world and more abstracted office environments, especially given their increased familiarity with real operating rooms. This is nevertheless an important initial step.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(6) Strong Accept — must be accepted due to excellence
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This is a highly structured and enjoyable paper with a clearly defined message that should resonate with many people in the CAI community designing user study experiments. I feel that it will be very well received by the community in general and would generate interesting discussion. No, the paper does not have “technological novelty” but such a thing would only distract from their message and their experiment for testing said message.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

I still strongly believe that such a study, even if preliminary, is important to generate discussion about how CAI research is performed. I understand R2 wants expert participants but I don’t think this should preclude accepting this study, given how common it is for initial and prototypical research in CAI to use novice participants.

Author Feedback

We thank all reviewers for their constructive feedback. We are encouraged by the recognition that our study addresses an “interesting and extremely relevant” (R1) problem, with a “clearly defined message” and “controlled experimental design” (R3) that provides “useful insights” (R3) into the ecological validity of user studies in computer-assisted interventions (CAI)—a topic R2 notes is “crucial and rarely studied.” Below, we respond to the main concerns raised.

<Study Scope, Design, and Population (R1, R2, R3)> Our study builds on prior work in the CAI community that established baselines for how well humans may detect registration misalignments (e.g., [4]). However, that work was conducted in a standard office-like environment, leaving open how environmental realism may influence user perception and behavior—precisely the gap we address. Given that human assurance of automated algorithms, such as those for 2D/3D registration, is still a relatively new area—and that no standard definition of “expertise” for assurance exists—we selected a well-defined task grounded in prior CAI work (R1, R3). All participants received standardized training (Session 1), and only the environment varied in Session 2, using a between-subjects design. Even with a small sample size, we observed significant effects with substantial effect sizes (R3). We believe that, as a foundational investigation (recognized by R1), the design choices offer guidance for future studies by identifying key effects that can inform power analyses and the design of larger-scale experiments. We agree that future work should explore more complex tasks (R3) and involve expert participants (R1, R2)—particularly for workflows where novice/expert distinctions are well established. We will revise the Discussion to clarify these choices, position this study as a starting point for future research, and outline how our framework may extend to more complex surgical scenarios.

<Environmental Contrast and Replicability (R2, R3)> We appreciate the concern about the stark contrast between the office and mock OR environments (R2). This contrast was intentional and central to our research question whether contextual realism meaningfully affects user behavior, even when the task and algorithmic performance remain constant. Participants were randomly assigned to one of the two environments to ensure balanced group comparison. We also acknowledge that mock OR setups may be resource-intensive and not available at all institutions (R3), but believe this concern is not central to evaluating this manuscript. As we better understand the factors most pertinent to user behavior—through studies like this one—we can work toward scalability. Indeed, as noted in our Discussion and supported by R1’s comment on the potential value of realistic virtual environments, we see strong potential in translating key aspects of realism into more scalable formats, such as virtual reality. We will revise the Discussion to clarify our motivation and future directions.

<Evaluation Metrics (R2)> In our study, we focused on a core set of objective and subjective measures to evaluate performance and perception. These were selected to align with our research questions and provide interpretable insights. That said, we agree there is value in expanding the framework. Future work may incorporate eye-tracking or physiological data to capture more granular indicators of cognitive and affective state. We will revise the Discussion to reflect this direction.

<Error Analysis and Usability (R3)> Our focus was on high-level performance and subjective responses to isolate the effect of environmental context. A detailed error analysis was beyond our scope, but would be informative in future work. While this study was not a usability study, the findings have broader implications for how usability evaluations are designed and interpreted, especially in human-centered CAI. We will revise the Discussion to clarify this contribution.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
The paper investigates the impact of ecological validity in user studies for computer-assisted interventions by comparing task performance and subjective perceptions in a conventional office-like lab versus a high-fidelity mock operating room.

Overall, the reviewers agreed that the paper presents a well-structured and clearly written study, with a highly relevant premise. The use of a mock OR to explore differences in task perception and performance represents a valuable contribution, especially for researchers designing user studies in medical settings. R1 rated the paper highly, highlighting the clarity of the experimental design, thoughtful statistical analysis, and the importance of the message for the broader MICCAI community.

R2 and R3 appreciated the paper’s focus and methodological care but raised concerns. Both reviewers questioned the generalizability of the results due to the limited sample size and the use of novices rather than clinical professionals. They noted that expert users might react differently to ecological variations due to their familiarity with operating room environments. R3 also expressed concerns about the scalability of the mock OR setup and the narrowness of the task used. R2 felt that the comparison between the lab and OR conditions may not be sufficiently controlled, pointing to Figure 1 as highlighting the stark differences between the environments.

Given the diversity of views, I would recommend the authors submit a rebuttal. In your response, please address the reviewers concerns including:
- Justify the use of novice participants/comment on how expert involvement might affect findings.
- Respond to concerns about the generalizability of the task and the potential for broader application of the conclusions.
- Clarify the rationale behind the experimental design choices, including the level of realism and the selected evaluation metrics.
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

Definitely not my area, but reviewers seem to be enthusiastic, except for R3, with whom I actually agree w.r.t. the value of ‘preliminary’ claims at miccai. The authors should take that feedback into account and perhaps rephrase.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

This AC recognizes the concerns raised by R3 post-rebuttal. However, as a CAI-focus paper, it should be evaluated differently than MIC-focused paper, where issues such as the # of participants may be of more importance.

back to top

Feeling the Stakes: Realism and Ecological Validity in User Research for Computer-Assisted Interventions

Author(s):