Open-access Performance of ChatGPT and DeepSeek in the Management of Postprostatectomy Urinary Incontinence

ABSTRACT

Purpose:  Artificial intelligence (AI) continues to evolve as a tool in clinical decision support. Large language models (LLMs), such as ChatGPT and DeepSeek, are increasingly used in medicine to provide fast, accessible information. This study aimed to compare the performance of ChatGPT and DeepSeek in generating recommendations for the management of postprostatectomy urinary incontinence (PPUI), based on the AUA/SUFU guideline.

Materials and Methods:  A total of 20 questions (10 conceptual and 10 case-based) were developed by three urologists with expertise in PPUI, following the AUA/SUFU guideline. Each question was submitted in English using zero-shot prompting to ChatGPT-4o and DeepSeek R1. Responses were limited to 200 words and graded independently as correct (1 point), partially correct (0.5), or incorrect (0). Total and domain-specific scores were compared.

Results:  ChatGPT achieved 19 out of 20 points (95.0%), while DeepSeek scored 14.5 (72.5%; p = 0.031). In conceptual questions, scores were 9.0 (ChatGPT) and 8.0 (DeepSeek; p = 0.50). In case-based scenarios, ChatGPT scored 10.0 versus 6.5 for DeepSeek (p = 0.08). ChatGPT outperformed DeepSeek across all guideline domains. DeepSeek made critical errors in the treatment domain, such as recommending a male sling for radiated patients.

Conclusion:  ChatGPT demonstrated superior performance in providing guideline-based recommendations for PPUI. However, both models should be used under expert supervision, and future research is needed to optimize their safe integration into clinical workflows.

Keywords:
Artificial Intelligence; Urinary Incontinence; Clinical Decision-Making

INTRODUCTION

Artificial intelligence (AI) has emerged as a transformative force in healthcare, particularly through large language models (LLMs) capable of processing and generating complex medical content. ChatGPT, developed by OpenAI, has become one of the most widely used tools, designed to simulate human-like dialogue and provide accurate, contextually relevant responses (1). Since its release in 2022, ChatGPT has shown substantial potential in supporting healthcare professionals by enabling rapid access to clinical information, assisting in medical decision-making, and improving patient care workflows (2-5).

More recently, other LLMs such as Perplexity, Gemini 2.0 and Copilot have entered the field, offered unique features and expanded the landscape of AI in medicine (6). In mid-January 2025, DeepSeek-R1, an innovative open-source LLM that has rapidly gained prominence worldwide, is an open-source and cost-free platform with strong performance and broad accessibility (7). Its disruptive potential lies in combining high-quality outputs with zero user cost, making it especially appealing in low-resource settings or for professionals seeking efficient, freely available AI solutions.

As the use of these tools becomes more widespread in clinical and academic settings, their evaluation is essential to ensure safe and practical application in real-world scenarios. For instance, recent studies have evaluated the accuracy and reproducibility of ChatGPT in answering questions related to various urological diagnoses, highlighting areas where its responses can be incomplete or misleading (8). Physicians increasingly turn to LLMs such as ChatGPT, DeepSeek, and Gemini to answer questions across medical specialties, including disease management strategies (4-6, 8). These platforms offer structured, on-demand access to medical knowledge, assisting with treatment algorithms, diagnostic decisions, and drug information retrieval during clinical routines (9, 10)

Urinary incontinence after prostate surgery – particularly postprostatectomy urinary incontinence (PPUI) – remains a common and challenging condition (11). Its management often requires individualized approaches based on various patient factors, and may prompt complex clinical questions. To support clinicians, the Incontinence After Prostate Treatment guidelines developed by the American Urological Association (AUA) and the Society of Urodynamics, Female Pelvic Medicine and Urogenital Reconstruction (SUFU) provide structured, evidence-based recommendations (12)

Our group previously assessed the performance of ChatGPT versions 3.5 and 4 in delivering guideline-based recommendations for PPUI management, using the AUA/SUFU document as a reference standard (5). Building on this foundation, we postulated that ChatGPT-4.0, given its advanced development and optimization, would outperform the recently introduced open-source model DeepSeek R1 in delivering recommendations consistent with established clinical guidelines. Consequently, the present study compares the most recent version of ChatGPT (4o) with DeepSeek (R1), focusing on their ability to deliver accurate and clinically relevant guidance for evaluating and treating PPUI.

MATERIALS AND METHODS

This research presents a comparative assessment of two widely utilized large language models – ChatGPT-4.0 and DeepSeek R1 – focusing on their capacity to deliver recommendations aligned with established guidelines for managing PPUI. To accomplish this, a set of questions was constructed by three urologists, each possessing over two decades of clinical experience and specialized expertise in PPUI. The questions were derived from the Incontinence After Prostate Treatment: AUA/SUFU Guideline, ensuring that each inquiry corresponded to a well-defined and non-controversial answer (12). The guideline itself is structured into several key domains: (a) considerations before prostate treatment, (b) management following prostate treatment, (c) evaluation strategies for post-treatment incontinence, (d) therapeutic interventions, (e) surgical complications, and (f) complex scenarios, including compromised urethral integrity, bladder neck strictures, and approaches to complications arising from surgical management of PPUI (12)

Test 1 – Conceptual questions: Ten conceptual questions were prepared based on the Guideline recommendations, divided into pre-prostate treatment (two questions), post-prostate treatment (two questions), evaluation of incontinence after prostate treatment (three questions), and treatment options (three questions).

Test 2 – Case-based questions: To analyze the ability of ChatGPT and DeepSeek to apply knowledge and critical thinking skills, 10 questions were created using real or hypothetical clinical cases and grounded in the concepts and recommendations provided by the AUA guideline. These questions were divided into post-prostate treatment (one question), evaluation of incontinence after prostate treatment (two questions), treatment options (four questions), surgical complications (two questions), and special situations (one question). A list of the questions used in tests 1 and 2 can be found in the Supporting Information.

All questions were open-ended and descriptive. They were entered individually and anonymously (without IP tracking) into ChatGPT-4o and DeepSeek R1 in February 2025. A single investigator submitted all prompts in English, who instructed the AI engines to provide specific, concise answers limited to 200 words. The AI models were not prompted to incorporate any particular guidelines. Each question was submitted independently using the "New Chat" function to ensure a zero-shot format, meaning no prior context or sequential prompting was used.

The same three expert urologists who formulated the questions independently evaluated the AI-generated responses. Each answer was graded using the following system:

  • (A) Correct (1 point)

  • (B) Mixed: includes correct and incorrect or outdated information (0.5 points)

  • (C) Incorrect (0 points)

The overall performance of each model—ChatGPT and DeepSeek—was assessed for both conceptual (Test 1) and case-based (Test 2) questions, with a maximum score of 10 points per test.

This study was exempt from IRB review as no patient-level data were used.

Statistical analysis

Quantitative variables were expressed as absolute values, percentages, or proportions. We compared categorical variables using the Chi‐square test or Fisher's exact test. All tests were two‐sided, and statistical significance was p < 0.05. The analysis was performed using GraphPad Prism, version 10.0.01 for Windows.

RESULTS

ChatGPT scored 19 out of 20 points (95.0% accuracy). In comparison, DeepSeek scored 14.5 out of 20 points (72.5%; p = 0.031). Tables 1 and 2 show examples of errors and differences in performance between the two AI models. They also show the domains and reasons for all incorrect or partially correct responses from ChatGPT and DeepSeek.

Table 1
Sample Questions Showing Divergent Performance Between ChatGPT-4 and DeepSeek in PPUI Recommendations.
Table 2
Performance of ChatGPT and DeepSeek according to guideline domains.

Test 1: In the conceptual questions, ChatGPT provided accurate answers to eight questions and partially correct answers to two, resulting in a final score of 9.0. No incorrect responses were recorded. In contrast, DeepSeek provided six correct answers and four partially correct responses, with a final score of 8.0 (p = 0.50).

Test 2: ChatGPT provided fully correct answers to all ten case-based questions, achieving a perfect score of 10.0. In contrast, DeepSeek provided five correct answers, three partially correct responses, and two incorrect answers, resulting in a final score of 6.5 (p = 0.08).

Tables 2 and 3 show the differences in performance between ChatGPT and DeepSeek, according to the different domains of the guideline. ChatGPT outperformed DeepSeek in all domains. Its two partially correct responses occurred in the Evaluation and Treatment Options domains. One of ChatGPT's partially correct answers was in response to a question about the need for urethrocystoscopy. It stated that the exam should be considered in patients being evaluated for surgical treatment of PPUI and correctly listed its principal utilities in this context; however, it did not classify the procedure as mandatory. The other partially correct response involved selecting patients with PPUI for sling placement. Although ChatGPT appropriately identified the incontinence as mild to moderate and noted the absence of prior pelvic radiotherapy, it failed to mention the time interval between surgery and treatment. It is considered mandatory in the preoperative evaluation.

Table 3
Wrong answers from ChatGPT and DeepSeek.

DeepSeek presented partial or complete errors across all six guideline domains. Its poorest performance was in Treatment Options, where it provided two partially correct answers and one incorrect response. The other incorrect answer occurred in the Pre-prostate Treatment domain. Additionally, two partially correct answers were given in the Evaluation domain, and one partially correct response was observed in each of the following domains: Pre-prostate TreatmentComplications, and Special Situations

Neither chatbots informed users that their answers may be based on general training data rather than specific, up-to-date, high-quality medical content, nor did they provide warnings about the limitations of their training data or the potential for inaccuracies. In a few responses (one conceptual and two case-based), DeepSeek highlighted the importance of shared decision-making, noting that surgical treatment involves risks and requires realistic expectations regarding outcomes. It also recommended early referral to a specialist in male incontinence to optimize the timing and appropriateness of the intervention. ChatGPT did not include explicit recommendations to consult a physician or healthcare professional. Some responses advised that a thorough discussion of risks, benefits, and the potential need for lifelong device management is essential to support better outcomes.

DISCUSSION

This study aimed to evaluate and compare the accuracy of two LLMs, ChatGPT-4o and DeepSeek R1, in generating guideline-concordant recommendations for the management of PPUI, using the AUA/SUFU guideline (12) as a benchmark. This direct, head-to-head comparison of these two contemporary LLMs in a specialized urological context provides novel insights into their relative clinical utility and potential safety implications. ChatGPT outperformed DeepSeek, achieving an overall accuracy of 95.0% versus 72.5%, respectively (p = 0.031). This performance gap was especially pronounced in case-based scenarios, where ChatGPT scored a perfect 10/10, while DeepSeek obtained 6.5/10. These results reinforce previous findings from our group (5) and from an independent study (13), where ChatGPT-4 significantly surpassed version 3.5 in similar clinical contexts.

ChatGPT demonstrated consistently high performance across all evaluated guideline domains, particularly in complex areas such as treatment selection and surgical complication management. In contrast, DeepSeek produced partially or fully incorrect responses in every domain, with its weakest performance in the "treatment options" section. Notably, many of these inaccuracies were critical rather than minor; for example, recommending a male sling for a radiated patient represents a potentially hazardous clinical decision, not a trivial factual slip, thereby heightening safety concerns. While DeepSeek occasionally incorporated ethical elements, such as promoting shared decision-making and specialist referral, these did not offset its factual inaccuracies. This divergence likely reflects differences in training architecture, particularly reinforcement learning with human feedback, which has been associated with improved factual reliability in newer LLMs (1)

Both models were tested under a zero-shot prompt configuration without prior context or instruction to follow a specific guideline. This design was intentional, simulating real-world interactions where users input spontaneous questions. Under these conditions, ChatGPT's strong contextual alignment with evidence-based recommendations suggests real-world utility in clinical support or educational use. Nonetheless, performance was not flawless. For instance, ChatGPT failed to state that cystoscopy is mandatory in the preoperative workup and incorrectly portrayed urodynamic testing as universally required, demonstrating that even high-performing LLMs may falter in nuanced or infrequently referenced scenarios.

The study methodology included domain-stratified question design by experienced urologists and structured, independent evaluation. Although the assessment framework was rigorous, the absence of a blinded review may have introduced bias. Additionally, as all prompts were in English, the findings may not generalize to other languages or cultural contexts.

ChatGPT's high accuracy in guideline-based PPUI scenarios supports its potential as a supplementary tool in urologic clinical practice. While not a substitute for clinical judgment, it may aid physicians in patient education, draft preparation, or as a second opinion generator. Similar performance advantages have been noted in other urologic conditions – including benign prostatic hyperplasia, prostate cancer, and urolithiasis – where ChatGPT-generated materials have been favorably rated compared to traditional resources (14)

These findings also underscore the broader potential of LLMs to improve healthcare delivery across diverse settings (15). In resource-limited environments, where access to trained specialists is constrained but digital connectivity is increasingly available, LLMs may serve as a valuable tool to support triage and immediate clinical decision-making (16). Beyond access, these models may enhance healthcare system efficiency by organizing and synthesizing complex clinical information, facilitating the monitoring of patients with multiple comorbidities, and reducing the administrative workload that often burdens clinicians (1)

Nonetheless, current LLMs lack key capabilities for individualized care. They do not reliably account for patient-specific variables such as comorbidities, psychosocial context, or long-term treatment trajectories. Moreover, neither model provided disclaimers regarding the scope or limitations of their training data, nor did they express uncertainty or confidence levels in their outputs. These omissions pose risks, particularly in unsupervised or low-resource environments. As static systems, LLMs that do not incorporate real-time updates may become outdated or misleading, further underscoring the need for continuous monitoring and recalibration (17)

The limitations observed in this study align with the broader literature on LLM performance in Medicine. Recent evaluations in pediatric urology have demonstrated that while ChatGPT offers general knowledge, its responses can be incomplete, ambiguous, and even misleading, particularly when compared to formal guidelines for conditions like phimosis (18). While ChatGPT generally outperforms models such as Gemini and Bard in urologic domains, all tend to simplify or misrepresent complex clinical cases (4 6). Moreover, our comparison involved the premium version of ChatGPT and the free version of DeepSeek, which may have influenced the overall performance gap. Another limitation is the potential language bias, as both AI models were evaluated exclusively in English. DeepSeek's performance may differ – and possibly improve – in Chinese, the language in which it was originally developed and likely optimized. Future studies should explore multilingual evaluations, particularly in Chinese, to more accurately assess each model's full capabilities and real-world applicability across diverse linguistic contexts.

Another significant limitation is the relatively small sample size, which utilizes only 20 questions to assess model performance. Although the questions were carefully developed based on established clinical guidelines and reviewed by experienced urologists, this limited number may not fully represent the wide spectrum of real-world clinical decision-making scenarios. Small differences in interpretation or phrasing could disproportionately influence the results.

CONCLUSIONS

While ChatGPT demonstrates superior performance over DeepSeek in the context of PPUI, both models currently lack the ability to autonomously guide clinical decision-making. Our findings support the view that AI can effectively augment — but not replace — physician expertise in Urology, highlighting the need for transparent, continuously updated, and supervised integration of LLMs into clinical workflows to ensure their safe and effective use.

APPENDIX

Feb 3rd, 2025


QUESTIONS WITH ANSWERS

REFERENCES

  • 1 Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29:1930–40. doi:10.1038/s41591-023-02448-8
    » https://doi.org/10.1038/s41591-023-02448-8
  • 2 Dave T, Athaluri SA, Singh S. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Frontiers in Artificial Intelligence. 2023;6:1169595. doi:10.3389/frai.2023.1169595
    » https://doi.org/10.3389/frai.2023.1169595
  • 3 Moons P, Van Bulck L. ChatGPT: can artificial intelligence language models be of value for cardiovascular nurses and allied health professionals. Eur J Cardiovasc Nurs. 2023;22:e55-e59. doi:10.1093/eurjcn/zvad022
    » https://doi.org/10.1093/eurjcn/zvad022
  • 4 Haver HL, Ambinder EB, Bahl M, Oluyemi ET, Jeudy J, Yi PH. Appropriateness of Breast Cancer Prevention and Screening Recommendations Provided by ChatGPT. Radiology. 2023;307:e230424. doi:10.1148/radiol.230424
    » https://doi.org/10.1148/radiol.230424
  • 5 Pinto VBP, Azevedo MF de, Wroclawski ML, Gentile G, Jesus VLM, de Bessa Junior JB, et al. Conformity of ChatGPT recommendations with the AUA/SUFU guideline on postprostatectomy urinary incontinence. Neurourol Urodyn. 2024;43:935-41. doi:10.1002/nau.25442
    » https://doi.org/10.1002/nau.25442
  • 6 Seth I, Marcaccini G, Lim K, Castrechini M, Cuomo R, Ng SKH, et al. Management of Dupuytren's Disease: a multi-centric comparative analysis between experienced hand surgeons versus artificial intelligence. Diagnostics (Basel). 2025;15:587. doi:10.3390/diagnostics15050587
    » https://doi.org/10.3390/diagnostics15050587
  • 7 Normile D. Chinese firm's large language model makes a splash. Science. 2025;387(6731):238. doi:10.1126/science.adv9836
    » https://doi.org/10.1126/science.adv9836
  • 8 Braga AVNM, Nunes NC, Santos EN, Veiga ML, Braga AANM, Abreu GE de, et al. Use of ChatGPT in Urology and its Relevance in Clinical Practice: Is it useful? Int Braz J Urol. 2024;50:192-8. doi:10.1590/S1677-5538.IBJU.2023.0570
    » https://doi.org/10.1590/S1677-5538.IBJU.2023.0570
  • 9 Fattah FH, Salih AM, Salih AM, Asaad SK, Ghafour AK, Bapir R, et al. Comparative analysis of ChatGPT and Gemini (Bard) in medical inquiry: a scoping review. Front Dig Health. 2025;7:1482712. doi:10.3389/fdgth.2025.1482712
    » https://doi.org/10.3389/fdgth.2025.1482712
  • 10 Alhur A. Redefining healthcare with artificial intelligence (AI): the contributions of ChatGPT, Gemini, and Co-pilot. Cureus. 2024;16:e57795. doi:10.7759/cureus.57795
    » https://doi.org/10.7759/cureus.57795
  • 11 Hakozaki K, Takeda T, Yasumizu Y, Tanaka N, Matsumoto K, Morita S, et al. Predictors of urinary function recovery after laparoscopic and robot-assisted radical prostatectomy. Int Braz J Urol. 2023;49:50–60. doi:10.1590/S1677-5538.IBJU.2022.0362
    » https://doi.org/10.1590/S1677-5538.IBJU.2022.0362
  • 12 Sandhu JS, Breyer B, Comiter C, Eastham JA, Gomez C, Kirages DJ, et al. Incontinence after prostate treatment: AUA/SUFU guideline. J Urol. 2019;202:369–78. doi:10.1097/JU.0000000000000314
    » https://doi.org/10.1097/JU.0000000000000314
  • 13 Banerjee A, Chatterjee M, Goyal K, Sarangi PK. Performance of ChatGPT-3.5 and ChatGPT-4 in solving questions based on core concepts in cardiovascular physiology. Cureus. 2025;17:e43314. doi:10.7759/cureus.43314
    » https://doi.org/10.7759/cureus.43314
  • 14 Shah YB, Ghosh A, Hochberg AR, Rapoport E, Lallas CD, Shah MS, et al. Comparison of ChatGPT and traditional patient education materials for men's health. Urol Pract. 2024;11:87–94. doi:10.1097/UPJ.0000000000000490
    » https://doi.org/10.1097/UPJ.0000000000000490
  • 15 Sarangi PK, Mondal H. Response generated by large language models depends on the structure of the prompt. Indian J Radiol Imaging. 2024;34:574–5. doi:10.1055/s-0044-1782165
    » https://doi.org/10.1055/s-0044-1782165
  • 16 Eckrich J, Ellinger J, Cox A, Stein J, Ritter M, Blaikie A, et al. Urology consultants versus large language models: potentials and hazards for medical advice in urology. BJUI Compass. 2024;5:552–8. doi:10.1002/bco2.359
    » https://doi.org/10.1002/bco2.359
  • 17 Lee P, Bubeck S, Petro J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. In: Drazen JM, Kohane IS, Leong TY, editors. N Engl J Med. 2023;388:1233–9. doi:10.1056/NEJMsr2214184
    » https://doi.org/10.1056/NEJMsr2214184
  • 18 Salvador Junior ES, Santos CS, Holanda VJO, Corrêa BM, Favorito LA. Can ChatGPT provides reliable technical medical information about phimosis? Int Braz J Urol. 2024;50:651–4. doi:10.1590/S1677-5538.IBJU.2024.9913
    » https://doi.org/10.1590/S1677-5538.IBJU.2024.9913

Publication Dates

  • Publication in this collection
    10 Nov 2025
  • Date of issue
    Nov-Dec 2025

History

  • Received
    21 June 2025
  • Accepted
    30 June 2025
  • Published
    10 Aug 2025
location_on
Sociedade Brasileira de Urologia Rua Bambina, 153, 22251-050 Rio de Janeiro RJ Brazil, Tel. +55 21 2539-6787, Fax: +55 21 2246-4088 - Rio de Janeiro - RJ - Brazil
E-mail: brazjurol@brazjurol.com.br
rss_feed Acompanhe os números deste periódico no seu leitor de RSS
Reportar erro