Treffer: pyDeid: an improved, fast, flexible, and generalizable rule-based approach for deidentification of free-text medical records.
Proc AMIA Annu Fall Symp. 1996;:333-7. (PMID: 8947683)
Sci Data. 2014 Sep 16;1:140032. (PMID: 25977789)
AMIA Annu Symp Proc. 2022 Feb 21;2021:438-447. (PMID: 35308962)
J Vasc Surg. 2017 Jun;65(6):1753-1761. (PMID: 28189359)
J Am Med Inform Assoc. 2007 Sep-Oct;14(5):550-63. (PMID: 17600094)
BMC Med Inform Decis Mak. 2008 Jul 24;8:32. (PMID: 18652655)
J Am Med Inform Assoc. 2008 Sep-Oct;15(5):601-10. (PMID: 18579831)
Methods Inf Med. 2006;45(3):246-52. (PMID: 16685332)
BMC Med Inform Decis Mak. 2006 Mar 06;6:12. (PMID: 16515714)
J Biomed Inform. 2017 Nov;75S:S19-S27. (PMID: 28602904)
Elife. 2020 Jul 07;9:. (PMID: 32633720)
J Biomed Inform. 2015 Dec;58 Suppl:S20-S29. (PMID: 26319540)
BMC Med Res Methodol. 2010 Aug 02;10:70. (PMID: 20678228)
NPJ Digit Med. 2020 Apr 14;3:57. (PMID: 32337372)
Transl Psychiatry. 2020 Feb 20;10(1):72. (PMID: 32080165)
J Biomed Inform. 2017 Sep;73:76-83. (PMID: 28756160)
J Med Internet Res. 2024 May 28;26:e55676. (PMID: 38805692)
Weitere Informationen
Objectives: Deidentification of personally identifiable information in free-text clinical data is fundamental to making these data broadly available for research. However, there exist gaps in the deidentification landscape with regard to the functionality and flexibility of extant tools, as well as suboptimal tradeoffs between deidentification accuracy and speed. To address these gaps and tradeoffs, we develop a new Python-based deidentification software, pyDeid.
Materials and Methods: pyDeid uses a combination of regular expression-based rules, fixed exclusion lists and inclusion lists to deidentify free-text data. Additional configurations of pyDeid include optional named entity recognition and custom name lists. We measure its deidentification performance and speed on 700 admission notes from a Canadian hospital, the publicly available n2c2 benchmark dataset of American discharge notes, as well as a synthetic dataset of artificial intelligence (AI) generated admission notes. We also compare its performance with the Physionet De-identification Software and the popular open-source Philter tool.
Results: Different configurations of pyDeid outperformed other tools on various metrics, with a "best" accuracy value of 0.988, best precision of 0.889, best recall of 0.950, and best F1 score of 0.904. All configurations of pyDeid were significantly faster than Philter and Physionet De-identification Software, with the fastest deidentification speed of 0.48 s per note.
Discussion and Conclusions: pyDeid allows the flexibility to prioritize between performance and speed, as well as precision and recall, while addressing some of the gaps in functionality left by other tools. pyDeid is also generalizable to domains outside of clinical data and can be further customized for specific contexts or for particular workflows.
(© The Author(s) 2025. Published by Oxford University Press on behalf of the American Medical Informatics Association.)
Fahad Razak and Amol A. Verma are part-time employees of Ontario Health (Provincial Clinical Leads) beyond the scope of this work. Other authors have no competing interests to declare.