Treffer: pyDeid: an improved, fast, flexible, and generalizable rule-based approach for deidentification of free-text medical records.

Title:
pyDeid: an improved, fast, flexible, and generalizable rule-based approach for deidentification of free-text medical records.
Authors:
Sundrelingam V; Li Ka Shing Knowledge Institute, St Michael's Hospital, Toronto, ON M5B 1T8, Canada., Parimoo S; Centre for Computational Medicine, Hospital for Sick Children, Toronto, ON M5G 0A4, Canada., Pogacar F; MAP Centre for Urban Health Solutions, St Michael's Hospital, Toronto, ON M5B 1T8, Canada., Koppula R; Li Ka Shing Knowledge Institute, St Michael's Hospital, Toronto, ON M5B 1T8, Canada., Shin S; Li Ka Shing Knowledge Institute, St Michael's Hospital, Toronto, ON M5B 1T8, Canada., Pou-Prom C; Data Science and Advanced Analytics, Unity Health Toronto, Toronto, ON M5C 2T2, Canada., Roberts SB; Li Ka Shing Knowledge Institute, St Michael's Hospital, Toronto, ON M5B 1T8, Canada.; Institute of Health Policy, Management and Evaluation, University of Toronto, ON M5T 3M6, Canada., Verma AA; Li Ka Shing Knowledge Institute, St Michael's Hospital, Toronto, ON M5B 1T8, Canada.; Institute of Health Policy, Management and Evaluation, University of Toronto, ON M5T 3M6, Canada.; Department of Medicine, University of Toronto, Toronto, ON M5S 3H2, Canada.; Division of General Internal Medicine, Unity Health, Toronto, ON M5B 1W8, Canada., Razak F; Li Ka Shing Knowledge Institute, St Michael's Hospital, Toronto, ON M5B 1T8, Canada.; Institute of Health Policy, Management and Evaluation, University of Toronto, ON M5T 3M6, Canada.; Department of Medicine, University of Toronto, Toronto, ON M5S 3H2, Canada.; Division of General Internal Medicine, Unity Health, Toronto, ON M5B 1W8, Canada.
Source:
JAMIA open [JAMIA Open] 2025 Jan 22; Vol. 8 (1), pp. ooae152. Date of Electronic Publication: 2025 Jan 22 (Print Publication: 2025).
Publication Type:
Journal Article
Language:
English
Journal Info:
Publisher: Oxford University Press on behalf of the American Medical Informatics Association Country of Publication: United States NLM ID: 101730643 Publication Model: eCollection Cited Medium: Internet ISSN: 2574-2531 (Electronic) Linking ISSN: 25742531 NLM ISO Abbreviation: JAMIA Open Subsets: PubMed not MEDLINE
Imprint Name(s):
Original Publication: [Cary, NC] : Oxford University Press on behalf of the American Medical Informatics Association, [2018]-
References:
PLoS One. 2017 Nov 9;12(11):e0187121. (PMID: 29121053)
Proc AMIA Annu Fall Symp. 1996;:333-7. (PMID: 8947683)
Sci Data. 2014 Sep 16;1:140032. (PMID: 25977789)
AMIA Annu Symp Proc. 2022 Feb 21;2021:438-447. (PMID: 35308962)
J Vasc Surg. 2017 Jun;65(6):1753-1761. (PMID: 28189359)
J Am Med Inform Assoc. 2007 Sep-Oct;14(5):550-63. (PMID: 17600094)
BMC Med Inform Decis Mak. 2008 Jul 24;8:32. (PMID: 18652655)
J Am Med Inform Assoc. 2008 Sep-Oct;15(5):601-10. (PMID: 18579831)
Methods Inf Med. 2006;45(3):246-52. (PMID: 16685332)
BMC Med Inform Decis Mak. 2006 Mar 06;6:12. (PMID: 16515714)
J Biomed Inform. 2017 Nov;75S:S19-S27. (PMID: 28602904)
Elife. 2020 Jul 07;9:. (PMID: 32633720)
J Biomed Inform. 2015 Dec;58 Suppl:S20-S29. (PMID: 26319540)
BMC Med Res Methodol. 2010 Aug 02;10:70. (PMID: 20678228)
NPJ Digit Med. 2020 Apr 14;3:57. (PMID: 32337372)
Transl Psychiatry. 2020 Feb 20;10(1):72. (PMID: 32080165)
J Biomed Inform. 2017 Sep;73:76-83. (PMID: 28756160)
J Med Internet Res. 2024 May 28;26:e55676. (PMID: 38805692)
Contributed Indexing:
Keywords: deidentification; personal health information; privacy; software validation
Entry Date(s):
Date Created: 20250123 Latest Revision: 20250206
Update Code:
20250206
PubMed Central ID:
PMC11752853
DOI:
10.1093/jamiaopen/ooae152
PMID:
39845288
Database:
MEDLINE

Weitere Informationen

Objectives: Deidentification of personally identifiable information in free-text clinical data is fundamental to making these data broadly available for research. However, there exist gaps in the deidentification landscape with regard to the functionality and flexibility of extant tools, as well as suboptimal tradeoffs between deidentification accuracy and speed. To address these gaps and tradeoffs, we develop a new Python-based deidentification software, pyDeid.
Materials and Methods: pyDeid uses a combination of regular expression-based rules, fixed exclusion lists and inclusion lists to deidentify free-text data. Additional configurations of pyDeid include optional named entity recognition and custom name lists. We measure its deidentification performance and speed on 700 admission notes from a Canadian hospital, the publicly available n2c2 benchmark dataset of American discharge notes, as well as a synthetic dataset of artificial intelligence (AI) generated admission notes. We also compare its performance with the Physionet De-identification Software and the popular open-source Philter tool.
Results: Different configurations of pyDeid outperformed other tools on various metrics, with a "best" accuracy value of 0.988, best precision of 0.889, best recall of 0.950, and best F1 score of 0.904. All configurations of pyDeid were significantly faster than Philter and Physionet De-identification Software, with the fastest deidentification speed of 0.48 s per note.
Discussion and Conclusions: pyDeid allows the flexibility to prioritize between performance and speed, as well as precision and recall, while addressing some of the gaps in functionality left by other tools. pyDeid is also generalizable to domains outside of clinical data and can be further customized for specific contexts or for particular workflows.
(© The Author(s) 2025. Published by Oxford University Press on behalf of the American Medical Informatics Association.)

Fahad Razak and Amol A. Verma are part-time employees of Ontario Health (Provincial Clinical Leads) beyond the scope of this work. Other authors have no competing interests to declare.