Automated extraction of information from free text of Spanish oncology pathology reports

Juan Sebastian Moreno; Juan Carlos Bravo-Ocaña; Alvaro José Riascos; Angela Regina Zambrano; Diana Marcela Mendoza-Urbano; Johan Felipe Garcia; Sergio I Prada

doi:10.25100/cm.v54i1.5300

https://doi.org/10.25100/cm.v54i1.5300

Published: 2023-05-16

Keywords:

Artificial Intelligence, Linguistics, Biomarkers, Natural language processing, Pathology reports, deep learning, information extraction, multitask learning, Free-form text

pdf pdf (Spanish) HTML HTML (Spanish)

Received 2022-06-11
Accepted 2023-03-01
Published 2023-05-16

Updated: 2023-05-16

Issue: Vol. 54 No. 1 (2023)

Section Original Articles

Publication metrics

1534 | 360 | 106 | 40 | 32

Authors

Juan Sebastian Moreno, MSc Quantil SAS. Bogotá, Colombia. Centro de Analítica para Políticas Públicas. Bogotá, Colombia

Juan Carlos Bravo-Ocaña, MD Fundación Valle del Lili; Departamento de Patología, Cali, Colombia

Alvaro José Riascos, PhD Quantil SAS. Bogotá, Colombia. Centro de Analítica para Políticas Públicas. Bogotá, Colombia. Universidad de los Andes, Facultad de Economía. Bogotá, Colombia

Angela Regina Zambrano, MD Fundación Valle del Lili; Departamento de Hemato-Oncología, Cali, Colombia

Diana Marcela Mendoza-Urbano, MD Universidad Nacional de Colombia, Facultad de Medicina, Departamento de Patología, Bogotá, Colombia,Universidad Nacional de Colombia, Facultad de Medicina, Departamento de Patología, Bogotá, Colombia

Johan Felipe Garcia, PhD Quantil SAS. Bogotá, Colombia

Sergio I Prada, MPA, PhD Fundación Valle del Lili, Centro de Investigaciones Clínicas, Cali, Colombia. Universidad Icesi, Centro PROESA, Cali, Colombia

Abstract

Background:
Pathology reports are stored as unstructured, ungrammatical, fragmented, and abbreviated free text with linguistic variability among pathologists. For this reason, tumor information extraction requires a significant human effort. Recording data in an efficient and high-quality format is essential in implementing and establishing a hospital-based cancer registry.

Objective:
This study aimed to describe implementing a natural language processing algorithm for oncology pathology reports.

Methods:
An algorithm was developed to process oncology pathology reports in Spanish to extract 20 medical descriptors. The approach is based on the successive coincidence of regular expressions.

Results:
The validation was performed with 140 pathological reports. The topography identification was performed manually by humans and the algorithm in all reports. The human identified morphology in 138 reports and by the algorithm in 137. The average fuzzy matching score was 68.3 for Topography and 89.5 for Morphology.

Conclusion:
A preliminary algorithm validation against human extraction was performed over a small set of reports with satisfactory results. This shows that a regular-expression approach can accurately and precisely extract multiple specimen attributes from free-text Spanish pathology reports. Additionally, we developed a website to facilitate collaborative validation at a larger scale which may be helpful for future research on the subject.

Author Biographies

Juan Sebastian Moreno, MSc, Quantil SAS. Bogotá, Colombia. Centro de Analítica para Políticas Públicas. Bogotá, Colombia

https://orcid.org/0009-0004-3487-0458

Juan Carlos Bravo-Ocaña, MD, Fundación Valle del Lili; Departamento de Patología, Cali, Colombia

https://orcid.org/0000-0003-3880-0751

Alvaro José Riascos, PhD, Quantil SAS. Bogotá, Colombia. Centro de Analítica para Políticas Públicas. Bogotá, Colombia. Universidad de los Andes, Facultad de Economía. Bogotá, Colombia

https://orcid.org/0000-0002-6325-5559

Angela Regina Zambrano, MD, Fundación Valle del Lili; Departamento de Hemato-Oncología, Cali, Colombia

https://orcid.org/0000-0003-0846-8129

Diana Marcela Mendoza-Urbano, MD, Universidad Nacional de Colombia, Facultad de Medicina, Departamento de Patología, Bogotá, Colombia,Universidad Nacional de Colombia, Facultad de Medicina, Departamento de Patología, Bogotá, Colombia

https://orcid.org/0000-0002-8642-8272

Johan Felipe Garcia, PhD, Quantil SAS. Bogotá, Colombia

https://orcid.org/0000-0002-6126-702X

Sergio I Prada, MPA, PhD, Fundación Valle del Lili, Centro de Investigaciones Clínicas, Cali, Colombia. Universidad Icesi, Centro PROESA, Cali, Colombia

https://orcid.org/0000-0001-7986-0959

How to Cite

Automated extraction of information from free text of Spanish oncology pathology reports. (2023). Colombia Medica, 54(1), e2035300. https://doi.org/10.25100/cm.v54i1.5300

References

Ruiz A, Facio Á. Hospital-based cancer registry: A tool for patient care, management and quality. A focus on its use for quality assessment. Rev Oncol. 2004; 6(2): 104-13. https://doi.org/10.1007/BF02710038 DOI: https://doi.org/10.1007/BF02710038

Kreimeyer K, Foster M, Pandey A, Arya N, Halford G, Jones SF, et al. Natural language processing systems for capturing and standardizing unstructured clinical information: A systematic review. J Biomed Inform. 2017; 73: 14-29. https://doi.org/10.1016/j.jbi.2017.07.012 DOI: https://doi.org/10.1016/j.jbi.2017.07.012

Alawad M, Gao S, Qiu JX, Yoon HJ, Blair Christian J, Penberthy L, et al. Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks. J Am Med Informatics Assoc. 2020; 27(1): 89-98. https://doi.org/10.1093/jamia/ocz153 DOI: https://doi.org/10.1093/jamia/ocz153

Nadkarni PM, Ohno-Machado L, Chapman WW. Natural language processing: an introduction. J Am Med Inform Assoc. 2011; 18(5): 544-51. https://doi.org/10.1136/amiajnl-2011-000464 DOI: https://doi.org/10.1136/amiajnl-2011-000464

Meystre S, Savova G, Kipper-Schuler KC, Hurdle JF. Extracting information from textual documents in the electronic health record: a review of recent research. Yearb Med Inf. 2007; 128-44. https://doi.org/10.1055/s-0038-1638592 DOI: https://doi.org/10.1055/s-0038-1638592

Velupillai S, Suominen H, Liakata M, Roberts A, Shah AD, Morley K, et al. Using clinical Natural Language Processing for health outcomes research: Overview and actionable suggestions for future advances. J Biomed Inform. 2018; 88: 11-9. https://doi.org/10.1016/j.jbi.2018.10.005 DOI: https://doi.org/10.1016/j.jbi.2018.10.005

Burger G, Abu-Hanna A, de Keizer N, Cornet R. Natural language processing in pathology: A scoping review. J Clin Pathol. 2016; 69: jclinpath-2016. https://doi.org/10.1136/jclinpath-2016-203872 DOI: https://doi.org/10.1136/jclinpath-2016-203872

Hammami L, Paglialonga A, Pruneri G, Torresani M, Sant M, Bono C, et al. Automated classification of cancer morphology from Italian pathology reports using Natural Language Processing techniques: A rule-based approach. J Biomed Inform. 2021; 116: 103712. https://doi.org/10.1016/j.jbi.2021.103712 DOI: https://doi.org/10.1016/j.jbi.2021.103712

Aalabdulsalam A, Garvin J, Redd A, Carter M, Sweeny C, Meystre S. Automated Extraction and Classification of Cancer Stage Mentions fromUnstructured Text Fields in a Central Cancer Registry. AMIA Jt Summits Transl Sci Proc. 2018; 2017: 16-25.

Koza W, Filippo D, Cotik V, Stricker V, Muñoz M, Godoy N, et al. Automatic Detection of Negated Findings in Radiological Reports for Spanish Language: Methodology Based on Lexicon-Grammatical Information Processing. J Digit Imaging. 2019; 32(1):19-29. https://doi.org/10.1007/s10278-018-0113-8 DOI: https://doi.org/10.1007/s10278-018-0113-8

Villena F, Dunstan J. Obtención automática de palabras clave en textos clínicos: una aplicación de procesamiento del lenguaje natural a datos masivos de sospecha diagnóstica en Chile. Rev Med Chil. 2019; 147(10): 1229-38. https://doi.org/10.4067/s0034-98872019001001229 DOI: https://doi.org/10.4067/s0034-98872019001001229

Solarte-Pabón O, Blazquez-Herranz A, Torrente M, Rodríguez-Gonzalez A, Provencio M, Menasalvas E. Extracting Cancer treatments from clinical text written in spanish: a deep learning approach. IEEE 8th Int Conf Data Sci Adv Anal DSAA 2021; 2021 https://doi.org/10.1109/DSAA53316.2021.9564137 DOI: https://doi.org/10.1109/DSAA53316.2021.9564137

Solarte-Pabón O, Torrente M, Provencio M, Rodríguez-Gonzalez A, Menasalvas E. Integrating speculation detection and deep learning to extract lung cancer diagnosis from clinical notes. Appl Sci. 2021; 11(2): 865. https://doi.org/10.3390/app11020865 DOI: https://doi.org/10.3390/app11020865

Parra-Lara LG, Mendoza-Urbano D, Zambrano Á, Valencia-Orozco A, Bravo-Ocaña JC, Bravo-Ocaña LE,et al. Methods and Implementation of a Hospital-Based Cancer Registry in a Major City in a Low-to MiddleIncome Country: The Case of Cali, Colombia. Cancer Causes Control. 2022; 33(3): 381-392. https://doi.org/10.1007/s10552-021-01532-z DOI: https://doi.org/10.1007/s10552-021-01532-z

American College of Surgeons. Facility oncology registry data standards (FORDS): Revised for 2016; 2017.Available from: https://www.facs.org/quality-programs/cancer-programs/national-cancer-database/ncdb-call-fordata/fordsmanual/

Instituto Nacional de Salud. Fichas y Protocolos; 2022. Available from: https://www.ins.gov.co/buscadoreventos/Paginas/Fichas-y-Protocolos.aspx

Fritz A, Percy C, Jack A, Shan K. Clasificación internacional de enfermedades para oncología (CIE-O). Rev Esp Salud Publica. 2003;77(5):659-659. https://doi.org/10.1590/S1135-57272003000500014 DOI: https://doi.org/10.1590/S1135-57272003000500014

Statistics

Downloads

Download data is not yet available.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

The copy rights of the articles published in Colombia Médica belong to the Universidad del Valle. The contents of the articles that appear in the Journal are exclusively the responsibility of the authors and do not necessarily reflect the opinions of the Editorial Committee of the Journal. It is allowed to reproduce the material published in Colombia Médica without prior authorization for non-commercial use

Article Sidebar