Spaces:

MattStammers
/

pteredactyl_PII

Running

App Files Files Community

MattStammers commited on Jun 30

Commit

067c9fa

•

1 Parent(s): 4b17270

readme improved

Browse files

Files changed (1) hide show

app.py +79 -9

app.py CHANGED Viewed

@@ -1,3 +1,5 @@
 import re
 from pathlib import Path
@@ -8,7 +10,17 @@ import seaborn as sns
 import yaml
 import pteredactyl as pt
-from pteredactyl.defaults import change_model  # Ensure this import is correct
 sample_text = """
 1. Dr. Huntington (Patient No: 1234567890) diagnosed Ms. Alzheimer with Alzheimer's disease during her last visit to the Huntington Medical Center on 12/12/2023. The prognosis was grim, but Dr. Huntington assured Ms. Alzheimer that the facility was well-equipped to handle her condition despite the lack of a cure for Alzheimer's.
@@ -68,6 +80,9 @@ def redact(text: str, model_name: str):
     model_path = model_paths.get(model_name, "StanfordAIMI/stanford-deidentifier-base")
     if model_path:
         change_model(model_path)
     else:
@@ -256,6 +271,12 @@ def redact_and_visualize(text: str, model_name: str):
         reference_text, redacted_text
     )
     # Count entities and compute metrics
     tp_count, fn_count, fp_count, tn_count = count_entities_and_compute_metrics(
         reference_text_with_fn, redacted_text_with_fp
@@ -282,21 +303,70 @@ def redact_and_visualize(text: str, model_name: str):
 hint = """
-# Guide/Instructions
-## How the tool works:
-When the input text is entered, the tool redacts the entered text with labelled masking tokens and then assesses the models results. You can test the text against different models by selecting from the dropdown.
 ### Strengths
-- The Stanford De-Identifier Base Model is 99% accurate on our test set of radiology reports. The others are really to illustrate its superiority.
-- This test set here was derived after lots of experimentation to make the challenge as hard as possible. It is the toughest PII benchmark we have seen so far.
 ### Limitations
-- The tool was not designed initially to redact clinic letters as it was developed primarily on radiology reports in the US. We have made some augmentations to cover postcodes but these might not always work.
-- It may overly aggressively redact text because it was built as a research tool where precision is prized > recall but the recall is also high.
 """
 description = """
@@ -304,7 +374,7 @@ description = """
 *Version:* **1.0** - Working Proof of Concept Demo with API option and webapp demonstration.
-*Authors:* **Cai Davis, Michael George, Matt Stammers**
 """
 iface = gr.Interface(

+import logging
+import logging.config
 import re
 from pathlib import Path
 import yaml
 import pteredactyl as pt
+from pteredactyl.defaults import change_model
+# Logging configuration. This is only done at root level
+logging_config = yaml.safe_load(Path("logging.yaml").read_text())
+logging.config.dictConfig(logging_config)
+# Get the logger
+log = logging.getLogger(__name__)
+# Load the model
+log.info("Starting App")
 sample_text = """
 1. Dr. Huntington (Patient No: 1234567890) diagnosed Ms. Alzheimer with Alzheimer's disease during her last visit to the Huntington Medical Center on 12/12/2023. The prognosis was grim, but Dr. Huntington assured Ms. Alzheimer that the facility was well-equipped to handle her condition despite the lack of a cure for Alzheimer's.
     model_path = model_paths.get(model_name, "StanfordAIMI/stanford-deidentifier-base")
+    # Log the model being changed to
+    log.info(f"Changing to model: {model_path}")
     if model_path:
         change_model(model_path)
     else:
         reference_text, redacted_text
     )
+    # Print the final texts with flags for debugging
+    log.debug("Final Reference Text with False Negatives:")
+    log.debug(reference_text_with_fn)
+    log.debug("\nFinal Redacted Text with False Positives:")
+    log.debug(redacted_text_with_fp)
     # Count entities and compute metrics
     tp_count, fn_count, fp_count, tn_count = count_entities_and_compute_metrics(
         reference_text_with_fn, redacted_text_with_fp
 hint = """
+## Pteredactyl Gradio Webapp and API
+Clinical patient identifiable information (cPII) presents a significant challenge in natural language processing (NLP) that has yet to be fully resolved but significant progress is being made [1,2].
+This is why we created [Pteredactyl](https://pypi.org/project/pteredactyl/) - a python module to help with redaction of clinical free text.
+## Tool Usage Instructions
+When the input text is entered, the tool redacts the cPII from the entered text using NLP with labelled masking tokens and then assesses the models results. You can test the text against different models by selecting from the dropdown.
+## Deployment Options
+This webapp is available online as a gradio app on Huggingface: [Huggingface Gradio App](https://huggingface.co/spaces/MattStammers/pteredactyl_PII). It is also available as [source](https://github.com/SETT-Centre-Data-and-AI/PteRedactyl) or as a Docker Image: [Docker Image](https://registry.hub.docker.com/r/mattstammers/pteredactyl). All are MIT licensed.
+Please note if deploying the docker image the port bindings are to 7860. The image can also be deployed from source using the following command:
+```bat
+docker build -t pteredactyl:latest .
+docker run -d -p 7860:7860 --name pteredactyl-app pteredactyl:latest
+```
+## Information
+A lot of work and experimentation has gone into the development of this tool. Because we believe in being fully transparent further details are given below.
+### Methods:
+We evaluated three open-source models from Huggingface: Stanford Base De-Identifier, Deberta PII, and Nikhilrk De-Identify using our Clinical_PII_Redaction_Test dataset. The text was tokenised, and all entities such as [PERSON], [ID], and [LOCATION] were tagged in the gold standard. Each model redacted cPII from clinical texts, and outputs were compared to the gold standard template to calculate the confusion matrix, accuracy, precision, recall, and F1 score.
+### Results
+The full results of the tool are given below in <i>Table 1</i> below.
+| Metric     | Stanford Base De-Identifier | Deberta PII | Nikhilrk De-Identify |
+|------------|-----------------------------|-------------|----------------------|
+| Accuracy   | 0.98                        | 0.85        | 0.68                 |
+| Precision  | 0.91                        | 0.93        | 0.28                 |
+| Recall     | 0.94                        | 0.16        | 0.49                 |
+| F1 Score   | 0.93                        | 0.28        | 0.36                 |
+<small><i>Table 1: Summary of Model Performance Metrics</i></small>
 ### Strengths
+- The test benchmark [Clinical_PII_Redaction_Test](https://huggingface.co/datasets/MattStammers/Clinical_PII_Redaction_Test) intentionally exploits commonly observed weaknesses in NLP cPII token masking systems such as clinician/patient/diagnosis name similarity and commonly observed ID/username and location/postcode issues.
+- [The Stanford De-Identifier Base Model](https://huggingface.co/StanfordAIMI/stanford-deidentifier-base)[1] is 99% accurate on our test set of radiology reports and achieves an F1 score of 93% on our challenging open-source benchmark. The others models are really to demonstrate the potential of Pteredactyl to deploy any transfomer model.
+- We have submitted the code to [OHDSI](https://www.ohdsi.org/) as an abstract and aim strongly to incorporate this into a wider open-source effort to solve intractable clinical informatics problems.
 ### Limitations
+- The tool was not designed initially to redact clinic letters as it was developed primarily on radiology reports in the US. We have made some augmentations to cover elements like postcodes using checksums but these might not always work. The same is true of NHS numbers as illustrated above.
+- It may overly aggressively redact text because it was built as a research tool where precision is prized > recall. However, in our experience this is uncommon enough that it is still very useful.
+- This is very much a research tool and should not be relied upon as a catch-all in any production-type capacity. The app makes the limitations very transparently obvious via the attached confusion matrix.
+### Conclusion
+The validation cohort introduced in this study proves to be a highly effective tool for discriminating the performance of open-source cPII redaction models. Intentionally exploiting common weaknesses in cNLP token masking systems offers a more rigorous cPII benchmark than many larger datasets provide.
+We invite the open-source community to collaborate to improve the present results and enhance the robustness of cPII redaction methods by building on the work we have begun here [here](https://github.com/SETT-Centre-Data-and-AI/PteRedactyl).
+### References:
+1. Chambon PJ, Wu C, Steinkamp JM, Adleberg J, Cook TS, Langlotz CP. Automated deidentification of radiology reports combining transformer and “hide in plain sight” rule-based methods. J Am Med Inform Assoc. 2023 Feb 1;30(2):318–28.
+2. Kotevski DP, Smee RI, Field M, Nemes YN, Broadley K, Vajdic CM. Evaluation of an automated Presidio anonymisation model for unstructured radiation oncology electronic medical records in an Australian setting. Int J Med Inf. 2022 Dec 1;168:104880.
 """
 description = """
 *Version:* **1.0** - Working Proof of Concept Demo with API option and webapp demonstration.
+*Authors:* **Matt Stammers🧪, Cai Davis🥼 and Michael George🩺**
 """
 iface = gr.Interface(