Organizations across industries face critical compliance challenges when handling documents containing personally identifiable information (PII) and sensitive data. Whether it’s healthcare PHI under HIPAA, financial data under PCI DSS, or legal documents with confidential information, manual redaction processes are time-consuming, error-prone, and costly. VLM Run’s document redaction capability automatically detects and redacts sensitive information from documents across various industries. For healthcare specifically, it follows HIPAA’s Safe Harbor method and detects all 18 types of PHI for de-identification. This ensures your documents are compliant while maintaining document readability and structure.

Original Document

Original document with visible PII

Redacted Document

Redacted document with PII blurred

Example of document redaction applied to a medical form

Here’s a step-by-step guide on how to redact sensitive information from documents:
1

Upload Document

Use the /v1/files endpoint to upload the document containing sensitive information that you want to redact.
from vlmrun.client import VLMRun
from vlmrun.client.types import FileResponse
from pathlib import Path

# Initialize the client
client = VLMRun(api_key="<your-api-key>")

# Upload the file
response: FileResponse = client.files.upload(
    file=Path("<path/to/sensitive_document.pdf>")
)
print(f"Uploaded file:\n {response.model_dump()}")
You should see a response like this:
Uploaded file:
{
  'id': '1e76cfd9-ba99-49b2-a8fe-2c8efaad2649',
  'filename': 'file-20240815-7UvOUQ-sensitive_document.pdf',
  'bytes': 62430,
  'purpose': 'assistants',
  'created_at': '2024-08-15T02:22:06.716130',
  'object': 'file'
}
2

Submit the Document Redaction Job

Submit the uploaded file to the /v1/document/execute endpoint to start the document redaction job using the healthcare/phi-redaction agent.
from vlmrun.client.types import PredictionResponse

# Submit the document for redaction
response: PredictionResponse = client.document.execute(
    name="healthcare/phi-redaction",
    version="latest",
    file_ids=[response.id],  # Use the file ID from the upload step
    batch=True
)
print(f"Document redaction job submitted:\n {response.model_dump()}")
You should see a response like this:
Document redaction job submitted:
{
  "id": "052cf2a8-2b84-45f5-a385-ccac2aae13bb",
  "created_at": "2024-08-15T02:22:09.157788",
  "response": null,
  "status": "pending"
}
3

Wait for the Job to Complete

You can now wait for the job to complete by calling the predictions.wait method:
# Wait for the job to complete
response: PredictionResponse = client.predictions.wait(
    id=response.id,
    timeout=120,
)
print(f"Job completed:\n {response.model_dump()}")

Industry Use Cases

VLM Run’s document redaction capability serves multiple industries with specific compliance requirements:

Healthcare - PHI Redaction

For healthcare organizations, VLM Run follows HIPAA’s Safe Harbor method and automatically detects and redacts all 18 types of Protected Health Information (PHI):

Other Industries

Legal Documents
  • Attorney-client privileged information
  • Case numbers and court identifiers
  • Witness names and contact information
  • Settlement amounts and financial details
Financial Services
  • Account numbers and routing numbers
  • Credit card and payment information
  • Social Security Numbers
  • Investment account details
Human Resources
  • Employee personal information
  • Salary and compensation data
  • Performance review details
  • Background check information
Real Estate
  • Property addresses and legal descriptions
  • Purchase prices and financial terms
  • Personal information of buyers/sellers
  • Loan and mortgage details
Insurance
  • Policy numbers and claim identifiers
  • Personal health information
  • Financial and asset information
  • Beneficiary details

Example Response

Here’s an example of the structured response you’ll receive after document redaction is complete:
{
  "id": "052cf2a8-2b84-45f5-a385-ccac2aae13bb",
  "created_at": "2024-08-15T02:22:09.157788",
  "status": "completed",
  "response": {
    "redacted_uri": "https://storage.googleapis.com/vlm-userdata/agents/healthcare/phi-redaction/redacted-sensitive_document-20240815-a1b2c3d4.pdf",
    "redacted_items": [
      {
        "phi_type": "name"
      },
      {
        "phi_type": "date_elements"
      },
      {
        "phi_type": "telephone_number"
      },
      {
        "phi_type": "email_address"
      },
      {
        "phi_type": "medical_record_number"
      },
      {
        "phi_type": "geographic_subdivision"
      }
    ]
  }
}
The response includes:
  • redacted_uri: A secure, time-limited URL to download the redacted document with all sensitive information visually obscured
  • redacted_items: A list of all sensitive information types that were detected and redacted in the document

Complete Example

Here’s a complete example showing how to redact sensitive information from a document:
from vlmrun.client import VLMRun
from vlmrun.client.types import PredictionResponse, FileResponse
from pathlib import Path

# Initialize the client
client = VLMRun(api_key="<your-api-key>")

# Upload the document
file_response: FileResponse = client.files.upload(
    file=Path("path/to/sensitive_document.pdf")
)

# Submit for redaction
response: PredictionResponse = client.document.execute(
    name="healthcare/phi-redaction",
    version="latest",
    file_ids=[file_response.id],
    batch=True
)

# Wait for completion
completed_response = client.predictions.wait(response.id, timeout=120)

# Access the redacted document
redacted_uri = completed_response.response["redacted_uri"]
redacted_items = completed_response.response["redacted_items"]

print(f"Redacted document available at: {redacted_uri}")
print(f"Sensitive information types redacted: {[item['phi_type'] for item in redacted_items]}")

Using URLs Instead of File Upload

You can also process documents directly from URLs without uploading them first:
# Process a document from URL
response: PredictionResponse = client.document.execute(
    name="healthcare/phi-redaction",
    version="latest",
    urls=["https://example.com/sensitive_document.pdf"],
    batch=True
)

# Wait for completion
completed_response = client.predictions.wait(response.id, timeout=120)
print(completed_response.response)

Batch Processing Multiple Documents

For organizations that need to redact sensitive information from multiple documents simultaneously:
# Upload multiple documents
file_paths = [
    "path/to/document1.pdf",
    "path/to/document2.pdf",
    "path/to/document3.pdf"
]

file_ids = []
for file_path in file_paths:
    file_response = client.files.upload(file=Path(file_path))
    file_ids.append(file_response.id)

# Submit batch redaction job
response: PredictionResponse = client.document.execute(
    name="healthcare/phi-redaction",
    version="latest",
    file_ids=file_ids,
    batch=True
)

# Wait for completion
completed_response = client.predictions.wait(response.id, timeout=300)

# Process results
for redacted_item in completed_response.response["redacted_items"]:
    print(f"Redacted sensitive information type: {redacted_item['phi_type']}")

Compliance and Data Security

VLM Run’s document redaction is designed with enterprise compliance at its core:
  • 🏥 Healthcare Compliance: Follows HIPAA Safe Harbor method for PHI de-identification
  • 🏛️ Legal Compliance: Supports attorney-client privilege and confidentiality requirements
  • 💰 Financial Compliance: Meets PCI DSS and financial data protection standards
  • 🔒 Secure Processing: All documents are processed in compliant infrastructure with encryption in transit and at rest
  • 🔑 Access Controls: Robust authentication and authorization mechanisms protect sensitive data
  • 📝 Audit Logging: Comprehensive audit trails for all redaction activities
  • Secure URLs: Redacted documents are provided via time-limited, secure URLs that automatically expire

Use Cases

VLM Run’s document redaction capability can be applied across various industries and scenarios:

Healthcare

  • 📋 Research Data Preparation: Automatically redact PHI from medical records before sharing with research teams
  • 🔄 Document Sharing: Safely share patient documents with external providers and insurance companies
  • 💾 Data Archival: Prepare historical medical records for long-term storage while removing PHI
  • 🏥 Quality Assurance: Create de-identified versions for training and educational purposes
  • ⚖️ Discovery Process: Redact privileged information from documents during legal discovery
  • 📄 Public Filing: Prepare court documents for public filing by removing sensitive client information
  • 🤝 Client Confidentiality: Protect attorney-client privilege in shared documents

Financial Services

  • 💳 Compliance Reporting: Remove PII from financial reports and audit documents
  • 🏦 Data Sharing: Safely share customer data with third-party vendors and partners
  • 📊 Analytics: Enable financial analytics while protecting customer privacy

Human Resources

  • 👥 Employee Records: Redact personal information from HR documents for compliance
  • 📈 Performance Analysis: Analyze workforce data while protecting employee privacy
  • 🔍 Background Checks: Prepare sanitized versions of background check reports

Supported Document Types

Document redaction works with various document formats across industries:
  • PDF Documents - Reports, contracts, legal briefs, medical records
  • Scanned Images - Faxed documents, handwritten forms, ID cards, insurance documents
  • Multi-page Documents - Complete case files, comprehensive reports, patient histories
  • Mixed Content - Documents containing both text and images, forms with signatures

Try our Document -> JSON API today

Head over to our Document -> JSON to start building your own document processing pipeline with VLM Run. Sign-up for access on our platform.