Document Redaction

Organizations across industries face critical compliance challenges when handling documents containing personally identifiable information (PII) and sensitive data. Whether it’s healthcare PHI under HIPAA, financial data under PCI DSS, or legal documents with confidential information, manual redaction processes are time-consuming, error-prone, and costly. VLM Run’s document redaction capability automatically detects and redacts sensitive information from documents across various industries. For healthcare specifically, it follows HIPAA’s Safe Harbor method and detects all 18 types of PHI for de-identification. This ensures your documents are compliant while maintaining document readability and structure.

Original Document

Redacted Document

Example of document redaction applied to a medical form

Here’s a step-by-step guide on how to redact sensitive information from documents:

Upload Document

Use the /v1/files endpoint to upload the document containing sensitive information that you want to redact.

from vlmrun.client import VLMRun
from vlmrun.client.types import FileResponse
from pathlib import Path

# Initialize the client
client = VLMRun(api_key="<your-api-key>")

# Upload the file
response: FileResponse = client.files.upload(
    file=Path("<path/to/sensitive_document.pdf>")
)
print(f"Uploaded file:\n {response.model_dump()}")

You should see a response like this:

Uploaded file:
{
  'id': '1e76cfd9-ba99-49b2-a8fe-2c8efaad2649',
  'filename': 'file-20240815-7UvOUQ-sensitive_document.pdf',
  'bytes': 62430,
  'purpose': 'assistants',
  'created_at': '2024-08-15T02:22:06.716130',
  'object': 'file'
}

Submit the Document Redaction Job

Submit the uploaded file to the /v1/document/execute endpoint to start the document redaction job using the healthcare/phi-redaction agent.

from vlmrun.client.types import PredictionResponse

# Submit the document for redaction
response: PredictionResponse = client.document.execute(
    name="healthcare/phi-redaction",
    version="latest",
    file_ids=[response.id],  # Use the file ID from the upload step
    batch=True
)
print(f"Document redaction job submitted:\n {response.model_dump()}")

You should see a response like this:

Document redaction job submitted:
{
  "id": "052cf2a8-2b84-45f5-a385-ccac2aae13bb",
  "created_at": "2024-08-15T02:22:09.157788",
  "response": null,
  "status": "pending"
}

Wait for the Job to Complete

You can now wait for the job to complete by calling the predictions.wait method:

# Wait for the job to complete
response: PredictionResponse = client.predictions.wait(
    id=response.id,
    timeout=120,
)
print(f"Job completed:\n {response.model_dump()}")

Industry Use Cases

VLM Run’s document redaction capability serves multiple industries with specific compliance requirements:

Healthcare - PHI Redaction

For healthcare organizations, VLM Run follows HIPAA’s Safe Harbor method and automatically detects and redacts all 18 types of Protected Health Information (PHI):

Personal Identifiers

Contact Information

Dates and Ages

Technical Identifiers

Other Industries

Legal Documents

Attorney-client privileged information
Case numbers and court identifiers
Witness names and contact information
Settlement amounts and financial details

Financial Services

Account numbers and routing numbers
Credit card and payment information
Social Security Numbers
Investment account details

Human Resources

Employee personal information
Salary and compensation data
Performance review details
Background check information

Real Estate

Property addresses and legal descriptions
Purchase prices and financial terms
Personal information of buyers/sellers
Loan and mortgage details

Insurance

Policy numbers and claim identifiers
Personal health information
Financial and asset information
Beneficiary details

Example Response

Here’s an example of the structured response you’ll receive after document redaction is complete:

{
  "id": "052cf2a8-2b84-45f5-a385-ccac2aae13bb",
  "created_at": "2024-08-15T02:22:09.157788",
  "status": "completed",
  "response": {
    "redacted_uri": "https://storage.googleapis.com/vlm-userdata/agents/healthcare/phi-redaction/redacted-sensitive_document-20240815-a1b2c3d4.pdf",
    "redacted_items": [
      {
        "phi_type": "name"
      },
      {
        "phi_type": "date_elements"
      },
      {
        "phi_type": "telephone_number"
      },
      {
        "phi_type": "email_address"
      },
      {
        "phi_type": "medical_record_number"
      },
      {
        "phi_type": "geographic_subdivision"
      }
    ]
  }
}

The response includes:

redacted_uri: A secure, time-limited URL to download the redacted document with all sensitive information visually obscured
redacted_items: A list of all sensitive information types that were detected and redacted in the document

Complete Example

Here’s a complete example showing how to redact sensitive information from a document:

from vlmrun.client import VLMRun
from vlmrun.client.types import PredictionResponse, FileResponse
from pathlib import Path

# Initialize the client
client = VLMRun(api_key="<your-api-key>")

# Upload the document
file_response: FileResponse = client.files.upload(
    file=Path("path/to/sensitive_document.pdf")
)

# Submit for redaction
response: PredictionResponse = client.document.execute(
    name="healthcare/phi-redaction",
    version="latest",
    file_ids=[file_response.id],
    batch=True
)

# Wait for completion
completed_response = client.predictions.wait(response.id, timeout=120)

# Access the redacted document
redacted_uri = completed_response.response["redacted_uri"]
redacted_items = completed_response.response["redacted_items"]

print(f"Redacted document available at: {redacted_uri}")
print(f"Sensitive information types redacted: {[item['phi_type'] for item in redacted_items]}")

Using URLs Instead of File Upload

You can also process documents directly from URLs without uploading them first:

# Process a document from URL
response: PredictionResponse = client.document.execute(
    name="healthcare/phi-redaction",
    version="latest",
    urls=["https://example.com/sensitive_document.pdf"],
    batch=True
)

# Wait for completion
completed_response = client.predictions.wait(response.id, timeout=120)
print(completed_response.response)

Batch Processing Multiple Documents

For organizations that need to redact sensitive information from multiple documents simultaneously:

# Upload multiple documents
file_paths = [
    "path/to/document1.pdf",
    "path/to/document2.pdf",
    "path/to/document3.pdf"
]

file_ids = []
for file_path in file_paths:
    file_response = client.files.upload(file=Path(file_path))
    file_ids.append(file_response.id)

# Submit batch redaction job
response: PredictionResponse = client.document.execute(
    name="healthcare/phi-redaction",
    version="latest",
    file_ids=file_ids,
    batch=True
)

# Wait for completion
completed_response = client.predictions.wait(response.id, timeout=300)

# Process results
for redacted_item in completed_response.response["redacted_items"]:
    print(f"Redacted sensitive information type: {redacted_item['phi_type']}")

Compliance and Data Security

VLM Run’s document redaction is designed with enterprise compliance at its core:

🏥 Healthcare Compliance: Follows HIPAA Safe Harbor method for PHI de-identification
🏛️ Legal Compliance: Supports attorney-client privilege and confidentiality requirements
💰 Financial Compliance: Meets PCI DSS and financial data protection standards
🔒 Secure Processing: All documents are processed in compliant infrastructure with encryption in transit and at rest
🔑 Access Controls: Robust authentication and authorization mechanisms protect sensitive data
📝 Audit Logging: Comprehensive audit trails for all redaction activities
⏰ Secure URLs: Redacted documents are provided via time-limited, secure URLs that automatically expire

Use Cases

VLM Run’s document redaction capability can be applied across various industries and scenarios:

Healthcare

📋 Research Data Preparation: Automatically redact PHI from medical records before sharing with research teams
🔄 Document Sharing: Safely share patient documents with external providers and insurance companies
💾 Data Archival: Prepare historical medical records for long-term storage while removing PHI
🏥 Quality Assurance: Create de-identified versions for training and educational purposes

Legal

⚖️ Discovery Process: Redact privileged information from documents during legal discovery
📄 Public Filing: Prepare court documents for public filing by removing sensitive client information
🤝 Client Confidentiality: Protect attorney-client privilege in shared documents

Financial Services

💳 Compliance Reporting: Remove PII from financial reports and audit documents
🏦 Data Sharing: Safely share customer data with third-party vendors and partners
📊 Analytics: Enable financial analytics while protecting customer privacy

Human Resources

👥 Employee Records: Redact personal information from HR documents for compliance
📈 Performance Analysis: Analyze workforce data while protecting employee privacy
🔍 Background Checks: Prepare sanitized versions of background check reports

Supported Document Types

Document redaction works with various document formats across industries:

PDF Documents - Reports, contracts, legal briefs, medical records
Scanned Images - Faxed documents, handwritten forms, ID cards, insurance documents
Multi-page Documents - Complete case files, comprehensive reports, patient histories
Mixed Content - Documents containing both text and images, forms with signatures

Parsing Intake Forms

Learn how to extract structured data from healthcare intake forms.

Classifying Documents

Classify healthcare documents by type before processing.

Visual Grounding

Learn about visual grounding for document verification.

Agent Execution

Complete API reference for agent execution.

Try our Document -> JSON API today

Head over to our Document -> JSON to start building your own document processing pipeline with VLM Run. Sign-up for access on our platform.

Get Started

Capabilities

Guides - Doc AI

Guides - Image AI

Guides - Video/Audio AI

Guides - Finetuning

Misc

Document Redaction

Original Document

Redacted Document

Industry Use Cases

Healthcare - PHI Redaction

Other Industries

Example Response

Complete Example

Using URLs Instead of File Upload

Batch Processing Multiple Documents

Compliance and Data Security

Use Cases

Healthcare

Legal

Financial Services

Human Resources

Supported Document Types

Parsing Intake Forms

Classifying Documents

Visual Grounding

Agent Execution

Try our Document -> JSON API today

Get Started

Capabilities

Guides - Doc AI

Guides - Image AI

Guides - Video/Audio AI

Guides - Finetuning

Misc

Original Document

Redacted Document

​Industry Use Cases

​Healthcare - PHI Redaction

​Other Industries

​Example Response

​Complete Example

​Using URLs Instead of File Upload

​Batch Processing Multiple Documents

​Compliance and Data Security

​Use Cases

​Healthcare

​Legal

​Financial Services

​Human Resources

​Supported Document Types

​Related Guides

Parsing Intake Forms

Classifying Documents

Visual Grounding

Agent Execution

​Try our Document -> JSON API today

Industry Use Cases

Healthcare - PHI Redaction

Other Industries

Example Response

Complete Example

Using URLs Instead of File Upload

Batch Processing Multiple Documents

Compliance and Data Security

Use Cases

Healthcare

Legal

Financial Services

Human Resources

Supported Document Types

Related Guides

Try our Document -> JSON API today