Sensitive Information De-identification
Using Google Cloud Data Loss Prevention with Salesforce for Sensitive Data Handling This insight discusses the transition from detecting and classifying sensitive data to preventing data loss using Google Cloud Data Loss Prevention (DLP). Sensitive Information De-identification for Salesforce is used as the data source to demonstrate how personal, health, credential, and financial information can be de-identified in unstructured data in near real-time. Overview of Google Cloud DLP Google Cloud DLP is a fully managed service designed to help discover, classify, and protect sensitive data. It easily transitions from detection to prevention by offering services that mask sensitive information and measure re-identification risk. Objective The goal was to demonstrate the ability to redact sensitive information in unstructured data at scale. Specifically, it aimed to determine whether sensitive data, such as credit card numbers, tax file numbers, and health care numbers, entered into Salesforce communications (Emails, Files, and Chatter) could be detected and redacted. Constraints Tested De-identifying Data with Google Cloud DLP API Instead of detailing the setup, this section focuses on the key areas of design. Google Design Decisions Supporting Disparate Data Sources with Multiple Integration Patterns and Redundant Design Salesforce Data Source De-identification targets include email addresses, Australian Medicare card numbers, GCP API keys, passwords, and credit card numbers. Credit card numbers are masked with asterisks, while other sensitive data is replaced with information types for readability (e.g., [email protected] becomes [redacted-email-address]). Sample Requests to Google De-identification Service JSON Structure to De-identify Text Using Google Cloud DLP API jsonCopy code{ // JSON structure } JSON Structure to De-identify Images Using Google Cloud DLP API jsonCopy code{ // JSON structure } Salesforce Design Decisions Redundancy and Batch Processing A scheduled batch job allows for recovery by polling unprocessed records. To handle large data volumes (e.g., 360,000 records over 5 days), the Salesforce BULK API is used to process queries and updates in large batch sizes, reducing the number of API calls. Sensitive Information De-identification Google Cloud Data Loss Prevention allows detecting and protecting assets with sensitive information, supporting a wide range of use cases across an enterprise. Proven Capabilities: Considerations and Lessons Learned Enhanced Email: Redacting tasks and EmailMessage records, handling read-only EmailMessage records by deleting and recreating them. Files: The architecture assumes files with sensitive data can be deleted and replaced with redacted versions. Audit Fields: Ensure setting CreatedDate and LastModifiedDate fields using original record dates. Field History Tracking: Avoid tracking fields intended for de-identification, tracking shadow fields instead. Image De-identification: Limited to JPEG, BMP, and PNG formats, with DOCX and PDF not yet supported. Like Related Posts Salesforce OEM AppExchange Expanding its reach beyond CRM, Salesforce.com has launched a new service called AppExchange OEM Edition, aimed at non-CRM service providers. Read more The Salesforce Story In Marc Benioff’s own words How did salesforce.com grow from a start up in a rented apartment into the world’s Read more Salesforce Jigsaw Salesforce.com, a prominent figure in cloud computing, has finalized a deal to acquire Jigsaw, a wiki-style business contact database, for Read more Health Cloud Brings Healthcare Transformation Following swiftly after last week’s successful launch of Financial Services Cloud, Salesforce has announced the second installment in its series Read more