Search Tutorials


SC-401 Data Classification | Microsoft Purview | JavaInUse

SC-401 - Data Classification

Data Classification Overview

Data classification in Microsoft Purview is the process of understanding what sensitive data exists in your organization and where it lives. Classification is the foundation for all downstream protection actions - you cannot protect what you cannot identify.

The Microsoft Purview compliance portal provides three main classification approaches:

  • Sensitive Information Types (SITs) - pattern-based (regex + keywords) detection for structured data
  • Trainable classifiers - machine learning models for unstructured data (documents about certain topics)
  • Exact Data Match (EDM) - database-driven exact-token matching for known data sets
Classification outputs feed directly into sensitivity label auto-labeling, DLP policies, Communication Compliance, and Insider Risk Management. Accurate classification reduces both over-protection (business friction) and under-protection (data exposure).

Sensitive Information Types (SITs)

A Sensitive Information Type defines a pattern - typically a regular expression combined with supporting evidence (keywords, checksums) - that Purview uses to detect sensitive content.

Microsoft provides 300+ built-in SITs covering common regulated data elements:

SIT CategoryExamplesDetection Method
FinancialCredit Card Number, SWIFT Code, ABA Routing NumberRegex + Luhn algorithm checksum
IdentityUS SSN, UK National Insurance Number, EU PassportRegex + keyword proximity
HealthICD codes, Drug Enforcement Agency (DEA) numberRegex + supporting keyword
CredentialsAzure SAS Token, AWS Secret Key, Generic PasswordRegex + entropy analysis
Named entitiesPerson names, Physical addressesML-based named entity recognition

Custom SITs

When built-in SITs do not match your data, create custom SITs using:

  • Pattern-based - define regex + optional keyword lists and confidence levels
  • EDM-based - built on top of Exact Data Match for per-record exact matching
  • Document fingerprinting-based - detects forms based on document structure

Confidence Levels

Every SIT match has three confidence levels:

LevelMeaningTypical Use
Low (65%)Primary pattern only, no supporting evidenceBroad detection, more false positives
Medium (75%)Primary pattern + 1 corroborating elementBalanced precision/recall
High (85%)Primary pattern + multiple corroborating elementsHigh-confidence block actions
DLP policies and sensitivity label auto-labeling rules allow you to specify the minimum confidence level and the instance count (e.g., "more than 5 credit card numbers at medium confidence") before triggering an action.

Exact Data Match (EDM)

EDM enables detection of sensitive data that exactly matches records in a database - such as specific employee IDs, patient record numbers, or customer account numbers. Unlike regex SITs that use patterns, EDM compares content against actual data values.

EDM Architecture

ComponentDescription
EDM SchemaDefines the structure of your data table (column names, which fields are searchable)
Sensitive Data TableCSV file containing the actual sensitive records (max 100M rows)
EDM HashThe data table is hashed (salted SHA-256) before upload - actual values never leave the organization
EDM SITA custom SIT that uses the EDM schema to detect exact matches in content

EDM Setup Process

  1. Create the EDM schema in the Purview compliance portal (define columns, mark searchable fields)
  2. Prepare the CSV data file with sensitive records
  3. Hash and upload the data using the EDM Upload Agent tool
  4. Create an EDM-based SIT referencing the schema
  5. Use the EDM SIT in DLP policies, auto-labeling, or Communication Compliance
The hash/upload process must be repeated whenever the source database changes - typically automated via a scheduled task. The EDM Upload Agent can run on-premises and only sends the hashed token index to Microsoft, not the raw data values.

Trainable Classifiers

Trainable classifiers use machine learning to identify documents by their content and context rather than by pattern matching. They are suitable for unstructured content categories like "HR documents," "source code," or "project contracts."

Types of Classifiers

TypeDescriptionLicense
Pre-trained (built-in)Microsoft-trained models for categories like Harassment, Threat, Profanity, Resumes, Source Code, Financial Statements, Tax DocumentsE3 and above
Custom trainableYou provide seed content samples, Purview trains, you provide positive/negative examples to refineE5 or compliance add-on

Custom Classifier Training Process

  1. Seed content - provide 50-500 representative positive examples in SharePoint
  2. Training phase - Purview learns the model (may take 24+ hours)
  3. Test phase - provide 200+ items (positive and negative) to evaluate accuracy
  4. Publish - deploy the classifier for use in policies after achieving acceptable accuracy metrics
Trainable classifiers work best for semantically consistent content categories. They are not suitable for detecting specific data values (use SITs or EDM instead). They support only SharePoint Online, Exchange, and Teams as locations - not on-premises files or third-party systems directly.

OCR and Document Fingerprinting

Optical Character Recognition (OCR)

Purview can extract text from images (JPG, PNG, TIFF, BMP) and PDFs to apply classification. OCR enables DLP and auto-labeling to detect sensitive information in scanned documents and screenshots.

  • Enabled per-location in DLP and auto-labeling policies
  • Supports images embedded in Office documents and PDFs
  • Adds processing overhead - consider enabling selectively for high-risk locations
  • Maximum image file size for OCR: 20 MB

Document Fingerprinting

Document fingerprinting converts a standard form or template into a SIT. Any document with a similar layout and filled-in content will be detected as a match.

Use cases:

  • W-2 tax forms, patent applications, employee onboarding forms
  • Any regulatory form that employees fill in with sensitive data
  • Works by hashing the blank template structure (word patterns, not content)
Document fingerprinting works for Exchange Online (email attachments) and SharePoint/OneDrive. It does NOT currently support detecting filled-in PDFs - it works best with Word document templates. Also, fingerprinting cannot detect forms that have been substantially modified from the original template structure.

Content Explorer and Activity Explorer

Both tools are found under Data Classification in the Microsoft Purview compliance portal and provide different views of your classified data estate:

ToolWhat It ShowsLookback PeriodRequired Role
Content ExplorerCurrent inventory of items with sensitivity labels or detected SITs, browsable by location and classificationCurrent state (snapshot)Content Explorer List Viewer + Content Explorer Content Viewer
Activity ExplorerHistorical timeline of label and DLP activity: labeling events, policy matches, endpoint file activitiesUp to 30 daysActivity Explorer Viewer (or Compliance Admin)

Key Activity Explorer event types tracked:

  • Label applied / changed / removed
  • File created / modified / deleted (SharePoint, OneDrive)
  • DLP policy matched
  • Endpoint activities: file copied to removable media, printed, accessed by unallowed app
  • Email sent with label, Teams message labeled
Activity Explorer is the go-to tool for proving that labeling policies are working as intended and for investigating specific data-handling events. Use Content Explorer to answer "how much sensitive data do we have and where?" and Activity Explorer to answer "what has been done with our sensitive data?"

Popular Posts

��