SC-401 - Data Classification

Quick Navigation

Data Classification Overview
Sensitive Information Types (SITs)
Exact Data Match (EDM)
Trainable Classifiers
OCR and Document Fingerprinting
Content Explorer and Activity Explorer

Data Classification Overview

Data classification in Microsoft Purview is the process of understanding what sensitive data exists in your organization and where it lives. Classification is the foundation for all downstream protection actions - you cannot protect what you cannot identify.

The Microsoft Purview compliance portal provides three main classification approaches:

Sensitive Information Types (SITs) - pattern-based (regex + keywords) detection for structured data
Trainable classifiers - machine learning models for unstructured data (documents about certain topics)
Exact Data Match (EDM) - database-driven exact-token matching for known data sets

Classification outputs feed directly into sensitivity label auto-labeling, DLP policies, Communication Compliance, and Insider Risk Management. Accurate classification reduces both over-protection (business friction) and under-protection (data exposure).

Sensitive Information Types (SITs)

A Sensitive Information Type defines a pattern - typically a regular expression combined with supporting evidence (keywords, checksums) - that Purview uses to detect sensitive content.

Microsoft provides 300+ built-in SITs covering common regulated data elements:

SIT Category	Examples	Detection Method
Financial	Credit Card Number, SWIFT Code, ABA Routing Number	Regex + Luhn algorithm checksum
Identity	US SSN, UK National Insurance Number, EU Passport	Regex + keyword proximity
Health	ICD codes, Drug Enforcement Agency (DEA) number	Regex + supporting keyword
Credentials	Azure SAS Token, AWS Secret Key, Generic Password	Regex + entropy analysis
Named entities	Person names, Physical addresses	ML-based named entity recognition

Custom SITs

When built-in SITs do not match your data, create custom SITs using:

Pattern-based - define regex + optional keyword lists and confidence levels
EDM-based - built on top of Exact Data Match for per-record exact matching
Document fingerprinting-based - detects forms based on document structure

Confidence Levels

Every SIT match has three confidence levels:

Level	Meaning	Typical Use
Low (65%)	Primary pattern only, no supporting evidence	Broad detection, more false positives
Medium (75%)	Primary pattern + 1 corroborating element	Balanced precision/recall
High (85%)	Primary pattern + multiple corroborating elements	High-confidence block actions

DLP policies and sensitivity label auto-labeling rules allow you to specify the minimum confidence level and the instance count (e.g., "more than 5 credit card numbers at medium confidence") before triggering an action.

Exact Data Match (EDM)

EDM enables detection of sensitive data that exactly matches records in a database - such as specific employee IDs, patient record numbers, or customer account numbers. Unlike regex SITs that use patterns, EDM compares content against actual data values.

EDM Architecture

Component	Description
EDM Schema	Defines the structure of your data table (column names, which fields are searchable)
Sensitive Data Table	CSV file containing the actual sensitive records (max 100M rows)
EDM Hash	The data table is hashed (salted SHA-256) before upload - actual values never leave the organization
EDM SIT	A custom SIT that uses the EDM schema to detect exact matches in content

EDM Setup Process

Create the EDM schema in the Purview compliance portal (define columns, mark searchable fields)
Prepare the CSV data file with sensitive records
Hash and upload the data using the EDM Upload Agent tool
Create an EDM-based SIT referencing the schema
Use the EDM SIT in DLP policies, auto-labeling, or Communication Compliance

The hash/upload process must be repeated whenever the source database changes - typically automated via a scheduled task. The EDM Upload Agent can run on-premises and only sends the hashed token index to Microsoft, not the raw data values.

Trainable Classifiers

Trainable classifiers use machine learning to identify documents by their content and context rather than by pattern matching. They are suitable for unstructured content categories like "HR documents," "source code," or "project contracts."

Types of Classifiers

Type	Description	License
Pre-trained (built-in)	Microsoft-trained models for categories like Harassment, Threat, Profanity, Resumes, Source Code, Financial Statements, Tax Documents	E3 and above
Custom trainable	You provide seed content samples, Purview trains, you provide positive/negative examples to refine	E5 or compliance add-on

Custom Classifier Training Process

Seed content - provide 50-500 representative positive examples in SharePoint
Training phase - Purview learns the model (may take 24+ hours)
Test phase - provide 200+ items (positive and negative) to evaluate accuracy
Publish - deploy the classifier for use in policies after achieving acceptable accuracy metrics

Trainable classifiers work best for semantically consistent content categories. They are not suitable for detecting specific data values (use SITs or EDM instead). They support only SharePoint Online, Exchange, and Teams as locations - not on-premises files or third-party systems directly.

OCR and Document Fingerprinting

Optical Character Recognition (OCR)

Purview can extract text from images (JPG, PNG, TIFF, BMP) and PDFs to apply classification. OCR enables DLP and auto-labeling to detect sensitive information in scanned documents and screenshots.

Enabled per-location in DLP and auto-labeling policies
Supports images embedded in Office documents and PDFs
Adds processing overhead - consider enabling selectively for high-risk locations
Maximum image file size for OCR: 20 MB

Document Fingerprinting

Document fingerprinting converts a standard form or template into a SIT. Any document with a similar layout and filled-in content will be detected as a match.

Use cases:

W-2 tax forms, patent applications, employee onboarding forms
Any regulatory form that employees fill in with sensitive data
Works by hashing the blank template structure (word patterns, not content)

Document fingerprinting works for Exchange Online (email attachments) and SharePoint/OneDrive. It does NOT currently support detecting filled-in PDFs - it works best with Word document templates. Also, fingerprinting cannot detect forms that have been substantially modified from the original template structure.

Content Explorer and Activity Explorer

Both tools are found under Data Classification in the Microsoft Purview compliance portal and provide different views of your classified data estate:

Tool	What It Shows	Lookback Period	Required Role
Content Explorer	Current inventory of items with sensitivity labels or detected SITs, browsable by location and classification	Current state (snapshot)	Content Explorer List Viewer + Content Explorer Content Viewer
Activity Explorer	Historical timeline of label and DLP activity: labeling events, policy matches, endpoint file activities	Up to 30 days	Activity Explorer Viewer (or Compliance Admin)

Key Activity Explorer event types tracked:

Label applied / changed / removed
File created / modified / deleted (SharePoint, OneDrive)
DLP policy matched
Endpoint activities: file copied to removable media, printed, accessed by unallowed app
Email sent with label, Teams message labeled

Activity Explorer is the go-to tool for proving that labeling policies are working as intended and for investigating specific data-handling events. Use Content Explorer to answer "how much sensitive data do we have and where?" and Activity Explorer to answer "what has been done with our sensitive data?"

Search Tutorials