SC-401 - Data Classification
Quick Navigation
Data Classification Overview
Data classification in Microsoft Purview is the process of understanding what sensitive data exists in your organization and where it lives. Classification is the foundation for all downstream protection actions - you cannot protect what you cannot identify.
The Microsoft Purview compliance portal provides three main classification approaches:
- Sensitive Information Types (SITs) - pattern-based (regex + keywords) detection for structured data
- Trainable classifiers - machine learning models for unstructured data (documents about certain topics)
- Exact Data Match (EDM) - database-driven exact-token matching for known data sets
Sensitive Information Types (SITs)
A Sensitive Information Type defines a pattern - typically a regular expression combined with supporting evidence (keywords, checksums) - that Purview uses to detect sensitive content.
Microsoft provides 300+ built-in SITs covering common regulated data elements:
| SIT Category | Examples | Detection Method |
|---|---|---|
| Financial | Credit Card Number, SWIFT Code, ABA Routing Number | Regex + Luhn algorithm checksum |
| Identity | US SSN, UK National Insurance Number, EU Passport | Regex + keyword proximity |
| Health | ICD codes, Drug Enforcement Agency (DEA) number | Regex + supporting keyword |
| Credentials | Azure SAS Token, AWS Secret Key, Generic Password | Regex + entropy analysis |
| Named entities | Person names, Physical addresses | ML-based named entity recognition |
Custom SITs
When built-in SITs do not match your data, create custom SITs using:
- Pattern-based - define regex + optional keyword lists and confidence levels
- EDM-based - built on top of Exact Data Match for per-record exact matching
- Document fingerprinting-based - detects forms based on document structure
Confidence Levels
Every SIT match has three confidence levels:
| Level | Meaning | Typical Use |
|---|---|---|
| Low (65%) | Primary pattern only, no supporting evidence | Broad detection, more false positives |
| Medium (75%) | Primary pattern + 1 corroborating element | Balanced precision/recall |
| High (85%) | Primary pattern + multiple corroborating elements | High-confidence block actions |
Exact Data Match (EDM)
EDM enables detection of sensitive data that exactly matches records in a database - such as specific employee IDs, patient record numbers, or customer account numbers. Unlike regex SITs that use patterns, EDM compares content against actual data values.
EDM Architecture
| Component | Description |
|---|---|
| EDM Schema | Defines the structure of your data table (column names, which fields are searchable) |
| Sensitive Data Table | CSV file containing the actual sensitive records (max 100M rows) |
| EDM Hash | The data table is hashed (salted SHA-256) before upload - actual values never leave the organization |
| EDM SIT | A custom SIT that uses the EDM schema to detect exact matches in content |
EDM Setup Process
- Create the EDM schema in the Purview compliance portal (define columns, mark searchable fields)
- Prepare the CSV data file with sensitive records
- Hash and upload the data using the EDM Upload Agent tool
- Create an EDM-based SIT referencing the schema
- Use the EDM SIT in DLP policies, auto-labeling, or Communication Compliance
Trainable Classifiers
Trainable classifiers use machine learning to identify documents by their content and context rather than by pattern matching. They are suitable for unstructured content categories like "HR documents," "source code," or "project contracts."
Types of Classifiers
| Type | Description | License |
|---|---|---|
| Pre-trained (built-in) | Microsoft-trained models for categories like Harassment, Threat, Profanity, Resumes, Source Code, Financial Statements, Tax Documents | E3 and above |
| Custom trainable | You provide seed content samples, Purview trains, you provide positive/negative examples to refine | E5 or compliance add-on |
Custom Classifier Training Process
- Seed content - provide 50-500 representative positive examples in SharePoint
- Training phase - Purview learns the model (may take 24+ hours)
- Test phase - provide 200+ items (positive and negative) to evaluate accuracy
- Publish - deploy the classifier for use in policies after achieving acceptable accuracy metrics
OCR and Document Fingerprinting
Optical Character Recognition (OCR)
Purview can extract text from images (JPG, PNG, TIFF, BMP) and PDFs to apply classification. OCR enables DLP and auto-labeling to detect sensitive information in scanned documents and screenshots.
- Enabled per-location in DLP and auto-labeling policies
- Supports images embedded in Office documents and PDFs
- Adds processing overhead - consider enabling selectively for high-risk locations
- Maximum image file size for OCR: 20 MB
Document Fingerprinting
Document fingerprinting converts a standard form or template into a SIT. Any document with a similar layout and filled-in content will be detected as a match.
Use cases:
- W-2 tax forms, patent applications, employee onboarding forms
- Any regulatory form that employees fill in with sensitive data
- Works by hashing the blank template structure (word patterns, not content)
Content Explorer and Activity Explorer
Both tools are found under Data Classification in the Microsoft Purview compliance portal and provide different views of your classified data estate:
| Tool | What It Shows | Lookback Period | Required Role |
|---|---|---|---|
| Content Explorer | Current inventory of items with sensitivity labels or detected SITs, browsable by location and classification | Current state (snapshot) | Content Explorer List Viewer + Content Explorer Content Viewer |
| Activity Explorer | Historical timeline of label and DLP activity: labeling events, policy matches, endpoint file activities | Up to 30 days | Activity Explorer Viewer (or Compliance Admin) |
Key Activity Explorer event types tracked:
- Label applied / changed / removed
- File created / modified / deleted (SharePoint, OneDrive)
- DLP policy matched
- Endpoint activities: file copied to removable media, printed, accessed by unallowed app
- Email sent with label, Teams message labeled