Audio Deepfake Detection

Detect AI-generated voices, voice cloning, and audio manipulation to prevent phone-based fraud and social engineering attacks.

Audio Deepfake Detection

Detect AI-generated voices, voice cloning, and audio manipulation to prevent social engineering attacks, CEO fraud, and phone-based identity theft.

Common Attack Vectors

Voice Cloning

AI Voice Synthesis:

Clone someone's voice from 3-10 seconds of audio
Tools: ElevenLabs, Descript, Play.ht
Convincing enough to fool family members
Used in CEO fraud, grandparent scams

How It Works:

Attacker obtains voice sample (social media, voicemail, YouTube)
Feed sample to text-to-speech AI model
Model generates new speech in target's voice
Call victim with cloned voice

Real-World Example:

2019: CEO voice cloned, $243,000 stolen from UK energy company
2020: Bank manager fooled by deepfake voice, transferred $35M

Text-to-Speech (TTS) Attacks

Synthetic Voice:

AI-generated voice (no real person)
Generic or custom voice profiles
Realistic prosody and intonation
Used in automated scam calls

Characteristics:

Unnatural speech patterns
Robotic transitions between words
Consistent tone (lack of emotion)
Background noise inconsistencies

Audio Splicing

Cut-and-Paste Audio:

Splice together real audio clips
Rearrange words/sentences
Create fake statements from real recordings
Detectable via frequency analysis

Background Noise Manipulation

Indicators of Fake Audio:

Inconsistent background noise
Abrupt changes in ambient sound
Silence where noise expected
Added artificial background to mask synthesis

3 Credits Per audio analysis

Detection Capabilities

1. Spectral Analysis

Analyze frequency patterns to detect AI generation.

What We Check:

Frequency Range: AI voices often lack full human frequency range
Harmonic Patterns: Unnatural harmonic structures
Spectral Anomalies: Artifacts specific to TTS models
Noise Floor: Consistent vs. natural noise patterns

Detection Example:

{
  "spectralAnalysis": {
    "suspiciousPatterns": true,
    "findings": [
      "Limited frequency range (200-3500 Hz, natural is 80-12000 Hz)",
      "Unnatural harmonic spacing",
      "Consistent noise floor (indicates synthetic generation)"
    ],
    "confidence": 87
  }
}

2. Prosody Analysis

Analyze natural speech rhythm and intonation.

Human Speech:

Variable pitch and tone
Natural pauses and emphasis
Emotion-driven variation
Breathing patterns

Synthetic Speech:

Consistent pitch/tone
Mechanical pauses
Lack of emotional variation
No breathing sounds (or artificial breathing)

Analysis:

{
  "prosodyAnalysis": {
    "pitchVariation": 12, // 0-100, low = suspicious
    "naturalPauses": false,
    "emotionalRange": 8, // Very low
    "breathingDetected": false,
    "verdict": "likely_synthetic"
  }
}

3. Voice Biometrics

Compare voice to known sample for speaker verification.

Use Case: Verify caller is who they claim to be

Process:

User provides known voice sample (enrollment)
Subsequent calls analysed for voice match
Deepfake detection + biometric matching

Result:

{
  "voiceBiometrics": {
    "matchScore": 42, // 0-100, should be 85+ for match
    "result": "no_match",
    "reason": "Spectral characteristics differ from enrolled sample",
    "deepfakeIndicators": [
      "Synthetic voice detected",
      "Voice characteristics inconsistent with enrollment"
    ]
  }
}

4. Compression Artifacts

Detect digital manipulation through compression analysis.

Indicators:

Multiple compression layers (edited audio)
Inconsistent compression across file
Splice points with compression mismatch
Unusual codec combinations

Detection:

{
  "compressionAnalysis": {
    "multipleCompressionDetected": true,
    "layers": 3, // Indicates editing
    "splicePoints": [
      { "timestamp": "2.3s", "confidence": 89 },
      { "timestamp": "5.7s", "confidence": 92 }
    ],
    "verdict": "edited_audio"
  }
}

5. Background Consistency

Analyze ambient noise patterns.

Authentic Audio:

Consistent background noise
Natural environmental sounds
Smooth transitions

Manipulated Audio:

Abrupt background changes
Silence between words (noise removed)
Artificial background added
Mismatched environmental acoustics

Analysis Process

Step 1: Upload Audio

const formData = new FormData();
formData.append('audio', audioFile);
formData.append('analysisType', 'voice_verification'); // or 'general'
formData.append('compareToSample', enrollmentAudioId); // Optional: voice biometrics
 
const job = await fetch('/api/v4/deepfake/analyse', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${apiKey}`
  },
  body: formData
});
 
// Returns job ID
{
  "jobId": "job_audio123",
  "status": "processing",
  "estimatedTime": "8 seconds"
}

Step 2: Processing

Analysis Pipeline (5-10 seconds):

Audio preprocessing (noise reduction, normalization)
Spectral analysis
Prosody and intonation analysis
Compression artifact detection
Background consistency check
Voice biometrics (if enrollment sample provided)
AI model fingerprinting
Final risk scoring

Step 3: Receive Results

GET /api/v4/deepfake/jobs/:jobId
 
{
  "jobId": "job_audio123",
  "status": "completed",
  "audio": {
    "filename": "phone_call.mp3",
    "duration": "45 seconds",
    "format": "MP3",
    "sampleRate": "44100 Hz",
    "bitrate": "128 kbps"
  },
  "result": {
    "isDeepfake": true,
    "confidence": 91,
    "manipulationType": "voice_cloning",
    "aiModel": "ElevenLabs-like (suspected)",
    "riskScore": 88,
    "recommendation": "reject"
  },
  "analysis": {
    "spectralAnalysis": {
      "suspiciousPatterns": true,
      "confidence": 89
    },
    "prosodyAnalysis": {
      "pitchVariation": 15,
      "naturalPauses": false,
      "verdict": "synthetic"
    },
    "compressionAnalysis": {
      "multipleCompressionDetected": false
    },
    "backgroundConsistency": {
      "consistent": false,
      "issues": ["Abrupt silence between words"]
    }
  },
  "segments": [
    {
      "start": "0.0s",
      "end": "5.2s",
      "deepfakeConfidence": 94,
      "text": "This is John Smith calling about..."
    },
    {
      "start": "5.2s",
      "end": "12.8s",
      "deepfakeConfidence": 88,
      "text": "...the urgent transfer request..."
    }
  ]
}

Confidence Scores

Score	Assessment	Action
90-100%	Highly likely deepfake	Reject/Hang up
70-89%	Likely deepfake	Manual verification required
40-69%	Uncertain	Additional authentication
0-39%	Likely authentic	Proceed

Audio Requirements

Technical Requirements

Requirement	Specification
Format	MP3, WAV, M4A, OGG
Min Sample Rate	16 kHz
Recommended Sample Rate	44.1 kHz
Min Duration	3 seconds
Max Duration	5 minutes
Max File Size	20 MB
Min Bitrate	64 kbps
Channel	Mono or Stereo

Quality Checks

Reject If:

Sample rate < 16 kHz (too low for analysis)
Duration < 3 seconds (insufficient data)
Heavy noise (SNR < 10 dB)
Clipped/distorted audio
No speech detected

Use Cases

Phone-Based Verification

Scenario: Bank customer calls to authorise wire transfer

Process:

Customer enrolled voice sample on file
Customer calls and speaks passphrase
Voice biometrics: Verify speaker identity
Deepfake detection: Check for voice cloning
Approve/Deny based on combined score

Implementation:

{
  "verificationType": "phone",
  "enrollmentSampleId": "enroll_abc123",
  "passphrase": "My voice is my password",
  "requirePassphraseMatch": true,
  "deepfakeThreshold": 70,
  "biometricThreshold": 85
}

CEO Fraud Prevention

Scenario: Employee receives call from "CEO" requesting urgent wire transfer

Red Flags:

Urgency and secrecy requested
Unusual request (CEO doesn't normally call)
Poor call quality (may hide deepfake artifacts)

Verification:

Record call audio
Run deepfake detection
If suspicious, call CEO back on known number
Implement dual-approval for unusual requests

Voice Authentication

Scenario: Customer service voice verification

Multi-Factor Check:

Knowledge: Answer security questions
Voice Biometrics: Match enrolled voice
Deepfake Detection: Ensure voice is real
Behavioral: Analyze speech patterns

Best Practices

Enroll voice samples - Collect clean voice sample during onboarding
Set 70%+ deepfake threshold - Balance false positives vs. fraud
Combine with other factors - Voice + knowledge questions + SMS code
Use passphrases - Harder to clone specific phrases
Monitor call quality - Poor quality may hide artifacts
Train staff - Recognise social engineering red flags
Callback verification - For high-risk requests, call back on known number
Time delays - Implement cooling-off period for unusual requests

Limitations

Detection Accuracy

Current Performance:

Known TTS models: 95%
Voice cloning: 90%
Audio splicing: 93%
Overall: 90%

Challenges:

High-quality voice clones (10+ minutes of training data)
Professional audio editing
Low-quality phone calls (masks artifacts)
Background noise interference

False Positives

Common Causes:

Poor phone connection (adds artifacts)
Background noise (hides natural speech patterns)
Non-native speakers (different prosody)
Medical conditions (affects voice characteristics)
Emotions (crying, stress alters voice)

Mitigation:

Use 70%+ threshold
Manual review for 70-89% range
Request callback if uncertain
Document legitimate reasons for unusual characteristics

Emerging Threats

Real-Time Voice Conversion

Technology:

Real-time voice-to-voice transformation
Latency < 100ms (imperceptible)
Maintains emotion and prosody
Very sophisticated

Detection:

Spectral anomalies still present
Slight latency in responses
Background noise inconsistencies

Emotion Synthesis

New Capability:

AI models that add emotion to synthetic speech
Crying, laughing, stress
Makes voice clones more convincing

Counter-Measures:

Analyze emotional transitions (synthetic often too perfect)
Check for natural vocal strain
Verify emotional context matches content

API Reference

Analyze Audio

POST /api/v4/deepfake/analyse
 
// multipart/form-data
{
  "audio": File,
  "analysisType": "voice_verification" | "general",
  "enrollmentSampleId": string, // Optional: for biometrics
  "returnSegmentAnalysis": boolean // Per-segment deepfake scores
}
 
// Returns
{
  "jobId": "job_audio123",
  "status": "processing",
  "estimatedTime": "8 seconds"
}

Enroll Voice Sample

POST /api/v4/voice/enroll
 
// multipart/form-data
{
  "audio": File,
  "applicantId": string,
  "passphrase": string // Optional: specific phrase
}
 
// Returns enrollment ID for future biometric matching
{
  "enrollmentId": "enroll_abc123",
  "quality": 94,
  "status": "active"
}

Verify Speaker

POST /api/v4/voice/verify
 
{
  "enrollmentId": "enroll_abc123",
  "audioSampleId": "job_audio123",
  "includeDeepfakeCheck": true
}
 
// Returns combined biometric + deepfake result
{
  "biometricMatch": {
    "score": 92,
    "result": "match"
  },
  "deepfakeDetection": {
    "isDeepfake": false,
    "confidence": 18
  },
  "overallResult": "verified",
  "confidence": 89
}

Pricing

Service	Processing Time	Cost
Audio deepfake detection	5-10 seconds	3 Credits
Voice enrollment	3-5 seconds	3 Credits
Voice verification (biometric + deepfake)	8-12 seconds	Contact us

Regulatory Landscape

Biometric Data Privacy

GDPR (EU):

Voice is biometric data
Explicit consent required
Right to erasure applies
Encryption required

BIPA (Illinois, US):

Written policy required
User consent before collection
Cannot sell biometric data

CCPA (California):

Privacy notice required
Opt-out right
Deletion right

Deepfake Regulations

US State Laws:

California AB 730: Criminalizesdeepfake videos in elections
Texas HB 3004: Criminalises deepfake videos without disclosure
Virginia HB 2678: Criminalises non-consensual deepfake pornography

EU AI Act (Proposed):

Disclosure requirements for deepfakes
Transparency obligations
High-risk AI system regulations

Next Steps

Image Deepfake Detection

Detect manipulated photos and documents.

Video Deepfake Detection

Detect deepfake videos and liveness spoofing.

Deepfake API

Integrate audio deepfake detection.

See it in action

Experience the full power of VeriPlus compliance platform.

Start Free Trial

Ready to get started?

Start with our free plan. No credit card required.

Start Free Trial Contact Sales

Audio Deepfake Detection

Audio Deepfake Detection

Common Attack Vectors

Voice Cloning

Text-to-Speech (TTS) Attacks

Audio Splicing

Background Noise Manipulation

Detection Capabilities

1. Spectral Analysis

2. Prosody Analysis

3. Voice Biometrics

4. Compression Artifacts

5. Background Consistency

Analysis Process

Step 1: Upload Audio

Step 2: Processing

Step 3: Receive Results

Confidence Scores

Audio Requirements

Technical Requirements

Quality Checks

Use Cases

Phone-Based Verification

CEO Fraud Prevention

Voice Authentication

Best Practices

Limitations

Detection Accuracy

False Positives

Emerging Threats

Real-Time Voice Conversion

Emotion Synthesis

API Reference

Analyze Audio

Enroll Voice Sample

Verify Speaker

Pricing

Regulatory Landscape

Biometric Data Privacy

Deepfake Regulations

Next Steps

Image Deepfake Detection

Video Deepfake Detection

Deepfake API

See it in action

Ready to get started?

We value your privacy