Audio Deepfake Detection
Detect AI-generated voices, voice cloning, and audio manipulation to prevent phone-based fraud and social engineering attacks.
Audio Deepfake Detection
Detect AI-generated voices, voice cloning, and audio manipulation to prevent social engineering attacks, CEO fraud, and phone-based identity theft.
Common Attack Vectors
Voice Cloning
AI Voice Synthesis:
- Clone someone's voice from 3-10 seconds of audio
- Tools: ElevenLabs, Descript, Play.ht
- Convincing enough to fool family members
- Used in CEO fraud, grandparent scams
How It Works:
- Attacker obtains voice sample (social media, voicemail, YouTube)
- Feed sample to text-to-speech AI model
- Model generates new speech in target's voice
- Call victim with cloned voice
Real-World Example:
- 2019: CEO voice cloned, $243,000 stolen from UK energy company
- 2020: Bank manager fooled by deepfake voice, transferred $35M
Text-to-Speech (TTS) Attacks
Synthetic Voice:
- AI-generated voice (no real person)
- Generic or custom voice profiles
- Realistic prosody and intonation
- Used in automated scam calls
Characteristics:
- Unnatural speech patterns
- Robotic transitions between words
- Consistent tone (lack of emotion)
- Background noise inconsistencies
Audio Splicing
Cut-and-Paste Audio:
- Splice together real audio clips
- Rearrange words/sentences
- Create fake statements from real recordings
- Detectable via frequency analysis
Background Noise Manipulation
Indicators of Fake Audio:
- Inconsistent background noise
- Abrupt changes in ambient sound
- Silence where noise expected
- Added artificial background to mask synthesis
3 Credits Per audio analysis
Detection Capabilities
1. Spectral Analysis
Analyze frequency patterns to detect AI generation.
What We Check:
- Frequency Range: AI voices often lack full human frequency range
- Harmonic Patterns: Unnatural harmonic structures
- Spectral Anomalies: Artifacts specific to TTS models
- Noise Floor: Consistent vs. natural noise patterns
Detection Example:
{
"spectralAnalysis": {
"suspiciousPatterns": true,
"findings": [
"Limited frequency range (200-3500 Hz, natural is 80-12000 Hz)",
"Unnatural harmonic spacing",
"Consistent noise floor (indicates synthetic generation)"
],
"confidence": 87
}
}2. Prosody Analysis
Analyze natural speech rhythm and intonation.
Human Speech:
- Variable pitch and tone
- Natural pauses and emphasis
- Emotion-driven variation
- Breathing patterns
Synthetic Speech:
- Consistent pitch/tone
- Mechanical pauses
- Lack of emotional variation
- No breathing sounds (or artificial breathing)
Analysis:
{
"prosodyAnalysis": {
"pitchVariation": 12, // 0-100, low = suspicious
"naturalPauses": false,
"emotionalRange": 8, // Very low
"breathingDetected": false,
"verdict": "likely_synthetic"
}
}3. Voice Biometrics
Compare voice to known sample for speaker verification.
Use Case: Verify caller is who they claim to be
Process:
- User provides known voice sample (enrollment)
- Subsequent calls analysed for voice match
- Deepfake detection + biometric matching
Result:
{
"voiceBiometrics": {
"matchScore": 42, // 0-100, should be 85+ for match
"result": "no_match",
"reason": "Spectral characteristics differ from enrolled sample",
"deepfakeIndicators": [
"Synthetic voice detected",
"Voice characteristics inconsistent with enrollment"
]
}
}4. Compression Artifacts
Detect digital manipulation through compression analysis.
Indicators:
- Multiple compression layers (edited audio)
- Inconsistent compression across file
- Splice points with compression mismatch
- Unusual codec combinations
Detection:
{
"compressionAnalysis": {
"multipleCompressionDetected": true,
"layers": 3, // Indicates editing
"splicePoints": [
{ "timestamp": "2.3s", "confidence": 89 },
{ "timestamp": "5.7s", "confidence": 92 }
],
"verdict": "edited_audio"
}
}5. Background Consistency
Analyze ambient noise patterns.
Authentic Audio:
- Consistent background noise
- Natural environmental sounds
- Smooth transitions
Manipulated Audio:
- Abrupt background changes
- Silence between words (noise removed)
- Artificial background added
- Mismatched environmental acoustics
Analysis Process
Step 1: Upload Audio
const formData = new FormData();
formData.append('audio', audioFile);
formData.append('analysisType', 'voice_verification'); // or 'general'
formData.append('compareToSample', enrollmentAudioId); // Optional: voice biometrics
const job = await fetch('/api/v4/deepfake/analyse', {
method: 'POST',
headers: {
'Authorization': `Bearer ${apiKey}`
},
body: formData
});
// Returns job ID
{
"jobId": "job_audio123",
"status": "processing",
"estimatedTime": "8 seconds"
}Step 2: Processing
Analysis Pipeline (5-10 seconds):
- Audio preprocessing (noise reduction, normalization)
- Spectral analysis
- Prosody and intonation analysis
- Compression artifact detection
- Background consistency check
- Voice biometrics (if enrollment sample provided)
- AI model fingerprinting
- Final risk scoring
Step 3: Receive Results
GET /api/v4/deepfake/jobs/:jobId
{
"jobId": "job_audio123",
"status": "completed",
"audio": {
"filename": "phone_call.mp3",
"duration": "45 seconds",
"format": "MP3",
"sampleRate": "44100 Hz",
"bitrate": "128 kbps"
},
"result": {
"isDeepfake": true,
"confidence": 91,
"manipulationType": "voice_cloning",
"aiModel": "ElevenLabs-like (suspected)",
"riskScore": 88,
"recommendation": "reject"
},
"analysis": {
"spectralAnalysis": {
"suspiciousPatterns": true,
"confidence": 89
},
"prosodyAnalysis": {
"pitchVariation": 15,
"naturalPauses": false,
"verdict": "synthetic"
},
"compressionAnalysis": {
"multipleCompressionDetected": false
},
"backgroundConsistency": {
"consistent": false,
"issues": ["Abrupt silence between words"]
}
},
"segments": [
{
"start": "0.0s",
"end": "5.2s",
"deepfakeConfidence": 94,
"text": "This is John Smith calling about..."
},
{
"start": "5.2s",
"end": "12.8s",
"deepfakeConfidence": 88,
"text": "...the urgent transfer request..."
}
]
}Confidence Scores
| Score | Assessment | Action |
|---|---|---|
| 90-100% | Highly likely deepfake | Reject/Hang up |
| 70-89% | Likely deepfake | Manual verification required |
| 40-69% | Uncertain | Additional authentication |
| 0-39% | Likely authentic | Proceed |
Audio Requirements
Technical Requirements
| Requirement | Specification |
|---|---|
| Format | MP3, WAV, M4A, OGG |
| Min Sample Rate | 16 kHz |
| Recommended Sample Rate | 44.1 kHz |
| Min Duration | 3 seconds |
| Max Duration | 5 minutes |
| Max File Size | 20 MB |
| Min Bitrate | 64 kbps |
| Channel | Mono or Stereo |
Quality Checks
Reject If:
- Sample rate < 16 kHz (too low for analysis)
- Duration < 3 seconds (insufficient data)
- Heavy noise (SNR < 10 dB)
- Clipped/distorted audio
- No speech detected
Use Cases
Phone-Based Verification
Scenario: Bank customer calls to authorise wire transfer
Process:
- Customer enrolled voice sample on file
- Customer calls and speaks passphrase
- Voice biometrics: Verify speaker identity
- Deepfake detection: Check for voice cloning
- Approve/Deny based on combined score
Implementation:
{
"verificationType": "phone",
"enrollmentSampleId": "enroll_abc123",
"passphrase": "My voice is my password",
"requirePassphraseMatch": true,
"deepfakeThreshold": 70,
"biometricThreshold": 85
}CEO Fraud Prevention
Scenario: Employee receives call from "CEO" requesting urgent wire transfer
Red Flags:
- Urgency and secrecy requested
- Unusual request (CEO doesn't normally call)
- Poor call quality (may hide deepfake artifacts)
Verification:
- Record call audio
- Run deepfake detection
- If suspicious, call CEO back on known number
- Implement dual-approval for unusual requests
Voice Authentication
Scenario: Customer service voice verification
Multi-Factor Check:
- Knowledge: Answer security questions
- Voice Biometrics: Match enrolled voice
- Deepfake Detection: Ensure voice is real
- Behavioral: Analyze speech patterns
Best Practices
- Enroll voice samples - Collect clean voice sample during onboarding
- Set 70%+ deepfake threshold - Balance false positives vs. fraud
- Combine with other factors - Voice + knowledge questions + SMS code
- Use passphrases - Harder to clone specific phrases
- Monitor call quality - Poor quality may hide artifacts
- Train staff - Recognise social engineering red flags
- Callback verification - For high-risk requests, call back on known number
- Time delays - Implement cooling-off period for unusual requests
Limitations
Detection Accuracy
Current Performance:
- Known TTS models: 95%
- Voice cloning: 90%
- Audio splicing: 93%
- Overall: 90%
Challenges:
- High-quality voice clones (10+ minutes of training data)
- Professional audio editing
- Low-quality phone calls (masks artifacts)
- Background noise interference
False Positives
Common Causes:
- Poor phone connection (adds artifacts)
- Background noise (hides natural speech patterns)
- Non-native speakers (different prosody)
- Medical conditions (affects voice characteristics)
- Emotions (crying, stress alters voice)
Mitigation:
- Use 70%+ threshold
- Manual review for 70-89% range
- Request callback if uncertain
- Document legitimate reasons for unusual characteristics
Emerging Threats
Real-Time Voice Conversion
Technology:
- Real-time voice-to-voice transformation
- Latency < 100ms (imperceptible)
- Maintains emotion and prosody
- Very sophisticated
Detection:
- Spectral anomalies still present
- Slight latency in responses
- Background noise inconsistencies
Emotion Synthesis
New Capability:
- AI models that add emotion to synthetic speech
- Crying, laughing, stress
- Makes voice clones more convincing
Counter-Measures:
- Analyze emotional transitions (synthetic often too perfect)
- Check for natural vocal strain
- Verify emotional context matches content
API Reference
Analyze Audio
POST /api/v4/deepfake/analyse
// multipart/form-data
{
"audio": File,
"analysisType": "voice_verification" | "general",
"enrollmentSampleId": string, // Optional: for biometrics
"returnSegmentAnalysis": boolean // Per-segment deepfake scores
}
// Returns
{
"jobId": "job_audio123",
"status": "processing",
"estimatedTime": "8 seconds"
}Enroll Voice Sample
POST /api/v4/voice/enroll
// multipart/form-data
{
"audio": File,
"applicantId": string,
"passphrase": string // Optional: specific phrase
}
// Returns enrollment ID for future biometric matching
{
"enrollmentId": "enroll_abc123",
"quality": 94,
"status": "active"
}Verify Speaker
POST /api/v4/voice/verify
{
"enrollmentId": "enroll_abc123",
"audioSampleId": "job_audio123",
"includeDeepfakeCheck": true
}
// Returns combined biometric + deepfake result
{
"biometricMatch": {
"score": 92,
"result": "match"
},
"deepfakeDetection": {
"isDeepfake": false,
"confidence": 18
},
"overallResult": "verified",
"confidence": 89
}Pricing
| Service | Processing Time | Cost |
|---|---|---|
| Audio deepfake detection | 5-10 seconds | 3 Credits |
| Voice enrollment | 3-5 seconds | 3 Credits |
| Voice verification (biometric + deepfake) | 8-12 seconds | Contact us |
Regulatory Landscape
Biometric Data Privacy
GDPR (EU):
- Voice is biometric data
- Explicit consent required
- Right to erasure applies
- Encryption required
BIPA (Illinois, US):
- Written policy required
- User consent before collection
- Cannot sell biometric data
CCPA (California):
- Privacy notice required
- Opt-out right
- Deletion right
Deepfake Regulations
US State Laws:
- California AB 730: Criminalizesdeepfake videos in elections
- Texas HB 3004: Criminalises deepfake videos without disclosure
- Virginia HB 2678: Criminalises non-consensual deepfake pornography
EU AI Act (Proposed):
- Disclosure requirements for deepfakes
- Transparency obligations
- High-risk AI system regulations
Next Steps
Ready to get started?
Start with our free plan. No credit card required.