Stress Less platform Help

StressLess Voice Stress Detection Pipeline

PlantUML Diagrams

System Architecture Diagram

StressLess Android Voice Stress Detection - System ArchitectureAndroid App LayerBusiness Logic LayerML Processing LayerInfrastructure LayerData LayerTensorFlow Lite ModelsMainActivityanalyzeAudioFile()showStressResult()handlePermissions()VoiceCheckViewModeluiState: StateFlow<VoiceCheckState>startRecording()analyzeRecording()StressAnalysisManageranalyzeWavFile(wavFilePath: String): StressAnalysisResultgenerateStressRecommendations(): List<String>validateInput()aggregateResults()AudioProcessorloadWavFile(filePath: String): FloatArrayparseWavFile(): FloatArrayconvertBytesToFloat(): FloatArrayremoveNoiseAndSilence(): FloatArrayMFCCExtractorextractMFCCFeatures(audio: FloatArray): FloatArrayapplyPreEmphasis(): FloatArraycreateFrames(): Array<FloatArray>computeSpectrogram(): Array<FloatArray>applyMelFilterBank(): Array<FloatArray>computeDCT(): Array<FloatArray>computeDeltaFeatures(): Array<FloatArray>ECAPAStressEngineecapaInterpreter: InterpreterstressClassifier: Interpreterinitialize()extractStressEmbeddings(mfcc: FloatArray): FloatArrayclassifyStress(embeddings: FloatArray): StressClassificationModelLoadercreateInterpreter(assetPath: String): InterpreterloadModel(assetPath: String): ByteBufferverifyModelChecksum()preloadModels()NPUDelegategetDelegate(): DelegatedetectNPUType(): NPUTypecreateOptimalDelegate(): DelegatecreateQualcommNPUDelegate(): DelegatecreateGPUFallback(): DelegateLocalStressRepositorysaveStressAssessment(result: StressAnalysisResult)getAssessmentById(id: String): StressAnalysisResult?getAllAssessments(): Flow<List<StressAnalysisResult>>exportAllData(): StringStressLessDatabasegetStressAssessmentDao()clearAllTables()EncryptionManagerencrypt(data: String): Stringdecrypt(encryptedData: String): StringclearAllKeys()EcapaModelecapa_tdnn_stress.tflite192-dimensional embeddingsClassifierModelstress_classifier.tflite10-level stress classification

Stress Analysis Pipeline Flow

Voice Stress Detection Pipeline FlowInput: WAV audio file(16kHz mono preferred)User selects WAV fileStage 1: Audio PreprocessingLoad WAV file from storageParse WAV header(sample rate, channels, bit depth)Convert bytes to normalized float arrayApply noise reduction & silence removalValidate audio qualityAudio quality acceptable?yesnoContinue processingReturn quality warningStage 2: Feature Extraction (MFCC)Apply pre-emphasis filter(α = 0.97)Create overlapping frames(512 samples, 256 hop)Apply Hamming windowCompute FFT spectrogramApply Mel filter bank(26 filters, 300-8000Hz)Compute DCT for MFCC(13 coefficients)Calculate delta featuresCalculate delta-delta featuresCombine to 39-dimensional featuresStage 3: ECAPA-TDNN EmbeddingInitialize ECAPA-TDNN model(TensorFlow Lite)Prepare input tensor(batch_size=1, features=39*frames)Run ECAPA inference(SE-Res2Net blocks)Apply attention poolingNPU accelerationif available (Qualcomm QNN)Extract 192-dim embeddingsStage 4: Stress ClassificationLoad stress classifier modelInput embeddings to classifierCompute softmax probabilities(10 stress levels)Select max probability indexConvert to stress level (1-10)Calculate confidence scoreStage 5: Result ProcessingGenerate contextual recommendationsbased on stress levelCreate StressAnalysisResult objectEncrypt sensitive dataSave to local databaseLog performance metricsReturn stress analysis result(level 1-10, confidence, recommendations)

MFCC Feature Extraction Process

MFCC Feature Extraction Detailed ProcessInput: Raw audio signalFloatArray (16kHz mono)Balances frequency spectrumand improves SNRApply Pre-emphasis Filtery[n] = x[n] - 0.97 * x[n-1]Creates overlapping windowsfor temporal analysisFrame the SignalFrame size: 512 samplesHop length: 256 samplesOverlap: 50%Reduces spectral leakagefrom windowingApply Hamming Windoww[n] = 0.54 - 0.46*cos(2πn/(N-1))Convert to frequency domainMagnitude spectrumCompute FFTX[k] = Σ(x[n] * e^(-j2πkn/N))Mimics human auditoryperception (mel scale)Apply Mel Filter Bank26 triangular filters300Hz - 8000Hz rangeCompresses dynamic rangemodels human hearingTake Logarithmlog(mel_spectrum)Decorrelates coefficients13 MFCC coefficientsCompute DCTMFCC[i] = Σ(log_mel[j] * cos(i*(j+0.5)*π/J))First-order derivatives13 delta coefficientsCalculate Delta Featuresδ[t] = (c[t+1] - c[t-1]) / 2Second-order derivatives13 delta-delta coefficientsCalculate Delta-Delta Featuresδδ[t] = (δ[t+1] - δ[t-1]) / 2Total: 39 features per frameFlattened for model inputCombine Features[MFCC(13) + Delta(13) + Delta-Delta(13)]Output: 39-dimensional feature vectorper audio frame

ECAPA-TDNN Architecture

ECAPA-TDNN Model Architecture for Stress DetectionInput: MFCC Features[batch_size, feature_dim]ECAPA-TDNN EncoderConv1D Layer(channels=512, kernel=5, dilation=1)SE-Res2Net Block 1Residual ConnectionSqueeze-and-Excitation(channel attention)Res2Net convolution(dilation=1)Batch NormalizationReLU ActivationSE-Res2Net Block 2Residual ConnectionSqueeze-and-ExcitationRes2Net convolution(dilation=2)Batch NormalizationReLU ActivationSE-Res2Net Block 3Residual ConnectionSqueeze-and-ExcitationRes2Net convolution(dilation=3)Batch NormalizationReLU ActivationAttention PoolingAttentive Statistics PoolingCompute attention weightsWeighted mean and std poolingDense Layer(1536 → 192 dimensions)L2 NormalizationOutput: Speaker Embeddings(192-dimensional vector)Stress ClassifierInput: Embeddings (192-dim)Dense Layer (192 → 64)ReLU ActivationDropout (0.3)Dense Layer (64 → 10)Softmax ActivationOutput: Stress Probabilities(10 levels)Select Maximum ProbabilityConvert to Stress Level (1-10)

Data Flow Timeline

StressLess Processing Timeline and Data FlowUser InteractionSelect WAV fileProcessing...Showing resultsResults displayedAudio ProcessingIdleLoading WAVParsing audioNoise reductionCompleteFeature ExtractionIdleStarting MFCCComputing FFTMel filteringDCT + deltasCompleteML InferenceModels loadedWaitingECAPA inferenceClassificationCompleteResult GenerationIdleStartingGenerating adviceCompleteDatabase StorageReadySavingSaved01003006001200180022002500270029003000

Pipeline Description

Overview

The StressLess voice stress detection pipeline is a sophisticated on-device machine learning system that analyzes human voice patterns to detect stress levels ranging from 1-10. The pipeline combines traditional signal processing techniques with modern deep learning architecture, specifically adapted ECAPA-TDNN (Emphasized Channel Attention, Propagation and Aggregation in TDNN) models.

Stage-by-Stage Breakdown

Stage 1: Audio Preprocessing (300-600ms)

Purpose: Prepare raw WAV audio for feature extraction
Input: WAV file (any sample rate, preferably 16kHz mono)
Output: Normalized float array

Key Operations:

  1. WAV File Parsing: Custom Kotlin implementation to read WAV headers and extract PCM data

  2. Sample Rate Normalization: Convert to 16kHz if needed for model compatibility

  3. Noise Reduction: Apply spectral subtraction and noise gating (threshold = 0.01)

  4. Silence Removal: Trim leading/trailing silence using energy-based detection

  5. Amplitude Normalization: Scale to [-1, 1] range to prevent saturation

Performance Target: <600ms for 30-second audio file

Stage 2: MFCC Feature Extraction (1200-1900ms)

Purpose: Convert time-domain audio to perceptually-relevant features
Input: Normalized audio signal
Output: 39-dimensional feature vector per frame

Mathematical Process:

  1. Pre-emphasis Filter: y[n] = x[n] - 0.97 * x[n-1] - balances frequency spectrum

  2. Windowing: 512-sample Hamming windows with 256-sample hop (50% overlap)

  3. FFT: Compute magnitude spectrum for each frame

  4. Mel Filter Bank: 26 triangular filters spanning 300-8000Hz (human hearing range)

  5. Logarithm: log(mel_spectrum) - compresses dynamic range

  6. DCT: Extract 13 MFCC coefficients - decorrelates features

  7. Delta Features: First derivatives - δ[t] = (c[t+1] - c[t-1]) / 2

  8. Delta-Delta Features: Second derivatives for temporal dynamics

Scientific Basis: MFCCs model human auditory perception and have proven effective for speech emotion recognition with 70-80% accuracy in stress detection tasks.

Stage 3: ECAPA-TDNN Embedding Extraction (400-600ms)

Purpose: Extract high-level speaker and stress-related representations
Input: 39-dimensional MFCC features
Output: 192-dimensional embedding vector

Architecture Details:

  • SE-Res2Net Blocks: Three blocks with dilations [1, 2, 3] for multi-scale temporal modeling

  • Channel Attention: Squeeze-and-Excitation modules for feature refinement

  • Temporal Modeling: TDNN layers capture long-range dependencies

  • Attentive Pooling: Weighted statistics pooling over temporal dimension

Adaptation for Stress: Originally designed for speaker verification, the model is fine-tuned on stress-labeled speech data to capture stress-related vocal characteristics like pitch variation, speaking rate, and voice quality changes.

Stage 4: Stress Classification (100-200ms)

Purpose: Convert embeddings to stress probability distribution
Input: 192-dimensional embeddings
Output: Stress level (1-10) with confidence score

Classification Network:

  • Input Layer: 192 dimensions (ECAPA embeddings)

  • Hidden Layer: 64 neurons with ReLU activation and 30% dropout

  • Output Layer: 10 neurons (stress levels) with softmax activation

  • Decision: stress_level = argmax(probabilities) + 1

Stage 5: Result Processing and Storage (100-200ms)

Purpose: Generate actionable insights and persist results
Input: Classification results
Output: Complete StressAnalysisResult with recommendations

Recommendation Engine:

  • Low Stress (1-3): Positive reinforcement and maintenance strategies

  • Moderate Stress (4-6): Basic stress management techniques (breathing exercises)

  • High Stress (7-10): Immediate intervention suggestions and professional guidance

Technical Specifications

Performance Metrics

  • Accuracy: ~77.5% (based on ECAPA-TDNN research for stress detection)

  • Processing Time: <3 seconds total pipeline

  • Memory Usage: <200MB peak during inference

  • Battery Impact: <2% per analysis session

  • Model Size: ~15MB total (ECAPA + Classifier)

Hardware Acceleration

  • NPU Support: Qualcomm Snapdragon (QNN delegate), MediaTek (NeuroPilot)

  • GPU Fallback: Mali, Adreno graphics processors

  • CPU Fallback: ARM Cortex-A series with NEON optimization

Data Privacy & Security

  • On-Device Processing: No voice data transmitted over network

  • Encrypted Storage: SQLCipher for assessment results

  • GDPR Compliance: Export/delete functionality built-in

  • Model Security: Checksum verification for model integrity

Scientific Foundation

The pipeline is based on established research in computational paralinguistics and affective computing:

  1. MFCC Features: Proven in emotion recognition with 70-85% accuracy rates

  2. ECAPA-TDNN: State-of-the-art speaker verification architecture adapted for stress

  3. Temporal Modeling: Dilation convolutions capture stress-related temporal patterns

  4. Attention Mechanisms: Focus on stress-relevant spectral and temporal regions

Integration Points

Android Framework Integration

  • AudioRecord: Real-time audio capture with 16kHz PCM format

  • TensorFlow Lite: Optimized inference with hardware acceleration

  • Room Database: Encrypted local storage for assessment history

  • WorkManager: Background processing for batch analysis

Quality Assurance

  • Audio Quality Assessment: Real-time SNR and clipping detection

  • Model Validation: Continuous accuracy monitoring with known test samples

  • Performance Monitoring: Latency and memory usage tracking

  • Error Recovery: Graceful degradation for poor quality inputs

Performance Timeline Summary

Total Processing Time: ~3 seconds

  • Audio loading: 0.3s

  • Feature extraction: 1.9s

  • ML inference: 0.6s

  • Result generation: 0.2s

Key Features Summary

ECAPA-TDNN Architecture Benefits:

  • SE-Res2Net blocks with dilated convolutions

  • Attention-based temporal pooling

  • 192-dimensional speaker embeddings

  • Adapted for stress pattern recognition

  • TensorFlow Lite optimized for mobile

This pipeline represents a production-ready implementation of voice stress analysis technology, balancing accuracy, performance, and privacy requirements for mobile deployment.

12 September 2025