信息处理 - 验证 MFCC 最终结果 - 吾爱随笔录

我为语音识别的 MFCC 计算创建了一个 Octave 代码，它以 wav 文件作为输入，然后将 MFCC 作为输出。但是最终的结果有点混乱，所以我无法验证它们是否正确。

我的过程：-

阅读文件，提取样本，频率。
通过将数据传递给alpha=0.95的高通滤波器来执行预加重
数据的框架。我的帧大小是 256 和重叠 128。
创建一个大小为 256 的汉明窗口，将其与每一帧相乘。
计算每一帧的 FFT，然后找到它的绝对值。并且只选择前半部分，因为它指的是正频率。
创建了20 个 Mel 滤波器，低频率=300Hz和高频率=8000Hz。现在将其应用于每一帧，以获得每帧 20 个数组。对每个数组求和，每帧只有 20 个值。
每帧取这 20 个值中的每一个以10 为底。
每帧对这 20 个值执行DCT 。为此，我使用dct2八度音阶的方法。

我的代码：-

#{
Perform MFCC on helloworld wav file.

#}


clear all;
close all;
#{Step 0: Reading the File & initializing the Time and Freq.
#}
    [x,fs,nbits]=wavread('helloworld.wav');
    ts=1/fs;
    N=length(x);
    Tmax=(N-1)*ts;
    fsu=fs/(N-1);
    t=(0:ts:Tmax);
    f=(-fs/2:fsu:fs/2);
    figure, subplot(411),plot(t,x),xlabel('Time'),title('Original Speech');
    subplot(412),plot(f,fftshift(abs(fft(x)))),xlabel('Freq (Hz)'),title('Frequency Spectrum');

#{
Step 1: Pre-Emphasis

#}

    a=[1];
    b=[1 -0.95];
    y=filter(b,a,x);
    subplot(413),plot(t,y),xlabel('Time'),title('Signal After High Pass Filter - Time Domain');
    subplot(414),plot(f,fftshift(abs(fft(y)))),xlabel('Freq (Hz)'),title('Signal After High Pass Filter - Frequency Spectrum');

#{
Step 2: Frame Blocking

#}
    frameSize=256;
    frameOverlap=128;
    frames=enframe(y,frameSize,frameOverlap);
    NumFrames=size(frames,1);

#{
Step 3: Hamming Windowing

#}
    hamm=hamming(256)';
    for i=1:NumFrames
    windowed(i,:)=frames(i,:).*hamm;
    end



#{
Step 4: FFT 
Taking only the positive values in the FFT that is the first half of the frame after being computed. 
#}

    for i=1:NumFrames
    ft(i,:)=abs(fft(windowed(i,:))(1:frameSize/2));     
    end


#{
Step 5: Mel Filterbanks
Lower Frequency = 300Hz
Upper Frequency = fs/2
With a total of 22 points we can create 20 filters.
#}
    Nofilters=20;
    lowhigh=[300 fs/2];
    %Here logarithm is of base 'e'
    lh_mel=1125*(log(1+lowhigh/700));
    mel=linspace(lh_mel(1),lh_mel(2),Nofilters+2);
    melinhz=700*(exp(mel/1125)-1);
    %Converting to frequency resolution
    fres=floor(((frameSize)+1)*melinhz/fs); 
    %Creating the filters
    for m =2:length(mel)-1
        for k=1:frameSize/2
     if k<fres(m-1)
        H(m-1,k) = 0;
    elseif (k>=fres(m-1)&&k<=fres(m))
        H(m-1,k)= (k-fres(m-1))/(fres(m)-fres(m-1));
    elseif (k>=fres(m)&&k<=fres(m+1))
       H(m-1,k)= (fres(m+1)-k)/(fres(m+1)-fres(m));
    elseif k>fres(m+1)
        H(m-1,k) = 0;    
    endif
        end
        end
    %H contains the 20 filterbanks, we now apply it to the
    %processed signal.
    for i=1:NumFrames
    for j=1:Nofilters
        bankans(i,j)=sum(ft(i,:).*H(j,:));
    end
    end


#{
Step 6: Nautral Log and DCT
#}
    pkg load signal
    %Here logarithm is of base '10'
    logged=log(bankans)/log(10);
    for i=1:NumFrames

        lnd(i,:)=dct2(logged(i,:));
    end

%plotting the MFCC

    figure 
    hold on
    for i=1:NumFrames
        plot(lnd(i,:));
    end
    hold off

MFCC 情节：-

我的问题：-

根据语音识别标准，我的所有步骤对于 MFCC 计算是否有效且正确？
有些教程说在第5步之后取平方值，这里有必要吗？
我的输出 MFCC 的第一帧的 20 个值是 -Infinity ,Nan,Nan .... 如果我的第一帧最初在时域中仅包含“0”值，是否有可能发生？
总体而言，我的代码是否正确，我不确定，因为我已经阅读了每个帧的 MFCC 的第一个元素的值比其他元素高得多，但对我来说，在某些情况下，第一个 MFCC 元素是负数，第二个是正数，这完全违背了既定的规则。

对不起，如果问题是业余的。我刚刚开始学习 DSP 和语音处理概念。

谢谢你。