实现音高变换器

信息处理 声音的 沥青 演讲
2022-02-01 12:56:52

我正在尝试在 MATLAB 中实现一个与此类似的音高移位器,以进行快速原型设计。

我的目标是做相反的事情,但我认为它是相似的。

缺少实现细节。所以我试图将它们拼凑在一起。

看起来相位声码器和 SOLA 方法是首选方法所以我在那里做了一些研究。

我发现这个带有 MATLAB 代码和一个示例章节来解释它是如何工作的。

我了解大部分代码,除了在相位重建期间,有一个奇怪的选择π或 0 取决于 bin 索引是否为奇数。

                pizero=pi*ones(1,length(freqind));
                pcent=peaksort(tk,2)-peaksort(tk,1)+1;
                indpc=(2-mod(pcent,2)):2:length(freqind);
                pizero(indpc)=zeros(1,length(indpc));
                phout(freqind)=phadvance(sk,peaksort(tk,2))+pizero;

书中的文字描述同样奇怪,参考文献只指向一个数字

If these are chosen to be ... the burst generated by the IFFT will be windowed with tapered ends

我们知道我们为什么要做出这样的选择吗?什么是锥形末端?为什么会导致这些选择呢?为什么锥形末端更可取?

我还找到了这个,它似乎在做相位声码器,但在其他方面有一个相当不同的实现,特别是,我找不到在该代码中完成的任何峰值查找工作。

这让我想知道,什么是相位声码器,是第一个环节,还是第二个环节,还是一类算法?

当作者说音高转换时,实际上,他似乎在进行音高缩放,并且看起来他并不关心某些频率内容是否超出范围,他只是忽略了它们。

            /* ***************** PROCESSING ******************* */
            /* this does the actual pitch shifting */
            memset(gSynMagn, 0, fftFrameSize*sizeof(float));
            memset(gSynFreq, 0, fftFrameSize*sizeof(float));
            for (k = 0; k <= fftFrameSize2; k++) { 
                index = k*pitchShift;
                if (index <= fftFrameSize2) { 
                    gSynMagn[index] += gAnaMagn[k]; 
                    gSynFreq[index] = gAnaFreq[k] * pitchShift; 
                } 
            }

如果我将此应用于男性到女性的转变。我们知道男性声音的频率范围是 85 到 180Hz,而女性声音的频率范围是 165 到 255Hz,只是进行缩放可能会产生一些女性声音不存在的 300Hz 部分。

我想知道这是否有意义?

  • 将 0 到 85Hz 缩放到 0 到 165Hz。
  • 将 85 Hz 缩放到 180Hz 到 165 到 255Hz
  • 将 180Hz 缩放到最大值到 255Hz 到最大值。

以便最小限度地改变带宽。

我也想做第一个链接中的共振峰技巧。

共振峰分析似乎主要使用 LPC 或倒谱来完成。我选择了 LPC。

我发现有助于我使用 LPC 估计共振峰。我可以毫无问题地重现实验。但我想知道

如何在相位声码器中实现共振峰偏移?

对于如此冗长而令人费解的问题,我深表歉意——这对我来说太难了,我才刚刚开始进行音频信号处理。如果我对这一切了解得更清楚,我就可以将它们分别提出清晰的问题,在这一点上,它们在我看来都是一团糟。

免责声明,这不是家庭作业,这只是一个辅助项目学习练习。

更新:

这是我现在拥有的代码。正是基于它通常以我想要的方式改变频谱,但它与我想要重现的实验中的音高偏移版本相去甚远。我怀疑我在这里观察到的高频噪声是罪魁祸首。

% This implements a phase vocoder for time-stretching/compression
% Put the file name of the input file in "infile"
%     the file to be created is "outfile"
% The amount of time-stretch is determined by the ratio of the hopin
% and hopout variables. For example, hopin=242 and hopout=161.3333
% (integers are not required) increases the tempo by
% hopin/hopout = 1.5. To slow down a comparable amount,
% choose hopin = 161.3333, hopout = 242.
% 5/2005 Bill Sethares

clf
infile='male.wav';
outfile='female.wav';
time=0;                                    % total time to process
hopin=121;                                 % hop length for input
hopout=121;                                % hop length for output

all2pi=2*pi*(0:100);                       % all multiples of 2 pi (used in PV-style freq search)
max_peak=50;                               % parameters for peak finding: number of peaks
eps_peak=0.005;                            % height of peaks
nfft=2^12; nfft2=nfft/2;                   % fft length
win=hanning(nfft)';                        % windows and windowing variables
% y are the samples, sr is the sampling rate.
[y,sr]=wavread(infile);                    % read song file
siz=wavread(infile,'size');                % length of song in samples
stmo=min(siz); leny=max(siz);              % stereo (stmo=2) or mono (stmo=1)
if siz(2)<siz(1), y=y'; end
if time==0, time = 100000; end

% time is in unit seconds, and therefore time * sr equals to the number
% of samples for time seconds.
% in any case, we will not be dealing with 100,000 seconds anyway.
% it looks like a precaution to avoid very long loop only.
tt=zeros(stmo,ceil(hopout/hopin)*min(leny,time*sr)); % place for output

% Suppose hopin is an integer and we are advancing window by hopin
% every time, the first window is [1..nfft],
% the next window is [hopin + 1..hopin + nfft]
% and so on, so the last window should be
% [k * hopin + 1 .. k * hopin + nfft]
% and k * hopin + nfft should be closest to leny
% That is probably how lenseg is determined.
lenseg=floor((min(leny,time*sr)-nfft)/hopin) % number of nfft segments to process

% The fastest wave in discrete time takes two samples. Therefore, the
% period is 2 * sampling period. The associated frequency is sampling rate / 2
% which is also equal to sr * nfft2/nfft
% the remaining frequencies are scaled linearly.
ssf=sr*(0:nfft2)/nfft;                     % frequency vector

% ph stands for phase
phold=zeros(stmo,nfft2+1); phadvance=zeros(stmo,nfft2+1);
outbeat=zeros(stmo,nfft); pold1=[]; pold2=[];
% hopin is the number of samples to advance, therefore dtin is the
% amount of time to advance
dtin=hopin/sr;                             % time advances dt per hop for input
dtout=hopout/sr;                           % time advances dt per hop for output
% for k=1239:1239
for k=1:lenseg-1                           % main loop - process each beat separately
% for k=1:3                           % main loop - process each beat separately
    if k/1000==round(k/1000), disp(k), end % for display so I know where we are
    % These are the indexes of the frame
    indin=round(((k-1)*hopin+1):((k-1)*hopin+nfft));
    for sk=1:stmo                          % do L/R channels separately
        s=win.*y(sk,indin);                % get this frame and take FFT
        ffts=fft(s);
        mag=abs(ffts(1:nfft2+1));
        ph=angle(ffts(1:nfft2+1));

        % find peaks to define spectral mapping
        % peaks is going to be a matrix with 3 columns, one row for
        % each peak, the first column and the last column represents
        % the bin for the peak, and the second column is index of the
        % peak
        peaks=findPeaks4(mag, max_peak, eps_peak, ssf);
        % inds(1) is the index of the maximum peak, the mag is a
        % vector, not a function
        [dummy,inds]=sort(mag(peaks(:,2)));
        % peaksort are the peaks, the columns are interpreted just like
        % peaks, and the rows are sorted in ascending order
        peaksort=peaks(inds,:);
        % pc is the second column, that correspond to the indices of
        % the peak, sorted in ascending order
        pc=peaksort(:,2);

        bestf=zeros(size(pc));
        for tk=1:length(pc)                % estimate frequency using PV strategy

            % tk is which peak
            % pc(tk) is the index of the peak
            % ph(pc(tk)) is the phase of the peak
            %
            % the phold term means the same thing for the last frame
            % MATLAB allows adding a scalar to a vector, it means
            % applying to all elements
            %
            % so dtheta is a set of phase differences
            dtheta=(ph(pc(tk))-phold(sk,pc(tk)))+all2pi;

            % In a period, the phasor runs 2pi radian
            % In time dt, the phasor runs 2 pi dt/T = 2pi dt f radian
            % And we know the phase differences are dtheta
            % so fest are all the possible frequencies causing that
            % peak.
            fest=dtheta./(2*pi*dtin);      % see pvanalysis.m for same idea

            % ssf(pc(tk)) correspond to the frequency of the coefficient.
            % This implies we wanted the frequency that is closest
            [er,indf]=min(abs(ssf(pc(tk))-fest));

            % Now we have the best frequency found for the peak
            bestf(tk)=fest(indf);          % find best freq estimate for each row
        end

        % generate output mag and phase

        magout=mag; phout=ph;
        for tk=1:length(pc)
            % We have only one best frequency for a peak
            fin=bestf(tk);
            % Here are all the input spectral indexes 
            inbinfrom = peaksort(tk,1);
            inbinto = peaksort(tk,3);
            freqindin=(inbinfrom:inbinto);  
            
            fdes = map_frequency(fin, sr);
            inbinfromfreq = inbinfrom * sr/nfft;
            inbintofreq = inbinto * sr/nfft;
            
            outbinfromfreq = map_frequency(inbinfromfreq, sr);
            outbintofreq = map_frequency(inbintofreq, sr);
            
            outbinfrom = round(outbinfromfreq * nfft / sr);
            outbinto = round(outbintofreq * nfft / sr);

            freqindout=(outbinfrom:outbinto);

            % specify magnitude and phase of each partial

            % The output magnitude is interpolated
            inputx = (0:length(freqindin)-1)/(length(freqindin));
            inputy = mag(freqindin);
            outputx = (0:length(freqindout)-1)/(length(freqindout));
            outputy = interp1(inputx, inputy, outputx,'linear','extrap');
            magout(freqindout)= outputy;

            % We know the real frequency, so that phase should increase by
            % 2 * pi * fdes * dtout
            % We are estimating only a single phadvance for the whole bin
            phadvance(sk,peaksort(tk,2))=phadvance(sk,peaksort(tk,2))+2*pi*fdes*dtout;
            
            % These weird expression zero outs some entries in pizero. This
            % is a mechanism that is not totally understood around the idea
            % of tapered end as in the notes
            pizero=pi*ones(1,length(freqindout));
            pcent=peaksort(tk,2)-peaksort(tk,1)+1;
            indpc=(2-mod(pcent,2)):2:length(freqindout);
            pizero(indpc)=zeros(1,length(indpc));

            % Now we can set the phase for the bin
            phout(freqindout)=phadvance(sk,peaksort(tk,2))+pizero;
        end
        
        % reconstruct time signal (stretched or compressed)
        compl=magout.*exp(sqrt(-1)*phout);
        compl(nfft2+1)=ffts(nfft2+1);
        compl=[compl,fliplr(conj(compl(2:(nfft2))))];
        wave=real(ifft(compl));
        outbeat(sk,:)=wave;
        phold(sk,:)=ph;
        

    end % end stereo
    indout=round(((k-1)*hopout+1):((k-1)*hopout+nfft));
    tt(:,indout)=tt(:,indout)+outbeat;
end

tt=0.8*tt/max(max(abs(tt)));
[rtt,ctt]=size(tt); if rtt==2, tt=tt'; end
wavwrite(tt,sr,16,outfile);
fclose('all');

给出了 findPeaks4 函数,我所做的只是写一些注释来记录我的理解。

function  peaks = findPeaks4( Amp, MAX_PEAK, EPS_PEAK, SSF )

%   This version modified from findPeaks.m by P. Moller-Nielson 
%   28.3.03, pm-n. ( see http://www.daimi.au.dk/~pmn/sound/ )

SPECTRUM_SIZE=length(Amp);
% The very first entry of the spectrum is the DC value, we do not consider
% that. Therefore only [3..SPECTRUM_SIZE-1] can be a local maximum
% The first two comparison make sure it is a local maximum, and the last
% multiplication extract the value of that local maximum.
%
% The resulting vector index 1 correspond to index 3 of the Amp vector (*)
peakAmp = ( Amp(3:SPECTRUM_SIZE-1) > Amp(2:SPECTRUM_SIZE-2) ) .* ...
          ( Amp(3:SPECTRUM_SIZE-1) > Amp(4:SPECTRUM_SIZE) ) .* ...
          Amp(3:SPECTRUM_SIZE-1);
peakPos = zeros( MAX_PEAK, 1);

maxAmp = max( peakAmp );
nPeaks = 0;
for p = 1 : MAX_PEAK
  % The maximum value is m, the index of that maximum value is b
  [m, b] = max( peakAmp );
  if m <= ( EPS_PEAK * maxAmp )
    break;
  end;
  % The + 2 is due to (*)
  peakPos(p) = b+2;
  % By setting peakAmp(b) to 0, the max in the next iteration will find the
  % next max
  peakAmp(b) = 0;
  nPeaks = p;
end;

peakPos = sort( peakPos );
peaks = zeros( nPeaks, 3 );

% b stands for bin. WTF! Saving just 2 characters to let me guess?
last_b = 1;
for p = 1 : nPeaks
  b = peakPos(MAX_PEAK-nPeaks+p);
  first_b = last_b+1;
  if p == nPeaks
    last_b = SPECTRUM_SIZE;
  else
    next_b = peakPos(MAX_PEAK-nPeaks+p+1);
    % Between the peaks, find the minimum
    % note that rel_min is the index starting from b
    % rel_min = 1 meaning Amp(b) is minimal
    [dummy, rel_min] = min( Amp(b:next_b));
    % Therefore last_b is the index of the minimum
    last_b = b+rel_min-1;
  end;
  peaks(p,1) = first_b;
  peaks(p,2) = b;
  peaks(p,3) = last_b;
end;

map_frequency 函数是您所期望的

function fdes = map_frequency(fin, sr)
    fmax = sr/2;
    if fin < 85
        fdes = fin / 85 * 165;
    else
        if fin < 180
            fdes = (fin - 85)/(180-85) * (255 - 165) + 165;
        else
            fdes = (fin - 180)/(fmax-180) * (fmax - 255) + 255;
        end
    end
2个回答

这个问题有很多你想要的。期望有人阅读是很多事情。我的理解是,您希望有人解释什么是相位声码器,这需要做很多事情。

使用SOLA 技术,您将需要一个非常好的音高检测器。但是这种技术将输入的人声波形分解成小的小波或颗粒。然后以预期输出间距的速率发射这些颗粒(并将它们重叠)。如果颗粒没有及时拉伸或压扁,即使音高移动,共振峰也不会移动。如果颗粒及时被压缩或压扁,共振峰会向上移动。如果颗粒及时拉伸,共振峰会向下移动。

如果您要使用相位声码器而不是 SOLA 来执行此操作,那么您可能需要进行一些 LPC 分析以获取共振峰滤波器并在音高移动之前使频谱变平,然后将频谱应用于移动后的输出。

这些都是非常大的话题。在 SE 站点上不容易完全处理。

我将只解决您问题的相位声码器/音高移位器部分。

这让我想知道,什么是相位声码器,是第一个环节,还是第二个环节,还是一类算法?

您链接到的音高变换代码实际上是相位声码器的简单实现。相位声码器是一种过程/工具,您首先将时域信号转换为频域(通常通过短时傅立叶变换),然后使用幅度(或相位)信息(或两者)对原始信号(无论是音高偏移、时间拉伸、频率提升/衰减等)。

当作者说音高转换时,实际上,他似乎在进行音高缩放,并且看起来他并不关心某些频率内容是否超出范围,他只是忽略了它们。

是的,该音高偏移算法会丢弃任何高于奈奎斯特频率的频率内容(如果音高偏移因子 > 1)。您首先需要将信号重新采样到更高的采样率以保留整个频谱。也就是说,人声的带宽相对较窄(如果我没记错的话,传统电话线的带宽是 4kHz),因此有人可能会争辩说,即使没有重新采样,实现音高移位器也足以获得不错的结果(至少对于较小的音高)转换因素,比如 1.0-2.0)。

然而,问题是,改变整个频谱的音高不会使男性声音听起来像女性声音。您需要改变共振峰以匹配女性的共振峰,但即使仅此一项也可能(可能)仍然不够。女性的声音通常具有男性声音所不具备的不同品质(例如呼吸),因此您可能也需要效仿这一点才能获得良好的效果。