Higham 的算法对于在线算法或存储容量有限的算法(例如边缘处理)似乎非常有用。在这些情况下,它可能总是物有所值。
但是,为了解决您提出的问题,我实现了一个 SSE2 版本,我认为它可以捕捉到您的问题:
#include <chrono>
#include <cstdlib>
#include <iomanip>
#include <iostream>
#include <xmmintrin.h>
int main(){
const int num_count = 100000;
alignas(16) float data[num_count];
for(int i=0;i<num_count;i++)
data[i] = rand()%100000+rand()/(double)RAND_MAX;
{
const auto t1 = std::chrono::high_resolution_clock::now();
float sum=0;
for(int i=0;i<num_count;i++){
sum += data[i];
}
const float mean=sum/num_count;
const auto t2 = std::chrono::high_resolution_clock::now();
const auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);
std::cout << "Exec time:\t" << time_span1.count() << " s\n";
std::cout << "Mean = "<<std::setprecision(20)<<mean<<std::endl;
}
{
const auto t1 = std::chrono::high_resolution_clock::now();
__m128 mean = _mm_load_ps(&data[0]);
for (int i=4;i<num_count;i+=4){
const __m128 x = _mm_load_ps(&data[i]);
const __m128 diff = _mm_sub_ps(x,mean);
const __m128 k = _mm_set1_ps(i/4);
const __m128 div = _mm_div_ps(diff, k);
mean = _mm_add_ps(mean, div);
}
float result[4];
_mm_store_ps(result, mean);
const float tmean = (result[0] + result[1] + result[2] + result[3])/4; //I'm suspicious about this step: probably throws away precision
const auto t2 = std::chrono::high_resolution_clock::now();
const auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);
std::cout << "Exec time:\t" << time_span1.count() << " s\n";
std::cout << "Mean = "<<std::setprecision(20)<<tmean<<std::endl;
}
}
}
并观察到
Exec time: 0.000225851 s
Simple Mean = 49891.23046875
Exec time: 0.0003759360000000000002 s
Higham Mean = 49890.26171875
Higham 的平均值需要更长的时间来计算,并且值的差异可能是不可忽略的数量,尽管您需要的准确性实际上取决于您的实施。