最优线,使得最大点位于上下边界之间

计算科学 优化 Python
2021-12-24 07:37:20

我有一些 2D 数据,想找到一条线之间的区域内捕获数据中的最大点数请参阅图片作为示例(黄线是边界,其中边距 = 20)。y=mx+by=mx+b+marginy=mx+bmargin

显示边界的线性回归

如果有帮助,y 轴上的值必然是整数。

目前,我使用线性回归(来自 scikit-learn)来找到一条线。但是,这并不能保证在边界之间放置最多的点。我还研究了一些支持向量回归 (SVR),它似乎试图将每个点都放在该线的某个 epsilon 内,但我不确定这是否是正确的应用程序,或者 SVR 是否过大。

任何 Python 代码都会受到赞赏,但更一般地说,我只是想要一个概念性的解决方案。有没有人有任何见解可以帮助解决这个问题?谢谢!

编辑:点数在 5000-10000 的范围内,理论上我只需要计算一次。n

2个回答

时间内近似得出答案,是您希望测试的样本行数。为此:O(NlogN+kN)k

  1. 选择一组均匀的跨越范围 [0,π)。这些是您的示例行。kθ
  2. 的行开始循环( ) θ=0O(N)
    1. 的方向上得到一个单位向量( )uθO(N)
    2. 得到一个向量垂直于这个向量。( )u^O(N)
    3. 将所有点投影到上。( )u^O(N)
    4. 沿对点进行排序。( )u^O(N)
    5. 沿着已排序的点滑动一个与您的页边距大小相同的窗口,并注意最大数量的点何时在此窗口内以及发生这种情况时窗口的中点。( )O(N)
  3. 返回有关具有最大点数的行的信息。

该算法需要时间来对点进行初始排序。然而,在此之后,我们将这些点投影到一条缓慢旋转的线上。缓慢的旋转意味着每次重投影都会导致点已经非常接近它们之前排序的位置:O(NlogN)

旋转最小二乘动画

大多数排序算法能够在几乎线性的时间内处理这种情况。还要注意,随着样本数量的增加,点已经排序的程度也会增加。

完整的算法如下所示:

#include <algorithm>
#include <cassert>
#include <cmath>
#include <functional>
#include <iostream>
#include <random>
#include <vector>



class FracPoint {
 public:
  float frac = std::numeric_limits<float>::quiet_NaN();
  float x;
  float y;
  FracPoint(float x0, float y0) : x(x0), y(y0) {}
  bool operator<(const FracPoint &o) const {
    return frac<o.frac;
  }
  bool operator>(const FracPoint &o) const {
    return frac>o.frac;
  }
};

struct BestLine {
  int    count;
  double slope;
  double intercept;
};

typedef std::vector<FracPoint> Points;
typedef std::pair<int,double> CountIntercept;



///Projects points to their closest orthogonal point on a line at an angle theta
///to the x-axis. Sets each point's `frac` property to the number of unit
///vectors their projection is from the origin. This can be used to sort points
///along the length of the line.
void GetFractionalDistPointsProjectedToVector(
  Points &pts,
  const double theta
){
  const double tvx = std::cos(theta);
  const double tvy = std::sin(theta);

  //Get a line perpendicular to the one we just generated. This is the line we
  //sort the points along to find the maximal margin.
  const auto vx = -tvy;
  const auto vy = tvx;

  //Do a dot product (v dot s)/(s dot s) and multiply by s to get projection
  for(auto &pt: pts)
    pt.frac = (pt.x*vx+pt.y*vy)/(vx*vx+vy*vy);
}



CountIntercept FindBestIntercept(Points &pts, const double margin){
  Points::iterator upper=pts.begin();
  Points::iterator lower=pts.begin();
  int best_count = 0;
  double best_pos = 0;
  for(;upper!=pts.end();++upper){
    while(upper->frac-lower->frac>margin)
      ++lower;
    const auto count = upper-lower+1;
    if(count>best_count){
      best_count = count;
      best_pos = (upper->frac-lower->frac)/2;
    }
  }
  return CountIntercept(best_count,best_pos);
}



BestLine FindBestLine(Points &pts, const int samples, const double margin){
  int    best_count = 0;
  double best_theta;
  for(int i=0;i<samples;i++){
    const double theta = i*(M_PI/samples);
    //Project points onto the line defined by theta
    GetFractionalDistPointsProjectedToVector(pts, theta);
    //Since points are already mostly sorted from the last projection, use
    //insertion sort's O(N) best case scenario to make them totally sorted.
    // ska_sort(pts.begin(),pts.end(),[](const FracPoint &pt){return pt.frac;});
    std::sort(pts.begin(),pts.end());
    const auto count_intercept = FindBestIntercept(pts, margin);
    if(count_intercept.first>best_count){
      best_count = count_intercept.first;
      best_theta = theta;
    }
  }

  //Project points onto the line defined by the best theta
  GetFractionalDistPointsProjectedToVector(pts, best_theta);
  //Since points are being projected onto a "randomly" chosen line, we use an
  //O(N log N) sort
  std::sort(pts.begin(),pts.end());
  const auto count_intercept = FindBestIntercept(pts, margin);  

  //`count_intercept.second` isn't actually the intercept, it's the distance
  //along a line perpendicular to the one defined theta at which the best
  //intercept is located. Let's find the real intercept now.
  double intercept;

  {
    //Get vector along theta
    const double tvx = std::cos(best_theta);
    const double tvy = std::sin(best_theta);

    //Get perpendicular vector
    auto vx = -tvy;
    auto vy = tvx;

    //Multiply by `counter_intercept.second` to get absolute position of best
    //point.
    vx *= count_intercept.second;
    vy *= count_intercept.second;

    //Get slope of line
    double slope = std::tan(best_theta);
    //Use point-slope equation of line y-y1=m*(x-x1) and reformulate to y=m*x+b
    //to recover the actual intercept
    intercept = -slope*vx+vy;
  }

  return {best_count, std::tan(best_theta), intercept};
}



int main(int argc, char **argv){
  if(argc!=3){
    std::cout<<"Syntax: "<<argv[0]<<" <N> <Samples>"<<std::endl;
    return -1;
  }

  const int N       = std::stoi(std::string(argv[1]));
  const int samples = std::stoi(std::string(argv[2]));

  //Poor way of initializing PRNG. Fine for this application.
  std::mt19937 gen(1234567); 
  std::uniform_real_distribution<> dis(-100, 100);
  auto dice = std::bind(dis,gen);

  Points pts;
  for(int i=0;i<N;i++)
    pts.emplace_back(dice(),dice());

  const auto ret = FindBestLine(pts, samples, 25);

  std::cout<<"Count          = "<<ret.count<<std::endl;
  std::cout<<"Best slope     = "<<ret.slope<<std::endl;
  std::cout<<"Best intercept = "<<ret.intercept<<std::endl;

  return 0;
}

现在,我们为算法在不同数量的点上计时 100 个样本,结果如​​下:

1000,       0.014
10000,      0.083
50000,      0.445
100000,     0.943
500000,     5.615
1000000,   12.092
5000000,   68.748
10000000, 144.155

实际上,它几乎是线性的(在 R2=0.9962 的对数图上的斜率为 1.028)。

OpenMP 可用作并行化操作的简单方法:

  #pragma omp parallel for firstprivate(pts)
  for(int i=0;i<samples;i++){
    const double theta = i*(M_PI/samples);
    //Project points onto the line defined by theta
    GetFractionalDistPointsProjectedToVector(pts, theta);
    //Since points are already mostly sorted from the last projection, use
    //insertion sort's O(N) best case scenario to make them totally sorted.
    // ska_sort(pts.begin(),pts.end(),[](const FracPoint &pt){return pt.frac;});
    std::sort(pts.begin(),pts.end());
    const auto count_intercept = FindBestIntercept(pts, margin);
    #pragma omp critical
    if(count_intercept.first>best_count){
      best_count = count_intercept.first;
      best_theta = theta;
    }
  }

我不确定你的尺码是多少n是或您需要多久计算一次所需的线,但是应该非常接近最优的最简单的解决方案是简单地采用大小的点集n,构造行,然后在时间内测试每一行,看看每行有多好。使用这种简单的算法,我们有一个整体的运行时间,而实际上只使用了空间。(n2)O(n)O(n3)O(n)

如果你构建一个半空间范围搜索的数据结构,目标是计算有多少点与给定的半空间相交,你可以使用空间。时间内返回 2D 中半空间的交集数,您只需执行其中两个操作(一个半空间对应到每个上下边缘线)每行。因此,对于每行工作量为运行时。O(n2logn)O(n2)O(logn)O(n2)O(logn)O(n2logn)

我的直觉告诉我,可能有一些随机算法可以真正有效地近似解决这个问题,但我必须更多地考虑它。