数据挖掘 - 评估随机森林分类器的性能 - 吾爱随笔录

我正在使用随机森林分类器（在 R 中）来估算数据集中的缺失数据。基本上，我有一堆对象（公司），我想size从其他属性（capital和owning_group）中猜测一个属性（ state）。因属性是一个分类变量 ( size)，具有 3 个可能的值（小|中|大）。一组 3 个变量上的随机森林（R 包 randomForest）提供以下输出：

ff = size ~ capital + owning_group + state

Call:
 randomForest(formula = ff, data = df, importance = T, ntree = ntree, na.action = na.omit) 
               Type of random forest: classification
                     Number of trees: 2000
No. of variables tried at each split: 1

        OOB estimate of  error rate: 32.41%
Confusion matrix:
       large medium small class.error
large    238     17   237  0.51626016
medium    80     25   322  0.94145199
small     73     30  1320  0.07238229

  Overall Statistics

               Accuracy : 0.7297          
                 95% CI : (0.7112, 0.7476)
    No Information Rate : 0.8049          
    P-Value [Acc > NIR] : 1               

                  Kappa : 0.426           
 Mcnemar's Test P-Value : <2e-16          

Statistics by Class:

                     Class: large Class: medium Class: small
Sensitivity                0.7087       0.84211       0.7294
Specificity                0.8868       0.83981       0.8950
Pos Pred Value             0.5488       0.14988       0.9663
Neg Pred Value             0.9400       0.99373       0.4450
Prevalence                 0.1627       0.03245       0.8049
Detection Rate             0.1153       0.02733       0.5871
Detection Prevalence       0.2101       0.18232       0.6076
Balanced Accuracy          0.7977       0.84096       0.8122

我将此输出解释为该模型具有 73% 的准确率，并且分类器对和犯了很多错误medium，large但small大部分都是正确的。P 值是否表明模型不显着？

假设这个精度对我的上下文来说是可以的，除了这些简单的观察之外，我如何验证这个模型？