quantificationlib.multiclass.knn module

PWKQuantifier a quantifier based on K-Nearest Neighbor

class PWKQuantifier(n_neighbors=10, p=2, metric='minkowski', metric_params=None, verbose=0)[source]

Bases: WithoutClassifiers

Quantifier based on K-Nearest Neighbor proposed by (Barranquero et al., 2013)

It is a AC method in which the estimator is PWK, a weighted version of KNN in which the weight depends on the proportion of each class in the training set. It is not derived from AC to allow decomposition

Parameters:
  • n_neighbors (int, (default=10)) – Number of neighbors to use by default for kneighbors() queries.

  • p (int, default=2) – Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.

  • metric (str or callable, default='minkowski') – The distance metric to use for the tree. The default metric is minkowski, and with p=2 is equivalent to the standard Euclidean metric. For a list of available metrics, see the documentation of DistanceMetric. If metric is “precomputed”, X is assumed to be a distance matrix and must be square during fit. X may be a sparse graph, in which case only “nonzero” elements may be considered neighbors.

  • metric_params (dict, default=None) – Additional keyword arguments for the metric function.

  • verbose (int, optional, (default=0)) – The verbosity level. The default value, zero, means silent mode

classes_

Class labels

Type:

ndarray, shape (n_classes, )

cm_

Confusion matrix. The true classes are in the rows and the predicted classes in the columns. So, for the binary case, the count of true negatives is cm_[0,0], false negatives is cm_[1,0], true positives is cm_[1,1] and false positives is cm_[0,1] .

Type:

ndarray, shape (n_classes, n_classes)

problem_

This attribute is set to None in the fit() method. With such model, the first time a testing bag is predicted this attribute will contain the corresponding cvxpy Object (if such library is used, i.e in the case of ‘L1’ and ‘HD’). For the rest testing bags, this object is passed to allow a warm start. The solving process is faster.

Type:

a cvxpy Problem object

verbose

The verbosity level

Type:

int

References

Jose Barranquero, Pablo González, Jorge Díez, Juna José del Coz: On the study of nearest neighbor algorithms for prevalence estimation in binary problems. Pattern Recognition, 46(2), 472-482. 2013

fit(X, y)[source]

This method performs the following operations: 1) fits the estimators for the training set and the testing set (if needed), and 2) computes predictions_train_ (crisp values) if needed. Both operations are performed by the fit method of its superclass. Finally the method computes the confusion matrix of the training set using predictions_train_

Parameters:
  • X (array-like, shape (n_examples, n_features)) – Data

  • y (array-like, shape (n_examples, )) – True classes

predict(X)[source]

Predict the class distribution of a testing bag

The prevalences are computed solving a system of linear scalar equations:

cm_.T * prevalences = CC(X)

For binary problems the system is directly solved using the original AC algorithm proposed by Forman

p = (p_0 - fpr ) / ( tpr - fpr)

For multiclass problems, the system may not have a solution. Thus, instead we propose to solve an optimization problem of this kind:

Min distance ( cm_.T * prevalences, CC(X) )

s.t. sum(prevalences) = 1, prevalecences_i >= 0

in which distance is ‘L1’

Parameters:

X (array-like, shape (n_examples, n_features)) – Testing bag

Returns:

prevalences – Contains the predicted prevalence for each class

Return type:

ndarray, shape(n_classes, )