quantificationlib.multiclass.knn module¶
PWKQuantifier a quantifier based on K-Nearest Neighbor
- class PWKQuantifier(n_neighbors=10, p=2, metric='minkowski', metric_params=None, verbose=0)[source]¶
Bases:
WithoutClassifiers
Quantifier based on K-Nearest Neighbor proposed by (Barranquero et al., 2013)
It is a AC method in which the estimator is PWK, a weighted version of KNN in which the weight depends on the proportion of each class in the training set. It is not derived from AC to allow decomposition
- Parameters:
n_neighbors (int, (default=10)) – Number of neighbors to use by default for
kneighbors()
queries.p (int, default=2) – Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.
metric (str or callable, default='minkowski') – The distance metric to use for the tree. The default metric is minkowski, and with p=2 is equivalent to the standard Euclidean metric. For a list of available metrics, see the documentation of
DistanceMetric
. If metric is “precomputed”, X is assumed to be a distance matrix and must be square during fit. X may be a sparse graph, in which case only “nonzero” elements may be considered neighbors.metric_params (dict, default=None) – Additional keyword arguments for the metric function.
verbose (int, optional, (default=0)) – The verbosity level. The default value, zero, means silent mode
- classes_¶
Class labels
- Type:
ndarray, shape (n_classes, )
- cm_¶
Confusion matrix. The true classes are in the rows and the predicted classes in the columns. So, for the binary case, the count of true negatives is cm_[0,0], false negatives is cm_[1,0], true positives is cm_[1,1] and false positives is cm_[0,1] .
- Type:
ndarray, shape (n_classes, n_classes)
- problem_¶
This attribute is set to None in the fit() method. With such model, the first time a testing bag is predicted this attribute will contain the corresponding cvxpy Object (if such library is used, i.e in the case of ‘L1’ and ‘HD’). For the rest testing bags, this object is passed to allow a warm start. The solving process is faster.
- Type:
a cvxpy Problem object
- verbose¶
The verbosity level
- Type:
int
References
Jose Barranquero, Pablo González, Jorge Díez, Juna José del Coz: On the study of nearest neighbor algorithms for prevalence estimation in binary problems. Pattern Recognition, 46(2), 472-482. 2013
- fit(X, y)[source]¶
This method performs the following operations: 1) fits the estimators for the training set and the testing set (if needed), and 2) computes predictions_train_ (crisp values) if needed. Both operations are performed by the fit method of its superclass. Finally the method computes the confusion matrix of the training set using predictions_train_
- Parameters:
X (array-like, shape (n_examples, n_features)) – Data
y (array-like, shape (n_examples, )) – True classes
- predict(X)[source]¶
Predict the class distribution of a testing bag
The prevalences are computed solving a system of linear scalar equations:
cm_.T * prevalences = CC(X)
For binary problems the system is directly solved using the original AC algorithm proposed by Forman
p = (p_0 - fpr ) / ( tpr - fpr)
For multiclass problems, the system may not have a solution. Thus, instead we propose to solve an optimization problem of this kind:
Min distance ( cm_.T * prevalences, CC(X) )
s.t. sum(prevalences) = 1, prevalecences_i >= 0
in which distance is ‘L1’
- Parameters:
X (array-like, shape (n_examples, n_features)) – Testing bag
- Returns:
prevalences – Contains the predicted prevalence for each class
- Return type:
ndarray, shape(n_classes, )