Title: | One-to-One Feature Matching |
---|---|
Description: | Statistical methods to match feature vectors between multiple datasets in a one-to-one fashion. Given a fixed number of classes/distributions, for each unit, exactly one vector of each class is observed without label. The goal is to label the feature vectors using each label exactly once so to produce the best match across datasets, e.g. by minimizing the variability within classes. Statistical solutions based on empirical loss functions and probabilistic modeling are provided. The 'Gurobi' software and its 'R' interface package are required for one of the package functions (match.2x()) and can be obtained at <https://www.gurobi.com/> (free academic license). For more details, refer to Degras (2022) <doi:10.1080/10618600.2022.2074429> "Scalable feature matching for large data collections" and Bandelt, Maas, and Spieksma (2004) <doi:10.1057/palgrave.jors.2601723> "Local search heuristics for multi-index assignment problems with decomposable costs". |
Authors: | David Degras |
Maintainer: | David Degras <[email protected]> |
License: | GPL-2 |
Version: | 1.0 |
Built: | 2025-03-08 04:46:10 UTC |
Source: | https://github.com/cran/matchFeat |
Statistical methods to match feature vectors between multiple datasets in a one-to-one fashion. Given a fixed number of classes/distributions, for each unit, exactly one vector of each class is observed without label. The goal is to label the feature vectors using each label exactly once so to produce the best match across datasets, e.g. by minimizing the variability within classes. Statistical solutions based on empirical loss functions and probabilistic modeling are provided. The 'Gurobi' software and its 'R' interface package are required for one of the package functions (match.2x()) and can be obtained at <https://www.gurobi.com/> (free academic license). For more details, refer to Degras (2022) <doi:10.1080/10618600.2022.2074429> "Scalable feature matching for large data collections" and Bandelt, Maas, and Spieksma (2004) <doi:10.1057/palgrave.jors.2601723> "Local search heuristics for multi-index assignment problems with decomposable costs".
This package serves to match feature vectors across a collection of datasets
in a one-to-one fashion. This task is formulated as a multidimensional assignment problem with decomposable costs (MDADC).
We propose fast algorithms with time complexity roughly linear in the number of datasets and space complexity a small fraction of the data size.
Initialization methods: match.rec
(recursive) and match.template
(template-based).
Main matching algorithms: match.bca
, match.bca.gen
(for unbalanced data), and match.kmeans
(-means matching).
Refinement methods (post-processing): match.2x
(pairwise interchange) and match.gaussmix
(Gaussian mixture model with permutation constraints).
Author: David Degras
Maintainer: David Degras <[email protected]>
Degras (2022) "Scalable feature matching across large data collections."
doi:10.1080/10618600.2022.2074429
Wright (2015). Coordinate descent algorithms.
https://arxiv.org/abs/1502.04759
McLachlan and Krishnan (2008). The EM Algorithm and Extensions
This function implements the Pairwise Interchange Heuristic for the multidimensional assignment problem with decomposable costs (MDADC).
match.2x(x, sigma = NULL, unit = NULL, w = NULL, control = list())
match.2x(x, sigma = NULL, unit = NULL, w = NULL, control = list())
x |
data: matrix of dimensions |
sigma |
permutations: matrix of dimensions |
unit |
integer (=number of units) or vector mapping rows of |
w |
weights for loss function: single positive number,
|
control |
tuning parameters |
Use of this function requires to have the GUROBI software and its R interface package installed. Both can be downloaded from https://www.gurobi.com after obtaining a free academic license.
A list of class matchFeat
with components
sigma
best assignment as set of permutations ( matrix)
cluster
best assignment as a cluster membership vector
objective
minimum objective value
mu
mean vector for each class/label ( matrix)
V
covariance matrix for each class/label ( array)
call
function call
Degras (2022) "Scalable feature matching across large data collections." doi:10.1080/10618600.2022.2074429
match.bca
, match.bca.gen
,
match.gaussmix
, match.kmeans
,
match.rec
, match.template
if (require(gurobi)) { ## Generate small example m <- 3 # number of classes n <- 10 # number of statistical units p <- 5 # number of variables mu <- matrix(rnorm(p*m),p,m) # mean vectors sigma <- 0.1 x <- array(as.vector(mu) + rnorm(p*m*n,sigma), c(p,m,n)) ## Match all feature vectors result <- match.2x(x) ## Display results result$cost # objective value = assignment cost result$sigma # solution permutations xmatched <- array(dim=dim(x)) ## Matched feature vectors for (i in 1:n) xmatched[,,i] <- x[,result$sigma[,i],i] }
if (require(gurobi)) { ## Generate small example m <- 3 # number of classes n <- 10 # number of statistical units p <- 5 # number of variables mu <- matrix(rnorm(p*m),p,m) # mean vectors sigma <- 0.1 x <- array(as.vector(mu) + rnorm(p*m*n,sigma), c(p,m,n)) ## Match all feature vectors result <- match.2x(x) ## Display results result$cost # objective value = assignment cost result$sigma # solution permutations xmatched <- array(dim=dim(x)) ## Matched feature vectors for (i in 1:n) xmatched[,,i] <- x[,result$sigma[,i],i] }
This function solves the multidimensional assignment problem with decomposable costs (MDADC) by block coordinate ascent. The dissimilarity function is the squared Euclidean distance.
match.bca(x, unit = NULL, w = NULL, method = c("cyclical", "random"), control = list())
match.bca(x, unit = NULL, w = NULL, method = c("cyclical", "random"), control = list())
x |
data: matrix of dimensions |
unit |
integer (=number of units) or vector mapping rows of |
w |
weights for loss function: single positive number,
|
method |
sweeping method for block coordinate ascent: |
control |
tuning parameters |
Given a set of statistical units, each having
possibly mislabeled feature vectors, the one-to-one matching problem is to find a set of
label permutations that produce the best match of feature vectors across units. The objective function to minimize is the sum of (weighted) squared Euclidean distances between all pairs of feature vectors having the same (new) label. This amounts to minimizing the sum of the within-label variances.
The sample means and sample covariances of the matched feature vectors are calculated as a post-processing step.
The block-coordinate ascent (BCA) algorithm successively sweeps through the statistical units (=blocks), each time relabeling the feature vectors of a unit to best match those of the other
units.
If x
is a matrix, the rows should be sorted by increasing unit label and unit
should be a nondecreasing sequence of integers, for example with each integer
replicated
times.
The argument w
can be specified as a vector of positive numbers (will be recycled to length if needed) or as a positive definite matrix of size
.
The optional argument control
is a list with three fields: sigma
, starting point for the optimization ( matrix of permutations;
maxit
, maximum number of iterations; and equal.variance
, logical value that specifies whether the returned sample covariance matrices V
for matched features should be equal between labels/classes (TRUE) or label-specific (FALSE, default).
A list of class matchFeat
with components
sigma
best set of permutations for feature vectors ( matrix)
cluster
associated clusters (=inverse permutations)
objective
minimum objective value
mu
sample mean for each class/label ( matrix)
V
sample covariance for each class/label ( matrix
call
function call
Degras (2022) "Scalable feature matching across large data collections."
doi:10.1080/10618600.2022.2074429
Wright (2015). Coordinate descent algorithms. https://arxiv.org/abs/1502.04759
match.2x
,
match.bca.gen
, match.gaussmix
,
match.kmeans
, match.rec
, match.template
data(optdigits) m <- length(unique(optdigits$label)) # number of classes n <- nrow(optdigits$x) / m # number of units ## Use function with data in matrix form fit1 <- match.bca(optdigits$x, unit=n) ## Use function with data in array form p <- ncol(optdigits$x) x <- t(optdigits$x) dim(x) <- c(p,m,n) fit2 <- match.bca(x)
data(optdigits) m <- length(unique(optdigits$label)) # number of classes n <- nrow(optdigits$x) / m # number of units ## Use function with data in matrix form fit1 <- match.bca(optdigits$x, unit=n) ## Use function with data in array form p <- ncol(optdigits$x) x <- t(optdigits$x) dim(x) <- c(p,m,n) fit2 <- match.bca(x)
Solve a feature matching problem by block coordinate ascent
match.bca.gen(x, unit = NULL, cluster = NULL, w = NULL, method = c("cyclical", "random"), control = list())
match.bca.gen(x, unit = NULL, cluster = NULL, w = NULL, method = c("cyclical", "random"), control = list())
x |
data matrix (rows=instances, columns=features) |
unit |
vector of unit labels (length = number of rows of |
cluster |
integer specifying the number of classes/clusters to assign the feature vectors to OR integer vector specifiying the initial cluster assignment. |
w |
feature weights in loss function. Can be specified as single positive number, vector, or positive definite matrix |
method |
sweeping method for block coordinate ascent: |
control |
optional list of tuning parameters |
If cluster
is an integer vector, it must have the same length as unit
and its values must range between 1 and the number of clusters.
The list control
can contain a field maxit
,
an integer that fixes the maximum number of algorithm iterations.
A list of class matchFeat
with components
cluster
integer vector of cluster assignments (length = now(x)
)
objective
minimum objective value
mu
sample mean for each cluster/class (feature-by-cluster matrix)
V
sample covariance for each cluster/class (feature-by-feature-by-cluster 3D array)
size
integer vector of cluster sizes
call
function call
Degras (2022) "Scalable feature matching across large data collections."
doi:10.1080/10618600.2022.2074429
Wright (2015). Coordinate descent algorithms.
https://arxiv.org/abs/1502.04759
match.2x
, match.bca
,
match.bca.gen
, match.gaussmix
,
match.kmeans
, match.rec
, match.template
data(optdigits) nobs <- nrow(optdigits$x) # total number of observations n <- length(unique(optdigits$unit)) # number of statistical units rmv <- sample.int(nobs, n-1) # remove (n-1) observations to make data unbalanced min.m <- max(table(optdigits$unit[-rmv])) # smallest possible number of clusters # lower values will result in an error message m <- min.m result <- match.bca.gen(optdigits$x[-rmv,], optdigits$unit[-rmv], m)
data(optdigits) nobs <- nrow(optdigits$x) # total number of observations n <- length(unique(optdigits$unit)) # number of statistical units rmv <- sample.int(nobs, n-1) # remove (n-1) observations to make data unbalanced min.m <- max(table(optdigits$unit[-rmv])) # smallest possible number of clusters # lower values will result in an error message m <- min.m result <- match.bca.gen(optdigits$x[-rmv,], optdigits$unit[-rmv], m)
This function performs maximum likelihood estimation (MLE) for the one-to-one feature matching problem represented as a multivariate Gaussian mixture model. MLE is carried out via the EM algorithm.
match.gaussmix(x, unit = NULL, mu = NULL, V = NULL, equal.variance = FALSE, method = c("exact", "approx"), fixed = FALSE, control = list())
match.gaussmix(x, unit = NULL, mu = NULL, V = NULL, equal.variance = FALSE, method = c("exact", "approx"), fixed = FALSE, control = list())
x |
data: matrix of dimensions |
unit |
integer (=number of units) or vector mapping rows of |
mu |
matrix of initial estimates of mean vectors (dimension |
V |
array of initial estimates of covariance matrices (dimension |
equal.variance |
logical: if |
method |
method for calculating class probabilities of feature vectors |
fixed |
logical; if |
control |
list of tuning parameters |
Given a sample of statistical units, each having
possibly mislabeled feature vectors, the one-to-one matching problem is to find a set of
label permutations that produce the best match of feature vectors across units. This problem is sometimes referred to as "data association ambiguity".
The feature vectors of all units are represented as independent realizations of multivariate normal distributions with unknown parameters. For each sample unit, exactly one vector from each distribution is observed and the
corresponding labels are randomly permuted. The goal is to estimate the true class of each feature vector, as well as the mean vector and covariance matrix of each distribution. These quantities are evaluated by ML estimation via the Expectation-Maximization (EM) algorithm.
If x
is a matrix, the rows should be sorted by increasing unit label and unit
should be a nondecreasing sequence of integers, for example with each integer
replicated
times.
The arguments mu
and V
should be specified only if a good guess is available for these parameters. Otherwise bad starting values may cause the EM algorithm to converge to a local maximum of the likelihood function quite far from the global maximum.
If method
is set to exact
(default), the class probabilities of the feature vectors (given the data) are calculated exactly at each iteration of the EM algorithm. This operation can be slow as it involves calculating the permanent of matrices. The argument method
can be set to approximate
to speed up calculations, but this option is not recommended in general as the approximations used are very crude and may produce "bad" EM solutions.
The optional argument control
can be specified with these fields:
maxit
, maximum number of EM iterations (default=1e4);
eps
, relative tolerance for EM convergence (default=1e-8),
the EM algorithm stops if the relative increase in log-likelihood between two iterations is less than this tolerance; verbose
, set to TRUE to display
algorithm progress (default=FALSE).
A list of class matchFeat
with fields
sigma
permutations that best match feature vectors across units ( matrix)
cluster
associated clusters (=inverse permutations)
P
conditional probability that a feature vector is assigned to its 'true' label
( matrix)
mu
MLE of true mean vectors ( matrix)
V
MLE of true covariance matrices ( array or
matrix if
equal.variance=TRUE
)
loglik
Maximum value of log-likelihood
Degras (2022) "Scalable feature matching across large data collections."
doi:10.1080/10618600.2022.2074429
'McLachlan and Krishnan (2008). The EM Algorithm and Extensions.'
match.2x
,
match.bca
,
match.bca.gen
,
match.kmeans
,
match.rec
,
match.template
data(optdigits) x <- optdigits$x label <- optdigits$label m <- length(unique(label)) n <- length(unique(optdigits$unit)) ## Randomly permute labels to make problem harder for (i in 1:n) { idx <- seq.int((i-1) * m + 1, i * m) sigma <- sample.int(m) x[idx,] <- x[idx[sigma],] label[idx] <- label[idx[sigma]] } ## Fit Gaussian mixture model fit <- match.gaussmix(x, unit = n) ## Calculate Rand index Rand.index(fit$cluster,label)
data(optdigits) x <- optdigits$x label <- optdigits$label m <- length(unique(label)) n <- length(unique(optdigits$unit)) ## Randomly permute labels to make problem harder for (i in 1:n) { idx <- seq.int((i-1) * m + 1, i * m) sigma <- sample.int(m) x[idx,] <- x[idx[sigma],] label[idx] <- label[idx[sigma]] } ## Fit Gaussian mixture model fit <- match.gaussmix(x, unit = n) ## Calculate Rand index Rand.index(fit$cluster,label)
This function matches collections of feature vectors in a one-to-one fashion using a -means-like method.
match.kmeans(x, unit = NULL, w = NULL, method = c("hungarian", "bruteforce"), control = list())
match.kmeans(x, unit = NULL, w = NULL, method = c("hungarian", "bruteforce"), control = list())
x |
data: matrix of dimensions |
unit |
integer (= number of units) or vector mapping rows of |
w |
weights for the (squared Euclidean) loss function. Can be specified as single positive number, |
method |
method for linear assignment problem: |
control |
optional list of tuning parameters |
Given a set of units or datasets, each having
unlabeled feature vectors, the one-to-one matching problem is to find a set of
labels that produce the best match of feature vectors across units. The objective function to minimize is the sum of (weighted) squared Euclidean distances between all pairs of feature vectors having the same label. This amounts to minimizing the sum of the within-label variances.
The sample means and sample covariances of the matched feature vectors are calculated as a post-processing step.
If x
is a matrix, the rows should be sorted by increasing unit label and unit
should be a nondecreasing sequence of integers, for example with each integer
replicated
times.
The argument w
can be specified as a vector of positive numbers (will be recycled to length if needed) or as a positive definite matrix of size
.
The optional argument control
is a list with three fields: sigma
, starting point for the optimization ( matrix of permutations;
maxit
, maximum number of iterations; and equal.variance
, logical value that specifies whether the returned sample covariance matrices V
for matched features should be equal between labels/classes (TRUE) or label-specific (FALSE, default).
A list of class matchFeat
with components
best set of permutations for feature vectors ( matrix)
associated clusters (= inverse permutations)
minimum objective value
sample mean for each class/label ( matrix)
sample covariance for each class/label ( matrix
function call
Degras (2022) "Scalable feature matching across large data collections."
doi:10.1080/10618600.2022.2074429
https://en.wikipedia.org/wiki/K-means_clustering
https://en.wikipedia.org/wiki/Assignment_problem
https://en.wikipedia.org/wiki/Hungarian_algorithm
match.2x
, match.bca
,
match.bca.gen
, match.gaussmix
,
match.rec
, match.template
## Generate data m <- 3 n <- 10 p <- 5 mu <- matrix(rnorm(p*m),p,m) sigma <- 0.1 x <- array(as.vector(mu) + rnorm(p*m*n,sigma), c(p,m,n)) ## Match all feature vectors result <- match.kmeans(x) ## Display results result$objective # cost function xmatched <- array(dim=dim(x)) ## re-arranged (matched) feature vectors for (i in 1:n){ xmatched[,,i] <- x[,result$sigma[,i],i]}
## Generate data m <- 3 n <- 10 p <- 5 mu <- matrix(rnorm(p*m),p,m) sigma <- 0.1 x <- array(as.vector(mu) + rnorm(p*m*n,sigma), c(p,m,n)) ## Match all feature vectors result <- match.kmeans(x) ## Display results result$objective # cost function xmatched <- array(dim=dim(x)) ## re-arranged (matched) feature vectors for (i in 1:n){ xmatched[,,i] <- x[,result$sigma[,i],i]}
RECUR1 algorithm of Bandelt et al (2004) to find starting point in the multidimensional assignment problem with decomposable costs (MDADC)
match.rec(x, unit = NULL, w = NULL, control = list())
match.rec(x, unit = NULL, w = NULL, control = list())
x |
data: matrix of dimensions |
unit |
integer (=number of units) or vector mapping rows of |
w |
weights for loss function: single positive number,
|
control |
tuning parameters |
A list of class matchFeat
with components
sigma
best set of permutations for feature vectors ( matrix)
cluster
associated clusters (= inverse permutations)
cost
minimum objective value
mu
sample mean for each class/label ( matrix)
V
sample covariance for each class/label ( matrix
call
function call
Degras (2022) "Scalable feature matching across large data collections."
doi:10.1080/10618600.2022.2074429
Bandelt, Maas, and Spieksma (2004), "Local search heuristics for multi-index assignment problems with decomposable costs." doi:10.1057/palgrave.jors.2601723
match.2x
, match.bca
,
match.gaussmix
, match.template
,
match.kmeans
data(optdigits) m <- length(unique(optdigits$label)) # number of classes n <- nrow(optdigits$x) / m # number of units ## Use function with data in matrix form fit1 <- match.rec(optdigits$x, unit=n) ## Use function with data in array form p <- ncol(optdigits$x) x <- t(optdigits$x) dim(x) <- c(p,m,n) fit2 <- match.rec(x)
data(optdigits) m <- length(unique(optdigits$label)) # number of classes n <- nrow(optdigits$x) / m # number of units ## Use function with data in matrix form fit1 <- match.rec(optdigits$x, unit=n) ## Use function with data in array form p <- ncol(optdigits$x) x <- t(optdigits$x) dim(x) <- c(p,m,n) fit2 <- match.rec(x)
This function solves the multidimensional assignment problem with decomposable costs (MDADC) by matching the data to a pre-specified set of vectors (the template). The dissimilarity function is the squared Euclidean distance.
match.template(x, template = 1L, unit = NULL, w = NULL, method = c("hungarian","bruteforce"), equal.variance = FALSE)
match.template(x, template = 1L, unit = NULL, w = NULL, method = c("hungarian","bruteforce"), equal.variance = FALSE)
x |
data: matrix of dimensions |
template |
integer (= which sample unit to take as template) or |
unit |
integer (=number of units/datasets) or vector mapping rows of |
w |
weights for the loss function. Can be specified as a |
method |
method for the linear assignment problem: |
equal.variance |
logical; if TRUE, resp. FALSE, return common, resp. label-specific, covariance of matched features |
Given datasets or statistical units, each containing
feature vectors, the one-to-one matching problem is to find a set of
label permutations that produce the best match of feature vectors across units. The objective function to minimize is the sum of squared (Euclidean) distances between all feature vectors having the same (new) label. This amounts to minimizing the sum of the within-label variances.
The template-based method consists in relabeling successively each sample unit to best match a template matrix of feature vectors. This method is very fast but its optimization performance is only as good as the template. For best results, the template should be representative of the collected data.
If x
is a matrix, the rows should be sorted by increasing unit label and unit
should be a nondecreasing sequence of integers, for example with each integer
replicated
times.
The argument w
can be specified as a vector of positive numbers (will be recycled to length if needed) or as a positive definite matrix of size
.
A list of class matchFeat
with fields
sigma
best assignement as set of permutations ( matrix)
cluster
best assignement as cluster indicators ( matrix)
objective
minimum objective value
mu
mean vector for each class/label ( matrix)
V
covariance matrix for each class/label ( array if
equal.variance
is FALSE, matrix otherwise
call
function call
Degras (2022) "Scalable feature matching across large data collections."
doi:10.1080/10618600.2022.2074429
https://en.wikipedia.org/wiki/Assignment_problem
https://en.wikipedia.org/wiki/Hungarian_algorithm
match.2x
, match.bca
,
match.bca.gen
, match.gaussmix
, match.kmeans
, match.rec
## Generate data n <- 10 k <- 3 d <- 5 mu <- matrix(1:k, nrow=d, ncol=k, byrow=TRUE) sigma <- 0.3 x <- array(mu, c(d,k,n)) + rnorm(d*k*n,sigma) ## Match all feature vectors with first case as template result <- match.template(x,1) ## Display results result$cost # cost function xmatched <- array(dim=dim(x)) # re-arranged (matched) feature vectors for (i in 1:n) xmatched[,,i] <- x[,result$sigma[,i],i]
## Generate data n <- 10 k <- 3 d <- 5 mu <- matrix(1:k, nrow=d, ncol=k, byrow=TRUE) sigma <- 0.3 x <- array(mu, c(d,k,n)) + rnorm(d*k*n,sigma) ## Match all feature vectors with first case as template result <- match.template(x,1) ## Display results result$cost # cost function xmatched <- array(dim=dim(x)) # re-arranged (matched) feature vectors for (i in 1:n) xmatched[,,i] <- x[,result$sigma[,i],i]
Calculates the objective value in the multidimensional assignment problem with decomposable costs (MDADC). The dissimilarity function used in this problem is the squared Euclidean distance.
objective.fun(x, sigma = NULL, unit = NULL, w = NULL)
objective.fun(x, sigma = NULL, unit = NULL, w = NULL)
x |
data: matrix of dimensions |
sigma |
permutations: matrix of dimensions |
unit |
integer (=number of units) or vector mapping rows of |
w |
weights for loss function: single positive number,
|
Given datasets having each
vectors of same size,
say
, and permutations
of
, the function calculates
where
and
run from 1 to
and
runs from 1 to
. This is the objective value (1) of Degras (2021), up to the factor
.
Objective value
Degras (2022) "Scalable feature matching across large data collections." doi:10.1080/10618600.2022.2074429
data(optdigits) m <- 10 n <- 100 sigma <- matrix(1:m,m,n) # identity permutations objective.fun(optdigits$x, sigma, optdigits$unit)
data(optdigits) m <- 10 n <- 100 sigma <- matrix(1:m,m,n) # identity permutations objective.fun(optdigits$x, sigma, optdigits$unit)
Calculates the objective value in the multidimensional assignment problem with decomposable costs (MDADC). The dissimilarity function used in this problem is the squared Euclidean distance. The data can be balanced OR unbalanced.
objective.gen.fun(x, unit, cluster)
objective.gen.fun(x, unit, cluster)
x |
data matrix with feature vectors in rows |
unit |
vector of unit labels (length should equal number of rows in |
cluster |
vector of cluster labels (length should equal number of rows in |
See equation (2) in Degras (2022). This function gives the same value as objective.fun
when the data are balanced.
Objective value
Degras (2022) "Scalable feature matching across large data collections." doi:10.1080/10618600.2022.2074429
data(optdigits) m <- 10 n <- 100 ## Balanced example: both 'objective.fun' and 'objective.gen.fun' work sigma <- matrix(1:m,m,n) cluster <- rep(1:m,n) objective.fun(optdigits$x, sigma, optdigits$unit) objective.gen.fun(optdigits$x, optdigits$unit, cluster) ## Unbalanced example idx <- 1:999 objective.gen.fun(optdigits$x[idx,], optdigits$unit[idx], cluster[idx])
data(optdigits) m <- 10 n <- 100 ## Balanced example: both 'objective.fun' and 'objective.gen.fun' work sigma <- matrix(1:m,m,n) cluster <- rep(1:m,n) objective.fun(optdigits$x, sigma, optdigits$unit) objective.gen.fun(optdigits$x, optdigits$unit, cluster) ## Unbalanced example idx <- 1:999 objective.gen.fun(optdigits$x[idx,], optdigits$unit[idx], cluster[idx])
Digitized images of handwritten digits used in optical recognition tasks
data("optdigits")
data("optdigits")
The format is:
List of 2
$ x : int [1:1000, 1:64] 0 0 0 0 0 0 0 0 0 0 ...
$ label: int [1:1000] 0 1 2 3 4 5 6 7 8 9 ...
This is a subset of a larger dataset containing handwritten digits contributed by 30 people on a preprinted form. The forms were converted to normalized bitmaps of size 32x32 which were divided into nonoverlapping blocks of size 4x4. The number of pixels was counted in each block, producing a matrix of size 8x8 with integer coefficients ranging in 0..16.
These matrix are vectorized in the rows of optdigits$x
. The corresponding digits are in optdigits$label
. 100 examples are available for each digit 0..9.
UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
Alpaydin and Kaynak (1998). Cascading Classifiers. ftp://ftp.icsi.berkeley.edu/pub/ai/ethem/kyb.ps.Z
data(optdigits) ## Quick visualization oldpar <- par(no.readonly = TRUE) par(mfrow=c(2,5)) for (i in 1:10) { mat <- matrix(optdigits$x[i,],8,8) image(mat[,8:1], xaxt="n", yaxt="n") title(optdigits$label[i]) } par(oldpar)
data(optdigits) ## Quick visualization oldpar <- par(no.readonly = TRUE) par(mfrow=c(2,5)) for (i in 1:10) { mat <- matrix(optdigits$x[i,],8,8) image(mat[,8:1], xaxt="n", yaxt="n") title(optdigits$label[i]) } par(oldpar)
predict
method for class "matchFeat"
## S3 method for class 'matchFeat' predict(object, newdata, unit = NULL, ...)
## S3 method for class 'matchFeat' predict(object, newdata, unit = NULL, ...)
object |
an object of class |
newdata |
new dataset of feature vectors |
unit |
unit labels for new data. Only necessary if |
... |
for compatibility with the generic |
The function predict.matchFeat
finds the best matching for new feature vectors relative to an existing set of cluster/class centers. If codeobject results from a call to match.gaussmix
, the same function is used for prediction (with fixed mean vectors and covariance matrices). In other cases, the function match.template
is used for prediction.
A list of class matchFeat
with fields
sigma
best matching as set of permutations ( matrix)
cluster
best matching as cluster indicators (-matrix)
objective
minimum objective value
mu
mean vector for each class/label ( matrix)
V
covariance matrix for each class/label ( array if
equal.variance
is FALSE, matrix otherwise
call
function call
print.matchFeat
, summary.matchFeat
data(optdigits) train.result <- match.bca(optdigits$x[1:900,], optdigits$unit[1:900]) test.result <- predict(train.result, optdigits$x[901:1000,], optdigits$unit[901:1000]) test.result
data(optdigits) train.result <- match.bca(optdigits$x[1:900,], optdigits$unit[1:900]) test.result <- predict(train.result, optdigits$x[901:1000,], optdigits$unit[901:1000]) test.result
print
method for class "matchFeat"
.
## S3 method for class 'matchFeat' print(x,...)
## S3 method for class 'matchFeat' print(x,...)
x |
an object of class |
... |
for compatibility with the generic |
The function print.matchFeat
concisely displays the information of an object of class "matchFeat"
. More precisely it shows the
data range, bandwidth used in local polynomial estimation, and key information on SCB and statistical tests.
No return value, called for side effects
predict.matchFeat
, summary.matchFeat
data(optdigits) result <- match.bca(optdigits$x, optdigits$unit) print(result)
data(optdigits) result <- match.bca(optdigits$x, optdigits$unit) print(result)
Calculates the Rand Index between two partitions of a set
Rand.index(x, y)
Rand.index(x, y)
x |
first partition vector |
y |
second partition vector |
The two vectors x
and y
must have equal length. Given a set and two partitions
and
of
, the Rand index is the proportion of pairs of elements in
(out of all pairs) that are either concordant in both
and
(i.e., they belong to the same member of
and to the same member of
) or discordant (i.e., not concordant) in both
and Y.
The Rand index (not adjusted for chance)
W. M. Rand (1971). "Objective criteria for the evaluation of clustering methods"
https://en.wikipedia.org/wiki/Rand_index
## Example 1 x <- sample.int(3, 20, replace = TRUE) y <- sample.int(3, 20, replace = TRUE) table(x,y) Rand.index(x,y) ## Example 2 data(optdigits) label <- optdigits$label m <- length(unique(label)) # 10 n <- length(unique(optdigits$unit)) # 100 dim(label) <- c(m,n) p <- ncol(optdigits$x) # 64 x <- array(t(optdigits$x),c(p,m,n)) ## Permute data and labels to make problem harder for (i in 1:n) { sigma <- sample.int(m) x[,,i] <- x[,sigma,i] label[,i] <- label[sigma,i] } ## Compare Rand indices of matching methods Rand.index(match.bca(x)$cluster, label) Rand.index(match.rec(x)$cluster, label) Rand.index(match.template(x)$cluster, label) Rand.index(match.kmeans(x)$cluster, label)
## Example 1 x <- sample.int(3, 20, replace = TRUE) y <- sample.int(3, 20, replace = TRUE) table(x,y) Rand.index(x,y) ## Example 2 data(optdigits) label <- optdigits$label m <- length(unique(label)) # 10 n <- length(unique(optdigits$unit)) # 100 dim(label) <- c(m,n) p <- ncol(optdigits$x) # 64 x <- array(t(optdigits$x),c(p,m,n)) ## Permute data and labels to make problem harder for (i in 1:n) { sigma <- sample.int(m) x[,,i] <- x[,sigma,i] label[,i] <- label[sigma,i] } ## Compare Rand indices of matching methods Rand.index(match.bca(x)$cluster, label) Rand.index(match.rec(x)$cluster, label) Rand.index(match.template(x)$cluster, label) Rand.index(match.kmeans(x)$cluster, label)
summary
method for class "matchFeat"
## S3 method for class 'matchFeat' summary(object, ...)
## S3 method for class 'matchFeat' summary(object, ...)
object |
an object of class |
... |
additional arguments; not currently used. |
The function summary.matchFeat
displays all fields of a matchFeat
object at the exception of x
, y
, par
, nonpar
, normscb
, and bootscb
which are potentially big. It provides information on the function call, data, local polynomial fit, SCB, and statistical tests.
No return value, called for side effects
predict.matchFeat
,
print.matchFeat
data(optdigits) result <- match.bca(optdigits$x, optdigits$unit) summary(result)
data(optdigits) result <- match.bca(optdigits$x, optdigits$unit) summary(result)