Title: | Multidimensional Cluster Generation Using Support Lines |
---|---|
Description: | An implementation of the clugen algorithm for generating multidimensional clusters with arbitrary distributions. Each cluster is supported by a line segment, the position, orientation and length of which guide where the respective points are placed. This package is described in Fachada & de Andrade (2023) <doi:10.1016/j.knosys.2023.110836>. |
Authors: | Nuno Fachada [aut, cre, cph] |
Maintainer: | Nuno Fachada <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.3.9000 |
Built: | 2025-02-10 04:45:28 UTC |
Source: | https://github.com/clugen/clugenr |
Typically, the angle between two vectors v1
and v2
can be obtained with:
acos((v1 %*% v2) / (norm(v1, "2") * norm(v2, "2")))
However, this approach is numerically unstable. The version provided here is numerically stable and based on the Angle Between Vectors Julia package by Jeffrey Sarnoff (MIT license), implementing an algorithm provided by Prof. W. Kahan in these notes (see page 15).
angle_btw(v1, v2)
angle_btw(v1, v2)
v1 |
First vector. |
v2 |
Second vector. |
Angle between v1
and v2
in radians.
angle_btw(c(1.0, 1.0, 1.0, 1.0), c(1.0, 0.0, 0.0, 0.0)) * 180 / pi
angle_btw(c(1.0, 1.0, 1.0, 1.0), c(1.0, 0.0, 0.0, 0.0)) * 180 / pi
Determine the angles between the average cluster direction and the
cluster-supporting lines. These angles are obtained from a wrapped normal
distribution (\(\mu=0\), \(\sigma=\) angle_disp
) with
support in the interval \(\left[-\pi/2,\pi/2\right]\).
Note this is different from the standard wrapped normal distribution, the
support of which is given by the interval
\(\left[-\pi,\pi\right]\).
angle_deltas(num_clusters, angle_disp)
angle_deltas(num_clusters, angle_disp)
num_clusters |
Number of clusters. |
angle_disp |
Angle dispersion, in radians. |
Angles between the average cluster direction and the cluster-supporting lines, given in radians in the interval \(\left[-\pi/2,\pi/2\right]\)
This function is stochastic. For reproducibility set a PRNG seed with set.seed.
set.seed(123) arad <- angle_deltas(4, pi / 8) # Angle dispersion of 22.5 degrees arad # What angles deltas did we get? arad * 180 / pi # Show angle deltas in degrees
set.seed(123) arad <- angle_deltas(4, pi / 8) # Angle dispersion of 22.5 degrees arad # What angles deltas did we get? arad * 180 / pi # Show angle deltas in degrees
Determine cluster centers using the uniform distribution, taking into account
the number of clusters (num_clusters
) and the average cluster separation
(clu_sep
).
More specifically, let \(c=\)num_clusters
,
\(\mathbf{s}=\)clu_sep
, \(\mathbf{o}=\)clu_offset
,
\(n=\)length(clu_sep)
(i.e., number of dimensions). Cluster centers
are obtained according to the following equation:
where \(\mathbf{C}\) is the \(c \times n\) matrix of cluster centers, \(\mathbf{U}\) is an \(c \times n\) matrix of random values drawn from the uniform distribution between -0.5 and 0.5, and \(\mathbf{1}\) is an \(c \times 1\) vector with all entries equal to 1.
clucenters(num_clusters, clu_sep, clu_offset)
clucenters(num_clusters, clu_sep, clu_offset)
num_clusters |
Number of clusters. |
clu_sep |
Average cluster separation (\(n \times 1\) vector). |
clu_offset |
Cluster offsets (\(n \times 1\) vector). |
A \(c \times n\) matrix containing the cluster centers.
This function is stochastic. For reproducibility set a PRNG seed with set.seed.
set.seed(321) clucenters(3, c(30, 10), c(-50,50))
set.seed(321) clucenters(3, c(30, 10), c(-50,50))
This is the main function of clugenr, and possibly the only function most users will need.
clugen( num_dims, num_clusters, num_points, direction, angle_disp, cluster_sep, llength, llength_disp, lateral_disp, allow_empty = FALSE, cluster_offset = NA, proj_dist_fn = "norm", point_dist_fn = "n-1", clusizes_fn = clusizes, clucenters_fn = clucenters, llengths_fn = llengths, angle_deltas_fn = angle_deltas, seed = NA )
clugen( num_dims, num_clusters, num_points, direction, angle_disp, cluster_sep, llength, llength_disp, lateral_disp, allow_empty = FALSE, cluster_offset = NA, proj_dist_fn = "norm", point_dist_fn = "n-1", clusizes_fn = clusizes, clucenters_fn = clucenters, llengths_fn = llengths, angle_deltas_fn = angle_deltas, seed = NA )
num_dims |
Number of dimensions. |
num_clusters |
Number of clusters to generate. |
num_points |
Total number of points to generate. |
direction |
Average direction of the cluster-supporting lines. Can be
a vector of length |
angle_disp |
Angle dispersion of cluster-supporting lines (radians). |
cluster_sep |
Average cluster separation in each dimension (vector of
length |
llength |
Average length of cluster-supporting lines. |
llength_disp |
Length dispersion of cluster-supporting lines. |
lateral_disp |
Cluster lateral dispersion, i.e., dispersion of points from their projection on the cluster-supporting line. |
allow_empty |
Allow empty clusters? |
cluster_offset |
Offset to add to all cluster centers (vector of length
|
proj_dist_fn |
Distribution of point projections along cluster-supporting lines, with three possible values:
|
point_dist_fn |
Controls how the final points are created from their projections on the cluster-supporting lines, with three possible values:
|
clusizes_fn |
Distribution of cluster sizes. By default, cluster sizes
are determined by the clusizes function, which uses the normal distribution
(\(\mu=\) |
clucenters_fn |
Distribution of cluster centers. By default, cluster
centers are determined by the clucenters function, which uses the uniform
distribution, and takes into account the |
llengths_fn |
Distribution of line lengths. By default, the lengths of
cluster-supporting lines are determined by the llengths function, which
uses the folded normal distribution (\(\mu=\) |
angle_deltas_fn |
Distribution of line angle differences with respect to
|
seed |
An integer used to initialize the PRNG, allowing for reproducible
results. If specified, |
If a custom function was given in the clusizes_fn
parameter, it is
possible that num_points
may have a different value than what was
specified in the num_points
parameter.
The terms "average" and "dispersion" refer to measures of central tendency and statistical dispersion, respectively. Their exact meaning depends on the optional arguments.
A named list with the following elements:
points
: A num_points
x num_dims
matrix with the generated points for
all clusters.
clusters
: A num_points
factor vector indicating which cluster
each point in points
belongs to.
projections
: A num_points
x num_dims
matrix with the point
projections on the cluster-supporting lines.
sizes
: A num_clusters
x 1 vector with the number of points in
each cluster.
centers
: A num_clusters
x num_dims
matrix with the
coordinates of the cluster centers.
directions
: A num_clusters
x num_dims
matrix with the final
direction of each cluster-supporting line.
angles
: A num_clusters
x 1 vector with the angles between the
cluster-supporting lines and the main direction.
lengths
: A num_clusters
x 1 vector with the lengths of the
cluster-supporting lines.
This function is stochastic. For reproducibility set a PRNG seed with set.seed.
# 2D example x <- clugen(2, 5, 1000, c(1, 3), 0.5, c(10, 10), 8, 1.5, 2) graphics::plot(x$points, col = x$clusters, xlab = "x", ylab = "y", asp = 1) # 3D example x <- clugen(3, 5, 1000, c(2, 3, 4), 0.5, c(15, 13, 14), 7, 1, 2)
# 2D example x <- clugen(2, 5, 1000, c(1, 3), 0.5, c(10, 10), 8, 1.5, 2) graphics::plot(x$points, col = x$clusters, xlab = "x", ylab = "y", asp = 1) # 3D example x <- clugen(3, 5, 1000, c(2, 3, 4), 0.5, c(15, 13, 14), 7, 1, 2)
fields
) of two or more data setsMerges the fields (specified in fields
) of two or more data sets (passed as
lists). The fields to be merged need to have the same number of columns. The
corresponding merged field will contain the rows of the fields to be merged,
and will have a common "supertype".
clumerge(..., fields = c("points", "clusters"), clusters_field = "clusters")
clumerge(..., fields = c("points", "clusters"), clusters_field = "clusters")
... |
One or more cluster data sets (in the form of lists) whose
|
fields |
Fields to be merged, which must exist in the data sets given in
|
clusters_field |
Field containing the integer cluster labels. If specified, cluster assignments in individual datasets will be updated in the merged dataset so that clusters are considered separate. |
The clusters_field
parameter specifies a field containing integers that
identify the cluster to which the respective points belongs to. If
clusters_field
is specified (by default it's specified as "clusters"
),
cluster assignments in individual datasets will be updated in the merged
dataset so that clusters are considered separate. This parameter can be set
to NA
, in which case no field will be considered as a special cluster
assignments field.
This function can be used to merge data sets generated with the clugen
function, by default merging the points
and clusters
fields in those data
sets. It also works with arbitrary data by specifying alternative fields in
the fields
parameter. It can be used, for example, to merge third-party
data with clugen-generated data.
A list whose fields consist of the merged fields in the original data sets.
a <- clugen(2, 5, 100, c(1, 3), 0.5, c(10, 10), 8, 1.5, 2) b <- clugen(2, 3, 250, c(-1, 3), 0.5, c(13, 14), 7, 1, 2) ab <- clumerge(a, b)
a <- clugen(2, 5, 100, c(1, 3), 0.5, c(10, 10), 8, 1.5, 2) b <- clugen(2, 3, 250, c(-1, 3), 0.5, c(13, 14), 7, 1, 2) ab <- clumerge(a, b)
Each point is placed around its projection using the normal distribution
(\(\mu=0\), \(\sigma=\) lat_disp
).
clupoints_n(projs, lat_disp, line_len, clu_dir, clu_ctr)
clupoints_n(projs, lat_disp, line_len, clu_dir, clu_ctr)
projs |
Point projections on the cluster-supporting line (\(p \times n\) matrix). |
lat_disp |
Standard deviation for the normal distribution, i.e., cluster lateral dispersion. |
line_len |
Length of cluster-supporting line (ignored). |
clu_dir |
Direction of the cluster-supporting line. |
clu_ctr |
Center position of the cluster-supporting line (ignored). |
This function's main intended use is by the main clugen function,
generating the final points when the point_dist_fn
parameter is set to
"n"
.
Generated points (\(p \times n\) matrix).
This function is stochastic. For reproducibility set a PRNG seed with set.seed.
set.seed(123) ctr <- c(0, 0) dir <- c(1, 0) pdist <- c(-0.5, -0.2, 0.1, 0.3) proj <- points_on_line(ctr, dir, pdist) clupoints_n(proj, 0.01, NA, dir, NA)
set.seed(123) ctr <- c(0, 0) dir <- c(1, 0) pdist <- c(-0.5, -0.2, 0.1, 0.3) proj <- points_on_line(ctr, dir, pdist) clupoints_n(proj, 0.01, NA, dir, NA)
Each point is placed on a hyperplane orthogonal to that line and centered at
the point's projection, using the normal distribution (\(\mu=0\),
\(\sigma=\) lat_disp
).
clupoints_n_1(projs, lat_disp, line_len, clu_dir, clu_ctr)
clupoints_n_1(projs, lat_disp, line_len, clu_dir, clu_ctr)
projs |
Point projections on the cluster-supporting line (\(p \times n\) matrix). |
lat_disp |
Standard deviation for the normal distribution, i.e., cluster lateral dispersion. |
line_len |
Length of cluster-supporting line (ignored). |
clu_dir |
Direction of the cluster-supporting line. |
clu_ctr |
Center position of the cluster-supporting line (ignored). |
This function's main intended use is by the main clugen function,
generating the final points when the point_dist_fn
parameter is set to
"n-1"
.
Generated points (\(p \times n\) matrix).
This function is stochastic. For reproducibility set a PRNG seed with set.seed.
set.seed(123) ctr <- c(0, 0) dir <- c(1, 0) pdist <- c(-0.5, -0.2, 0.1, 0.3) proj <- points_on_line(ctr, dir, pdist) clupoints_n_1(proj, 0.1, NA, dir, NA)
set.seed(123) ctr <- c(0, 0) dir <- c(1, 0) pdist <- c(-0.5, -0.2, 0.1, 0.3) proj <- points_on_line(ctr, dir, pdist) clupoints_n_1(proj, 0.1, NA, dir, NA)
Generate points from their \(n\)-dimensional projections on a
cluster-supporting line, placing each point on a hyperplane orthogonal to
that line and centered at the point's projection. The function specified in
dist_fn
is used to perform the actual placement.
clupoints_n_1_template(projs, lat_disp, clu_dir, dist_fn)
clupoints_n_1_template(projs, lat_disp, clu_dir, dist_fn)
projs |
Point projections on the cluster-supporting line (\(p \times n\) matrix). |
lat_disp |
Dispersion of points from their projection. |
clu_dir |
Direction of the cluster-supporting line (unit vector). |
dist_fn |
Function to place points on a second line, orthogonal to the first. |
This function is used internally by clupoints_n_1 and may be useful for
constructing user-defined final point placement strategies for the
point_dist_fn
parameter of the main clugen function.
Generated points (\(p \times n\) matrix).
This function is stochastic. For reproducibility set a PRNG seed with set.seed.
set.seed(123) ctr <- c(0, 0) dir <- c(1, 0) pdist <- c(-0.5, -0.2, 0.1, 0.3) proj <- points_on_line(ctr, dir, pdist) clupoints_n_1_template(proj, 0, dir, function(p, l) stats::runif(p))
set.seed(123) ctr <- c(0, 0) dir <- c(1, 0) pdist <- c(-0.5, -0.2, 0.1, 0.3) proj <- points_on_line(ctr, dir, pdist) clupoints_n_1_template(proj, 0, dir, function(p, l) stats::runif(p))
Cluster sizes are determined using the normal distribution
(\(\mu=\) num_points
\(/\) num_clusters
,
\(\sigma=\mu/3\)), and then assuring that the final cluster sizes
add up to num_points
via the fix_num_points function.
clusizes(num_clusters, num_points, allow_empty)
clusizes(num_clusters, num_points, allow_empty)
num_clusters |
Number of clusters. |
num_points |
Total number of points. |
allow_empty |
Allow empty clusters? |
Number of points in each cluster (vector of length num_clusters
).
This function is stochastic. For reproducibility set a PRNG seed with set.seed.
set.seed(123) sizes <- clusizes(4, 1000, TRUE) sizes sum(sizes)
set.seed(123) sizes <- clusizes(4, 1000, TRUE) sizes sum(sizes)
Certifies that, given enough points, no clusters are left empty. This is done
by removing a point from the largest cluster and adding it to an empty
cluster while there are empty clusters. If the total number of points is
smaller than the number of clusters (or if the allow_empty
parameter is set
to TRUE
), this function does nothing.
fix_empty(clu_num_points, allow_empty = FALSE)
fix_empty(clu_num_points, allow_empty = FALSE)
clu_num_points |
Number of points in each cluster (vector of size \(c\)), where \(c\) is the number of clusters. |
allow_empty |
Allow empty clusters? |
This function is used internally by clusizes and might be useful for custom
cluster sizing implementations given as the clusizes_fn
parameter of the
main clugen function.
Number of points in each cluster, after being fixed by this function (vector of size \(c\)).
clusters <- c(3, 4, 5, 0, 0) # A vector with some empty elements clusters <- fix_empty(clusters) # Apply this function clusters # Check that there's no more empty elements
clusters <- c(3, 4, 5, 0, 0) # A vector with some empty elements clusters <- fix_empty(clusters) # Apply this function clusters # Check that there's no more empty elements
Certifies that the values in the clu_num_points
array, i.e. the number of
points in each cluster, add up to num_points
. If this is not the case, the
clu_num_points
array is modified in-place, incrementing the value
corresponding to the smallest cluster while
sum(clu_num_points) < num_points
, or decrementing the value corresponding
to the largest cluster while sum(clu_num_points) > num_points
.
fix_num_points(clu_num_points, num_points)
fix_num_points(clu_num_points, num_points)
clu_num_points |
Number of points in each cluster (vector of size \(c\)), where \(c\) is the number of clusters. |
num_points |
The expected total number of points. |
This function is used internally by clusizes and might be useful for
custom cluster sizing implementations given as the clusizes_fn
parameter of
the main clugen function.
Number of points in each cluster, after being fixed by this function.
clusters <- c(1, 6, 3) # 10 total points clusters <- fix_num_points(clusters, 12) # But we want 12 total points clusters # Check that we now have 12 points
clusters <- c(1, 6, 3) # 10 total points clusters <- fix_num_points(clusters, 12) # But we want 12 total points clusters # Check that we now have 12 points
Line lengths are determined using the folded normal distribution
(\(\mu=\) llength
, \(\sigma=\) llength_disp
).
llengths(num_clusters, llength, llength_disp)
llengths(num_clusters, llength, llength_disp)
num_clusters |
Number of clusters. |
llength |
Average line length. |
llength_disp |
Line length dispersion. |
Lengths of cluster-supporting lines (vector of size num_clusters
).
This function is stochastic. For reproducibility set a PRNG seed with set.seed.
set.seed(123) llengths(4, 20, 3.5)
set.seed(123) llengths(4, 20, 3.5)
Determine coordinates of points on a line with center
and direction
,
based on the distances from the center given in dist_center
.
This works by using the vector formulation of the line equation assuming
direction
is a \(n\)-dimensional unit vector. In other words,
considering \(\mathbf{d}=\) as.matrix(direction)
(\(n \times
1\) vector), \(\mathbf{c}=\) as.matrix(center)
(\(n
\times 1\) vector), and \(\mathbf{w}=\)
as.matrix(dist_center)
(\(p \times 1\) vector), the coordinates
of points on the line are given by:
where \(\mathbf{P}\) is the \(p \times n\) matrix of point coordinates on the line, and \(\mathbf{1}\) is a \(p \times 1\) vector with all entries equal to 1.
points_on_line(center, direction, dist_center)
points_on_line(center, direction, dist_center)
center |
Center of the line (\(n\)-component vector). |
direction |
Line direction (\(n\)-component unit vector). |
dist_center |
Distance of each point to the center of the line (\(n\)-component vector, where \(n\) is the number of points). |
Coordinates of points on the specified line (\(p \times n\) matrix).
points_on_line(c(5, 5), c(1, 0), seq(-4, 4, length.out=5)) # 2D, 5 points points_on_line(c(-2, 0, 0, 2), c(0, 0, -1, 0), c(10, -10)) # 4D, 2 points
points_on_line(c(5, 5), c(1, 0), seq(-4, 4, length.out=5)) # 2D, 5 points points_on_line(c(-2, 0, 0, 2), c(0, 0, -1, 0), c(10, -10)) # 4D, 2 points
u
.Get a random unit vector orthogonal to u
.
rand_ortho_vector(u)
rand_ortho_vector(u)
u |
A unit vector. |
A random unit vector orthogonal to u
.
This function is stochastic. For reproducibility set a PRNG seed with set.seed.
r <- stats::runif(3) # Get a random 3D vector r <- r / norm(r, "2") # Normalize it o <- rand_ortho_vector(r) # Get a random unit vector orthogonal to r r %*% o # Check that r and o are orthogonal (result should be ~0)
r <- stats::runif(3) # Get a random 3D vector r <- r / norm(r, "2") # Normalize it o <- rand_ortho_vector(r) # Get a random unit vector orthogonal to r r %*% o # Check that r and o are orthogonal (result should be ~0)
num_dims
components.Get a random unit vector with num_dims
components.
rand_unit_vector(num_dims)
rand_unit_vector(num_dims)
num_dims |
Number of components in vector (i.e. vector size). |
A random unit vector with num_dims
components.
This function is stochastic. For reproducibility set a PRNG seed with set.seed.
r <- rand_unit_vector(4) norm(r, "2")
r <- rand_unit_vector(4) norm(r, "2")
Get a random unit vector which is at angle
radians of vector u
.
Note that u
is expected to be a unit vector itself.
rand_vector_at_angle(u, angle)
rand_vector_at_angle(u, angle)
u |
Unit vector with \(n\) components. |
angle |
Angle in radians. |
Random unit vector with \(n\) components which is at angle
radians with vector u
.
This function is stochastic. For reproducibility set a PRNG seed with set.seed.
u <- c(1.0, 0, 0.5, -0.5) # Define a 4D vector u <- u / norm(u, "2") # Normalize the vector v <- rand_vector_at_angle(u, pi / 4) # Get a vector at 45 degrees arad <- acos((u %*% v) / norm(u,"2") * norm(v, "2")) # Get angle in radians arad * 180 / pi # Convert to degrees, should be close to 45 degrees
u <- c(1.0, 0, 0.5, -0.5) # Define a 4D vector u <- u / norm(u, "2") # Normalize the vector v <- rand_vector_at_angle(u, pi / 4) # Get a vector at 45 degrees arad <- acos((u %*% v) / norm(u,"2") * norm(v, "2")) # Get angle in radians arad * 180 / pi # Convert to degrees, should be close to 45 degrees