
SOM + latent profile clustering pipeline (with AHP and distance baselines)
Source:R/Pipeline_SOMClust.R
Pipeline_SOMClust.RdEnd-to-end pipeline to:
Standardize variables using SciDataReportR::CalcZScore() or a supplied Z-score object.
Fit a Self-Organizing Map (SOM; kohonen) on complete cases.
Generate aweSOM visualizations (Circular, Line, Cloud) with optional relabeling using variable labels from the original data frame.
Cluster SOM codebook vectors using latent profile analysis (tidyLPA / mclust backend).
In
method = "exploratory", fit a grid of models and select a recommended solution using an Analytic Hierarchy Process (AHP)-style index combining AIC, BIC, and Entropy.In
method = "finalize", fit a user-specified model and number of profiles.Map node-level clusters and posterior probabilities back to individuals.
Missing data:
SOM and clustering are fit only on rows with complete Z-scores.
The returned
df_with_clustershas the full original data with.scidr_rowidand a single cluster column appended; rows not used in SOM/LPA get NA.
Stable row id:
.scidr_rowidis added to the input data and carried intodf_with_clustersandProbFit$individual.ProbFit$individual$RowIDis set equal to.scidr_rowidso merges do not rely on row order.
Z-score behavior:
ZScoreType = "Center and Scale"/"Center Only"/"Scale Only"computes Z-scores fromdfviaCalcZScore().ZScoreType = "ZScoreObj"projects Z-scores using an externalZScoreObjviaProject_ZScore().ZScoreType = "PreZScored"uses existing Z-score columns indfas-is and does not re-zscore.
Usage
Pipeline_SOMClust(
df,
variables = NULL,
method = c("exploratory", "finalize"),
k_range = 2:15,
models = c(1, 2, 3),
final_k = NULL,
final_model = NULL,
ClusterName = "Cluster",
ZScoreType = c("Center and Scale", "Center Only", "Scale Only", "ZScoreObj",
"PreZScored"),
ZScoreObject = NULL,
som_xdim = NULL,
som_ydim = NULL,
som_topo = "hexagonal",
som_neigh = "gaussian",
seed_som = 934521L,
seed_lpa = 93421L,
Relabel = TRUE,
ZScorePrefix = "Z_",
ZScoreVars = NULL,
id_col = NULL
)Arguments
- df
Data frame containing the variables to be used in SOM and clustering.
- variables
Optional character vector of variable names. If NULL, numeric variables are auto-detected using
SciDataReportR::getNumVars(df, Ordinal = FALSE). InZScoreType = "PreZScored", this can also be NULL if you supplyZScoreVarsor if Z-score columns can be auto-detected by prefix.- method
One of
"exploratory"(default) or"finalize". In"exploratory", a grid of models is fit and AHP chooses the recommended solution. In"finalize", the user must specifyfinal_kandfinal_model.- k_range
Integer vector of numbers of clusters/profiles to consider in exploratory mode. Default
2:15.- models
Integer vector of model specifications for tidyLPA (mclust backend). Default
c(1, 2, 3).- final_k
Integer; number of profiles for
method = "finalize".- final_model
Integer; model specification for
method = "finalize"(should be one ofmodels).- ClusterName
Name of the cluster column in the output. Defaults to
"Cluster". If this column already exists indf, it is overwritten (with a message).- ZScoreType
One of:
"Center and Scale"(default)"Center Only""Scale Only""ZScoreObj"(use an existing ZScore object)"PreZScored"(use existing Z-score columns in df as-is)
- ZScoreObject
Optional ZScoreObj (from
CalcZScore()orProject_ZScore()) to use whenZScoreType = "ZScoreObj".- som_xdim, som_ydim
Optional integers for SOM grid dimensions. If NULL, a square grid with side length
ceiling(n_complete^(1/3))is used.- som_topo
SOM topology for
kohonen::somgrid(), default"hexagonal".- som_neigh
SOM neighbourhood function, default
"gaussian".- seed_som, seed_lpa
Integer seeds for SOM and LPA steps (defaults 934521 and 93421).
- Relabel
Logical; if TRUE (default), aweSOM plots are relabeled using variable labels from the original
df(via Hmisc or sjlabelled when available) by stripping the Z-score prefix.- ZScorePrefix
Character prefix used for Z-score columns when
ZScoreType = "PreZScored". Default"Z_".- ZScoreVars
Optional character vector of Z-score column names to use when
ZScoreType = "PreZScored". If NULL, the function attempts to infer them fromvariablesor by detecting columns starting withZScorePrefix.- id_col
Optional character scalar. If provided and present in
df, this column is carried intoProbFit$individualfor convenience.
Value
A list of class "Pipeline_SOMClust" with components:
method,vars_used,ZScoreType,ZScoreObject,ZScoreVars,ClusterNamecomplete_rows: logical vector (rows used for SOM/LPA)df_with_clusters: originaldfwith.scidr_rowidand only the cluster column appendedfit_plot: ggplot of AIC/BIC/Entropy vs k and modelModelInfo_SOM: list withsom_model,som_codes,som_grid,SOMFit(distance diagnostics, baselines, and per-cluster flags),plots(aweSOM plots)ModelInfo_MClust: list withlpa_models,fit_table, andAHPinformationProbFit: list withnode(node-level posterior probabilities),individual(per-person mapping and probabilities, including.scidr_rowid), and probability plots
Details
The AHP-style index is computed by:
Scaling AIC, BIC, and Entropy across candidate solutions (AIC/BIC are negated so that lower values correspond to better fit; higher scaled scores are preferred).
Taking the mean of the three scaled indices. The model with the highest AHP index is recommended.