
SOM + latent profile clustering pipeline (with AHP and distance baselines)
Source:R/Pipeline_SOMClust.R
CreateSOMClusterModel.RdEnd-to-end pipeline to:
Standardize variables using SciDataReportR::CreateZScoreObject() or a supplied Z-score object.
Fit a Self-Organizing Map (SOM; kohonen) on complete cases.
Generate aweSOM visualizations (Circular, Line, Cloud) with optional relabeling using variable labels from the original data frame.
Cluster SOM codebook vectors using latent profile analysis (tidyLPA / mclust backend).
In
method = "exploratory", fit a grid of models and select a recommended solution using an Analytic Hierarchy Process (AHP)-style index combining AIC, BIC, and Entropy.In
method = "finalize", fit a user-specified model and number of profiles.Map node-level clusters and posterior probabilities back to individuals.
Missing data:
SOM and clustering are fit only on rows with complete Z-scores.
The returned
df_with_clustershas the full original data with.scidr_rowidand a single cluster column appended; rows not used in SOM/LPA get NA.The returned
ProbFit$individualis also full length, preserving one row per input row with NA posterior probabilities for rows excluded from SOM/LPA.
Stable row id:
.scidr_rowidis added to the input data and carried intodf_with_clustersandProbFit$individual.ProbFit$individual$RowIDis set equal to.scidr_rowidso merges do not rely on row order.
Z-score behavior:
ZScoreType = "Center and Scale"/"Center Only"/"Scale Only"computes Z-scores fromdfviaCreateZScoreObject().ZScoreType = "ZScoreObj"projects Z-scores using an externalZScoreObjviaProjectZScore().ZScoreType = "PreZScored"uses existing Z-score columns indfas-is and does not re-zscore.
Usage
CreateSOMClusterModel(
df,
variables = NULL,
method = c("exploratory", "finalize"),
k_range = 2:15,
models = c(1, 2, 3),
final_k = NULL,
final_model = NULL,
ClusterName = "Cluster",
ZScoreType = c("Center and Scale", "Center Only", "Scale Only", "ZScoreObj",
"PreZScored"),
ZScoreObject = NULL,
som_xdim = NULL,
som_ydim = NULL,
som_topo = "hexagonal",
som_neigh = "gaussian",
seed_som = 934521L,
seed_lpa = 93421L,
Relabel = TRUE,
ZScorePrefix = "Z_",
ZScoreVars = NULL,
id_col = NULL,
lpa_progress = FALSE,
lpa_em_itmax = 100L,
lpa_em_tol = 1e-05,
lpa_timeout_seconds = 120,
lpa_drop_zero_sd = TRUE,
lpa_zero_sd_tol = 1e-08,
skip_model_after_n_failures = 2L,
slow_fit_seconds = 120,
min_nodes_per_cluster = 5
)Arguments
- df
Data frame containing the variables to be used in SOM and clustering.
- variables
Optional character vector of variable names. If NULL, numeric variables are auto-detected using
SciDataReportR::getNumVars(df, Ordinal = FALSE). InZScoreType = "PreZScored", this can also be NULL if you supplyZScoreVarsor if Z-score columns can be auto-detected by prefix.- method
One of
"exploratory"(default) or"finalize". In"exploratory", a grid of models is fit and AHP chooses the recommended solution. In"finalize", the user must specifyfinal_kandfinal_model.- k_range
Integer vector of numbers of clusters/profiles to consider in exploratory mode. Default
2:15.- models
Integer vector of model specifications for tidyLPA (mclust backend). Default
c(1, 2, 3).- final_k
Integer; number of profiles for
method = "finalize".- final_model
Integer; model specification for
method = "finalize"(should be one ofmodels).- ClusterName
Name of the cluster column in the output. Defaults to
"Cluster". If this column already exists indf, it is overwritten (with a message).- ZScoreType
One of:
"Center and Scale"(default)"Center Only""Scale Only""ZScoreObj"(use an existing ZScore object)"PreZScored"(use existing Z-score columns in df as-is)
- ZScoreObject
Optional ZScoreObj (from
CreateZScoreObject()orProjectZScore()) to use whenZScoreType = "ZScoreObj".- som_xdim, som_ydim
Optional integers for SOM grid dimensions. If NULL, a square grid with side length
ceiling(n_complete^(1/3))is used.- som_topo
SOM topology for
kohonen::somgrid(), default"hexagonal".- som_neigh
SOM neighbourhood function, default
"gaussian".- seed_som, seed_lpa
Integer seeds for SOM and LPA steps (defaults 934521 and 93421).
- Relabel
Logical; if TRUE (default), aweSOM plots are relabeled using variable labels from the original
df(via Hmisc or sjlabelled when available) by stripping the Z-score prefix.- ZScorePrefix
Character prefix used for Z-score columns when
ZScoreType = "PreZScored". Default"Z_".- ZScoreVars
Optional character vector of Z-score column names to use when
ZScoreType = "PreZScored". If NULL, the function attempts to infer them fromvariablesor by detecting columns starting withZScorePrefix.- id_col
Optional character scalar. If provided and present in
df, this column is carried intoProbFit$individualfor convenience.- lpa_progress
Logical; if TRUE, print short progress messages while fitting model/profile combinations.
- lpa_em_itmax
Integer; maximum number of EM iterations passed to
mclust::emControl(). Use NULL to leave mclust defaults unchanged.- lpa_em_tol
Numeric; EM convergence tolerance passed to
mclust::emControl(). Use NULL to leave mclust defaults unchanged.- lpa_timeout_seconds
Optional timeout in seconds for individual LPA fits. Use NULL to disable timeouts.
- lpa_drop_zero_sd
Logical; if TRUE, remove SOM code dimensions with near-zero standard deviation before LPA.
- lpa_zero_sd_tol
Numeric tolerance used when
lpa_drop_zero_sd = TRUE.- skip_model_after_n_failures
Optional integer; skip a model family after this many failures.
- slow_fit_seconds
Optional runtime threshold used to flag slow LPA fits in diagnostics.
- min_nodes_per_cluster
Optional minimum average SOM nodes per cluster considered before attempting a candidate profile count.
Value
A list of class "Pipeline_SOMClust" with components:
method,vars_used,ZScoreType,ZScoreObject,ZScoreVars,ClusterNamecomplete_rows: logical vector (rows used for SOM/LPA)df_with_clusters: originaldfwith.scidr_rowidand only the cluster column appendedfit_plot: ggplot of AIC/BIC/Entropy vs k and modelModelInfo_SOM: list withsom_model,som_codes,som_grid,SOMFit(distance diagnostics, baselines, and per-cluster flags),plots(aweSOM plots)ModelInfo_MClust: list withlpa_models,fit_table,AHPinformation, anddiagnosticsfor LPA warnings, failures, runtimes, and preprocessingProbFit: list withnode(node-level posterior probabilities),individual(full-length per-person mapping and probabilities, including.scidr_rowid), and probability plots
Details
The AHP-style index is computed by:
Scaling AIC, BIC, and Entropy across candidate solutions (AIC/BIC are negated so that lower values correspond to better fit; higher scaled scores are preferred).
Taking the mean of the three scaled indices. The model with the highest AHP index is recommended.
LPA model/profile combinations are fit one at a time so that failed or
warning-producing solutions are captured in diagnostics instead of blocking
the entire pipeline. Successful fits are retained and failed fits are listed
in ModelInfo_MClust$diagnostics.