Use of consensus clustering to identify distinct subtypes of chronic kidney disease and associated mortality risk

0
Use of consensus clustering to identify distinct subtypes of chronic kidney disease and associated mortality risk

Study population

The US National Health and Nutrition Examination Survey (NHANES) is an ongoing cross-sectional, national, stratified, multistage probability surveys of the civilian, noninstitutionalized US population26. About 10 thousand individuals in each survey for every 2 years are investigated to complete a household interview and underwent a physical examination. In present study, we used data from NHANES III 1999–2000 to 2017–2018 which included 10 cycles of survey. All nonpregnant participants with 20 years or older and CKD were included in the analysis. A detailed description of the NHANES database is publicly available (http://www.cdc.gov/nchs/nhanes.htm).

We combined ten consecutive survey cycles which included 101,316 participants. Participants were excluded for aged younger than 20 (n = 46,235), being pregnant at examination or uncertain of the pregnancy status (n = 2,639), and having received dialysis treatment in the past 12 months (n = 162). Participants were also excluded due to missing data on mortality or outliers of cluster variables. Finally, a total of 6,526 eligible subjects with CKD were included in the analysis (Supplementary Fig. 1). The NHANES protocol was approved by the National Center for Health Statistics (NCHS) Institutional Review Board and written informed consent was obtained. Our study followed the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guideline27.

Mortality data

NHANES III participant records were linked to mortality data from the National Death Index based on death certificate data ( International Classification of Diseases, Tenth Revision codes were used to identify the cause of deaths. Cardiovascular death includes death due to diseases of heart (I00-I09, I11, I13, I20-I51) and cerebrovascular diseases (I60-I69). Cancer death was classified using codes C00-C97. Person-months were calculated in months from the date of interview to date of death or most recent vital status record.

Variables

Sociodemographic characteristics, behavioral risk factors and history of diseases were administered in the survey by trained interviewers using questionnaires. The physical examinations and laboratory tests in NHANES took place in a mobile examination center using standardized protocols and calibrated equipment, and details on the data collection are described on the website ( We selected 45 clinically available and novel factors from over all variables that were measured at NHANES study baseline. Variables were selected on the basis of literature review for those that are most clinically relevant to CKD28,29,30,31,32. We excluded variables with over 10% missing data or small variability (e.g., binary variable with < 5%). The 45 variables included variables of sociodemographic characteristics (n = 9), behavioral risk factors (n = 5), biomarkers of metabolic status (n = 19), and history of diseases (n = 12).

The sociodemographic characteristics included ethnicity, education level, family income, marriage status, citizenship status, housing, employment, health insurance, and regular health care access. The ethnicity was categorized into non-Hispanic white and non-white. Low education attainment was defined as attaining less than a high school education. The income-to-poverty ratio (annual family income divided by the poverty threshold adjusted for family size and inflation) was used as a measure of income. The low income-to-poverty ratio was defined as less than 100%. The marriage status was dichotomized as currently married and not married. The citizenship status includes two options, citizen by birth or naturalization and not a citizen of the US. For investigating housing status, the participants were asked “Is this home owned, being bought, rented, or occupied by some other arrangement by you or someone else in your family?” Employed status was dichotomized as unemployed and employed, student, or retired. The type of health insurance was also dichotomized as with and without health insurance. The participants were asked “Is there a place that you usually go when you are sick, or do you need advice about your health?” for investigating the health care access.

The behavioral risk factors included currently smoking status, currently drinking status, physical activity level, sleep duration and sodium intake. Current smoking was defined as having smoked at least 100 cigarettes in life and smoking at present. Current alcohol drinking was defined as taking at least 12 times drinks of any type of alcoholic beverage in the last 12 months. Physical activity was estimated using the form of the Global Physical Activity Questionnaire by asking questions about the intensity, duration, and frequency of physical activity. There were different types of physical activity assessment tools used in NHANES 1999–2000 to 2005–2006 and NHANES 2007–2008 to 2017–2018. In NHANES 1999–2000 to 2005–2006, the duration of the physical activity was not ascertained, each physical activity was assigned an intensity value (metabolic equivalent tasks) that represents the ratio of the energy expenditure of the activity to the basal metabolic rate. In NHANES 2007–2008 to 2017–2018, total metabolic equivalent minutes per week were calculated as the measurement of physical activity level for the subjects. A higher level of physical activity was defined as having a higher metabolic equivalent/week than the median levels of the metabolic equivalent/week by investigation cycles. The usual sleep duration at night was investigated and long sleep duration was considered as sleep longer than 8 h. The sodium intake was collected through dietary interview.

The physical examinations and laboratory tests of metabolic biomarkers were collected using standardized protocols and assays, including body mass index (BMI), waist circumference (WC), systolic blood pressure (SBP), diastolic blood pressure (DBP), HbA1c, fasting plasma glucose (FPG), 2 h postprandial glucose (2 h PG), alanine transaminase (ALT), aspartate transaminase (AST), γ-glutamyl transferase (GGT), triglyceride (TG), high-density lipoprotein cholesterol (HDL-cholesterol), total cholesterol, low-density lipoprotein cholesterol (LDL-cholesterol), C-reactive protein (CRP), serum albumin, uric acid (UA), Urinary albumin creatinine ratio (UACR), and eGFR.

The information on currently taking prescribed medicine for treating hypertension, diabetes, and hypercholesterolemia was investigated in the survey. The history of diseases, including congestive heart failure, coronary heart disease, heart attack, stroke, and cancer or malignancy was also collected.

Definition of CKD

The eGFR was calculated using the 2009 chronic kidney disease epidemiology collaboration (CKD-EPI) equation with considering the sex, serum creatinine level and race33. Albuminuria was calculated using urinary albumin divided by the urinary creatinine based on morning spot urine. CKD was defined as eGFR level < 60 ml/min/1.73 m2 or UACR ≥ 30 mg/g. Equations expressed for specified sex and serum creatinine level were showed in the Supplementary Table 1.

Statistical analysis

We employed multiple imputation with arbitrary missing patterns to correct for response bias under the assumption of missing at random, and to maximally utilize existing risk factor data. The continuous variables with skewed distributions were log-transformed to normal distributions in the imputation process. The linear regression method was used to impute missing values for continuous variables, and the logistic regression method for variables having binary or ordinal responses. For each variable with missing data, we used the other variables to impute.

We performed consensus clustering analysis on the participants with chronic kidney disease and the continuous values were centered to a mean value of 0 and a standard deviation of 1. The clustering algorithm is to maintain high cluster consensus while maximizing the number of clusters34. With the prespecified setting a number of clusters K = 2, 3, …, 7, the consensus clustering algorithm generated a random subset that contained 80% of the data records without replacement and repeated 100 times for each number of clusters. For each random subset, the K-means (Euclidean distance-based) algorithm was conducted while each individual was assigned to one of the clusters. The frequencies of any pair of two individuals were calculated after 100 iterations, which were grouped together under each scenario of K and constructed a matrix of participantsˈ pairwise consensus value34. In the consensus matrix, consensus values ranged from 0 (never clustered together) to 1 (always clustered together) were marked by white to bright blue. For each number of cluster analysis, the cluster memberships are marked by colored rectangles. The consensus matrix is ordered by the consensus clustering which is displayed as a dendrogram atop the heatmap.

The optimal number of clusters was ascertained by observing the consensus matrix heat map, the within-cluster consensus scores, and the cumulative distribution function (CDF) (range 0–1) plot34. The CDF plot showed the area under the CDFs for each K, and for a specific number of clusters, the CDF reached an approximate maximum, thus consensus and cluster confidence was at a maximum at this K. The relative change in area under the CDF curve comparing K and K − 1 was also used to determine the optimal number of clusters. The cluster consensus score, ranged between 0 and 1, was defined as the average consensus value for all pairs of individuals belonging to the same cluster. A value approached to 1 indicated better cluster stability34.

For continuous variables with normal distribution, we calculated mean and standard deviation; for continuous variables with skewed distributions, we calculated median and interquartile range; and for categorical variables, we presented count and percentage. To present the cluster profiles of the 45 variables, we graphically displayed the standardized means of continuous variables (metabolic risk factors and sodium intake) and proportions for categorical variables (the other variables) by cluster. The frequencies of endpoints related to death were calculated as the number of events divided by person­months of observation censored at the date of event occurrence, death, or follow-up visit, whichever came first. Adjusted Cox proportional hazard models were used and hazard ratios (HRs) with 95% confidence intervals (CIs) were calculated to estimate the risks for all-cause mortality, CVD mortality, cancer mortality and mortality due to other causes by cluster.

All the statistical analysis was conducted using the R version 4.2.3 ( Consensus clustering analysis was done using the ConsensusClusterPlus function (minimum K = 2, maximum K = 7, replication = 100, proportion of random subset = 0.8, Euclidean distance-based K-means algorithm) in the ‘ConsensusClusterPlus’ package in R version 4.2.3 (

link

Leave a Reply

Your email address will not be published. Required fields are marked *