2 BIOSTATISTICS, EPIDEMIOLOGY, & RESEARCH DESIGN FORUM Advances and Challenges in Wearables Research Friday, November 3 Advances and Challenges in Wearables Research Julia Wrobel, PhD Keynote Speaker Friday, November 3 10:00 AM — 3:00 PM REGISTER: bit.ly/BERD2023 In-Person: Morehouse School of Medicine, Building A, 4th Floor Sr. Biostatistician Virtual: Zoom
Accelerometers • Physical activity is key to many health-related questions • Active individuals tend to live longer and healthier lives • Traditionally, this has been done using retrospective questionnaires • Accelerometers have become hugely popular • Objective • Collection “in the wild” • High resolution 7
• PA measures: Total steps / counts, MVPA minutes • Sedentary measures: Sedentary time, number of sedentary bouts Accelerometer data processing pipeline
Reproducibility and rigor • Much of this is still up for debate • Consider moderate-to-vigorous physical activity (MVPA) • How are “activity counts” generated? • How are cut points formed (no PA / light PA/ MVPA)? • Are these consistent across devices? Age groups? Placements? • Some general recommendations • Keep data in rawest form possible • Process using non-proprietary software 11
Functional data analysis (FDA) • Wearables devices record signal over 24-hour periods- the exact focus of FDA! • In FDA, outcome is curve or function 𝑌! 𝑡 • For accelerometer data 𝑌! 𝑡 is a 24-hour activity profiles 12 𝑡 (hour) 𝑌! (𝑡)
Uses for FDA in wearables • Less pre-processing of the raw data • Less information is discarded • Better ways of imputing data • Missing data is a big problem in wearables • Time-dependent interpretations • Timing and consistency • Does it matter when and how regularly someone moves? 13
FDA tools for massive accelerometer studies • Function-on-scalar regression (FoSR) • Functional outcome, scalar predictors (e.g. age) • UK Biobank Accelerometry Study • 80,000+ participants • Generalized functional principal components analysis (gFPCA) • National Health and Nutrition Examination Survey (NHANES) • 4,000+ participants (2011-2014 wave) • Registration • How does timing of wake/sleep, PA differ across people? • Baltimore Longitudinal Study on Aging (BLSA) • 500+ participants 14
FDA of 88,693 subjects from UK Biobank study • Average daily activity patterns across ages from functional regression • Left are males, right panel are females 17 J. Wrobel, J. Muschelli, and A. Leroux (2021). Sensors.
Generalized functional principal components analysis • Generalized FPCA and generalized regression model exponential family functional data using a (GLM)-like framework 𝑔 𝐸 𝑌! 𝑠 = 𝜂! 𝑠 = 𝛽" 𝑠 + 𝑏! 𝑠 = 𝛽" 𝑠 + + #$% & 𝜉!# 𝜙# 𝑠 • 𝑌! ∼ 𝐸𝑥𝑝𝑜𝑛𝑒𝑛𝑡𝑖𝑎𝑙 𝐹𝑎𝑚𝑖𝑙𝑦; 𝑔(⋅) is a link function • 𝛽& 𝑠 is a population mean function • 𝜙' 𝑠 are population level eigenfunctions • 𝜉!' are subject-specific scores 23
The NHANES 2011-2014 accelerometer study • National Health and Nutrition Examination Survey • Accelerometer data from 2011-2014 wave released in 2021 • Accelerometer data over multiple days from > 4000 subjects • 1440 minutes per day of PA measurement • Goal is to understand population patterns in sedentary behavior • Existing FDA methods cannot handle data of this size • We proposed a fast, general-purpose algorithm for generalized FPCA 24
𝑔 𝐸 𝑌! 𝑠 = 𝜂! 𝑠 = 𝛽" 𝑠 + 𝑏! 𝑠 = 𝛽" 𝑠 + + #$% & 𝜉!# 𝜙# 𝑠 1. Bin the data along the functional domain 𝑠 into 𝐿 bins 2. Estimate separate local GLMMs in each bin to obtain 𝜂! 𝑠(! at each bin midpoint 3. Estimate FPCA on local latent estimates 𝜂! 𝑠(! to obtain eigenfunctions 𝝓 𝑠 4. Estimate global model conditioning on eigenfunctions 𝝓 𝑠 by re- estimating subject-specific scores 𝜉!' Four-step fast GFPCA algorithm A. Leroux, C. Crainiceanu, and J. Wrobel (2023+). Fast generalized functional principal components analysis. Under review.
fastGFPCA simulation results • Compared with two existing methods • Variational Bayes binary FPCA (Wrobel, 2019), bfpca • Can’t estimate Poisson or other distributions • Two-step conditional model (Gertheiss, 2017), tsGFPCA • Breaks for N > 100 • fastGFPCA is • More accurate than tsGFPCA for binary and Poisson data • Order of magnitude faster • As or more accurate than bfpca for binary data • Comparable computation time 26
GFPCA results for NHANES data • 4286 participants with 1440 observations each • 3-4 hours of computation time (step 4 is the slow step) • Subsampled version of step 4 led to ~22 minutes of computation time
Misalignment in accelerometer data • Time variation: subjects start and end the day at different times • Activity level variation: people have higher or lower levels of activity 29
Registration methods align functional data by warping the domain • Most methods are computationally inefficient and handle only continuous data 𝜇! 𝑡! ∗ ℎ! #$ 𝑡! ∗ = 𝑡 𝜇! ℎ! #$ 𝑡! ∗ = 𝜇! 𝑡
Algorithm and software optimized for computational efficiency • Step 1: Estimates template to which curves are registered • uses fast, novel variational EM algorithm for binary functional data • Step 2: Estimates warping function for each subject • uses constrained maximum likelihood estimation • Implemented in R package registr • Implemented in C++ 34 • Wrobel, Goldsmith (2019). Registration for exponential family functional data. Biometrics. • Wrobel (2018). registr: Registration for exponential family functional data. Journal of Open Source Software. 3.
Future methods work in these areas • Fast GFPCA • Multilevel data (Monday-Sunday) • Xinkai Zhou • Sparse and irregular data • Fast Generalized function-on-scalar regression • Dustin Rogers • Registration • Multilevel registration
Acknowledgements Colorado SPH Biostatistics • Andrew Leroux • Dustin Rogers Columbia Biostatistics Functional Data Analysis Working Group • Jeff Goldsmith Johns Hopkins School of Public Health WIT: Wearable and Implantable Technology • Vadim Zipunnikov • Jennifer Schrack • John Muschelli • Ciprian Crainiceanu • Xinkai Zhou
Step 1: bin the data Choose 𝐿 bins where 𝑚+ is the midpoint bin 𝑙 ∈ 1, … , 𝐿 Considerations • Bin width: simplicity- equidistance and non- overlapping • Number of bins
Step 1: bin the data Choose 𝐿 bins where 𝑚+ is the midpoint bin 𝑙 ∈ 1, … , 𝐿 Considerations • Bin width: simplicity- equidistance and non- overlapping • Number of bins • Too many bins: bin width is too small, identifiability issues
Step 1: bin the data Choose 𝐿 bins where 𝑚+ is the midpoint bin 𝑙 ∈ 1, … , 𝐿 Considerations • Bin width: simplicity- equidistance and non- overlapping • Number of bins • Too many bins: bin width is too small, identifiability issues • Too few bins: bins width too big, don’t capture shape of underlying function
Step 1: bin the data Choose 𝐿 bins where 𝑚+ is the midpoint bin 𝑙 ∈ 1, … , 𝐿 Considerations • Bin width: simplicity- equidistance and non- overlapping • Number of bins • Too many bins: bin width is too small, identifiability issues • Too few bins: bins width too big, don’t capture shape of underlying function
Step 2: fit Generalized Linear Mixed Model in each bin Fit separate GLMM in each bin to get latent estimates • 𝑔 𝐸 𝑌! 𝑠"! = 𝛽$ 𝑠"! + 𝑏! 𝑠"! = 𝜂! 𝑠"! • 𝑠"! : time 𝑠 at the midpoint of bin 𝑙 • 𝛽$ 𝑠"! : fixed effect mean • 𝑏! 𝑠"! : subject-specific random effect • 𝜂! 𝑠"! : linear predictor, local latent estimates • Estimates are not on the original domain • On domain defined by bin midpoints • Model assumes constant effect for 𝛽% , 𝑏! across each bin • Used for estimating covariance matrix and eigenfunctions
Step 3: estimate eigenfunctions using fPCA Estimate FPCA using linear predictor from Step 2 • + 𝜂! 𝑠"! = , 𝛽$ 𝑠"! + ∑%&' ( , 𝜉!% / 𝜙% 𝑠"! • Estimated using refund::fpca.face() • Eigenfunctions F 𝝓 characterize covariance • 𝐾 : chosen by percent variance explained • Evaluated at bin midpoint rather than original domain • Project eigenfunctions onto original domain
Step 4: estimate GFPCA Estimate GFPCA conditional on eigenfunctions from Step 3 • 𝑔 𝐸 𝑌! 𝑠 | = 𝛽$ 𝑠 + ∑%&' ( 𝜉!% / 𝜙% 𝑠 • Eigenfunctions are orthogonal basis functions • Reduces number of covariance parameters that need to be estimated for random effects • Simple implemention • mgcv::bam()