class: title-slide, center, middle, remark-slide-content, inverse, title-slide, hljs-github # GPU Computing with R ### Jared P. Lander ### Chief Data Scientist <img src="data:image/png;base64,#/home/jared/consulting/talks/images/Lander_logo.png" width="40%" /> ??? - GPUs and R
<!-- Start main content --> --- class: middle, center background-image: url("data:image/png;base64,#/home/jared/consulting/talks/images/bitcoin-logo.png") background-size: contain ??? - Going to talk about Bitcoin! --- class: middle, center .large[Just Kidding] ??? - OK, we won't --- class: middle, center background-image: url("data:image/png;base64,#/home/jared/consulting/talks/images/doge-logo.png") background-size: contain ??? - Really talking about Doge --- class: middle, center .large[Just Kidding] ??? - OK, not this either --- class: middle, center background-image: url("data:image/png;base64,#/home/jared/consulting/talks/images/nvidia-dgx1.jpg") background-size: contain ??? - GPU for Scientific computing - Like stats and machine learning --- class: middle, center .large[Not (Just) Deep Learning] ??? - Talk about other uses than deep learning - But deep learning too --- class: part # Why the GPU? ??? - Why do we want to use the GPU - This thing for video games - It all comes down to cores... --- class: middle, center, hide_logo background-image: url("data:image/png;base64,#/home/jared/consulting/talks/images/cpu-gpu-cores.png") background-size: contain ??? - Source: [Nvidia](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html) - CPU: 4 cores - GPU: thousands of cores - Price per core is lower - Massively parallelize - Why can't CPUs have thousands of cores? - Not general purpose - Must all be doing the same kind of calculation --- class: middle, center # Why Do We Need So Many Cores? ??? - What do we do with them? --- class: middle - Programming - Matrix Algebra - Parallel Loops - Machine Learning - Deep Learning - Boosted Trees - Bayesian Methods ??? - deep learning is just matrix algebra repeatedly - Yes, we can speed up Bayes --- class: part # CRAN Packages ??? - Let's see what CRAN has - CRAN Task View for High Performance Computing - And it lists... --- class: middle - [`{gpuR}`](https://cran.r-project.org/web/packages/gpuR/index.html) - [`{gputools}`](https://cran.r-project.org/web/packages/gputools/index.html) - [`{gmatrix}`](https://cran.r-project.org/web/packages/gmatrix/index.html) - [`{magma}`](https://cran.r-project.org/web/packages/magma/index.html) - [`{rpud}`](https://cran.r-project.org/web/packages/rpud/index.html) - [`{cudaBayesreg}`](https://cran.r-project.org/web/packages/cudaBayesreg/index.html) - [`{OpenCL}`](https://cran.r-project.org/web/packages/OpenCL/index.html) - [`{permGPU}`](https://cran.r-project.org/web/packages/permGPU/index.html) - [`{torch}`](https://cran.r-project.org/web/packages/torch/index.html) - [`{keras}`](https://cran.r-project.org/web/packages/keras/index.html) ??? - Some GPU packages on CRAN - gpuR - gputools - gmatrix - magma --- - <div class="container"> <p>Package ‘gpuR’ was removed from the CRAN repository.</p> <p> Formerly available versions can be obtained from the <a href="https://CRAN.R-project.org/src/contrib/Archive/gpuR">archive</a>. </p> <p> Archived on 2020-06-02 as check problems were not corrected despite reminders. </p> <p>A summary of the most recent check results can be obtained from the <a href="https://cran-archive.R-project.org/web/checks/2020/2020-06-02_check_results_gpuR.html">check results archive</a>.</p> <p>Please use the canonical form <a href="https://CRAN.R-project.org/package=gpuR"><samp>https://CRAN.R-project.org/package=gpuR</samp></a> to link to this page.</p> </div> ??? - gpuR --- - <div class="container"> <p>Package ‘gputools’ was removed from the CRAN repository.</p> <p> Formerly available versions can be obtained from the <a href="https://CRAN.R-project.org/src/contrib/Archive/gputools">archive</a>. </p> <p> Archived on 2017-12-19 as check problems were not corrected despite reminders. </p> <p>Please use the canonical form <a href="https://CRAN.R-project.org/package=gputools"><samp>https://CRAN.R-project.org/package=gputools</samp></a> to link to this page.</p> </div> ??? - gputools --- - <div class="container"> <p>Package ‘gmatrix’ was removed from the CRAN repository.</p> <p> Formerly available versions can be obtained from the <a href="https://CRAN.R-project.org/src/contrib/Archive/gmatrix">archive</a>. </p> <p> Archived on 2017-12-19 as check problems were not corrected despite reminders. </p> <p>Please use the canonical form <a href="https://CRAN.R-project.org/package=gmatrix"><samp>https://CRAN.R-project.org/package=gmatrix</samp></a> to link to this page.</p> </div> ??? - gmatrix --- - <div class="container"> <p>Package ‘magma’ was removed from the CRAN repository.</p> <p> Formerly available versions can be obtained from the <a href="https://CRAN.R-project.org/src/contrib/Archive/magma">archive</a>. </p> <p> Archived on 2015-06-03 as requested by the maintainer <brian-j-smith@uiowa.edu>. </p> <p>Please use the canonical form <a href="https://CRAN.R-project.org/package=magma"><samp>https://CRAN.R-project.org/package=magma</samp></a> to link to this page.</p> </div> ??? - magma --- - <div class="container"> <p>Package ‘rpud’ was removed from the CRAN repository.</p> <p> Formerly available versions can be obtained from the <a href="https://CRAN.R-project.org/src/contrib/Archive/rpud">archive</a>. </p> <p> Archived on 2015-11-21 as check problems were not corrected despite reminders. </p> <p>Please use the canonical form <a href="https://CRAN.R-project.org/package=rpud"><samp>https://CRAN.R-project.org/package=rpud</samp></a> to link to this page.</p> </div> ??? - rpud --- - <div class="container"> <p>Package ‘cudaBayesreg’ was removed from the CRAN repository.</p> <p> Formerly available versions can be obtained from the <a href="https://CRAN.R-project.org/src/contrib/Archive/cudaBayesreg">archive</a>. </p> <p> Archived on 2018-11-29 at the request of the maintainer. </p> <p>A summary of the most recent check results can be obtained from the <a href="https://cran-archive.R-project.org/web/checks/2018/2018-11-29_check_results_cudaBayesreg.html">check results archive</a>.</p> <p>Please use the canonical form <a href="https://CRAN.R-project.org/package=cudaBayesreg"><samp>https://CRAN.R-project.org/package=cudaBayesreg</samp></a> to link to this page.</p> </div> ??? - cudaBayesreg - Haven't updated task view --- class: middle - [`{OpenCL}`](https://cran.r-project.org/web/packages/OpenCL/index.html) - [`{permGPU}`](https://cran.r-project.org/web/packages/permGPU/index.html) - [`{torch}`](https://cran.r-project.org/web/packages/torch/index.html) - [`{keras}`](https://cran.r-project.org/web/packages/keras/index.html) ??? - Still on CRAN - BioConductor has some - Only OpenCL is general purpose - Torch isn't even listed --- class: middle, center # More Not on CRAN ??? - More options off of CRAN - We'll see some of those - But first... --- class: part # Installing CUDA ??? - General purpose programming toolkit - Computation on GPU - Can be real difficult to install --- class: middle ```sh # makes sure linux headers are installed sudo apt-get install linux-headers-$(uname -r) # gets name and version of OS distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g') # pin for installing from wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-$distribution.pin # put it in the right spot sudo mv cuda-$distribution.pin /etc/apt/preferences.d/cuda-repository-pin-600 # get the GPG key to ensure authenticity sudo apt-key adv --fetch-keys \ https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/7fa2af80.pub # point to Nvidia's repository echo "deb http://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64 /" | sudo \ tee /etc/apt/sources.list.d/cuda.list # make sure apt sees the new installers in the Nvidia repository sudo apt update # install the drivers sudo apt -y install cuda-drivers ``` ??? - Install correct Linux headers - Download a pin file - Fetch a GPG key - Add an apt source - Install cuda-drivers - Also installs Nvidia drivers - Nvidia focused - Sometimes works - Might want to use Docker - Focus on programming and machine learning - Start with programming... --- class: part # Programming ??? - Different frameworks for writing custom code on GPU --- class: section # [`{OpenCL}`](https://cran.r-project.org/web/packages/OpenCL/index.html) ??? - Generic code - Can run on different devices - CPU - GPU - ASICs --- class: middle ```sh sudo apt install ocl-icd-opencl-dev ``` ```r install.packages('OpenCL') ``` ??? - Install OpenCL - ocl-icd-opencl-dev --- class: middle ```r dist_cl <- c( ' __kernel void distance( __global numeric* output, const unsigned int out_length, int in_length, __global double* x, __global double* y ) { // iterator size_t i = get_global_id(0); output[i] = sqrt(pow(x[i/in_length] - x[i%in_length], 2) + pow(y[i/in_length] - y[i%in_length], 2)); } ' ) ``` ??? - Output vector is an argument - Star for pointer - As is length of output - Globals because they are generic memory locations - Pass in vectors rather than matrix - code is less generic - Regular distance function - No loops, automatically loop over length of output - need to get the indices right --- class: middle ```r library(OpenCL) platform <- oclPlatforms()[[1]] device <- oclDevices(platform)[[1]] ctx <- oclContext(device) ``` ??? - Platform: Nvidia CUDA - Device: Nvidia GeForce GTX 1080 Ti (gaming) - This is the context for computations --- class: middle ```r distance_kernel <- oclSimpleKernel( # context we defined ctx, # same as function name in code name='distance', # character vector of code we wrote code=dist_cl, output.mode='numeric' ) ``` ??? - Turn function into a kernel - Kernel runs in the context - Name same as function name --- class: middle ```r len_in <- 30000 x <- rnorm(len_in) y <- rnorm(len_in) x_buf <- as.clBuffer(x, ctx) y_buf <- as.clBuffer(y, ctx) ``` ??? - Buffers are generic vectors in memory - This case on GPU - Copy memory - Not awesome - Pass to context --- class: middle ```r cl_ans <- oclRun( kernel=distance_kernel, size=len_in^2, in_length=as.integer(len_in), x=x_buf, y=y_buf ) ``` ??? - Call the kernel: oclRun() - Tell it how long the output is - Pass in buffers/vectors - Could have written a smarter function - Limited to GPU memory - OpenCL a little clunky - Limited to vectors --- class: section # `{RcppArrayFire}` ??? - Tensor Library for GPUs - Tensors are just arrays - Essentially matrix multiplication - Like armadillo for GPU - Vectorized like R --- class: middle ```sh wget https://arrayfire.s3.amazonaws.com/3.8.0/ArrayFire-v3.8.0_Linux_x86_64.sh sudo sh ./ArrayFire-v3.8.0_Linux_x86_64.sh --include-subdir --prefix=/opt sudo echo /opt/arrayfire/lib64 > /etc/ld.so.conf.d/arrayfire.conf sudo ldconfig ``` ??? - Install the ArrayFire binary - Into /opt - Tell ld config where ArrayFire lives - Can also install with apt, but that didn't work --- class: middle ```r drat::addRepo("daqana") install.packages("RcppArrayFire") ``` ??? - Not on CRAN - Install from GitHub - Write C++ function like with Rcpp --- class: middle ```cpp #include <RcppArrayFire.h> // [[Rcpp::depends(RcppArrayFire)]] // [[Rcpp::export]] af::array distance(RcppArrayFire::typed_array<f32> x) { int num_rows = x.dims(0); af::array res = af::constant(0.0, num_rows, num_rows); gfor(af::seq i, num_rows) { for(int j = 0; j < num_rows; j++) { res(i, j) = af::sqrt(af::sum(af::pow(x(i, af::span) - x(j, af::span), 2))); } } return res; } ``` ??? - gfor is parallel - for is within gfor - can't depend on i - Would be better to unroll double loop --- class: middle ```r X <- matrix(rnorm(1:60000), ncol=2) dist_fire <- distance(X) ``` ??? - Create a matrix to test on - Call function on matrix like dist() - Still have to move data from CPU memory to GPU memory - Hasn't been updated in 6 years - Rather than writing external code... --- class: section # nvblas ??? - CUDA Matrix Algebra Library - Drop-in replacement for linear algebra routines - MKL - ATLAS - OPENBLAS - Use regular R functions on GPU --- class: middle ## /etc/nvblas.conf ```sh NVBLAS_LOGFILE nvblas.log NVBLAS_CPU_BLAS_LIB libopenblas.so NVBLAS_GPU_LIST ALL NVBLAS_AUTOPIN_MEM_ENABLED ``` ??? - Comes installed with CUDA - Tells R to use CUDA for matrix algebra --- class: middle ```sh env LD_PRELOAD=/usr/local/cuda/lib64/libnvblas.so NVBLAS_CONFIG_FILE=/etc/nvblas.conf /opt/R/4.1.0/bin/R ``` ??? - Launch R with CUBLAS as matrix library - Ways to do this with RStudio - LD_PRELOAD: Path to nvblas - NVBLAS_CONFIG_FILE Path to config file --- class: middle ```r A <- matrix(rnorm(100000000, mean=4, sd=2), ncol=1000) B <- matrix(rnorm(10000000, mean=4, sd=2), ncol=100) C <- t(A) %*% B ``` ??? - R uses GPU memory and computation - Matrix algebra as usual --- class: middle ```r X <- matrix(rnorm(1:60000), ncol=2) dist_nvblas <- dist(X) ``` ??? - Call normal dist() function - No special code needed - Limited by GPU memory --- class: section # [`{torch}`](https://cran.r-project.org/web/packages/torch/index.html) ??? - Not just for neural nets - General purpose matrix algebra - Highly optimized --- class: middle ```r install.packages('torch') torch::install_torch(type='11.1') ``` ```r torch::cuda_is_available() ``` ??? - First, install - 11.1 means use GPU - Automatically installs when loading package - Might not work with renv --- class: middle ```r library(torch) A_torch <- torch_normal(mean=4, std=2, size=c(100000, 1000))$cuda() B_torch <- torch_normal(mean=4, std=2, size=c(100000, 100))$cuda() C_torch <- A_torch$t()$matmul(B_torch) ``` ??? - transpose of A multiplied by B - Should be able to generate normals with device='cuda' - Some weird bug - Same idea but faster --- class: middle exclude: true ```r bench::mark(cpu=t(A) %*% B, iterations=50) bench::mark(gpu=A_gpu$t()$matmul(B_gpu), iterations=50) ``` ??? - Not necessarily a lot faster - But faster still - Important for repeated runs --- class: middle ```r x_torch <- torch_normal(mean=35, std=10, size=c(30000, 2))$cuda() dist_torch <- torch_pdist(x_gpu) ``` ??? - Let's compare distance - Limited by GPU memory --- class: middle exclude: true ```r bench::mark(cpu=dist(x, upper=FALSE), iterations=50) bench::mark(gpu=torch_pdist(x_gpu), iterations=50) ``` ??? - torch on CPU is only slightly faster than dist() --- class: section # Moral of the Story ??? - What did we learn so far - If you're doing matrix algebra... --- class: middle, center # Just use `{torch}` ??? - Easiest and fastest - both computer and human - Turn our attention to... --- class: part # Machine Learning ??? - Look at a few methods - Deep Learning - Boosted Trees - Bayesian Regression - Pick up where we left off with... --- class: section # [`{torch}`](https://cran.r-project.org/web/packages/torch/index.html) ??? - torch - Deep Learning --- class: middle ```r data(credit_data, package='modeldata') credit <- credit_data |> tibble::as_tibble() credit ``` ``` ## # A tibble: 4,454 × 14 ## Status Seniority Home Time Age Marital Records Job Expenses Income ## <fct> <int> <fct> <int> <int> <fct> <fct> <fct> <int> <int> ## 1 good 9 rent 60 30 married no freelance 73 129 ## 2 good 17 rent 60 58 widow no fixed 48 131 ## 3 bad 10 owner 36 46 married yes freelance 90 200 ## 4 good 0 rent 60 24 single no fixed 63 182 ## 5 good 0 rent 36 26 single no fixed 46 107 ## 6 good 1 owner 60 36 married no fixed 75 214 ## 7 good 29 owner 60 44 married no fixed 75 125 ## 8 good 9 parents 12 27 single no fixed 35 80 ## 9 good 0 owner 60 32 married no freelance 90 107 ## 10 bad 0 parents 48 41 married no partime 90 80 ## # … with 4,444 more rows, and 4 more variables: Assets <int>, Debt <int>, ## # Amount <int>, Price <int> ``` ??? - Credit data - Want to predict status --- class: middle ```r library(rsample) credit_split <- initial_split(credit, prop=0.8, strata='Status') train <- training(credit_split) test <- testing(credit_split) ``` ??? - Some data for training - Some for testing --- class: middle ```r library(recipes) rec_torch <- train |> recipe(Status ~ .) |> step_impute_knn(all_predictors()) |> step_normalize(all_numeric_predictors()) |> step_integer(Status, zero_based=FALSE) |> step_dummy(all_nominal_predictors(), one_hot=TRUE) |> # why bother carrying the data in this object prep(retain=FALSE) ``` ??? - Great way to preprocess data - Better than formula - Impute missing data - Standardize numerics - Should embed categorical, but dummy - Output is 1/2 - Different for torch than other methods --- class: middle ```r library(torch) credit_dataset <- dataset( name='credit_data', initialize=function(df, recipe){ self$x <- recipe |> bake(all_predictors(), new_data=df, composition='matrix') |> torch_tensor(dtype=torch_float()) self$y <- recipe |> bake(all_outcomes(), new_data=df, composition='matrix') |> torch_tensor(dtype=torch_float()) }, .getitem=function(i){ list(x=self$x[i, ], y=self$y[i]) }, .length=function(){ length(self$y) } ) ``` ??? - Dataset used by dataloader - Bakes recipe for x and y - Leaves data on CPU - Later move portions to GPU --- class: middle ```r train_ds <- credit_dataset(train, recipe=rec_torch) test_ds <- credit_dataset(test, recipe=rec_torch) train_dl <- dataloader(dataset=train_ds, batch_size=64, shuffle=TRUE) test_dl <- dataloader(dataset=test_ds, batch_size=64, shuffle=TRUE) ``` ??? - Dataloaders iterate through the data sets - Leave data on CPU for now --- class: middle ```r net <- nn_module( classname='credit_net', initialize=function( n_col, fc1_dim, fc2_dim, fc3_dim, output_dim ){ self$fc1 <- nn_linear(in_features=n_col,out_features=fc1_dim) self$fc2 <- nn_linear(in_features=fc1_dim, out_features=fc2_dim) self$fc3 <- nn_linear(in_features=fc2_dim, out_features=fc3_dim) self$output <- nn_linear(in_features=fc3_dim, out_features=output_dim) }, forward=function(x){ x |> nnf_dropout(p=0.2) |> self$fc1() |> nnf_relu() |> nnf_dropout(p=0.5) |> self$fc2() |> nnf_relu() |> nnf_dropout(p=0.5) |> self$fc3() |> nnf_relu() |> nnf_dropout(p=0.5) |> self$output() } ) ``` ??? - Fairly simple - Still complex - Three hidden layers - Relu activation function - All the power is in the activation - Extreme non-linear modeling - Dropout --- class: middle ```r torch_model <- net( n_col=summary(rec_torch) |> dplyr::filter(role == 'predictor') |> nrow(), fc1_dim=128, fc2_dim=64, fc3_dim=32, output_dim=1 ) ``` ??? - Build network - Before we designed - Now it's instantiated - Provide dim sizes --- class: middle ```r *torch_model$cuda() ``` ??? - move model to GPU - All the weights are on the GPU - Where matrix algebra takes place --- class: middle ```r optimizer <- optim_adam(torch_model$parameters, lr = 0.0001) for (epoch in 1:10) { torch_model$train() train_loss <- c() coro::loop(for (b in train_dl) { optimizer$zero_grad() * output <- torch_model(b$x$cuda()) * loss <- nnf_binary_cross_entropy_with_logits(output, b$y$cuda()) loss$backward() optimizer$step() train_loss <- c(train_loss, loss$item()) }) torch_model$eval() val_loss <- c() coro::loop(for (b in test_dl) { * output <- torch_model(b$x$cuda()) * loss <- nnf_binary_cross_entropy_with_logits(output, b$y$cuda()) val_loss <- c(val_loss, loss$item()) }) cat(sprintf("Loss at epoch %d: training: %3.3f, validation: %3.3f\n", epoch, mean(train_loss), mean(val_loss))) } ``` ??? - Define optimizer - 10 epochs - cuda() is all that's need to run on GPU - Only want portions of data on GPU at a time - Same device as model weights - torch_model() does matrix algebra - A lot of boiler plate - Still simpler way... --- class: middle ```r library(luz) mod_set <- net |> setup( loss=nn_bce_with_logits_loss(pos_weight=.5), optimizer=optim_adam, metrics = list( luz_metric_binary_accuracy_with_logits() ) ) |> set_hparams( n_col=summary(rec_torch) |> dplyr::filter(role == 'predictor') |> nrow(), fc1_dim=128, fc2_dim=64, fc3_dim=32, output_dim=1 ) |> set_opt_hparams(lr=0.01) ``` ??? - Handles epochs - Iterating over data - Validation - Choose optimizer - Set parameters - Hidden layer sizes --- class: middle ```r *fitted <- fit(object=mod_set, data=train_dl, epochs=10, valid_data=test_dl) ``` ??? - luz handles GPU usage for you - If GPU is available, it's used - No double for loops --- class: section # [`{xgboost}`](https://cran.r-project.org/web/packages/xgboost/index.html) ??? - Great algorithm for predictions --- class: middle ```sh # download xgboost package wget https://github.com/dmlc/xgboost/releases/download/v1.4.1/xgboost_r_gpu_linux_1.4.1.tar.gz # install needed packages from command line /opt/R/4.1.0/bin/R -q -e "install.packages(c('data.table', 'jsonlite', 'DiagrammeR')" # install xgboost with GPU support /opt/R/4.1.0/bin/R CMD INSTALL ./xgboost_r_gpu_linux_1.4.1.tar.gz ``` ??? - Can't install from CRAN - Prebuilt binaries - Install from the command line - Used to have to compile yourself --- class: middle, center background-image: url("data:image/png;base64,#/home/jared/consulting/talks/images/xgboost-gpu-github-request.png") background-size: contain ??? - I want to claim partial credit - Open source: Ask nicely and you receive --- class: middle ```r rec_xg <- recipe(Status ~ ., data=train) %>% # replace missing characters with unknown step_unknown(all_nominal_predictors(), new_level='missing') %>% # deal with new data being present in new data step_novel(all_nominal_predictors(), new_level='Unseen') %>% # collapse infrequent levels step_other(all_nominal_predictors(), other='Misc') %>% step_dummy(all_nominal_predictors(), one_hot=TRUE) %>% step_integer(Status, zero_based=TRUE) rec_xg_prep <- rec_xg %>% prep() ``` ??? - Recipe for xgboost - Outcome is now zero-based integer - No imputation needed --- class: middle ```r x_train <- rec_xg_prep %>% bake(all_predictors(), new_data=NULL, composition='dgCMatrix') y_train <- rec_xg_prep %>% bake(all_outcomes(), new_data=NULL, composition='matrix') x_test <- rec_xg_prep %>% bake(all_predictors(), new_data=test, composition='dgCMatrix') y_test <- rec_xg_prep %>% bake(all_outcomes(), new_data=test, composition='matrix') ``` ??? - Need to build matrices - Can use sparse matrices this time --- class: middle ```r library(xgboost) xg_train <- xgb.DMatrix(data=x_train, label=y_train) xg_test <- xgb.DMatrix(data=x_test, label=y_test) ``` ??? - Data need to be in special xgboost format - xgb.DMatrix --- class: middle ```r mod_xg_1 <- xgb.train( data=xg_train, objective='binary:logistic', eval_metric='auc', nrounds=60, subsample=0.5, colsample_bytree=0.8, watchlist=list(train=xg_train, validate=xg_test), print_every_n=20, * tree_method='gpu_hist', * gpu_id=0 ) ``` ??? - Train as normal - tree_method and gpu_id dictate GPU - Trees are sequential - But search for splits within trees is parallel - Speeding up an already fast algorithm - Outcome steps and GPUS don't play nicely --- class: middle exclude: TRUE ```r library(ggplot2) mod_xg_1$evaluation_log |> tidyr::pivot_longer(cols=c(train_auc, validate_auc), names_to='Set', values_to='AUC', names_pattern='(.*)_auc') %>% ggplot(aes(x=iter, y=AUC, color=Set)) + geom_line() + geom_vline(xintercept=mod_xg_1$evaluation_log[validate_auc == max(validate_auc), ]$iter, linetype=2) ``` --- class: middle, center # So Many Parameters ??? - Need to tune --- class: middle, center # [`{tidymodels}`](https://cran.r-project.org/web/packages/tidymodels/index.html) ??? - Use tidymodels tools --- class: middle ```r cv_split <- vfold_cv(data=train, v=5, repeats=2, strata='Status') cv_split ``` ``` ## # 5-fold cross-validation repeated 2 times using stratification ## # A tibble: 10 × 3 ## splits id id2 ## <list> <chr> <chr> ## 1 <split [2850/713]> Repeat1 Fold1 ## 2 <split [2850/713]> Repeat1 Fold2 ## 3 <split [2850/713]> Repeat1 Fold3 ## 4 <split [2851/712]> Repeat1 Fold4 ## 5 <split [2851/712]> Repeat1 Fold5 ## 6 <split [2850/713]> Repeat2 Fold1 ## 7 <split [2850/713]> Repeat2 Fold2 ## 8 <split [2850/713]> Repeat2 Fold3 ## 9 <split [2851/712]> Repeat2 Fold4 ## 10 <split [2851/712]> Repeat2 Fold5 ``` ??? - Should do three repeats on 10 folds - Saving time here --- class: middle ```r library(parsnip) library(tune) scaler <- train |> count(Status)|> pull(n) |> purrr::reduce(`/`) xg_spec <- boost_tree( mode='classification', trees=tune(), tree_depth=tune(), sample_size=tune() ) |> set_engine( 'xgboost', * params=list(tree_method = 'gpu_hist'), scale_pos_weight=!!scaler ) ``` ??? - params is specific to xgboost - setting gpu_hist here - scaler handles class imbalance --- class: middle ```r rec_xg_flow <- recipe(Status ~ ., data=train) |> # replace missing characters with unknown step_unknown(all_nominal_predictors(), new_level='missing') |> # deal with new data being present in new data step_novel(all_nominal_predictors(), new_level='Unseen') |> # collapse infrequent levels step_other(all_nominal_predictors(), other='Misc') |> step_dummy(all_nominal_predictors(), one_hot=TRUE) ``` ??? - tidymodels expects the outcome to be a factor - It is already so we leave it be --- class: middle ```r library(workflows) xg_flow <- workflow(preprocessor=rec_xg_flow, spec=xg_spec) ``` ??? - Combine recipe and model --- class: middle ```r library(dials) xg_params <- xg_flow %>% parameters() %>% update( trees=trees(range=c(10, 300)), tree_depth=tree_depth(range=c(2, 8)), sample_size=sample_prop(range=c(0.3, 1.0)) ) ``` ??? - Update parameters for a smaller search --- class: middle ```r library(doFuture) registerDoFuture() cl <- parallel::makeCluster(6) plan(cluster, workers = cl) library(yardstick) xg_tune <- tune_grid( xg_flow, resamples=cv_split, param_info=xg_params, metrics=metric_set(roc_auc, accuracy), grid=30, control=control_grid(allow_par=TRUE, save_pred=TRUE) ) ``` ??? - Two levels of parallelization - CPU - GPU - Iterate over parameters on CPU - Fit model on GPU - Set earlier with parsnip - Texted Max Kuhn - Said someone asked if it was possible - I'm the one who asked --- class: middle exclude: TRUE ```r xg_tune |> select_best() ``` ??? - See which combination is best --- class: section exclude: true # `{catboost}` ??? - Like xgboost - But different --- class: middle exclude: true ```r devtools::install_url( 'https://github.com/catboost/catboost/releases/download/v0.26.1/catboost-R-Linux-0.26.1.tgz', args = c("--no-multiarch") ) ``` ??? - Once again, can't install from CRAN - Install from GitHub release - GPU functionality built in --- class: middle exclude: true ```r rec_cat <- recipe(Status ~ ., data=train) |> themis::step_downsample(Status) |> # replace missing characters with unknown step_unknown(all_nominal_predictors(), new_level='missing') |> # deal with new data being present in new data step_novel(all_nominal_predictors(), new_level='Unseen') |> # collapse infrequent levels step_other(all_nominal_predictors(), other='Misc') |> step_string2factor(all_nominal_predictors()) |> step_integer(Status, zero_based=TRUE) ``` ??? - Need outcome as integers - Categorical inputs as factors --- class: middle exclude: true ```r rec_cat_prepped <- rec_cat |> prep() x_cat_train <- rec_cat_prepped |> bake(new_data=NULL, all_predictors(), composition='data.frame') y_cat_train <- rec_cat_prepped |> bake(new_data=NULL, Status, composition='matrix') x_cat_test <- rec_cat_prepped |> bake(new_data=test, all_predictors(), composition='data.frame') y_cat_test <- rec_cat_prepped |> bake(new_data=test, Status, composition='matrix') ``` ??? - input is a data.frame - Categorical as factor - Outcome is integer vector or matrix - treesnip package: not official --- class: middle exclude: true ```r library(catboost) train_pool <- catboost.load_pool( data=x_cat_train, label=y_cat_train ) test_pool <- catboost.load_pool( data=x_cat_test, label=y_cat_test ) ``` ??? - Special catboost object: catbooat pool - Like xgb.DMatrix - Authors were inspired by xgboost --- class: middle exclude: true ```r cat_mod <- catboost.train( learn_pool=train_pool, test_pool=test_pool, params = list( loss_function = 'Logloss', iterations = 100, metric_period=10, * task_type='GPU', devices='0' ) ) ``` ??? - Just set task_type='GPU' - Hard setting up data - Now that we've done deep learning and boosted trees... --- class: section # Stan ??? - Bayes time - Once again turn to OpenCL --- class: middle, center # OpenCL ??? - "Run anywhere" code - Including GPU - All the work done for us --- class: middle ```sh sudo apt install nvidia-cuda-toolkit clinfo ``` ```bash # check device index clinfo -l ``` ``` ## Platform #0: NVIDIA CUDA ## `-- Device #0: NVIDIA GeForce GTX 1080 Ti ``` ??? - Install Nvidia toolkit and clinfo - Check the platform and ID of GPU - Instead of rstan... --- class: middle ```r # we recommend running this is a fresh R session or restarting your current session install.packages( 'cmdstanr', repos = c('https://mc-stan.org/r-packages/', getOption('repos')) ) # install Stan library(cmdstanr) install_cmdstan(cores=4) ``` ??? - cmdstanr is not on CRAN yet - After install package, install Stan --- class: middle ```stan data{ int<lower=1> N; int<lower=1> n_col; array[N] int<lower=0,upper=1> y; matrix[N,n_col] X; } parameters{ real alpha; vector[n_col] beta; } model{ alpha ~ normal(0, 10); beta ~ normal(0, 10); y ~ bernoulli_logit_glm(X, alpha, beta); } ``` ??? - Define Stan program - Simple logistic regression - bernoulli_logit_glm() --- class: middle ```r library(cmdstanr) stan_mod <- cmdstan_model( 'credit_mod.stan', * cpp_options = list(stan_opencl = TRUE) ) ``` ??? - Set compiler to use OpenCL --- class: middle ```r rec_stan <- train |> recipe(Status ~ .) |> step_impute_knn(all_predictors()) |> step_normalize(all_numeric_predictors()) |> step_integer(Status, zero_based=TRUE) |> step_dummy(all_nominal_predictors(), one_hot=TRUE) |> step_intercept() |> # why bother carrying the data in this object prep(retain=FALSE) x_stan <- rec_stan |> bake(new_data=train, all_predictors(), composition='matrix') y_stan <- rec_stan |> bake(new_data=train, all_outcomes(), composition='matrix') stan_data <- list(N=nrow(x_stan), n_col=ncol(x_stan), y=as.numeric(y_stan), X=x_stan) ``` ??? - Need 0/1 outcome - Put data into list for Stan - When we do sampling... --- class: middle ```r stan_fit <- stan_mod$sample( data=stan_data, chains=4, parallel_chains=4, * opencl_ids = c(0, 0), iter_sampling=1000, iter_warmup=1000, refresh=1 ) ``` ??? - opencl_ids: - Platform 0: CUDA - Device 0: My GPU - Rest is the same - Chains still sequential - GPU benefits matrix multiplication, matrix decomposition - Uses a lot of GPU memory --- class: part # Wrapping Up ??? - Lessons Learned --- class: section # CUDA ??? - CUDA is the lynchpin --- class: middle - CUDA is hard to work with - Use Docker if possible - Or Cloud VMs - Not a lot left on CRAN ??? - CUDA is hard to work with - Use Docker if possible - Or Cloud VMs - Not a lot left on CRAN --- class: section # Programming ??? - In terms of custom programming --- class: middle - Stick to matrix algebra - Just use `{torch}` ??? - Stick to matrix algebra - Just use torch --- class: section # Machine Learning ??? - When it comes to fitting models --- class: middle - Use prebuilt binaries - Use Docker - Fairly easy to switch code to GPU - Speed gains may vary ??? - Use prebuilt binaries - Use Docker - Fairly easy to switch code to GPU - Speed gains may vary - Is it worth CUDA headaches? --- class: part # Thank You ??? - Thank you