class: center, middle, inverse, title-slide # Taking R Code from Hours to Seconds ### Jared P. Lander ---
<style type="text/css"> .largest { font-size: 200% } .large { font-size: 130% } .small { font-size: 70% } .smallest { font-size: 50% } .smallCode .remark-code { //font-size: 50%; font-size: 12px; } .remark-slide-number { font-size: 10pt; margin-bottom: -11.6px; margin-right: 10px; color: #FFFFFF; /* white */ opacity: 0; /* default: 0.5 */ } .center2 { margin: 0; position: absolute; top: 50%; left: 50%; -ms-transform: translate(-50%, -50%); transform: translate(-50%, -50%); } .indent1 { display:block; text-indent: 5%; } .indent2 { display:block; text-indent: 10%; } .indent3 { display:block; text-indent: 15%; } </style> .center2[.largest[The Ask]] ??? - What are we trying to solve? --- class: middle, center .large[ Out of Millions of Points, Which are Similar to Each Other? ] ??? - For millions of points - Which points are near each other? --- class: middle, center .large[ Which Points are Within a Certain Distance of Each Other? ] ??? - Essentially - Range searching problem --- class: center, middle .largest[The Problem] ??? - What's the big deal? --- class: middle, center .large[This is Slow] ??? - So slow --- class: middle, center .large[How Slow?] ??? - Quantify this - How bad is it? - In terms of calculations --- class: center, middle `\(n(n-1)/2\)` Calculations ??? - Quadratic Relationship -- For 500,000 rows this is 124,999,750,000 ??? - 125 billion calculations - 125 billion slots in memory -- For 1,000,000 rows this is 499,999,500,000 ??? - A milllion rows: 500 billion calculations - Double rows, quadruple calculations --- class: middle, center <img src="data:image/png;base64,#HoursToSeconds_files/figure-html/calculations-plot-1.png" width="90%" style="display: block; margin: auto;" /> ??? - Calculations are `\(O(n^{2})\)` - A million rows isn't that much data --- class: center, middle .largest[The Data] ??? - Our particular data --- class: center, middle
Var1
Var2
-1.829996
51.04901
-1.038547
50.87115
31.130743
33.69047
-0.733131
52.09273
-4.916367
38.70388
-75.365101
40.40710
-4.015215
39.99676
2.951305
41.62434
-6.039258
41.07505
13.596565
42.80572
??? - Just a sampling of about 500 thousand points - Simulated data - Actual data had millions - Each row has two dimensions - Want to calculate how similar each row is to every other - Then find those that are most similar (by some definition) --- class: center, middle <img src="data:image/png;base64,#HoursToSeconds_files/figure-html/points-plot-1.png" width="90%" style="display: block; margin: auto;" /> ??? - Like a Rorschach test - More density in certain areas --- class: center, middle .large[First Attempt] ??? - Start with obvious function --- class: middle ```r dist(all_points) ``` ??? - Built into Base R - Returns distance between every pair of rows --- background-image: url("data:image/png;base64,#/home/jared/consulting/talks/images/ComputerFire.jpg") background-size: cover class: hide_logo ??? - Did not go well - Who knows when this will finish --- class: center, middle <img src="data:image/png;base64,#HoursToSeconds_files/figure-html/unnamed-chunk-1-1.png" width="90%" style="display: block; margin: auto;" /> ??? - Look at data again - Noticed something about the data --- class: middle ```r all_points %>% summarize(range(Var1), range(Var2)) ``` ``` ## # A tibble: 2 x 2 ## `range(Var1)` `range(Var2)` ## <dbl> <dbl> ## 1 -102. 4.40 ## 2 107. 68.1 ``` ??? - Look at ranges - First: -100 to 100 - Second: 5 to 68 --- background-image: url("data:image/png;base64,#/home/jared/consulting/talks/images/globe-lat-long.jpg") background-size: contain class: hide_logo ??? - Looks like lat/long --- class: center, middle .large[Let's use `{sf}`] ??? - Figure `{sf}` has functions for this --- class: middle ```r library(sf) sf_points <- all_points %>% # Var1: Longitude, Var2: latitude st_as_sf(coords=c('Var1', 'Var2')) %>% # Use lat/long projection st_set_crs(4326) sf_points ``` ``` ## Simple feature collection with 500000 features and 0 fields ## geometry type: POINT ## dimension: XY ## bbox: xmin: -102.1256 ymin: 4.397471 xmax: 106.855 ymax: 68.06752 ## CRS: EPSG:4326 ## # A tibble: 500,000 x 1 ## geometry ## * <POINT [°]> ## 1 (-1.829996 51.04901) ## 2 (-1.038547 50.87115) ## 3 (31.13074 33.69047) ## 4 (-0.733131 52.09273) ## 5 (-4.916367 38.70388) ## 6 (-75.3651 40.4071) ## 7 (-4.015215 39.99676) ## 8 (2.951305 41.62434) ## 9 (-6.039258 41.07505) ## 10 (13.59657 42.80572) ## # … with 499,990 more rows ``` ??? - Convert with `st_as_sf()` - Specify lat/long columns - Like points on map - Data weren't points on map, but no reason we couldn't use those methods --- class: middle ```r countries <- rnaturalearth::ne_countries(returnclass='sf') ggplot() + geom_sf(data=countries) + geom_sf(data=sf_points) ``` <img src="data:image/png;base64,#HoursToSeconds_files/figure-html/sf-plot-1.png" width="85%" style="display: block; margin: auto;" /> ??? - Weird shape, not point - Simulated data - Let's us use `{sf}` functions --- class: middle ```r st_distance(sf_points) ``` ??? - needs `{lwgeom}` - Measures distance between every point and every other point --- background-image: url("data:image/png;base64,#/home/jared/consulting/talks/images/ComputerFire.jpg") background-size: cover class: hide_logo ??? - Dumpster fire - Or computer fire - Doesn't finish --- class: middle ```r st_is_within_distance(sf_points %>% st_transform(3857), dist=11000) ``` ??? - Exactly what we want! - Need to convert our numbers to meters - .1 degree roughly 11,000 meters --- background-image: url("data:image/png;base64,#/home/jared/consulting/talks/images/ComputerFire.jpg") background-size: cover class: hide_logo ??? - But doesn't finish - Need a new idea --- class: center, middle <img src="data:image/png;base64,#HoursToSeconds_files/figure-html/unnamed-chunk-2-1.png" width="90%" style="display: block; margin: auto;" /> ??? - No point comparing points that are far apart --- class: center, middle .large[Split the Data] ??? - Break into grid --- class: middle ```r sf_grid <- sf_points %>% # project to meters st_transform(3857) %>% # build the grid # use hexagons since they fit better st_make_grid(cellsize=1500000, square=FALSE) %>% # back to original scale st_transform(4326) ``` ??? - Convert to meters for calculations - Use hexagons for better spacing - Buckets should be small enough so we don't compare too many points - But not so small it is hard to create grid - 1.5 million meters --- class: middle ```r ggplot() + geom_sf(data=countries) + geom_sf(data=sf_grid, color='blue') + geom_sf(data=sf_points) ``` <img src="data:image/png;base64,#HoursToSeconds_files/figure-html/sf-grid-plot-1.png" width="85%" style="display: block; margin: auto;" /> ??? - Only compare points within a grid cell - Points in different cells don't need to be compared - Let's see how this affects computation time --- class: middle, center .large[ `\(g*\frac{n}{g}(\frac{n}{g}-1)/2\)` Calculations ] ??? - Assuming the points are evenly distributed - Perform fewer calculations `\(g\)` times -- For 500,000 rows and 600 groups this is 208,083,333 calculations ??? - For 500,000 rows and 600 groups, 208 million calculations -- 124,791,666,667 fewer calculations ??? - 125 billion fewer - Assuming points evenly distributed - They're not --- class: middle, center <img src="data:image/png;base64,#HoursToSeconds_files/figure-html/calculations-plot-groups-1.png" width="90%" style="display: block; margin: auto;" /> ??? - The difference is stark - Millions versus billions --- class: center, middle .large[What About Points near the Edges of Hexagons?] ??? - Points near edges could be near each other --- class: middle ```r sf_buffer <- sf_grid %>% st_transform(3857) %>% st_buffer(11000*2) %>% st_transform(4326) ``` ??? - Make each grid cell overlap - Points can pair to neighboring cell - Some points counted twice, deal with later --- class: middle ```r ggplot() + geom_sf(data=countries) + geom_sf(data=sf_buffer, color='blue') + geom_sf(data=sf_points) ``` <img src="data:image/png;base64,#HoursToSeconds_files/figure-html/sf-buffer-plot-1.png" width="85%" style="display: block; margin: auto;" /> ??? - Can't see difference in plot since buffer is so small --- class: middle ```r sf_points_grid <- sf_points %>% st_join( sf_buffer %>% st_as_sf() %>% tibble::rowid_to_column('Cell'), join=st_intersects, left=FALSE) sf_points_grid ``` ``` ## Simple feature collection with 540525 features and 1 field ## geometry type: POINT ## dimension: XY ## bbox: xmin: -102.1256 ymin: 4.397471 xmax: 106.855 ymax: 68.06752 ## CRS: EPSG:4326 ## # A tibble: 540,525 x 2 ## geometry Cell ## * <POINT [°]> <int> ## 1 (-1.829996 51.04901) 27 ## 2 (-1.038547 50.87115) 27 ## 3 (31.13074 33.69047) 37 ## 4 (-0.733131 52.09273) 27 ## 5 (-4.916367 38.70388) 26 ## 6 (-75.3651 40.4071) 9 ## 7 (-4.015215 39.99676) 26 ## 8 (2.951305 41.62434) 28 ## 9 (-6.039258 41.07505) 25 ## 10 (13.59657 42.80572) 33 ## # … with 540,515 more rows ``` ??? - Which points belong to which grid cell? --- ```r ggplot() + geom_sf(data=countries) + geom_sf(data=sf_buffer, color='blue') + geom_sf(data=sf_points_grid, aes(color=factor(Cell))) + theme(legend.position='none') ``` <img src="data:image/png;base64,#HoursToSeconds_files/figure-html/sf-buffer-plot-color-1.png" width="85%" style="display: block; margin: auto;" /> ??? - Search within grid cells independently --- class: center, middle .large[ `st_make_grid()` and `st_join()` Can be Slow ] ??? - But `st_make_grid()` and `st_join()` are Slow --- class: middle, center .large[ `hexbin::hexbin()` ] ??? - `hexbin()` is much faster than `st_make_grid()` --- class: middle, center .large[ Does Not Deal with Border Issues ] ??? - Draw polygons and create buffer - Not much faster overall --- class: center, middle .large[ Back to our Grid ] ??? - False Start - Back to grid --- class: middle ```r group_counts <- sf_points_grid %>% st_drop_geometry() %>% count(Cell) %>% filter(n > 1) %>% mutate(Calculations=n*(n-1)/2) group_counts ``` ``` ## # A tibble: 56 x 3 ## Cell n Calculations ## <int> <int> <dbl> ## 1 1 7 21 ## 2 2 4 6 ## 3 3 335 55945 ## 4 4 18395 169178815 ## 5 5 3127 4887501 ## 6 6 196 19110 ## 7 7 27539 379184491 ## 8 8 35760 639370920 ## 9 9 37565 705545830 ## 10 10 4 6 ## # … with 46 more rows ``` ```r summary(group_counts$n) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 2 10 201 9652 7450 100170 ``` ```r summary(group_counts$Calculations) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.000e+00 4.500e+01 2.011e+04 2.454e+08 2.858e+07 5.017e+09 ``` ??? - 56 groups - Could have made more groups, but that takes more time - 2 to 100,000 points in each - Average of 9600; Median 201 - Average calculations: 245 Million; Median 20,000 - Still 5 billion for just one group - `count()` is much faster after dropping geometry - because `count()` aggregates the geometries --- class: middle, center .large[ 13,742,167,792 vs 124,999,750,000 ] ??? - All groups combined: 14 billion - 14 billion vs 125 billion - Order of magnitude fewer --- class: middle, center .large[ `group_nest()` + `mutate()` + `map()` ] ??? - Split data into separate `data.frame`s in a `list`-column, essentially - Apply work to each individually - Easier to reason about this way --- class: middle ```r library(purrr) distance_nested <- sf_points_grid %>% st_transform(3857) %>% group_nest(Cell, keep=TRUE) %>% mutate(GroupDist=map( data, ~st_is_within_distance(.x, dist=11000)) ) ``` ??? - .1 is the metric we want, convert to meters - Computes if points fall within threshold - For each group --- class: middle, center .large[ Went from Impossible to Hours... ] ??? - Doable -- .large[ ...if Each Group was Small Enough for `st_is_within_distance()` ] ??? - Only if `st_is_within_distance()` could finish --- class: middle, center .large[ Two Issues ] ??? - Two issues -- .large[ `group_nest()` Can be Slow ] ??? - `group_nest()` slow - Especially for wider data -- .large[ `st_is_within_distance()` is Slow and Might Not Finish ] ??? - `st_is_within_distance()` is slow - and might not finish --- class: middle .large[ First: Deal with `group_nest()` Then: Deal with `st_is_within_distance()` ] ??? - One thing at a time - `group_nest()` - Then `st_is_within_distance()` <!-- --- --> <!-- class: middle, center --> <!-- .large[ --> <!-- Instead --> <!-- ] --> <!-- ??? --> <!-- - Something better --> --- class: center, middle .large[ `group_by() + group_map()` ] ??? - `group_by()` - Instead of `group_nest()` - `group_map()` gives a `list` of results - like `dlply()` from `{plyr}` --- class: middle ```r distance_grouped <- sf_points_grid %>% st_transform(3857) %>% group_by(Cell) %>% group_map(~st_is_within_distance(.x, threshold=11000)) ``` ??? - Pass `.x` to `st_is_within_distance()` - So it will work with grouped `tibble` --- class: middle, center .large[ Went from Hours to Minutes... ] ??? - Dramatic speed up by avoiding nesting -- .large[ ...if Each Group was Small Enough for `st_is_within_distance()` ] ??? - Only if `st_is_within_distance()` could finish - Can't make smaller grid cells --- class: center, middle .large[ Better Still ] ??? - We can do even better --- class: middle, center .large[ `{data.table}` ] ??? - `{data.table}` excels at grouping --- class: middle ```r library(data.table) distance_dt <- sf_points_grid %>% as.data.table() %>% .[, .(Results=list(st_is_within_distance(.SD[['geometry']], dist=11000))), by=Cell ] ``` ??? - Pipe into square brackets - `.SD` allows grouping to work - then act on `geometry` column - Just calling on `geometry` column wouldn't respect grouping - yes `{data.table}` with pipes --- class: middle, center .large[ Now in Fewer Minutes... ] ??? - Cut the time by a third - Matters when this runs often -- .large[ ...if Each Group was Small Enough for `st_is_within_distance()` ] ??? - Only if `st_is_within_distance()` could finish --- class: middle, center .large[ Need an `st_is_within_distance()` Alternative ] ??? - `st_is_within_distance()` is not cutting it --- class: middle, center .large[`{Rfast}`] ??? - Faster functions than base R - Has a fast distance function --- class: middle, center .large[`Dist()`] ??? - Works pretty well --- class: middle, center .large[Fast...] ??? - Fast -- .large[...Until 46,342 Rows] ??? - Runs out of memory at 46,342 rows --- class: middle ```cpp static int proper_size(int nrw, int ncl){ return ncl*(ncl-1)*0.5; } ``` ??? - Had to dive into C++ - `int` is the problem --- class: middle, center .large[R Integers are 32-Bit] ??? - R integers are 32-bits --- class: middle, center .large[ `\(n(n-1)/2\)` Calculations ] ??? - Saw this before --- class: middle, center .large[ Have to Compute `\(n*(n-1)\)` ] ??? - Need to store `\(n*(n-1)\)` in memory -- .large[ `\(\text{46,342}*(\text{46,342} - 1) = \text{2,147,534,622}\)` ] ??? - 2 billion 147 million 534 thousand 622 --- class: middle, center .large[ But `\(2^{32} - 1 = \text{4,294,967,295} > \text{2,147,534,622}\)` ] ??? - 32-bit is bigger than our number -- .large[ True, But One Bit is Used for the Sign ] ??? - Need to account for positive or negative --- class: middle, center .large[ `\(2^{31} - 1 = \text{2,147,483,647} < \text{2,147,534,622}\)` ] ??? - So we fail at 46,342 rows --- class: middle, center .large[ Need Bigger Integer in C++ ] ??? - Need something bigger -- .large[ `R_xlen_t` ] ??? - Not 64-bit - Biggest `int` machine can handle --- class: middle <!-- How it Started --> ```cpp static int proper_size(int nrw, int ncl){ return ncl*(ncl-1)*0.5; } ``` <!-- How it's Going --> ```cpp static R_xlen_t proper_size(R_xlen_t nrw, R_xlen_t ncl){ return ncl*(ncl-1)*0.5; } ``` ??? - Change `int` to `R_xlen_t` --- background-image: url("data:image/png;base64,#/home/jared/consulting/talks/images/rfast-pull-request.png") background-size: contain class: hide_logo ??? - I made a [pull request](https://github.com/RfastOfficial/Rfast/pull/32/commits/8704c4e5cdb5a72de5dc22b55147718dfb50defa#diff-be4eeb82997d524556d2d9f761b3189160026860e90336e75e60abc4c85e774f) - Not accepted yet --- class: middle, center .large[ `{tcor}` ] ??? - From Bryan Lewis and Michael Kane - Plus James Baglama and Alex Poliakov - Built on `{irlba}` - Meant for threshold correlation matrices - But can ID points within distance of each other --- class: middle ```r library(tcor) sf_points_grid %>% filter(Cell == 9) %>% st_coordinates() %>% tdist(t=.1, p=2) %>% magrittr::extract2('indices') %>% head() ``` ``` ## i j val ## [1,] 1232 13241 1.047478e-05 ## [2,] 5462 1979 6.720093e-05 ## [3,] 25128 1284 1.611042e-04 ## [4,] 17111 9107 1.790253e-04 ## [5,] 5539 19076 1.931850e-04 ## [6,] 29849 32388 2.033469e-04 ``` ??? - Returns indices for points within threshold of each other - `tdist()` works in parallel --- class: middle ```r sf_points_grid %>% bind_cols(st_coordinates(.) %>% as_tibble()) %>% st_drop_geometry() %>% as.data.table() %>% .[, .( Results=list(tdist(as.matrix(.SD), t=.1, p=2)$indices) ), by=Cell, .SDcols=c('X', 'Y')] ``` ??? - Convert to `data.table`for fast grouping - Compute threshold for each group - `tdist()` depends on sparsity - So can be inconsistently slow --- class: middle, center .large[ `{torch}` ] ??? - Matrix computations on the GPU - Pretty new - No Python, pure R --- class: middle ```r library(torch) torch9 <- sf_points_grid %>% filter(Cell == 9) %>% st_coordinates() %>% torch_tensor(device='cuda') torchDist <- nnf_pdist(torch9) ``` ??? - Unbelievably fast - 0.003 Seconds for 37,000 rows - But my GPU only has 11 GB memory - Easy to blow through it and crash - Past 60,000 rows - Not everyone has GPU - Takes time to load data onto GPU for each group - Then extract results - Then clear GPU memory --- class: middle, center .large[ Custom Code ] ??? - Need to build this from scratch - Help from Michael Beigelmacher, Kaz Sakamoto and Ben Lerner --- class: middle Sort data according to one dimension (`X`) For row i=1:N .indent1[For row j=i:N] .indent2[Compute `X`-distance between points `i` and `j`] .indent3[If greater than threshold, proceed to next point `i`] .indent2[Compute squared distance between points `i` and `j`] .indent3[If greater than threshold-squared, proceed to next point `j`] .indent2[Compute distance and record indices] ??? - If `X`-dimension is too far, then by triangle inequality, distance must be too far - Data are sorted, so once one `X` is too far, the remaining are too far for that row - If that passes check distance - Don't keep row pairs that are too far apart - Saves compute time by breaking the loop and skipping not needed points --- class: middle ```r within_distance <- function(data, threshold) { keepers <- array(NA, dim=rep(nrow(data), 2)) keep_dist <- array(NA, dim=rep(nrow(data), 2)) data <- data[order(data[, 1]), ] for(i in seq(1, nrow(data) - 1, by=1)) { for(j in seq(i + 1, nrow(data), by=1)) { if(abs(data[i, 1] - data[j, 1]) > threshold) { break } sq_dist <- sum((data[i, ] - data[j, ])^2) if(sq_dist > threshold^2) { next } keepers[i, j] <- 1 keep_dist[i, j] <- sqrt(sq_dist) } } list(keepers=keepers, dist=keep_dist) } ``` ??? - Generalizes to multiple dimensions --- class: middle ```r sf_points_grid %>% filter(Cell == 9) %>% st_coordinates() %>% within_distance(threshold=.1) ``` ??? - 1 second for 3,000 rows (Cell 5) - 21 seconds for 18,000 rows (Cell 4) - 70 seconds for 37,000 rows if enough memory (Cell 9) --- class: middle ```r sf_points_grid %>% group_by(Cell) %>% group_map(~within_distance(st_coordinates(.x), threshold=.1)) ``` ??? - `group_map()` saves a `list` of results - The individual timings would add up - Allocating the `array` for large groups could be problematic - Inefficient memory use - Parallel with `{furrr}` might cause memory issues --- background-image: url("data:image/png;base64,#/home/jared/consulting/talks/images/must-go-faster.gif") background-size: cover class: hide_logo ??? - Need more speed --- class: middle, center `{Rcpp}` ??? - Integrate R and C++ - R-like C++ --- class: middle .smallCode[ ```cpp #include <RcppArmadillo.h> using namespace Rcpp; // [[Rcpp::plugins(cpp11)]] // [[Rcpp::depends(RcppArmadillo)]] // [[Rcpp::interfaces(cpp)]] inline double distance_squared(arma::mat& x){ return sum(pow(x.row(0) - x.row(1), 2)); } // [[Rcpp::export]] DataFrame threshold_distance(arma::mat obj, double threshold) { // precompute squared threshold so we don't compute it each time const double threshold_squared = pow(threshold, 2); // empty vectors to hold indices below threshold std::vector<int> i_keep, j_keep; std::vector<double> distances; int num_rows = obj.n_rows; for(int i = 0; i < num_rows; ++i){ for(int j = i + 1; j < num_rows; ++j){ // if the distance is too far even on one dimension, skip the problem if(arma::as_scalar(abs(obj.col(0).row(i) - obj.col(0).row(j))) > threshold) break; // compute the distance arma::uvec indices; indices << i << j; arma::mat mat_ij = obj.rows(indices); double the_dist_squared = distance_squared(mat_ij); // if this distance is too big, skip ahead if(the_dist_squared > threshold_squared) continue; // puch_back() doesn't hurt as much in C++ i_keep.push_back(i); j_keep.push_back(j); distances.push_back(sqrt(the_dist_squared)); } } // need to add 1 back to the indices since C++ starts at 0 while R starts at 1 transform(i_keep.begin(), i_keep.end(), i_keep.begin(), bind2nd(std::plus<int>(), 1)); transform(j_keep.begin(), j_keep.end(), j_keep.begin(), bind2nd(std::plus<int>(), 1)); DataFrame results = DataFrame::create(_["i"]=i_keep, _["j"]=j_keep, _["distance"]=distances); return results; } ``` ] ??? - R function translated to C++ - Expects a sorted matrix - Uses Armadillo for matrix computations - Can grow vectors without slowing down much - Double for loops are still fast --- class: middle ```r sf_points_grid %>% filter(Cell == 9) %>% bind_cols(st_coordinates(.) %>% as_tibble()) %>% st_drop_geometry() %>% arrange(X) %>% dplyr::select(X, Y) %>% as.matrix() %>% threshold_distance(threshold=.1) ``` ??? - 0.05 seconds for 3,000 rows (Cell 5) - 0.5 seconds for 18,000 rows (Cell 4) - 1.3 seconds for 37,000 rows if enough memory (Cell 9) - Insane speed up - Not even parallel --- class: middle, center <img src="data:image/png;base64,#HoursToSeconds_files/figure-html/threhold-func-speed-plot-1.png" width="504" style="display: block; margin: auto;" /> ??? - Things get worse and worse in pure R --- class: middle ```r sf_dt <- sf_points_grid %>% bind_cols(st_coordinates(.) %>% as_tibble()) %>% st_drop_geometry() %>% as.data.table() # sort the data setkey(sf_dt, X) # use data.table for faster grouping similar_points <- sf_dt[, threshold_distance(as.matrix(.SD), threshold=.1), by=Cell, .SDcols=c('X', 'Y')] similar_points ``` ``` ## Cell i j distance ## 1: 3 27 29 0.06500678 ## 2: 3 41 45 0.07625256 ## 3: 3 42 44 0.09625963 ## 4: 3 50 53 0.05584888 ## 5: 3 61 63 0.08657027 ## --- ## 25002846: 48 267 274 0.08309678 ## 25002847: 48 306 307 0.01747799 ## 25002848: 49 9 17 0.08602352 ## 25002849: 49 10 14 0.06327113 ## 25002850: 49 29 30 0.05506324 ``` ??? - `.SD` respects the grouping - 30 seconds for the _WHOLE_ dataset - Millions of rows: still about 30 seconds --- class: middle, center .large[ We Went from Cannot Finish ] ??? - We Went from Cannot Finish -- .large[ to Hours, Maybe ] ??? - to Hours, Maybe - Memory permitting -- .large[ to Minutes ] ??? - to Minutes -- .large[ to Seconds ] ??? - to Seconds - Blazing fast - Matters when running this hourly --- class: middle - Splitting the Data into Smaller Pieces - Using a Smarter Algorithm - Writing Compiled Code ??? - We did this by - Splitting the Data into Smaller Pieces - Using a Smarter Algorithm - Writing Compiled Code - Made it solvable <!-- --- --> <!-- - gpu (show gpu) --> <!-- - my function time vs gpu time --> <!-- - time transferring data on and off gpu --> <!-- - if group small enough (makes grid slower) --> --- class: middle, center .largest[Thank You]