Applying R at Work

Ten Years Ago

Statistician vs Data Scientist

Now

Data Scientist vs Machine Learning Engineer

Data Scientist vs Machine Learning Engineer vs AI Specialist

Also 10 Years Ago

\[ \begin{alignat*}{4} \text{R} & \text{ vs } && \text{SAS} \end{alignat*} \]

\[ \begin{alignat*}{4} \text{R} & \text{ vs } && \text{SAS} \\ & && \text{Stata} \\ & && \text{Matlab} \\ & && \text{SPSS} \end{alignat*} \]

Also Now

\[ \begin{alignat*}{4} \text{R} & \text{ vs } && \text{Python} \end{alignat*} \]

\[ \begin{alignat*}{4} \text{R} & \text{ vs } && \text{Python} \\ & && \text{Julia} \end{alignat*} \]

\[ \begin{alignat*}{4} \text{R} & \text{ vs } && \text{Python} \\ & && \text{Julia} \\ & && \text{SAS} \end{alignat*} \]

More 10 Years Ago

More Now

Stick with Now

Who is using R?

Banking
Insurance
Retail
Pharmaceutical
Sports
Law
Manufacturing
Genetics
Tech
Public Safety
Telecom

Disaster Response
Investments
Publishing
Food
Mining
Construction
Marketing
Human Resources
Defense
Politics
Healthcare

How are they using R?

Industry Use Cases

Drug Efficacy
Drug Discovery
Prescription Forecasting
Clinical Trials

But SAS is mandated by the FDA

MYTH

But SAS is mandated by the FDA

FDA does not require use of any specific software for statistical analyses, and statistical software is not explicitly discussed in Title 21 of the Code of Federal Regulations [e.g., in 21CFR part 11]. However, the software package(s) used for statistical analyses should be fully documented in the submission, including version and build identification.

As noted in the FDA guidance, E9 Statistical Principles for Clinical Trials (available at http://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/ default.htm), “The computer software used for data management and statistical analysis should be reliable, and documentation of appropriate software testing procedures should be available.” Sponsors are encouraged to consult with FDA review teams and especially with FDA statisticians regarding the choice and suitability of statistical software packages at an early stage in the product development process.

May 6, 2015

From: https://www.fda.gov/media/109552/download

Reprodicibility and Traceability are Essential

library(gsDesign)

n <- 100         # number of events
hr <- .7         # hazard ratio (<1 means beneficial treatment)
alpha <- .025    # test level
r <- 1           # randomization
beta <- 0.1      # related to power of test

Schoenfeld <- gsDesign(
  k=2, 
  # number of events required to achieve desired power
  n.fix = nEvents(hr=hr, alpha=alpha, beta=beta, r=1), 
  delta1 = log(hr))

Schoenfeld %>% 
    gsBoundSummary(deltaname="HR", logdelta=TRUE) %>% 
    knitr::kable(row.names=FALSE)

Analysis	Value	Efficacy	Futility
IA 1: 50%	Z	2.7500	0.4122
N: 173	p (1-sided)	0.0030	0.3401
	~HR at bound	0.6577	0.9391
	P(Cross) if HR=1	0.0030	0.6599
	P(Cross) if HR=0.7	0.3412	0.0269
Final	Z	1.9811	1.9811
N: 345	p (1-sided)	0.0238	0.0238
	~HR at bound	0.8078	0.8078
	P(Cross) if HR=1	0.0239	0.9761
	P(Cross) if HR=0.7	0.9000	0.1000

Targeting offers
Software efficiency
Valuations
Fraud detection

Fraud detection

Supervised Learning (Regression or Tree-based)
Time Series Models
Structural Estimation

{glmnet}
{xgboost}
{fable}
{prophet}
{anomalize}
{rstan}

library(anomalize)
fin %>% 
    time_decompose(Value, method="stl", frequency="auto", trend="auto") %>%
    anomalize(remainder, method="iqr", alpha=0.10, max_anoms=0.2) %>%
    time_recompose() %>% plot_anomalies(time_recomposed=TRUE)

Predictive maintenance
Defect detection
Temperature adjustments
Forecasting electricity usage
Production optimization

Production Optimization

Convex optimization
Simulated annealing

{ompr}
{optimization}
{CVXR}

library(CVXR)
total_volume <- 2400
revenue <- c(2300, 1700, 1800, 2200, 1900)
volume <- c(500, 400, 400, 700, 500)

amount <- Variable(5, integer=TRUE)
constr <- list(
    sum_entries(volume*amount) <= total_volume
    , amount >= 0
)
prob <- Problem(objective=Maximize(sum_entries(revenue*amount)), constraints=constr)
sol <- solve(prob)
sol$value

[1] 11000

t(round(sol$getValue(amount)))

     [,1] [,2] [,3] [,4] [,5]
[1,]    4    0    1    0    0

Actuarial sciences
Automating reports

Automating Reports

Connect to enterprise data warehouses
Generate contingency tables
Design workflows
Write reports
Publish and distribute work

{DBI}
{MortalityTables}/{lifecontingencies}
{rmarkdown}/{shiny}
{drake}
RStudio Connect

library(drake)
plan <- drake_plan(
    data=DBI::dbReadTable(ignore(con), 'accidents')  # read data from DB
    , table=table_pred(data)      # build actuarial table
    , plot=actuary_plot(table)    # generate plot of table
    , report=rmarkdown::render(   # render report with table and plot
      knitr_in('report.Rmd'), 
      output_file=file_out("report.html")
    )
)
make(plan)    # run whole process

AB testing
Targeted advertising
Causal analysis
Funnel analysis

Funnel analysis

Create a cohort of customers
Follow them as they progress through the buying process
See how many make it from one stage to the next

{dbplyr}
{funneljoin}

funneljoin::landed

user_id	timestamp
1	2018-07-01
2	2018-07-01
3	2018-07-02
4	2018-07-01
4	2018-07-04
5	2018-07-10
5	2018-07-12
6	2018-07-07
6	2018-07-08

funneljoin::registered

user_id	timestamp
1	2018-07-02
3	2018-07-02
4	2018-06-10
4	2018-07-02
5	2018-07-11
6	2018-07-10
6	2018-07-11
7	2018-07-07

library(funneljoin)
after_inner_join(landed, registered, 
                 by_user="user_id", by_time="timestamp",
                 type="first-firstafter", suffix=c("_landed", "_registered"))

user_id	timestamp_landed	timestamp_registered
1	2018-07-01	2018-07-02
4	2018-07-01	2018-07-02
3	2018-07-02	2018-07-02
6	2018-07-07	2018-07-10
5	2018-07-10	2018-07-11

Typical Stages of R at Work

Enthusiastic about R
Uses R on Windows desktop
Excel replacement
Data munging and models

A few enthusiastic users
Slack channel for R help
Slack channel for R memes
Collecting hex stickers
Really excited

Got buy in from higher ups
Bring in outside instructor
Speeds up adoption by analysts
Gets early adopters really excited and accelerates capabilities

Centralized, controlled server for everyone
Using git
Start with RStudio Server Open
Graduate to RStudio Server Pro and Connect
Use project package libraries

But you can’t put R in production

MYTH

But you can’t put R in production

Use a build tool like make or {drake}
Expose functionality as an API with {plumber}
Capture package versions with {renv}
Bundle steps into Docker containers
Incorporate into a pipeline like mlflow or kubeflow
Utilize CI/CD to automate everything

Never Been Easier to Use R at Work

10 Years Ago

Using R for Work was Harder…

…and Easier

Institutional roadblocks
Ingrained culture of Excel (or SAS)
Resistance to open source in general
Many tools not yet invented
Could do everything on a single computer

Now

Much Easier

Companies open to using R
Open source is everywhere
Programming is the new Excel
Need bigger compute
Great ecosystem of packages to help

What is Your Use Case?

First Project?

End Game?

Don’t be to tempted to just sprinkle AI dust

Plan and Build

Use R Anywhere and Everywhere

Thank You