Ten Years Ago

Statistician vs Data Scientist

Now

Data Scientist vs Machine Learning Engineer

Data Scientist vs Machine Learning Engineer vs AI Specialist

Also 10 Years Ago

\[ \begin{alignat*}{4} \text{R} & \text{ vs } && \text{SAS} \end{alignat*} \]

\[ \begin{alignat*}{4} \text{R} & \text{ vs } && \text{SAS} \\ & && \text{Stata} \\ & && \text{Matlab} \\ & && \text{SPSS} \end{alignat*} \]

Also Now

\[ \begin{alignat*}{4} \text{R} & \text{ vs } && \text{Python} \end{alignat*} \]

\[ \begin{alignat*}{4} \text{R} & \text{ vs } && \text{Python} \\ & && \text{Julia} \end{alignat*} \]

\[ \begin{alignat*}{4} \text{R} & \text{ vs } && \text{Python} \\ & && \text{Julia} \\ & && \text{SAS} \end{alignat*} \]

More 10 Years Ago

More Now

Stick with Now

Who is using R?

  • Banking
  • Insurance
  • Retail
  • Pharmaceutical
  • Sports
  • Law
  • Manufacturing
  • Genetics
  • Tech
  • Public Safety
  • Telecom
  • Disaster Response
  • Investments
  • Publishing
  • Food
  • Mining
  • Construction
  • Marketing
  • Human Resources
  • Defense
  • Politics
  • Healthcare

How are they using R?

Industry Use Cases

  • Drug Efficacy
  • Drug Discovery
  • Prescription Forecasting
  • Clinical Trials

But SAS is mandated by the FDA

MYTH

But SAS is mandated by the FDA

FDA does not require use of any specific software for statistical analyses, and statistical software is not explicitly discussed in Title 21 of the Code of Federal Regulations [e.g., in 21CFR part 11]. However, the software package(s) used for statistical analyses should be fully documented in the submission, including version and build identification.

As noted in the FDA guidance, E9 Statistical Principles for Clinical Trials (available at http://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/ default.htm), “The computer software used for data management and statistical analysis should be reliable, and documentation of appropriate software testing procedures should be available.” Sponsors are encouraged to consult with FDA review teams and especially with FDA statisticians regarding the choice and suitability of statistical software packages at an early stage in the product development process.

May 6, 2015

From: https://www.fda.gov/media/109552/download

Reprodicibility and Traceability are Essential

library(gsDesign)

n <- 100         # number of events
hr <- .7         # hazard ratio (<1 means beneficial treatment)
alpha <- .025    # test level
r <- 1           # randomization
beta <- 0.1      # related to power of test

Schoenfeld <- gsDesign(
  k=2, 
  # number of events required to achieve desired power
  n.fix = nEvents(hr=hr, alpha=alpha, beta=beta, r=1), 
  delta1 = log(hr))

Schoenfeld %>% 
    gsBoundSummary(deltaname="HR", logdelta=TRUE) %>% 
    knitr::kable(row.names=FALSE)

Analysis Value Efficacy Futility
IA 1: 50% Z 2.7500 0.4122
N: 173 p (1-sided) 0.0030 0.3401
~HR at bound 0.6577 0.9391
P(Cross) if HR=1 0.0030 0.6599
P(Cross) if HR=0.7 0.3412 0.0269
Final Z 1.9811 1.9811
N: 345 p (1-sided) 0.0238 0.0238
~HR at bound 0.8078 0.8078
P(Cross) if HR=1 0.0239 0.9761
P(Cross) if HR=0.7 0.9000 0.1000

  • Targeting offers
  • Software efficiency
  • Valuations
  • Fraud detection

Fraud detection

  • Supervised Learning (Regression or Tree-based)
  • Time Series Models
  • Structural Estimation

  • {glmnet}
  • {xgboost}
  • {fable}
  • {prophet}
  • {anomalize}
  • {rstan}

library(anomalize)
fin %>% 
    time_decompose(Value, method="stl", frequency="auto", trend="auto") %>%
    anomalize(remainder, method="iqr", alpha=0.10, max_anoms=0.2) %>%
    time_recompose() %>% plot_anomalies(time_recomposed=TRUE)

  • Predictive maintenance
  • Defect detection
  • Temperature adjustments
  • Forecasting electricity usage
  • Production optimization

Production Optimization

  • Convex optimization
  • Simulated annealing

  • {ompr}
  • {optimization}
  • {CVXR}

library(CVXR)
total_volume <- 2400
revenue <- c(2300, 1700, 1800, 2200, 1900)
volume <- c(500, 400, 400, 700, 500)

amount <- Variable(5, integer=TRUE)
constr <- list(
    sum_entries(volume*amount) <= total_volume
    , amount >= 0
)
prob <- Problem(objective=Maximize(sum_entries(revenue*amount)), constraints=constr)
sol <- solve(prob)
sol$value
[1] 11000
t(round(sol$getValue(amount)))
     [,1] [,2] [,3] [,4] [,5]
[1,]    4    0    1    0    0

  • Actuarial sciences
  • Automating reports

Automating Reports

  • Connect to enterprise data warehouses
  • Generate contingency tables
  • Design workflows
  • Write reports
  • Publish and distribute work

  • {DBI}
  • {MortalityTables}/{lifecontingencies}
  • {rmarkdown}/{shiny}
  • {drake}
  • RStudio Connect

library(drake)
plan <- drake_plan(
    data=DBI::dbReadTable(ignore(con), 'accidents')  # read data from DB
    , table=table_pred(data)      # build actuarial table
    , plot=actuary_plot(table)    # generate plot of table
    , report=rmarkdown::render(   # render report with table and plot
      knitr_in('report.Rmd'), 
      output_file=file_out("report.html")
    )
)
make(plan)    # run whole process

  • AB testing
  • Targeted advertising
  • Causal analysis
  • Funnel analysis

Funnel analysis

  • Create a cohort of customers
  • Follow them as they progress through the buying process
  • See how many make it from one stage to the next

  • {dbplyr}
  • {funneljoin}

funneljoin::landed
user_id timestamp
1 2018-07-01
2 2018-07-01
3 2018-07-02
4 2018-07-01
4 2018-07-04
5 2018-07-10
5 2018-07-12
6 2018-07-07
6 2018-07-08
funneljoin::registered
user_id timestamp
1 2018-07-02
3 2018-07-02
4 2018-06-10
4 2018-07-02
5 2018-07-11
6 2018-07-10
6 2018-07-11
7 2018-07-07

library(funneljoin)
after_inner_join(landed, registered, 
                 by_user="user_id", by_time="timestamp",
                 type="first-firstafter", suffix=c("_landed", "_registered"))
user_id timestamp_landed timestamp_registered
1 2018-07-01 2018-07-02
4 2018-07-01 2018-07-02
3 2018-07-02 2018-07-02
6 2018-07-07 2018-07-10
5 2018-07-10 2018-07-11

Typical Stages of R at Work