Ten Years Ago
Ten Years Ago
Statistician vs Data Scientist
Now
Data Scientist vs Machine Learning Engineer
Data Scientist vs Machine Learning Engineer vs AI Specialist
Also 10 Years Ago
\[ \begin{alignat*}{4} \text{R} & \text{ vs } && \text{SAS} \end{alignat*} \]
\[ \begin{alignat*}{4} \text{R} & \text{ vs } && \text{SAS} \\ & && \text{Stata} \\ & && \text{Matlab} \\ & && \text{SPSS} \end{alignat*} \]
Also Now
\[ \begin{alignat*}{4} \text{R} & \text{ vs } && \text{Python} \end{alignat*} \]
\[ \begin{alignat*}{4} \text{R} & \text{ vs } && \text{Python} \\ & && \text{Julia} \end{alignat*} \]
\[ \begin{alignat*}{4} \text{R} & \text{ vs } && \text{Python} \\ & && \text{Julia} \\ & && \text{SAS} \end{alignat*} \]
More 10 Years Ago
More Now
Stick with Now
Who is using R?
How are they using R?
Industry Use Cases
But SAS is mandated by the FDA
But SAS is mandated by the FDA
FDA does not require use of any specific software for statistical analyses, and statistical software is not explicitly discussed in Title 21 of the Code of Federal Regulations [e.g., in 21CFR part 11]. However, the software package(s) used for statistical analyses should be fully documented in the submission, including version and build identification.
As noted in the FDA guidance, E9 Statistical Principles for Clinical Trials (available at http://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/ default.htm), “The computer software used for data management and statistical analysis should be reliable, and documentation of appropriate software testing procedures should be available.” Sponsors are encouraged to consult with FDA review teams and especially with FDA statisticians regarding the choice and suitability of statistical software packages at an early stage in the product development process.
May 6, 2015
Reprodicibility and Traceability are Essential
library(gsDesign)
n <- 100 # number of events
hr <- .7 # hazard ratio (<1 means beneficial treatment)
alpha <- .025 # test level
r <- 1 # randomization
beta <- 0.1 # related to power of test
Schoenfeld <- gsDesign(
k=2,
# number of events required to achieve desired power
n.fix = nEvents(hr=hr, alpha=alpha, beta=beta, r=1),
delta1 = log(hr))
Schoenfeld %>%
gsBoundSummary(deltaname="HR", logdelta=TRUE) %>%
knitr::kable(row.names=FALSE)
| Analysis | Value | Efficacy | Futility |
|---|---|---|---|
| IA 1: 50% | Z | 2.7500 | 0.4122 |
| N: 173 | p (1-sided) | 0.0030 | 0.3401 |
| ~HR at bound | 0.6577 | 0.9391 | |
| P(Cross) if HR=1 | 0.0030 | 0.6599 | |
| P(Cross) if HR=0.7 | 0.3412 | 0.0269 | |
| Final | Z | 1.9811 | 1.9811 |
| N: 345 | p (1-sided) | 0.0238 | 0.0238 |
| ~HR at bound | 0.8078 | 0.8078 | |
| P(Cross) if HR=1 | 0.0239 | 0.9761 | |
| P(Cross) if HR=0.7 | 0.9000 | 0.1000 |
Fraud detection
{glmnet}{xgboost}{fable}{prophet}{anomalize}{rstan}library(anomalize)
fin %>%
time_decompose(Value, method="stl", frequency="auto", trend="auto") %>%
anomalize(remainder, method="iqr", alpha=0.10, max_anoms=0.2) %>%
time_recompose() %>% plot_anomalies(time_recomposed=TRUE)
Production Optimization
{ompr}{optimization}{CVXR}library(CVXR)
total_volume <- 2400
revenue <- c(2300, 1700, 1800, 2200, 1900)
volume <- c(500, 400, 400, 700, 500)
amount <- Variable(5, integer=TRUE)
constr <- list(
sum_entries(volume*amount) <= total_volume
, amount >= 0
)
prob <- Problem(objective=Maximize(sum_entries(revenue*amount)), constraints=constr)
sol <- solve(prob)
sol$value
[1] 11000
t(round(sol$getValue(amount)))
[,1] [,2] [,3] [,4] [,5] [1,] 4 0 1 0 0
Automating Reports
{DBI}{MortalityTables}/{lifecontingencies}{rmarkdown}/{shiny}{drake}library(drake)
plan <- drake_plan(
data=DBI::dbReadTable(ignore(con), 'accidents') # read data from DB
, table=table_pred(data) # build actuarial table
, plot=actuary_plot(table) # generate plot of table
, report=rmarkdown::render( # render report with table and plot
knitr_in('report.Rmd'),
output_file=file_out("report.html")
)
)
make(plan) # run whole process
Funnel analysis
{dbplyr}{funneljoin}funneljoin::landed
| user_id | timestamp |
|---|---|
| 1 | 2018-07-01 |
| 2 | 2018-07-01 |
| 3 | 2018-07-02 |
| 4 | 2018-07-01 |
| 4 | 2018-07-04 |
| 5 | 2018-07-10 |
| 5 | 2018-07-12 |
| 6 | 2018-07-07 |
| 6 | 2018-07-08 |
funneljoin::registered
| user_id | timestamp |
|---|---|
| 1 | 2018-07-02 |
| 3 | 2018-07-02 |
| 4 | 2018-06-10 |
| 4 | 2018-07-02 |
| 5 | 2018-07-11 |
| 6 | 2018-07-10 |
| 6 | 2018-07-11 |
| 7 | 2018-07-07 |
library(funneljoin)
after_inner_join(landed, registered,
by_user="user_id", by_time="timestamp",
type="first-firstafter", suffix=c("_landed", "_registered"))
| user_id | timestamp_landed | timestamp_registered |
|---|---|---|
| 1 | 2018-07-01 | 2018-07-02 |
| 4 | 2018-07-01 | 2018-07-02 |
| 3 | 2018-07-02 | 2018-07-02 |
| 6 | 2018-07-07 | 2018-07-10 |
| 5 | 2018-07-10 | 2018-07-11 |
Typical Stages of R at Work