This supplement presents an empirical investigation into the characteristics of a set of \(M=5\) fully-synthetic samples generated from a single sample of \(n=10,000\) records from the California data using CART FCS synthesizers; see main text for a detailed description. The set of synthetic samples corresponds to the first replication of the repeated sampling experiment (Section 3.1 from main text), using the the CART FCS synthesizer with a small (cp = 0.0001) complexity parameter.

The complexity, or ``contamination" parameter in CART models determine the maximum level, relative to the root node, of impurity on the leaves that is acceptable to stop splitting (see Hastie, Tibshirani, and Friedman 2009, p305). In the case of discrete data, this is usually measured using the Gini index (Hastie, Tibshirani, and Friedman 2009, Drechsler and Reiter (2011)). Smaller complexity parameters leads to better fitted trees, at the risk of overfitting. Thus, as noted by (Drechsler and Reiter 2011), CART FCS synthesizers with a small contamination parameter can lead to very high-utility data. However, as we will show here, in some cases they achieve this extreme performance in part by essentially reproducing a fair portion the original confidential data. This behavior defeats the purpose of generating synthetic data for mitigating disclosure risk.

What follows can be seen as an informal investigation into the disclosure risk of the synthetic data generated by the aforementioned CART FCS synthesizer.

FCS CART Can Create Extremely High-utility Synthetic Samples

We start by loading the original population (into “popu_tab”), the sample of \(n=10,000\) records from that population (into “samp”), and the \(M = 5\) synthetic datasets generated from that data using the CART synthesizer tuned with the small contamination parameter 0.0001 (into list “col_cart”). We chose the sample so that it corresponds to the first of the 200 replications in the repeated sampling experiment presented in the main article.

As expected, the synthetic data produces extremely high-quality inferences of the population quantities:

The left plot (MI_CART vs Population, left plot) shows that values calculated from the CART synthesizer are excellent. In fact, they are comparable to the ones obtained from the actual sample (right plot). Moreover, calculations with synthetic data are almost indistinguishable from calculations with the original data themselves:

Therefore, from a purely analytical utility-based perspective, it seems that the FCS CART synthesizer with small contamination parameter is the way to go. But let us now look at the actual synthetic datasets.

But it does so by directly disclosing large portions of the original data

FCS synthesizers work by, starting from the original data, vary one coordinate at a time by regressing said coordinate on the rest of the values, and using that model to generate replacing values for that coordinate using the Bayesian bootstrap, conditioning on the rest of the multivariate vector in a Gibbs sampler-like manner (J. P. Reiter 2005). The idea is that after at least a few full cycles, all entries in the dataset would be replaced by predicted values, forming the synthesized dataset. However, there is always a possibility that the regression ends up working too well (overfitting) and thus that the generated predicted values end up being too close, or even equal, to the original ones.

We start by checking how many of the \(n=10,000\) records have \(k \in \{0,1,...,17\}\) variables unchanged with respect to the original sample, in each of the \(M = 5\) synthetic data replications:

Here we can see that in each of the \(M = 5\) replications there are several records that are completely unchanged. For example, in the first replication (column labeled “Synth_Dat_1”) 386 records did not change at all after the application of the CART FCS synthesizer. This means that if the Agency were to release that supposedly safe synthetic dataset, they would be also releasing 386 of the original records. Furthermore, several of the records that have actually been altered have been so by just a few variables:

In this table we can see that in the first synthetic dataset more than half of the records (5535) are copies of the original records with at most 3 altered variables. Furthermore, we can see that there are no synthetic records that are not the result of keeping at least 5 variables unchanged.

Now let us look at how individual variables (the columns in the original dataset) are preserved after the application of the CART synthesizer. Here we calculated the percentage of records in each synthetic dataset where each variable has been left untouched, for each of the 17 variables. We have sorted the variables in descending order.

In this table we see that there are 9 variables (OWNERSHP, SCHLTYPE, EMPSTAT, VETSTAT, GRADEATT, DISABWRK, MORTGAGE and LOOKING) that after the application of the CART synthesizer are completely preserved in at least 80% of the records. Moreover, one variable (SCHOOL) is completely preserved in all synthetic datasets. This means that the supposedly synthetic data contains several variables that are almost verbatim copies of the confidential data, and one that is a perfect copy.

What is happening here?

To better understand this issue we will look closer to the synthesis of one particularly problematic record from the original sample. Record 1297 is an instance of a data point that after the application of the CART synthesizer was left completely unmodified in all \(M = 5\) synthetic datasets—and therefore perfectly disclosed.

We first select the record from the original data,

and also fit the 17 full conditional CART regression models, corresponding to each of the variables, using the original sample.

Now let us look at the predictions that we can obtain for each for the variables when conditioning on the rest of the row, following the FCS approach. In the next output we detail, for each of the \(j=1,...,17\) variables, its current value (“curr. value”) and the prediction probabilities associated with each of the levels of said variable (“p(level1), p(level2), etc.”), obtained from the fitted CARTs keeping the rest of the vector at their original values.

OWNERSHP: (3 levels)
    curr. value: 2
    Predictions:    p(0)=0.017 p(1)=0 p(2)=0.983 
MORTGAGE: (4 levels)
    curr. value: 0
    Predictions:    p(0)=1 p(1)=0 p(3)=0 p(4)=0 
SEX: (2 levels)
    curr. value: 1
    Predictions:    p(1)=0.569 p(2)=0.431 
AGE: (7 levels)
    curr. value: <15
    Predictions:    p([15,17])=0 p([18,24])=0 p([25,35])=0 p([36,50])=0 p([51,70])=0 p(<15)=1 p(>70)=0 
MARST: (6 levels)
    curr. value: 6
    Predictions:    p(1)=0 p(2)=0 p(3)=0 p(4)=0 p(5)=0 p(6)=1 
CITIZEN: (4 levels)
    curr. value: 0
    Predictions:    p(0)=0.938 p(1)=0 p(2)=0 p(3)=0.0619 
SPEAKENG: (6 levels)
    curr. value: 0
    Predictions:    p(0)=0.9 p(1)=0.02 p(3)=0.04 p(4)=0 p(5)=0.02 p(6)=0.02 
RACESING: (5 levels)
    curr. value: 1
    Predictions:    p(1)=0.806 p(2)=0.0583 p(3)=0.0291 p(4)=0.107 p(5)=0 
SCHOOL: (3 levels)
    curr. value: 1
    Predictions:    p(0)=0 p(1)=1 p(2)=0 
EDUC: (11 levels)
    curr. value: 0
    Predictions:    p(0)=0.979 p(1)=0.0209 p(2)=0 p(3)=0 p(4)=0 p(5)=0 p(6)=0 p(7)=0 p(8)=0 p(10)=0 p(11)=0 
GRADEATT: (8 levels)
    curr. value: 0
    Predictions:    p(0)=1 p(1)=0 p(2)=0 p(3)=0 p(4)=0 p(5)=0 p(6)=0 p(7)=0 
SCHLTYPE: (4 levels)
    curr. value: 1
    Predictions:    p(0)=0 p(1)=1 p(2)=0 p(3)=0 
EMPSTAT: (4 levels)
    curr. value: 0
    Predictions:    p(0)=1 p(1)=0 p(2)=0 p(3)=0 
CLASSWKR: (3 levels)
    curr. value: 0
    Predictions:    p(0)=1 p(1)=0 p(2)=0 
LOOKING: (4 levels)
    curr. value: 0
    Predictions:    p(0)=1 p(1)=0 p(2)=0 p(3)=0 
DISABWRK: (3 levels)
    curr. value: 0
    Predictions:    p(0)=1 p(1)=0 p(4)=0 
VETSTAT: (3 levels)
    curr. value: 0
    Predictions:    p(0)=1 p(1)=0 p(2)=0 

This output explains why this record is problematic. For most variables (MORTGAGE, AGE, MARST, SCHOOL, GRADEATT, SCHLTYPE, EMPSTAT, CLASSWKR, LOOKING, DISABWRK and VETSTAT) the only possible prediction is their current value. For example, for AGE (current value = ‘<15’), we have that p(<15) = 1, while the rest of the levels have all probability zero. This can be somewhat surprising until we realize that this particular variable is involved in several structural-zero definitions which drastically limit its acceptable values. For example, the current value of VETSTAT (veteran status) is ‘0’, which is the code for ‘N/A’; see Table 1 in Supplement #1 for variable codes. Looking at the definition of structural zeros (see Table 2 in Supplement #1) we note that such value for VETSTAT is only allowed for records with AGE=“<15”. In other words, conditional on VETSTAT=0, AGE can only be ‘<15’. This would not be too problematic if down the line we were able to change the value of VETSTAT. However, as we have seen, AGE=‘<15’ implies VETSTAT=0 and conversely, VETSTAT=0 also implies AGE=‘<15’. This makes it impossible for the CART synthesizer to ever change these values. Several other complex structural-zero conditions are at play here, further constraining the synthesis.

It is important to realize that this phenomenon is just an artifact of the strict FCS one-at-a-time conditional imputation strategy. In fact, the reason why we cannot escape the value \((AGE,VETSTAT) = (<15, 0)\) to reach any other combination (say e.g. \(([5,35], 1)\)) is only because doing so using the FCS approach would require sampling intermediate values that include impossible combinations. This problem is further compounded by tuning the CARTs using a small contamination parameter. Such a choice forces to grow large trees where the terminal nodes are more likely to be sparsely populated. This causes predictions to closely resemble the observed data, which further limit the some possible movements.

This said, we note that in our implementation of the CART-based FCS approach we have not explicitly enforced the structural zero restrictions. However, the fact that by definition a dataset cannot include structural zero data points, coupled with the nature of CART regression, results in a sort of empirical enforcement of structural zeros. Indeed, whenever the node splitting structure for the tree results in terminal nodes whose ancestors constrain predictor values in a way that matches a structural-zero restriction that involves the response, it will be impossible that such terminal node will contain any values that violate the restrictions simply because they cannot exist in a sample. This phenomenon is more likely when the contamination parameter of the tree is small, as small contamination parameter values usually result in the growth of large trees (see Drechsler and Reiter 2011). This results on terminal nodes which will likely descend from nodes that have a large number of variables involved, thus increasing the chance of matching a restriction.

References

Drechsler, Jörg, and Jerome P Reiter. 2011. “An Empirical Evaluation of Easily Implemented, Nonparametric Methods for Generating Synthetic Datasets.” Computational Statistics & Data Analysis 55 (12). Elsevier: 3232–43.

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. 2nd edition. New York: Springer-Verlag.

Reiter, J. P. 2005. “Using CART to Generate Partially Synthetic, Public Use Microdata.” Journal of Official Statistics 21: 441–62.

---
title: 'Supplemental Materials #2 for Bayesian Non-parametric Generation of Fully Synthetic
  Multivariate Categorical Data in the Presence of Structural Zeros'
author: "Daniel Manrique-Vallier and Jingchen Hu"
date: '`r format(Sys.time(), "%B %d, %Y")`'
output:
  html_notebook:
    fig_caption: yes
  html_document: default
  pdf_document: default
subtitle: Investigation into CART FCS synthesizer with small impurity parameter
bibliography: ../mb.bib
---

```{r setup,include=FALSE}
require(dplyr)
require(tidyr)
require(magrittr)
require(ggplot2)
require(gridExtra)
require(rprojroot)
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_knit$set(root.dir = find_root('LCM_Synth_Zeros.Rproj'))
```

```{r load and setup,include=FALSE}
require(tree, lib.loc = 'frozen_packages')
source('src/fn_trees.R')
fns <- new.env()
sys.source(file='src/functions_for_tests.R', envir = fns)
outdir <- 'tmp'
#population quantities
load('data/usa_09_1.RData')
popu_tab <- usa09_1$dat_proc %>%
  group_by_all %>%
  count() %>%
  rename(Freq = n) %>%
  as.data.frame

#synthetic data (First replication)
set.seed(123) #First replication corresponds to sample generated with this seed.
samp <- usa09_1$dat_proc[sample(1:NROW(usa09_1$dat_proc), size = 10000, replace = T),]
with(new.env(), {
  load('tmp/Replic_CART2/replic_CART_1', envir = environment())
  assign('col_cart', collection$imputations, envir = parent.env(environment()))
})
```


This supplement presents an empirical investigation into the characteristics of a set of $M=5$ fully-synthetic samples generated from a single sample of $n=10,000$ records from the California data using CART FCS synthesizers; see main text for a detailed description. The set of synthetic samples corresponds to the first replication of the repeated sampling experiment (Section 3.1 from main text), using the the CART FCS synthesizer with a small (cp = 0.0001) complexity parameter.  

The complexity, or ``contamination" parameter in CART models determine  the maximum level, relative to the root node,  of impurity on the leaves that is acceptable to stop splitting [see @Hastie2009, p305]. In the case of discrete data, this is usually measured using the Gini index [@Hastie2009, @drechsler2011empirical]. Smaller complexity parameters leads to better fitted trees, at the risk of overfitting. Thus, as noted by [@drechsler2011empirical], CART FCS synthesizers with a small contamination parameter can lead to *very* high-utility data. However, as we will show here, in some cases they achieve this extreme performance in part by essentially *reproducing* a fair portion the original confidential data. This behavior defeats the purpose of generating synthetic data for mitigating disclosure risk.

What follows can be seen as an informal investigation into the disclosure risk of the synthetic data generated by the aforementioned CART FCS synthesizer.

## FCS CART Can Create Extremely High-utility Synthetic Samples

We start by loading the original population (into "popu_tab"), the sample of $n=10,000$  records from that population (into "samp"), and the $M = 5$ synthetic datasets  generated from that data using the CART synthesizer tuned with the small contamination parameter 0.0001 (into list "col_cart"). We chose the sample so that it corresponds to the first of the 200 replications in the repeated sampling experiment presented in the main article. 

As expected, the synthetic data produces extremely high-quality inferences of the population quantities:
```{r select test margins,include=FALSE}
# Create the list of margins with zeros
margins_w_zeros <-  apply(usa09_1$MCZ, MARGIN = 2, FUN= function(x)as.factor(!is.na(x))) %>% 
  as.data.frame() %>%
  fns$fn_partial_contingency2() %>% 
  select(-Freq) %>% 
  apply(., MARGIN = 1, FUN = function(x)(1:ncol(.))[x=='TRUE']) %>% 
  split(., col(.))
#modification
#margins_w_zeros <- list(c(1,2))
# Create list of margins to compute
test_margins <- margins_w_zeros %>%
  lapply(., FUN = function(x)fns$fn_mway_margins_including(1:ncol(usa09_1$MCZ), x, 3)) %>%
  Reduce(union, .)
```

```{r population estimation MI and sample, include = FALSE}
#pm <- fns$fn_all_mway_margins_tabulated(df_tabulated = popu_tab, m_way = 3)$p %>% unlist
pm <- sapply(test_margins, FUN = function(x)fns$fn_marg_prob_tabulated(popu_tab, margin = x)$point) %>% unlist()
#sm <- fns$fn_all_mway_margins_tabulated(df_tabulated = samp %>% fns$fn_partial_contingency2(), m_way = 3)$p %>% unlist
sm <- sapply(test_margins, FUN = function(x)fns$fn_marg_prob_tabulated(samp %>% fns$fn_partial_contingency2(), margin = x)$point) %>% unlist()
MIm <- lapply(col_cart,
  FUN = function(x) {
#    fns$fn_all_mway_margins_tabulated(
#      df_tabulated = x %>% fns$fn_partial_contingency2(),
#      m_way = 3
#    )$p %>% unlist
    sapply(
      test_margins, 
      FUN = function(y)fns$fn_marg_prob_tabulated(x %>% fns$fn_partial_contingency2(), margin = y)$point
    ) %>% unlist()
  }
) %>% as.data.frame %>% apply(., MARGIN = 1, FUN = mean)
```

```{r plot results,echo=FALSE,fig.align='center'}
set.seed(1)
tibble(Population = pm, Sample = sm, MI_CART = MIm) %>%
  filter(Sample != 0) %>%
  dplyr::sample_n(2000) %>%
    {
      ggplot(.) +
      ggplot2::coord_fixed(ratio = 1) +
      geom_abline(intercept = 0, slope = 1, col = 'grey')
    }  %>% {
      grid.arrange({
          . + geom_point(aes(y = MI_CART, x = Population), alpha = 0.6, col = 'black', size = 1) +
          labs(title = 'Synthetic data vs. Population', subtitle = 'Estimates of 3-way margin proportions')
        }, {
          . + geom_point(aes(y = Sample, x = Population), alpha = 0.6, col = 'black', size = 1) +
          labs(title = 'Original sample vs. Population', subtitle ='Estimates of 3-way margin proportions' )
        }, 
        ncol = 2 
      )
    }
```

The left plot (MI_CART vs Population, left plot) shows that values calculated from  the CART synthesizer are excellent. In fact, they are comparable to the ones obtained from the actual sample (right plot). Moreover, calculations with synthetic data are almost indistinguishable from calculations with the original data themselves:

```{r plot synth vs sample,echo=FALSE,fig.align='center'}
set.seed(1)
tibble(Sample = sm, MI_CART = MIm) %>%
  filter(Sample != 0) %>%
  sample_n(2000) %>%
  ggplot(.) +
    ggplot2::coord_fixed(ratio = 1) +
    geom_abline(intercept = 0, slope = 1, col = 'grey') +
    geom_point(aes(y = MI_CART, x = Sample), alpha = 0.6, col = 'black', size = 1) + labs(title = 'Synthetic data vs. Original sample', subtitle = 'Estimates of 3-way margin proportions')
```

Therefore, from a purely analytical utility-based perspective, it seems that the FCS CART synthesizer with small contamination parameter is the way to go. But let us now look at the actual synthetic datasets.

## But it does so by directly disclosing large portions of the original data

FCS synthesizers work by, *starting from the original data*, vary one coordinate at a time by regressing said coordinate on the rest of the values, and using that model to generate replacing values for that coordinate using the Bayesian bootstrap, conditioning on the rest of the multivariate vector in a Gibbs sampler-like manner [@Reiter:2005:CART:synthetic]. The idea is that after at least a few full cycles, all entries in the dataset would be replaced by predicted values, forming the synthesized dataset.  However, there is always a possibility that the regression ends up working *too well* (overfitting) and thus that the generated predicted values end up being too close, or even equal, to the original ones.

We start by checking how many of the $n=10,000$ records have $k \in \{0,1,...,17\}$ variables unchanged with respect to the original sample, in each of the $M = 5$ synthetic data replications:
```{r samples with unchanged variables,echo=FALSE}
#How many records in each synthetic dataset have exactly n variables unchanged:
(
  records_unchanged <- col_cart %>% sapply(
    FUN = function(x){
      (samp == x) %>% 
        apply(., MARGIN = 1, FUN = sum) %>% 
        factor(., levels = 0:17) %>% 
        table(dnn=list('unchanged_variables')) %>% 
        as.data.frame()
    },
    simplify = F
  ) %>% bind_rows(.id = 'Synth_Dat') %>%
    spread(Synth_Dat, value = Freq, sep = '_') %>%
    arrange(desc(unchanged_variables))
) %>% print

```

Here we can see that in each of the $M = 5$ replications there are several records that are completely unchanged. For example, in the first replication (column labeled "Synth_Dat_1") 386 records did not change *at all* after the application of the CART FCS synthesizer. This means that if the Agency were to release that supposedly safe synthetic dataset, they would be also releasing 386 of the *original* records. Furthermore, several of the records that have actually been altered have been so by just a few variables:

```{r at least n variables unchanged,collapse=TRUE,echo=FALSE}
#How many samples have at least n variables unchanged:
records_unchanged %>%
  mutate_at(vars(starts_with('Synth')), funs(cumsum)) %>%
  arrange(desc(unchanged_variables))
```

In this table we can see that in the first synthetic dataset more than half of the records (5535) are copies of the original records with at most 3 altered variables. Furthermore, we can see that there are no synthetic records that are not the result of keeping at least 5 variables unchanged.

Now let us look at how individual variables (the columns in the original dataset) are preserved after the application of the CART synthesizer. Here we calculated the percentage of records in each synthetic dataset where each variable has been left untouched, for each of the 17 variables. We have sorted the variables in descending order.

```{r untouched variables,collapse=TRUE,echo=FALSE}
#percentage of records where variable j is unchanged per synthetic dataset...
col_cart %>%
  lapply(
    FUN = function(a){
        (samp == a) %>%
        apply(., MARGIN = 2, FUN = sum) %>%
        {./10000 * 100} %>%
        data.frame(a=.) %>% tibble::rownames_to_column() %>% spread(rowname, a)
    }
  ) %>% bind_rows %>%
  mutate(Synth_Dat = as.character(row_number())) %>%
  .[, c('Synth_Dat',{select(., -Synth_Dat) %>% apply(., MARGIN = 2, FUN = min) %>% sort(.,decreasing =TRUE) %>% names})] %>%
  bind_rows(  summarize_at(., vars(-Synth_Dat), funs(min)) %>% 
              mutate(Synth_Dat ='---MINIMUM---')  %>%
              select(Synth_Dat, everything())
  ) %>%
  mutate_if(is.numeric, funs(round(., digits = 1))) %>%
  print(.)
```

In this table we see that there are 9 variables (OWNERSHP,  SCHLTYPE,  EMPSTAT,  VETSTAT,  GRADEATT,  DISABWRK,  MORTGAGE and LOOKING) that after the application of the CART synthesizer are completely preserved in at least 80\% of the records. Moreover, one variable (SCHOOL) is *completely* preserved in all synthetic datasets. This means that the supposedly synthetic data contains several variables that are almost verbatim copies of the confidential data, and one that is a perfect copy.

## What is happening here?

To better understand this issue we will look closer to the synthesis of one particularly problematic record from the original sample. Record 1297 is an instance of a data point that after the application of the CART synthesizer was left completely unmodified in all $M = 5$ synthetic datasets---and therefore perfectly disclosed. 

We first select the record from the original data,
```{r,warning=FALSE,echo=FALSE}
(
  test_entry <- samp %>%
  slice(1297) %>%
  select_if(is.factor)
)
```

and also fit the 17 full conditional CART regression models, corresponding to each of the variables, using the original sample.

```{r synth engine,echo=FALSE}
engine <- list()
for (i in 1:NCOL(samp)){
  y <- names(samp)[i]
  engine[[i]] <- tree(formula = paste0(y,'~.'), data = samp, mindev = 0.0001)
}
```

Now let us look at the predictions that we can obtain for each for the variables when conditioning on the rest of the row, following the FCS approach. In the next output we detail, for each of the $j=1,...,17$ variables, its current value ("curr. value") and the prediction probabilities associated with each of the levels of said variable ("p(level1), p(level2), etc."), obtained from the fitted CARTs keeping the rest of the vector at their original values.
```{r predictions,echo=FALSE}
# and try to predict
predictions <- sapply(engine, function(x) predict(x, test_entry), simplify = T)
names(predictions) <- names(samp)
for(i in seq_along(predictions)){
  v <- predictions[[i]]; nm <- names(predictions)[i]
  cat(
    nm, ': (', length(v),' levels)\n\tcurr. value: ', as.data.frame(test_entry)[,i] %>% as.character(),
    '\n\tPredictions:\t', sep = ''
  )
  for(j in seq_along(v)){
    cat('p(', colnames(v)[j], ')=', signif(v[j], 3), ' ', sep = '')
  }
  cat('\n')
}

```

This output explains why this record is problematic. For most variables (MORTGAGE, AGE, MARST, SCHOOL, GRADEATT, SCHLTYPE, EMPSTAT, CLASSWKR, LOOKING, DISABWRK and VETSTAT) the only possible prediction is their current value. For example, for AGE (current value = '<15'), we have that p(<15) = 1, while the rest of the levels have all probability zero. This can be somewhat surprising until we realize that this particular variable is involved in several structural-zero definitions which drastically limit its acceptable values. For example, the current value of VETSTAT (veteran status) is '0', which is the code for 'N/A'; see Table 1 in Supplement \#1 for variable codes. Looking at the definition of structural zeros (see Table 2 in Supplement \#1) we note that such value for VETSTAT is only allowed for records with AGE="<15". In other words, conditional on VETSTAT=0, AGE can only be '<15'. This would not be too problematic if  down the line we were able to change the value of VETSTAT. However, as we have seen, AGE='<15' implies VETSTAT=0 and conversely, VETSTAT=0 also implies AGE='<15'. This makes it impossible for the CART synthesizer to *ever* change these values. Several other complex structural-zero conditions are at play here, further constraining the synthesis.

It is important to realize that this phenomenon is just an artifact of the strict FCS one-at-a-time conditional imputation strategy. In fact, the reason why we cannot escape the value $(AGE,VETSTAT) = (<15, 0)$ to reach any other combination (say e.g. $([5,35], 1)$) is only because doing so using the FCS approach would require sampling intermediate values that include impossible combinations.  This problem is further compounded by tuning the CARTs using a small contamination parameter. Such a choice forces to grow large trees where the terminal nodes are more likely to be sparsely populated. This causes predictions to closely resemble the observed data, which further limit the some possible movements.

This said, we note that in our implementation of the CART-based FCS approach we have not *explicitly* enforced the structural zero restrictions. However, the fact that by definition a dataset cannot include structural zero data points, coupled with the nature of CART regression, results in a sort of empirical enforcement of structural zeros. Indeed, whenever the node splitting structure for the tree results in terminal nodes whose ancestors constrain predictor values in a way that matches a structural-zero restriction that involves the response, it will be impossible that such terminal node will contain any values that violate the restrictions simply because they cannot exist in a sample. This phenomenon is more likely when the contamination parameter of the tree is small, as small contamination parameter values usually result in the growth of large trees [see @drechsler2011empirical]. This results on terminal nodes which will likely descend from nodes that have a large number of variables involved, thus increasing the chance of matching a restriction.





#References
