Finding front-door adjustment sets

In this article, we develop a CIfly algorithm that finds a front-door adjustment set in a DAG for treatment variables XX and outcome variables YY, or reports that no such set exists. Front-door adjustment is a classical strategy for identifying a causal effect. It uses a front-door adjustment set ZZ to express the causal effect of XX on YY through the formula P(ydo(x))=zP(zx)xP(yx,z)P(x)P(y \mid do(x)) = \sum_z P(z \mid x) \sum_{x'} P(y \mid x', z) P(x'). Pearl (1995) gives a sufficient criterion for a set of nodes ZZ to be a valid front-door adjustment set:

Theorem 1 [Pearl (1995)]
Let G=(V,E)G = (V, E) be a DAG and XX, YY, ZZ pairwise disjoint subsets of VV. Then, set ZZ is a front-door adjustment set relative to (X,Y)(X, Y) in GG if

  1. ZZ intercepts all directed paths from XX to YY,
  2. there is no unblocked proper back-door path from XX to ZZ and
  3. all proper back-door paths from ZZ and YY are blocked by XX.

To understand this criterion, some definitions are in order. A path from AA to BB is a back-door path if it starts with an edge aa \leftarrow for some aAa \in A. It is proper if it contains exactly one node from AA. It is unblocked if it is d-connected and blocked if it is d-separated.

Note that these are not necessary conditions for ZZ to be a front-door adjustment set. Still, when we refer to front-door adjustment sets in the following, we mean those that satisfy this criterion (even though there might be sets ZZ that can be used in the front-door formula but do not satisfy Pearl’s criterion). Our goal in this article is to develop an algorithm that finds a front-door set ZZ or decides that none exists.

General strategy for finding a front-door adjustment set

Even though we have a criterion that allows us to check whether a certain set ZZ is a front-door adjustment set, it doesn’t mean it is easy to find one. Naively, one could go through all possible sets ZZ and check for each whether it satisfies the criterion, but that would require time exponential in the input size. Hence, a more clever method is generally needed. Such a method was proposed by Jeong et al. (2022) and we follow it for the implementation of our CIfly algorithm.

The strategy for finding ZZ consists of the following steps:

  • first compute Z(i)Z_{(i)} to include all nodes that satisfy the second condition of the front-door criterion,
  • then compute Z(ii)Z_{(ii)} as the largest subset of Z(i)Z_{(i)} that also satisfies the third condition of the front-door criterion and
  • finally return the set Z(ii)Z_{(ii)} if it satisfies the first condition of the front-door criterion.

The reasoning behind this is that for satisfying the first condition of the front-door criterion, larger sets can only help in blocking the directed paths from XX to YY. Hence, we compute Z(ii)Z_{(ii)} as the largest possible set of nodes that satisfies the other two conditions. It will be a corollary of the algorithmic approach below that this set is unique. As we will see, the second step of this procedure is the tricky one. Thus, let us deal with the other two first.

Checking the first condition

We begin with the last step of the procedure that checks whether the first condition of the front-door criterion is satisfied. Here, our goal is to find out whether every directed path from XX to YY contains a node in ZZ, which corresponds to the question whether we reach YY from XX when stopping at nodes in ZZ. Hence, we can do this with the following rule table.

intercepted_paths_dag.txt
EDGES --> <--
SETS X, Z
START --> AT X
OUTPUT -->

--> | --> | current not in Z

If this search returns a node from YY, then the condition is violated.

Finding nodes satisfying the second condition

The second condition of the front-door criterion excludes nodes zz for which there exists a proper open back-door path from XX to zz. It easy to find those nodes with the following rule table (this rule table also allows to specify a conditioning set WW, which however is not needed here).

backdoor_connected_dag.txt
EDGES --> <--
SETS X, W
COLORS init, yield
START ... [init] AT X
OUTPUT ... [yield]

... [init]  | <-- [yield] | next not in X
--> [yield] | <-- [yield] | current in W
... [yield] | ... [yield] | next not in X and current not in W

The first transition from color init to yield is allowed only for back-door edges <--. Afterwards, only non-colliders are traversed.

Finding a set of nodes satisfying the third condition

Computing the nodes that can be part of a set satisfying the third condition of the front-door criterion is more complicated. The reason for this is that it needs to be ensured that there is no proper back-door path from ZZ to YY blocked by YY. This means that we cannot decide for individual nodes in isolation whether they should be part of Z(ii)Z_{(ii)} due to the fact that adding extra nodes to Z(ii)Z_{(ii)} may affect which proper paths exist. Instead, we will compute this set by identifying the nodes that cannot be in Z(ii)Z_{(ii)} with a single graph search following the approach by Wienöbst et al. (2024). This method guarantees that the remaining nodes form the unique maximum-size set of nodes satisfying the second and third condition of the front-door criterion. It relies on the following Lemma:

Lemma 1 [Wienöbst et al. (2024)]
Let GG be a DAG and XX, YY disjoint sets of vertices. Vertex vv is not in Z(ii)Z_{(ii)} if, and only if,

  • v∉Z(i)v \not\in Z_{(i)},
  • vYv \leftarrow Y, or
  • there exists an open back-door way (consisting of at least three variables) from vv to YY given XX, and all its nonterminal nodes are not in Z(ii)Z_{(ii)}.

Hence, we have a recursive definition of the nodes not in Z(ii)Z_{(ii)}. It already suggests a graph traversal visiting only nodes in Z(ii)Z_{(ii)} which connect to YY via a back-door path. The problematic bit is that if a node is visited, it may yet be unclear which parents of it are not in Z(ii)Z_{(ii)}. Thus, a parent might be identified later on as a node not in Z(ii)Z_{(ii)} (via some other path from YY) and then the search may need to be continued. It is possible to precisely characterize when this happens (for details, we refer to the proofs in the paper), yielding the following rule table which uses the sets YY, the ancestors of YY, Z(i)Z_{(i)} and XX.

frontdoor_forbidden_dag.txt
EDGES --> <--
SETS Y, A, Z, X
START <-- AT Y
OUTPUT ...

... | --> | current not in X
<-- | <-- | current not in X and next not in Z
--> | <-- | current in A and next not in Z

Putting everything together

We are now ready to give a full implemention of the algorithm for finding a front-door adjustment set. Before we look at the code, a few more remarks:

  • The code offers the user the option to constrain the front-door adjustment set to only include nodes from set RR and to always include nodes from set II. Hence, a set ZZ satisfying IZRI \subseteq Z \subseteq R is returned. In particular, this allows the modeling of unobserved variables by excluding them from RR. A different implementation option, that we have chosen in other articles, is to directly design the algorithm for ADMGs.
  • The algorithm will always return a front-door set of maximum size. Often this may not be the desired outcome. Wienöbst et al. (2024) also present algorithms for finding minimal front-door sets and for listing all front-door sets. Both can be implemented with CIfly, however, this goes beyond the scope of this article.

Adding the constraining set RR to the algorithm is fairly straightforward by restricting Z(i)Z_{(i)} and thus also Z(ii)Z_{(ii)} to these nodes. For set II we only need to check that it is part of Z(ii)Z_{(ii)} before returning, as Z(ii)Z_{(ii)} is the unique maximum size front-door set.

frontdoor.py
import ciflypy as cf
import ciflypy_examples.utils as utils

ruletables = utils.get_ruletable_path()
# re-use the ancestor table for ADMGs
ancestors_table = cf.Ruletable(ruletables / "ancestors_admg.txt")
backdoor_connected_table = cf.Ruletable(ruletables / "backdoor_connected_dag.txt")
frontdoor_forbidden_table = cf.Ruletable(ruletables / "frontdoor_forbidden_dag.txt")
intercepted_paths_table = cf.Ruletable(ruletables / "intercepted_paths_dag.txt")


def frontdoor(g, X, Y, I, R):
    Zi = set(R).difference(cf.reach(g, {"X": X}, backdoor_connected_table))
    A = cf.reach(g, {"X": Y}, ancestors_table)
    Zii = set(Zi).difference(
        cf.reach(g, {"Y": Y, "A": A, "Z": Zi, "X": X}, frontdoor_forbidden_table)
    )
    if set(I).issubset(Zii) and not set(Y).intersection(
        cf.reach(g, {"X": X, "Z": Zii}, intercepted_paths_table)
    ):
        return Zii
    else:
        return None

The Python and R code are also available in the CIfly Github repository. In your code, assign the value of the ruletables variable according to the location of the rule tables in your project. Alternatively, it is also possible to specify the rule tables as multi-line strings directly in the Python or R code by setting table_as_string=True, respectively, tableAsString=TRUE. For example, for the ancestors_admg table this looks as follows:

ancestors_table_string = """
EDGES --> <--, <->
SETS X
START <-- AT X
OUTPUT ...

... | <-- | true
"""
ancestors_table = cf.Ruletable(ancestors_table_string, table_as_string=True)

Finally, the function for finding front-door adjustment sets should be called as shown in the code snippet below:

# find front-door adjustment set for nodes 0 and 2
# in the DAG with the specified directed edges
g = {"-->": [(0, 1), (1, 2), (3, 0), (3, 2)]}
x = [0]
y = [2]
print(frontdoor(g, x, y, [], [1]))

References