Optimal adjustment sets

In this article, we show how to implement an algorithm for constructing the optimal adjustment set using CIfly. Covariate adjustment is a fundamental technique for identifying causal effects. Applying covariate adjustment to CPDAGs (a class of causal graphs that can be learned from data), allows us to learn causal effects from observational data by combining causal discovery and causal effect estimation. As there may be many different adjustment sets in a causal graph, it is an important question which sets are preferable over others. Henckel et al. (2022) provide a graphical criterion for an adjustment set guaranteed to be valid if there is a valid adjustment set for the target causal effect. Furthermore, the corresponding estimator is guaranteed to have the smallest asymptotic variance among all estimators corresponding to valid adjustment sets.

The main result in this regard is stated in the following theorem. Note that we focus here purely on the graphical aspects of this problem, for more statistical and causal background, we refer the reader to the original paper.

Theorem [Henckel et al. (2022)]
Consider disjoint node sets XX and YY in a CPDAG GG such that YpossdeG(X)Y \subseteq \mathrm{possde}_G(X) and the node set O=(paG(causalG(X,Y)))(forbG(X,Y)X).O=\left(\mathrm{pa}_G(\mathrm{causal}_G(X,Y))\right) \setminus (\mathrm{forb}_G(X,Y) \cup X). Then the following three statements hold:

  1. If there exists a valid adjustment set relative to (X,Y)(X,Y) in GG than so is OO.
  2. The causal effect estimator corresponding to OO has the smallest asymptotic variance among all estimators corresponding to valid adjustment sets for all causal models with CPDAG GG.
  3. Any other valid adjustment set with Property 2 is a superset of OO.

The criterion is formulated for CPDAGs, which may contain both directed and undirected edges. They do not contain directed cycles and satisfy additional properties, which we, however, do not discuss in this article. The graphical terminology in the criterion is introduced below.

Graphical terminology

Let us first revisit the notion of possible descendants, causal and forbidden sets. A path is possibly directed from XX to YY if every edge on it is either undirected or points towards the node in YY. For example, xuvyx \rightarrow u - v \rightarrow y is, for X={x}X = \{x\} and Y={y}Y = \{y\}, a possibly directed path from XX to YY. We call a node yy a possible descendant of XX if there exists a possibly directed path from XX to yy. The set possdeG(X)\mathrm{possde}_G(X) is the set of all possible descendants. In the criterion above we require that YpossdeG(X)Y \subseteq \mathrm{possde}_G(X). This may seem like a cumbersome requirement but if a node yy is not a possible descendant of XX then there is no causal effect of XX on yy, which is why we may disregard such nodes.

The causal set relative to (X,Y)(X,Y) in GG, denoted causalG(X,Y)\mathrm{causal}_G(X,Y), consists of all nodes lying on possibly directed paths from XX to YY excluding XX itself. The forbidden set relative to (X,Y)(X,Y) in GG, denoted forbG(X,Y)\mathrm{forb}_G(X,Y), is the set of all possible descendants of the causal set.

Finally, we introduce one further notion which we will use to verify whether valid adjustment set relative to (X,Y)(X,Y) in GG exists. A proper path between XX and YY is one which contains exactly one node from XX (the starting point of the path). A valid adjustment set relative to (X,Y)(X,Y) in GG does not exist if there exists a proper possibly directed path starting with an undirected edge from a node xXx \in X to YY. We call nodes vv to which there exists such a path from XX non-amenable. In our documentation articles for ciflypy and ciflyr, we already used the rule table below for finding all non-amenable nodes as a running example.

Checking whether valid adjustment sets exist

Since the optimal adjustment set OO is only valid if a valid adjustment set exists we first check whether this is the case. To do so we rely on Theorem 5 and Corollary 27 of Perković et al. (2018) which states that a valid adjustment set exists if no node in YY is non-amenable and no node in XX is forbidden. For finding all non-amenable nodes we use the rule table below that already served as a running example in our documentation. It finds all nodes vv for which there exists a proper possibly directed path starting with an undirected edge from a node xXx \in X to vv. Effectively, these are the nodes that cannot be in set YY.

not_amenable_cpdag.txt
EDGES --> <--, ---
SETS X
COLORS init, yield
START ... [init] AT X
OUTPUT ... [yield]

... [init]  | ---      [yield] | next not in X
... [yield] | ---, --> [yield] | next not in X

To check amenability, we use this ruletable for set XX and check whether the intersection of returned nodes with YY is empty.

For verifying the other condition of the adjustment criterion, we need to compute the forbidden set which we do in multiple steps. First, we find all ww on a proper possibly directed path from XX to YY. Afterwards, we find the possible descendants of these ww. The first step can be done by finding all nodes on a proper possibly directed path from XX that does not contain a node in YY and, vice versa, all nodes from which there exists a proper possibly directed path to YY not containing a node in XX. By definition, the intersection of these two node sets are precisely the nodes which lie on a proper possibly directed path from XX to YY, that is, the causal set.

Hence, we need a table for finding nodes on proper possibly directed paths from XX with the option to stop before nodes of a passed set WW are reached. Clearly, this table can also be used to find possible descendants.

possible_descendants_cpdag.txt
EDGES --> <--, ---
SETS X, W
START --> AT X
OUTPUT ...

... | -->, --- | next not in W

We also need to find the nodes from which there exists a proper possibly directed path to YY. For this, we can simply start the reachability algorithm at YY and traverse these paths the opposite way.

possible_ancestors_cpdag.txt
EDGES --> <--, ---
SETS X, W
START <-- AT X
OUTPUT ...

... | <--, --- | next not in W

Using these tables, we can compute the forbidden set (and the causal set) as follows.

anc = cf.reach(cpdag, {"X": Y, "W": X}, "./possible_ancestors_cpdag.txt")
des = cf.reach(cpdag, {"X": X, "W": []}, "./possible_descendants_cpdag.txt")
cn = set(anc).intersection(des)
forb = cf.reach(cpdag, {"X": cn}, "./possible_descendants_cpdag.txt")

To check that XX does not contain forbidden nodes, we use this code to compute the forbidden set and then check whether the intersection with XX is empty.

Constructing the optimal adjustment set

To construct the optimal set itself we need precisely the causal set and the forbidden set. We have already shown how to obtain those above. The only step that remains is computing the parents of the causal set, i.e., all nodes vv such that there exists an edge vcv \rightarrow c to some causal node cc. This is an easy task but it is nonetheless possible to inadvertantly implement a super-linear algorithm for it. In CIfly, however, it is rather easy to implement a linear-time algorithm by using the following rule table.

parents_cpdag.txt
EDGES --> <--, ---
SETS X
COLORS init, yield
START <-- [init] AT X
OUTPUT <-- [yield]

<-- [init] | <-- [yield] | next not in X

Full implementation

In the final implementation, we use the four tables introduced above to check whether YY consists of possible descendants of XX, to verify whether a valid adjustment set exists and finally to construct the optimal set simultanously. In your code, assign the ruletables variable according to the location of the rule tables in your project (in Python, the code uses a Path object from the pathlib library; in R, ruletables is a string holding the path to the folder).

optimal_adjustment.py
import ciflypy as cf
import ciflypy_examples.utils as utils

ruletables = utils.get_ruletable_path()
not_amenable_table = cf.Ruletable(ruletables / "not_amenable_cpdag.txt")
possible_ancestors_table = cf.Ruletable(ruletables / "possible_ancestors_cpdag.txt")
possible_descendants_table = cf.Ruletable(ruletables / "possible_descendants_cpdag.txt")
parents_table = cf.Ruletable(ruletables / "parents_cpdag.txt")


def optimal_adjustment(cpdag, X, Y):
    descendants = cf.reach(cpdag, {"X": X}, possible_descendants_table)
    if set(Y).difference(descendants):
        return None

    not_amenable = cf.reach(cpdag, {"X": X}, not_amenable_table)
    ancestors = cf.reach(cpdag, {"X": Y, "W": X}, possible_ancestors_table)
    cn = set(ancestors).intersection(descendants)
    forbidden = cf.reach(cpdag, {"X": cn}, possible_descendants_table)
    if set(forbidden).intersection(X) or set(Y).intersection(not_amenable):
        return None

    pre_opt = cf.reach(cpdag, {"X": cn}, parents_table)
    return set(pre_opt).difference(set(forbidden).union(X))

You can also find this code in the CIfly Github repository, see the Python and R version. Finally, the following short code snippet shows how this function can be called:

# find an optimal adjustment set for nodes x = 2 and y = 4
# in the CPDAG with the specified directed and undirected edges
cpdag = {"-->": [(1, 4), (3, 4), (2, 4)], "---": [(0, 1), (0, 3), (1, 3)]}

print(optimal_adjustment(cpdag, [2], [4]))

References