Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to include unobserved confounders in the model? #616

Closed
AlxndrMlk opened this issue Aug 26, 2022 · 8 comments · Fixed by #1060
Closed

How to include unobserved confounders in the model? #616

AlxndrMlk opened this issue Aug 26, 2022 · 8 comments · Fixed by #1060

Comments

@AlxndrMlk
Copy link
Contributor

Hi, thank you for a great package!

I was wondering if there's a possibility to include unobserved confounders in CausalModel.

Under certain conditions back-door and front-door criteria can provide us with correct causal estimands even in face of unobserved confounding (e.g. Pearl, Glymour & Jewell, 2016).

I tried including unobserved confounders by adding these variables to the graph but not including them in the data.

It seems that the model does not return correct estimates in such a case (although the model should theoretically be fully identified).

@amit-sharma
Copy link
Member

amit-sharma commented Aug 29, 2022

@AlxndrMlk DoWhy does support unobserved confounders. You are adding them correctly, by not including those variables in the dataframe. The default behavior is that the identification algorithm considers the full graph and outputs estimands only in terms of the observed variables.

Can you share an example where you are obtaining an incorrect output?

@AlxndrMlk
Copy link
Contributor Author

AlxndrMlk commented Aug 29, 2022

Hi @amit-sharma,

thank you for a quick reply. I created a self-contained example that demonstrates the behavior I described:

from dowhy.causal_model import CausalModel
from sklearn.linear_model import LinearRegression

# Create the graph describing the causal structure
graph = """
graph [
    directed 1
    
    node [
        id "X" 
        label "X"
    ]    
    node [
        id "Z"
        label "Z"
    ]
    node [
        id "Y"
        label "Y"
    ]
    node [
        id "U"
        label "U"
    ]
    
    edge [
        source "X"
        target "Z"
    ]
    
    edge [
        source "Z"
        target "Y"
    ]
    
    edge [
        source "U"
        target "Y"
    ]
    
    edge [
        source "U"
        target "X"
    ]
]
""".replace('\n', '')

N_SAMPLES = 10000

# Generate the data
U = np.random.randn(N_SAMPLES)
X = np.random.randn(N_SAMPLES) + 0.3*U
Z = 0.7*X + 0.3*np.random.randn(N_SAMPLES) 
Y = 0.65*Z + 0.2*U

# Data to df
df = pd.DataFrame(np.vstack([X, Z, Y]).T, columns=['X', 'Z', 'Y'])

# Create a model
model = CausalModel(
    data=df,
    treatment='X',
    outcome='Y',
    graph=graph
)

# Get the estimand
estimand = model.identify_effect(proceed_when_unidentifiable=True)
print(estimand)

# Estimand type: nonparametric-ate

# ### Estimand : 1
# Estimand name: backdoor
# Estimand expression:
#  d        
# ────(E[Y])
# d[X]      
# Estimand assumption 1, Unconfoundedness: If U→{X} and U→Y then P(Y|X,,U) = P(Y|X,)

# ### Estimand : 2
# Estimand name: iv
# No such variable found!

# ### Estimand : 3
# Estimand name: frontdoor
# Estimand expression:
#  ⎡ d       d       ⎤
# E⎢────(Y)⋅────([Z])⎥
#  ⎣d[Z]    d[X]     ⎦
# Estimand assumption 1, Full-mediation: Z intercepts (blocks) all directed paths from X to Y.
# Estimand assumption 2, First-stage-unconfoundedness: If U→{X} and U→{Z} then P(Z|X,U) = P(Z|X)
# Estimand assumption 3, Second-stage-unconfoundedness: If U→{Z} and U→Y then P(Y|Z, X, U) = P(Y|Z, X)

# Estimate the effect with front-door
estimate = model.estimate_effect(
    identified_estimand=estimand,
    method_name='frontdoor.linear_regression'
)

estimate.value
# Out[12] 0.511009176530488

# Compute expected output

# Model P(Z|X)
lr_zx = LinearRegression()
lr_zx.fit(
    X=df['X'].values.reshape(-1, 1),
    y=df['Z']
)

# Model P(Y|X, Z)P(X)
lr_yz = LinearRegression()
lr_yz.fit(
    X=df[['Z', 'X']],
    y=df['Y']
)

# Compute the expected causal effect
lr_zx.coef_ * lr_yz.coef_[0]

# Out[13] array([0.45161212])

# Sanity check -> compute naive estimate
lr_naive = LinearRegression()
lr_naive.fit(
    X=df['X'].values.reshape(-1, 1),
    y=df['Y']
)

lr_naive.coef_

# Out[14] array([0.51100918])

I might be missing something, but there are a couple of things that drew my attention:

  1. I don't fully understand why back-door criterion appears as a valid criterion for this graph; it seems to me that backdoor does not provide us with identification in this case
  2. It seems that model.estimate_effect(...) returns the value that is virtually identical to the naive estimate (vide # Sanity check...) rather than the front-door adjusted estimate

Am I using the correct model for this graph (frontdoor.linear_regression)?

What are your thoughts?


Environment details

Windows 11
Python 3.8
dowhy==0.6

@amit-sharma
Copy link
Member

Ah, you are using an older version of dowhy. Can you update to the latest version v0.8?

  1. When I ran the code, it does not show any valid backdoor variables.
  2. You need to use the two-stage regression estimator for frontdoor. method_name="frontdoor.two_stage_regression".

image

Estimand type: nonparametric-ate

### Estimand : 1
Estimand name: backdoor
No such variable(s) found!

### Estimand : 2
Estimand name: iv
No such variable(s) found!

### Estimand : 3
Estimand name: frontdoor
Estimand expression:
Expectation(Derivative(Y, [Z])*Derivative([Z], [X]))
Estimand assumption 1, Full-mediation: Z intercepts (blocks) all directed paths from X to Y.
Estimand assumption 2, First-stage-unconfoundedness: If U→{X} and U→{Z} then P(Z|X,U) = P(Z|X)
Estimand assumption 3, Second-stage-unconfoundedness: If U→{Z} and U→Y then P(Y|Z, X, U) = P(Y|Z, X)
### Estimand : 1
Estimand name: frontdoor
Estimand expression:
Expectation(Derivative(Y, [Z])*Derivative([Z], [X]))
Estimand assumption 1, Full-mediation: Z intercepts (blocks) all directed paths from X to Y.
Estimand assumption 2, First-stage-unconfoundedness: If U→{X} and U→{Z} then P(Z|X,U) = P(Z|X)
Estimand assumption 3, Second-stage-unconfoundedness: If U→{Z} and U→Y then P(Y|Z, X, U) = P(Y|Z, X)

## Realized estimand
(b: Z~X)*(b: Y~Z+X)
Target units: ate

## Estimate
Mean value: 0.4599358419804275

@AlxndrMlk
Copy link
Contributor Author

Thank you @amit-sharma, I updated to 0.8 and front-door works smoothly!

@AnselmJeong
Copy link

@amit-sharma points out that in dowhy version 0.8.0, the unobserved confounder - a variable included in the graph but not in the dataset - is properly considered when performing frontdoor.two_stage_regression, and demonstrated that he could obtain the estimate value of 0.46 nearly identical to the actual value of 0.455.

However, when running the same example proposed by @AlxndrMlk myself with the corrected method frontdoor.two_stage_regression in dowhy version 0.8.0. Once again the estimate.value came out to be 0.51, identical to the naive regression coefficient (which is incorrect) with the following warnings.

WARNING:dowhy.causal_model:The graph defines 4 variables. 3 were found in the dataset and will be analyzed as observed variables. 1 were not found in the dataset and will be analyzed as unobserved variables. The observed variables are: '['X', 'Y', 'Z']'. The unobserved variables are: '['U']'. If this matches your expectations for observations, please continue. If you expected any of the unobserved variables to be in the dataframe, please check for typos.

Therefore, contrary to @amit-sharma's explanation, it seems there is no easy method to include the unobserved confounder in the model. This appears to be more true in the following message as well.

Estimand assumption 1, Full-mediation: Z intercepts (blocks) all directed paths from X to Y.
Estimand assumption 2, First-stage-unconfoundedness: If U→{X} and U→{Z} then P(Z|X,U) = P(Z|X)
Estimand assumption 3, Second-stage-unconfoundedness: If U→{Z} and U→Y then P(Y|Z, X, U) = P(Y|Z, X)

Doesn't the second and third unconfoundedness assumptions imply that the influence of the unobserved confounder, U, has been ignored?

This result is also replicated in dowhy version 0.10.0

@AnselmJeong
Copy link

Also the output of the following statement is not the same as in @amit-sharma

estimate = model.estimate_effect(
    identified_estimand=estimand,
    
    method_name="frontdoor.two_stage_regression"
print(estimate)
)

Estimand : 1

Estimand name: frontdoor
Estimand expression:

 ⎡ d       d       ⎤
E⎢────(Y)⋅────([Z])⎥
 ⎣d[Z]    d[X]     ⎦
Estimand assumption 1, Full-mediation: Z intercepts (blocks) all directed paths from X to Y.
Estimand assumption 2, First-stage-unconfoundedness: If U→{X} and U→{Z} then P(Z|X,U) = P(Z|X)
Estimand assumption 3, Second-stage-unconfoundedness: If U→{Z} and U→Y then P(Y|Z, X, U) = P(Y|Z, X)

Realized estimand

(b: Z~X)*(b: Y~Z)
Target units: ate

Estimate

Mean value: 0.5004912411156469

@AlxndrMlk
Copy link
Contributor Author

One of the readers of "Causal Inference & Discovery in Python" reported that they had a similar issue with front-door in DoWhy 0.10.0:

Hello Aleksander.
While running chapter 7 notebook, in the front-door case I got -0.33 (naive one) instead of -0.42 (real causal) using dowhy method as you suggest. This does not match with what is written in the book (that I believe is the correct figure -0.42)

But it worked for them correctly in 0.8:

Good morning Aleksander. With version 0.8 it works properly.

Have you tried to replicate the issue over a multiple runs (and datasets), @AnselmJeong?

@amit-sharma
Copy link
Member

Thank you @AlxndrMlk and @AnselmJeong for resurfacing this issue. Unfortunately the error creeped up again in 0.10, while it works fine in v0.8.

I have now added a fix through PR #1060 . I have also included @AlxndrMlk example as a test in the library so we never see this bug again in future versions of DoWhy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants