How to include unobserved confounders in the model? #616

AlxndrMlk · 2022-08-26T15:03:17Z

Hi, thank you for a great package!

I was wondering if there's a possibility to include unobserved confounders in CausalModel.

Under certain conditions back-door and front-door criteria can provide us with correct causal estimands even in face of unobserved confounding (e.g. Pearl, Glymour & Jewell, 2016).

I tried including unobserved confounders by adding these variables to the graph but not including them in the data.

It seems that the model does not return correct estimates in such a case (although the model should theoretically be fully identified).

The text was updated successfully, but these errors were encountered:

amit-sharma · 2022-08-29T04:03:04Z

@AlxndrMlk DoWhy does support unobserved confounders. You are adding them correctly, by not including those variables in the dataframe. The default behavior is that the identification algorithm considers the full graph and outputs estimands only in terms of the observed variables.

Can you share an example where you are obtaining an incorrect output?

AlxndrMlk · 2022-08-29T19:14:43Z

Hi @amit-sharma,

thank you for a quick reply. I created a self-contained example that demonstrates the behavior I described:

from dowhy.causal_model import CausalModel
from sklearn.linear_model import LinearRegression

# Create the graph describing the causal structure
graph = """
graph [
    directed 1
    
    node [
        id "X" 
        label "X"
    ]    
    node [
        id "Z"
        label "Z"
    ]
    node [
        id "Y"
        label "Y"
    ]
    node [
        id "U"
        label "U"
    ]
    
    edge [
        source "X"
        target "Z"
    ]
    
    edge [
        source "Z"
        target "Y"
    ]
    
    edge [
        source "U"
        target "Y"
    ]
    
    edge [
        source "U"
        target "X"
    ]
]
""".replace('\n', '')

N_SAMPLES = 10000

# Generate the data
U = np.random.randn(N_SAMPLES)
X = np.random.randn(N_SAMPLES) + 0.3*U
Z = 0.7*X + 0.3*np.random.randn(N_SAMPLES) 
Y = 0.65*Z + 0.2*U

# Data to df
df = pd.DataFrame(np.vstack([X, Z, Y]).T, columns=['X', 'Z', 'Y'])

# Create a model
model = CausalModel(
    data=df,
    treatment='X',
    outcome='Y',
    graph=graph
)

# Get the estimand
estimand = model.identify_effect(proceed_when_unidentifiable=True)
print(estimand)

# Estimand type: nonparametric-ate

# ### Estimand : 1
# Estimand name: backdoor
# Estimand expression:
#  d        
# ────(E[Y])
# d[X]      
# Estimand assumption 1, Unconfoundedness: If U→{X} and U→Y then P(Y|X,,U) = P(Y|X,)

# ### Estimand : 2
# Estimand name: iv
# No such variable found!

# ### Estimand : 3
# Estimand name: frontdoor
# Estimand expression:
#  ⎡ d       d       ⎤
# E⎢────(Y)⋅────([Z])⎥
#  ⎣d[Z]    d[X]     ⎦
# Estimand assumption 1, Full-mediation: Z intercepts (blocks) all directed paths from X to Y.
# Estimand assumption 2, First-stage-unconfoundedness: If U→{X} and U→{Z} then P(Z|X,U) = P(Z|X)
# Estimand assumption 3, Second-stage-unconfoundedness: If U→{Z} and U→Y then P(Y|Z, X, U) = P(Y|Z, X)

# Estimate the effect with front-door
estimate = model.estimate_effect(
    identified_estimand=estimand,
    method_name='frontdoor.linear_regression'
)

estimate.value
# Out[12] 0.511009176530488

# Compute expected output

# Model P(Z|X)
lr_zx = LinearRegression()
lr_zx.fit(
    X=df['X'].values.reshape(-1, 1),
    y=df['Z']
)

# Model P(Y|X, Z)P(X)
lr_yz = LinearRegression()
lr_yz.fit(
    X=df[['Z', 'X']],
    y=df['Y']
)

# Compute the expected causal effect
lr_zx.coef_ * lr_yz.coef_[0]

# Out[13] array([0.45161212])

# Sanity check -> compute naive estimate
lr_naive = LinearRegression()
lr_naive.fit(
    X=df['X'].values.reshape(-1, 1),
    y=df['Y']
)

lr_naive.coef_

# Out[14] array([0.51100918])

I might be missing something, but there are a couple of things that drew my attention:

I don't fully understand why back-door criterion appears as a valid criterion for this graph; it seems to me that backdoor does not provide us with identification in this case
It seems that model.estimate_effect(...) returns the value that is virtually identical to the naive estimate (vide # Sanity check...) rather than the front-door adjusted estimate

Am I using the correct model for this graph (frontdoor.linear_regression)?

What are your thoughts?

Environment details

Windows 11
Python 3.8
dowhy==0.6

amit-sharma · 2022-08-31T05:18:33Z

Ah, you are using an older version of dowhy. Can you update to the latest version v0.8?

When I ran the code, it does not show any valid backdoor variables.
You need to use the two-stage regression estimator for frontdoor. method_name="frontdoor.two_stage_regression".

Estimand type: nonparametric-ate

### Estimand : 1
Estimand name: backdoor
No such variable(s) found!

### Estimand : 2
Estimand name: iv
No such variable(s) found!

### Estimand : 3
Estimand name: frontdoor
Estimand expression:
Expectation(Derivative(Y, [Z])*Derivative([Z], [X]))
Estimand assumption 1, Full-mediation: Z intercepts (blocks) all directed paths from X to Y.
Estimand assumption 2, First-stage-unconfoundedness: If U→{X} and U→{Z} then P(Z|X,U) = P(Z|X)
Estimand assumption 3, Second-stage-unconfoundedness: If U→{Z} and U→Y then P(Y|Z, X, U) = P(Y|Z, X)

### Estimand : 1
Estimand name: frontdoor
Estimand expression:
Expectation(Derivative(Y, [Z])*Derivative([Z], [X]))
Estimand assumption 1, Full-mediation: Z intercepts (blocks) all directed paths from X to Y.
Estimand assumption 2, First-stage-unconfoundedness: If U→{X} and U→{Z} then P(Z|X,U) = P(Z|X)
Estimand assumption 3, Second-stage-unconfoundedness: If U→{Z} and U→Y then P(Y|Z, X, U) = P(Y|Z, X)

## Realized estimand
(b: Z~X)*(b: Y~Z+X)
Target units: ate

## Estimate
Mean value: 0.4599358419804275

AlxndrMlk · 2022-09-06T05:57:53Z

Thank you @amit-sharma, I updated to 0.8 and front-door works smoothly!

AnselmJeong · 2023-08-09T01:53:54Z

@amit-sharma points out that in dowhy version 0.8.0, the unobserved confounder - a variable included in the graph but not in the dataset - is properly considered when performing frontdoor.two_stage_regression, and demonstrated that he could obtain the estimate value of 0.46 nearly identical to the actual value of 0.455.

However, when running the same example proposed by @AlxndrMlk myself with the corrected method frontdoor.two_stage_regression in dowhy version 0.8.0. Once again the estimate.value came out to be 0.51, identical to the naive regression coefficient (which is incorrect) with the following warnings.

WARNING:dowhy.causal_model:The graph defines 4 variables. 3 were found in the dataset and will be analyzed as observed variables. 1 were not found in the dataset and will be analyzed as unobserved variables. The observed variables are: '['X', 'Y', 'Z']'. The unobserved variables are: '['U']'. If this matches your expectations for observations, please continue. If you expected any of the unobserved variables to be in the dataframe, please check for typos.

Therefore, contrary to @amit-sharma's explanation, it seems there is no easy method to include the unobserved confounder in the model. This appears to be more true in the following message as well.

Estimand assumption 1, Full-mediation: Z intercepts (blocks) all directed paths from X to Y.
Estimand assumption 2, First-stage-unconfoundedness: If U→{X} and U→{Z} then P(Z|X,U) = P(Z|X)
Estimand assumption 3, Second-stage-unconfoundedness: If U→{Z} and U→Y then P(Y|Z, X, U) = P(Y|Z, X)

Doesn't the second and third unconfoundedness assumptions imply that the influence of the unobserved confounder, U, has been ignored?

This result is also replicated in dowhy version 0.10.0

AnselmJeong · 2023-08-09T03:01:00Z

Also the output of the following statement is not the same as in @amit-sharma

estimate = model.estimate_effect(
    identified_estimand=estimand,
    
    method_name="frontdoor.two_stage_regression"
print(estimate)
)

Estimand : 1

Estimand name: frontdoor
Estimand expression:

 ⎡ d       d       ⎤
E⎢────(Y)⋅────([Z])⎥
 ⎣d[Z]    d[X]     ⎦
Estimand assumption 1, Full-mediation: Z intercepts (blocks) all directed paths from X to Y.
Estimand assumption 2, First-stage-unconfoundedness: If U→{X} and U→{Z} then P(Z|X,U) = P(Z|X)
Estimand assumption 3, Second-stage-unconfoundedness: If U→{Z} and U→Y then P(Y|Z, X, U) = P(Y|Z, X)

Realized estimand

(b: Z~X)*(b: Y~Z)
Target units: ate

Estimate

Mean value: 0.5004912411156469

AlxndrMlk · 2023-08-11T08:08:18Z

One of the readers of "Causal Inference & Discovery in Python" reported that they had a similar issue with front-door in DoWhy 0.10.0:

Hello Aleksander.
While running chapter 7 notebook, in the front-door case I got -0.33 (naive one) instead of -0.42 (real causal) using dowhy method as you suggest. This does not match with what is written in the book (that I believe is the correct figure -0.42)

But it worked for them correctly in 0.8:

Good morning Aleksander. With version 0.8 it works properly.

Have you tried to replicate the issue over a multiple runs (and datasets), @AnselmJeong?

amit-sharma · 2023-11-04T17:24:19Z

Thank you @AlxndrMlk and @AnselmJeong for resurfacing this issue. Unfortunately the error creeped up again in 0.10, while it works fine in v0.8.

I have now added a fix through PR #1060 . I have also included @AlxndrMlk example as a test in the library so we never see this bug again in future versions of DoWhy.

amit-sharma mentioned this issue Aug 31, 2022

check valid estimators for different identification strategies #618

Open

AlxndrMlk closed this as completed Sep 6, 2022

asha24choudhary mentioned this issue Oct 16, 2023

Identify effect not showing backdoor variable #1048

Closed

amit-sharma mentioned this issue Nov 4, 2023

Fix frontdoor estimation bug #1060

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to include unobserved confounders in the model? #616

How to include unobserved confounders in the model? #616

AlxndrMlk commented Aug 26, 2022

amit-sharma commented Aug 29, 2022 •

edited

Loading

AlxndrMlk commented Aug 29, 2022 •

edited

Loading

amit-sharma commented Aug 31, 2022

AlxndrMlk commented Sep 6, 2022

AnselmJeong commented Aug 9, 2023

AnselmJeong commented Aug 9, 2023

AlxndrMlk commented Aug 11, 2023

amit-sharma commented Nov 4, 2023

How to include unobserved confounders in the model? #616

How to include unobserved confounders in the model? #616

Comments

AlxndrMlk commented Aug 26, 2022

amit-sharma commented Aug 29, 2022 • edited Loading

AlxndrMlk commented Aug 29, 2022 • edited Loading

Environment details

amit-sharma commented Aug 31, 2022

AlxndrMlk commented Sep 6, 2022

AnselmJeong commented Aug 9, 2023

AnselmJeong commented Aug 9, 2023

Estimand : 1

Realized estimand

Estimate

AlxndrMlk commented Aug 11, 2023

amit-sharma commented Nov 4, 2023

amit-sharma commented Aug 29, 2022 •

edited

Loading

AlxndrMlk commented Aug 29, 2022 •

edited

Loading