My Notes on The Book of Why by Judea Pearl

- book-reviews

I recently read “The Book of Why” by Judea Pearl.

Here are my notes:

Chapter 1

Tells the story of the science of how we distinguish facts from fiction

The new science of causality and causal inference

The fundamental gap between the vocabulary in which we cast causal questions and the traditional vocabulary in which we communicate scientific theories

Algebraic equations show equality and do not communicate causal relationships

Science has neglected to develop a language for causality for so many generations

Causal vocabulary was virtually prohibited for more than half a century

And when you prohibit speech you prohibit thought and stifle principles, methods and tools

In statistics 101 everyone learns to chant correlation is not causation

We live in an era that presumes big data to be the solution to all our problems

Data are profoundly dumb

Data can tell you people who took a medicine recovered faster than the people who did not take it, but they can’t tell you why - maybe the people who took the medicine did so because they could afford it and would have recovered just as fast without it

We see many instances where data are not enough

Scientists can address problems that would have once been considered unsolvable or beyond the pale of scientific inquiry

The causal revolution is a scientific shakeup that embraces rather than denies our innate understanding of cause and effect The causal revolution did not happen in a vacuum It’s mathematical secret is the calculus of causation

I’m thrilled to unveil this calculus

The calculus of causation consists of two languages:

  • Causal diagrams to express what we know
  • A symbolic language resembling algebra to express what we want to know

Dots represent variables Arrows represent causal relationships between the variables

If you can navigate one way streets then you can understand causal diagrams

The model should depict the system that generates the data

If we are interested in the effect of a drug d on lifespan L then our query might be written P(L given do(d) )

A method for predicting the effects of an intervention without enacting it

Classical statistics only summarizes data so it does not even provide a language for asking the question: would this person have died even if they didn’t take the drug

Our minds make reproducible judgments all the time about what might be or what might have been

Two people who share the same causal model will also share all counterfactual judgments

In the world AI, you do not understand a topic unless you can teach it to a mechanical robot

Notation, language, vocabulary and grammar

My emphasis on language comes from a deep conviction that language shapes our thoughts

You cannot answer a question you cannot ask and you cannot ask a question you have no words for

As a student of computer science and philosophy My attraction to causal inference is largely triggered by the excitement of seeing an orphaned scientific language making it from birth to maturity

Strong AI is an achievable goal and one not to be feared precisely because causality is part of the solution

A causal reasoning module will give machines the ability to reflect on their mistakes, to pinpoint weakness in their software, to function as moral entities, and to converse naturally with humans about their own choices and intentions

Inputs: Assumptions, queries and data Output 1: answerable? Output 2: estimand Output 3: answer to the query

We collect data only afterwards (causal model, query and estimand)

Contrasts with the traditional statistical approach that doesn’t even have a causal model

Awareness of causal models has grown by leaps and bounds among the sciences

Many researchers in AI would like to skip the step of constructing or acquiring a causal model and rely only on data for casual tasks The hope is that data will guide us to the right answers whenever questions come up

I am an outspoken skeptic of this trend about this because I know how dumb data are about cause and effects

Information about the effects of actions and interventions are not available in raw data unless they are collected by controlled experimental manipulation

Counterfactuals: What would have happened had we acted differently

Why questions are counterfactuals in disguise

The estimand computed is good for any data collected

Chapter 1: three steps of observation, intervention and counterfactuals as the steps of the ladder of causation

Will be far ahead of generations of data scientists who attempted to interpret data through a model blind lens oblivious to the distinctions that the ladder of causation illuminates

Chapter 3: getting here from Bayesian networks, became convinced that he was wrong about being the key to unlocking AI

Bayesian networks are an important tool for AI

Chapter 4: Major contribution to causal statistics - the randomized controlled trial

Chapter 5: Statisticians struggled with the question of whether smoking causes lung cancer because scientists didn’t have an adequate language for answering causal questions

Chapter 6: Paradoxes including Monty hall paradox, Simpson’s paradox, and Berkson’s paradox

Chapter 7-9: Assent of the ladder of causation - back door adjustment, front door adjustment, and instrumental variables

Two definitions of causality given by Hume: regularity and counterfactual (the latter is the right one)

Chapter 9: Mediation, the mechanism through which a variable acts

Chapter 10: Automating human level intelligence, strong AI

Summarizing the book in one pithy phrase: You are smarter than your data

There is no better way to understand ourselves than by emulating ourselves

In the age of computers, this also brings the prospect of amplifying our abilities so we can make sense of data, be it big or small

Chapter 2

Methematization of Causality

Causality has been mathematized

Causal learner must master:

  • Seeing
  • Doing
  • Imagining

Personal Aside


Imagine a procedure is performed on 99% of the population (like a vaccine).

Now imagine 200 die from the procedure and 40 die from not having the procedure done out of every 10,000 people.

Is the procedure a good idea?

Tall Dads and Tall Sons

Unusually tall sons usually have shorter dads Unusually tall dads usually have shorter sons

Guinea Pig Variation

For guinea pigs

Among random populations, 42% of variation of fur is due to genetics, 58% is developmental

Among inbred populations, 3% of variation is due to heredity, 92% is developmental

Frequentists vs Bayesians vs Causalists

Frequentist objectivity Bayesian subjectivity Causal subjectivity

Casual inference is objective in one critically important sense

Once people agree on their assumptions, It provides a 100% objective way of interpreting any new evidence or data

Bayesian Probability

Bayesian probability Bayesian networks

Chapter 3

Randomization is a way of eliminating confounders

The only cause of the input is the random card

The skillful interrogation of nature

Randomizes controlled trials > controlled trials > observation studies > non-observational claims

Chapter 5

Hill’s criteria / viewpoints:

  1. Consistency
  2. Strength of association
  3. Specificity of association
  4. Temporal relationship
  5. Coherence / biological plausibility

Chapters 6+

Philosophy and Causality


Material cause Formal cause Efficient cause Final cause


Counterfactuals at the heart of causality

Personal Aside

Causality at the most fundamental physical level only refers to what particles are colliding with what other particles

We can translate that to the level of human comprehension through common language, where we give labels to a collection of particles

Then we can know that one object’s actions have influenced another object and caused a change, and we do this by thinking through counterfactuals

Because many forces can be continuously acting on the caused object, we have to eliminate all forces until one is left, and can say that the effect wouldn’t have been produced if it hadn’t been for the particular force from the particular object

Partial Attribution

For heat waves 1.6 degrees above normal, 50% of the risk can be attributed to climate change

Necessity vs Sufficiency

Pn (necessity) vs Ps (sufficiency)

P_n should be used over a short period

P_s should be used over a long period

Complex climate models are seemingly much more reliable than the linear models in the social sciences

But could contain systemic errors

Effect of Treatment

ETT (effect of treatment on the treated)

ACE (average causal effect)

What if the ones subjected are the least likely to benefit from it?

Mediation analysis


Mediator is the one who transmits the effect of the treatment to the outcome

Not obvious that direct and indirect effects involve counterfactual statements

Fairness in Acceptance

44% of men were accepted 35% of women were accepted

Investigate reasons for disparity

Data showed the decisions were actually more favorable to women than men

Females applied to harder departments like sociology and other humanities

Females didn’t apply as often to departments that were easier to get into like mechanical engineering

Distinguish between bias and discrimination

Bias is a pattern of association between a particular decision and a particular sex of applicant

Discrimination as the exercise of decision influenced by the sex of the applicant when that is immaterial to the qualifications for entry

Data should be stratified by department because those are the decision making units

It appeared that admission officials were attempting to overcome long established shortages in their fields

This is what Berkeley concluded


Local fairness everywhere means global fairness

Outdated Barron Kenny Method

Barron Kenny method

Paper cited 33rd of all papers

2nd in psychology

Cited more than Einstein and Freud

Outdated methodology

Algebra Introduction

When algebra for all was instituted

Low performing students made their way to classrooms with high performing students

They might have gotten less attention The class taught to a higher median They might have performed worse on tests And learned less as a result

The teachers of high performers had less experience teaching low performers The teachers of low performers were not skilled enough to teach high performing math

Thus the algebra for all program effected the mediator of environment on the way to learning

Double dose algebra succeeded

Two classes of algebra per day for students who performed below the median