2 Probability Spaces and Random Variables

2.1 Probability Spaces

2.1.1 Definition

Let \(\Omega\) be a set and let \(\mathcal{P}(\Omega)\) be its power set, i.e. the family \(\left\{A\colon A\subset \Omega\right\}\).

Definition 2.1 A family of sets \(\mathscr{F}\subset \mathcal{P}\left(\Omega\right)\) is called a \(\sigma\)-algebra if it has the following properties:

\(\Omega\in\mathscr{F}\) and \(\emptyset\in\mathscr{F}\).
If \(A\in\mathscr{F}\) then \(A^c\in\mathscr{F}\).
If \((A_n)_{n\in\mathbb{N}}\subset\mathscr{F}\) is a countable collection of subsets in \(\mathscr{F}\), then \[\bigcup_{n=1}^\infty A_n\in\mathscr{F}.\] I.e. \(\mathscr{F}\) is closed under countable unions.

The pair \((\Omega,\mathscr{F})\) is called a measurable space.

Definition 2.2 Let \((\Omega,\mathscr{F})\) be a measurable space. A function \[\mathbb{P}\colon \mathscr{F}\to \mathbb{R}^+\] is called a probability measure if it satisfies the following properties:

\(\mathbb{P}(\Omega) = 1\) and \(\mathbb{P}(\emptyset)=0\).
\(\mathbb{P}(A)\geq 0\) for all \(A\in\mathscr{F}\).
If \((A_n)_n\subset \mathscr{F}\) is a countable collection of pairwise disjoint sets, i.e. \[ A_i \bigcap A_j = \emptyset \quad \forall i\neq j, \] then \[ \mathbb{P}\left(\bigcup_n A_n\right) = \sum_n \mathbb{P}(A_n). \] The triplet \((\Omega,\mathscr{F},\mathbb{P})\) is called a probability space.

2.1.2 First Examples

Example 2.1 Set

\(\Omega=[0,1]\),
\(\mathscr{F}=\mathscr{B}([0,1])\) (safely ignore), and
\(\mathbb{P}(A)=length(A)\)
- E.g. If \(A=[a,b]\) then \(\mathbb{P}(A)=b-a\)

Example 2.2 Set

\(\Omega=[0,1]^n\),
\(\mathscr{F}=\mathscr{B}([0,1]^n)\) (safely ignore), and
\(\mathbb{P}(A)=volume(A)\),
- E.g. If \(n=2\) and \(A=[a,b]\times[c,d]\) then \(\mathbb{P}(A)=(b-a)(d-c)\).
- E.g.

In this case \(\mathbb{P}\left(A\right)=\sum_{n=1}^\infty \left(\frac{1}{2^2}\right)^n = 1/3\).

2.1.3 Important Examples

2.1.3.1 Probability Measures with Density on \(\mathbb{R}\) (or \(\mathbb{R}^+\), etc)

Definition 2.3 (Probability Measure with Density) Let \(f\colon D\subset\mathbb{R}\to \mathbb{R}^+\) be such that \(\int_D f(x)\hspace{3pt}dx = 1\). Such an \(f\) is called a density function on \(D\). Set

\(\Omega=D\),
\(\mathscr{F}=\mathscr{B}(D)\) (safely ignore), and
\(\mathbb{P}(A)=\int_{A\cap D} f(x) \hspace{3pt}dx\)¹
- E.g. If \(D=\mathbb{R}\) and \(A=[a,b]\) then \(\mathbb{P}(A)=\int_a^b f(x)\hspace{3pt}dx\).

A probability measure with such representation is said to have density \(f\).

Example 2.3 (Exponential Measure) Let \(D=\mathbb{R}^+\). For any \(\lambda>0\) set \[f(x)= \lambda e^{-\lambda x},\] \(f\) satisfies \[\int_0^\infty f(x) \hspace{3pt}dx=\int_0^\infty \lambda e^{-\lambda x} \hspace{3pt}dx= 1.\]

Thus we may define a continuous probability measure \(\mathbb{P}\) on \(\mathbb{R}^+\) which satisfies, for example \[\mathbb{P}([a,\infty])=\int_a^\infty \lambda e^{-\lambda x}\hspace{3pt}dx=e^{-\lambda a}.\]

In this case \(\mathbb{P}[(2,\infty)]=\int_2^\infty 0.5e^{-0.5x}\hspace{3pt}dx = e^{-1}\).

2.1.3.2 Discrete Probability Measures on \(\mathbb{R}\) (or \(\mathbb{N}\), etc)

Definition 2.4 (Discrete Probability Measure) Let \((a_n)_n\subset\mathbb{R}\) be a countable collection of points in \(\mathbb{R}\) (resp. \(\mathbb{N}\), etc), and let \((p_n)_n\subset [0,1]\) be such that \[\sum_n p_n = 1.\] Set

\(\Omega=\mathbb{R}\),
\(\mathscr{F}=\mathscr{B}(\mathbb{R})\) (safely ignore),
\[\mathbb{P}(A)=\sum_{n\colon a_n\in A}p_n \]
- E.g. If \(a_n=n\) for \(1\leq n \leq N\) and \(p_n=1/N\), then \(\mathbb{P}(\{1,2\}) = \frac{1}{N} + \frac{1}{N}=\frac{2}{N}\)

A probability measure that can be written in this way is called a discrete probability measure.

Example 2.4 (Geometric Measure) Let \(a_0=0, a_1=1, a_2=2, \dots\) and fix any \(p\in[0,1];\) define \(p_n=p^n(1-p)\) for \(n\in\{0,1,\dots\}\) and note that they satisfy² \[ \begin{aligned} \sum_{n=0}^\infty p^n(1-p) &= (1-p)\sum_{n=0}^\infty p^n \\ &= \frac{1-p}{1-p} \\ &= 1. \end{aligned} \]

Thus we may define a probability measure \(\mathbb{P}\) on \(\mathbb{N}\) (or on \(\mathbb{R}\)). For any \(k\in \mathbb{N}\) this probability measure evaluated on the set \(\{0,\dots,k\}\) gives \[ \begin{aligned} \mathbb{P}(\{0,\dots,k\}) &= (1-p)\sum_{n=0}^k p^n \\ &= (1-p)\frac{1-p^{k+1}}{1-p}\\ &=1-p^{k+1}. \end{aligned} \]

Exercise 2.1 Invent one continuous probability measure and one discrete probability measure.

Exercise 2.2 If \(A,B\in\mathscr{F}\) prove that \[ \mathbb{P}(A) = \mathbb{P}(A\cap B)+ \mathbb{P}(A \cap B^c). \] If, furtheremore \(B\subset A\), prove that \[ \mathbb{P}(A) = \mathbb{P}(B) + \mathbb{P}(A\cap B^c) \] and conclude that \(\mathbb{P}(B)\leq\mathbb{P}(A).\)

2.1.4 Properties of Probability Measures

Theorem 2.1 Let \((B_n)_n\subset\mathscr{F}\) be a partition of \(\Omega\), i.e.

\(\bigcup_n B_n = \Omega\), and
\(B_n \bigcap B_m = \emptyset \quad \forall n\neq m\),

Then, for any \(A\in\mathscr{F}\) we have the following equality \[ \mathbb{P}\left(A\right) = \sum_{n} \mathbb{P}\left(A\bigcap B_n\right). \]

Theorem 2.2 Let \((\Omega,\mathscr{F}, \mathbb{P})\) be a probability space. The following equalities hold.

If \((A_n)_{n\in\mathbb{N}}\subset \mathscr{F}\) and \[ A_{n}\subset A_{n+1} \quad\forall n\in\mathbb{N}, \] then \[ \mathbb{P}\left(\bigcup_{n} A_n\right) = \lim_{n\to\infty} \mathbb{P}\left(A_n\right). \]
If \((A_n)_{n\in\mathbb{N}}\subset \mathscr{F}\) and \[ A_{n+1}\subset A_{n} \quad\forall n\in\mathbb{N}, \] then \[ \mathbb{P}\left(\bigcap_{n} A_n\right) = \lim_{n\to\infty} \mathbb{P}\left(A_n\right). \]

2.2 Random Elements

2.2.1 Single Random Variables

Definition 2.5 (Random Variable) Let \((\Omega, \mathscr{F}, \mathbb{P})\) be a probability space. A function \[ X\colon \Omega \to \mathbb{R} \] is called a random variable.

Remark. Observe that a function of random variable, say \(h(X)\), is again a random variable.

Definition 2.6 (Distribution) Let \(X\) be a random varaible defined on a probability space \((\Omega, \mathscr{F}, \mathbb{P})\). The probability measure on \(\mathbb{R}\) induced by \(X\), which is given by \[ \mathbb{P}(X\in A) :=\mathbb{P}\left(\left\{\omega\in\Omega\colon X(\omega)\in A \right\}\right), \quad A\in\mathscr{B}(\mathbb{R}), \] is called its distribution or its law.

Example 2.5 Recall Example 2.1 where \(\Omega = [0,1]\) and \(\mathbb{P}(A)=length(A)\). Let \(N\in\mathbb{N}\) and define the function \(X\colon [0,1]\to\mathbb{N}\) given by

\[ X(\omega):=n \quad \text{ if }\quad \frac{n-1}{N} \leq \omega \leq \frac{n}{N}, \quad n\in\{1,\dots,N\}. \]

In this case \(X\) has the (discrete) uniform distribution on the set \(\{1,\dots, N\}\) since all the points have the same probability measure, that is \[ \forall i,j\in\{1, \dots, N\},\quad\mathbb{P}(X=i)=\frac{1}{N}=\mathbb{P}(X=j). \]

Exercise 2.3 Let \(\Omega=[0,1]\) and \(\mathbb{P}(A)=length(A)\). Construct a random variable \(X\colon [0,1]\to \mathbb{N}\) with geometric distribution.

In this course we will only deal with distributions of the forms presented in Subsection 2.1.3.

Definition 2.7 (Continuous and Discrete Random Variables.) Let \(X\) be a random variable.

\(X\) is a continuous random variable, or its distribution is said to be continuous, if \[ \mathbb{P}(X\in A) = \int_A f_X(x) \hspace{3pt}dx \] for some density function \(f_X\).
\(X\) is a discrete random variable, or its distribution is said to be discrete, if \[ \mathbb{P}(X\in A) = \sum_{a_n\in A} p_n \] where \((a_n)_n\) is a countable collection of points in \(\mathbb{R}\) and \((p_n)_n\) is a collection of probabilities that sum to 1.

Remark. Continuous distributions are in fact more general than defined above; however, in applications (in statistical analysis and even in probabilistic modelling) we almost exclusively deal with distributions having a density \(f\), so for all practical purposes there is no loss in considering only this type of continuous random variables.

2.2.2 Random Vectors

In general we often want to be able to deal with multiple random variables at once, this is why at this point it is useful to think of an abstract probability space \((\Omega, \mathscr{F}, \mathbb{P})\) and simply assume that it is rich enough so that we can define/asssure the existence of different random variables with any properties that might interest us. In fact we often make no mention at all of the underlying probability space \((\Omega, \mathscr{F}, \mathbb{P})\) and simply talk about random variables, random vectors, etc, and their distributions.

Definition 2.8 (Random Vectors) Let \((\Omega,\mathscr{F},\mathbb{P})\) be a probability space. A function of the form \[ \begin{aligned} \bar{X}\colon \Omega &\to \mathbb{R}^n \\ \omega &\mapsto \bar{X}(\omega)=(X_1(\omega),\cdots,X_n(\omega)) \end{aligned} \] is called a random vector.

Random vectors induce a probability measure on \(\mathbb{R}^n\) by considering, for any \(A\in\mathscr{B}(\mathbb{R}^n)\) \[ A\mapsto \mathbb{P}(\bar{X}\in A) = \mathbb{P}\left(\left\{\omega\in\Omega\colon \bar{X}(\omega)\in A \right\}\right). \] The probability measure induced by a random vector is called its distribution.

Similarly as in the case of single random varibles, we may consider continuous random vectors whose distribution is given by a density function \(f_{\bar{X}}\colon \mathbb{R}^n \to \mathbb{R}^+\), and discrete random vectors whose distribution is given by a countable collection of points and probabilities \((a_n)_n\) and \((p_n)_n\).

Definition 2.9 (n-dimensional density) A function \(f\colon \mathbb{R}^n\to\mathbb{R}^+\) is a density function on \(\mathbb{R}^n\) if \[ \int_{\mathbb{R}^n} f(x_1,\dots,x_n) \hspace{3pt}dx_1,\dots,x_n = 1. \]

Definition 2.10 (n-dimensional discrete distribution) A collection of points \((a_k)_k\subset \mathbb{R}^n\) and corresponding probabilities \((p_k)_k\) such that \[ \sum_{k=1}^\infty p_k = 1, \] determine a discrete distribution in \(\mathbb{R}^n\) in the same way as in the single random variable case.

Remark. A random vector \(\bar{X}\) may not fall in any of the categories above, for example it is possbile that some of its coordinates \((X_1,\dots,X_n)\) are continuous while others are discrete random variables.

Example 2.6 (Continuous Random Vector Density) Let \(f_1,\dots,f_n\) be \(n\) density functions on \(\mathbb{R}\). Define a density function \(f\) on \(\mathbb{R}^n\) as \[ f(x_1,\dots,x_n) = \prod_{k=1}^n f_k(x_k), \] and consider a random vector \(\bar{X}=(X_1,\dots,X_n)\) with density function \(f\).

In particular note that for any \(A_1,\dots,A_n\in\mathscr{B}(\mathbb{R})\) and \(A=A_1\times\dots\times A_n\) we have

\[ \begin{aligned} \mathbb{P}(\bar{X}\in A) &= \mathbb{P}\left(X_1\in A_1,\dots, X_n\in A_n\right) \\ &=\int_{A_1}\dots \int_{A_n} f(x_1, \dots, x_n) \hspace{3pt}dx_n \dots \hspace{3pt}dx_1 \\ &=\int_{A_1}\dots \int_{A_n} \prod_{k=1}^n f_k(x_k) \hspace{3pt}dx_n \dots \hspace{3pt}dx_1 \\ &=\prod_{k=1}^n \int_{A_k} f_k(x) \hspace{3pt}dx \\ &=\prod_{k=1}^n \mathbb{P}(X_k\in A). \end{aligned} \]

```

Example 2.7 (Random Parameter - Exponential Case) Recall the exponential distribution introduced in Example 2.3. Let \(f\) be a density function on \(\mathbb{R}^+\). Define the density \(g\colon (\mathbb{R}^+)^2\to\mathbb{R}\) as \[ g(\lambda, X) = \lambda e^{-\lambda x}f(\lambda). \]

Let \((\Lambda, X)\) be a random vector with density \(g\). The distribution of \(X\) can be interpreted algorithmically: 1) first choose a random parameter \(\Lambda\) according to the density \(f\), 2) then generate a random variable \(X\) that is exponentially distributed of parameter \(\Lambda\).

Exercise 2.4 Argue why the function \(g\) defined in Example 2.7 above is a density function on \(\mathbb{R}^2\).

Definition 2.11 (Bernoulli Distribution) The Bernoulli distribution of parameter \(p\in[0,1]\) (the probability of success) is the discrete distribution that takes the value \(0\) with probability \(1-p\) and \(1\) with probability \(p\). I.e. if \(X\) is Bernoulli distributed with parameter \(p\) then \[ \mathbb{P}(X = x) = \begin{cases} p &\text{ if } x = 1,\\ 1-p &\text{ if } x = 0,\\ 0 &\text{ otherwise}. \end{cases} \]

Example 2.8 (Random Parameter - Bernoulli Case) Let \(f\) be a density function on \([0,1]\) and let \(\mathbb{P}\) be a probability measure on \([0,1]\times {0,1}\) such that for any \(A\in\mathscr{B}([0,1])\) we have \[ \mathbb{P}(A\times \{1\}) = \int_A pf(p) \hspace{3pt}dp \] and \[ \mathbb{P}(A\times \{0\}) = \int_A (1-p)f(p) \hspace{3pt}dp. \] Let \((\rho, X)\) be a random vector with distirbution \(\mathbb{P}\). The distribution of \(X\) can be interpreted algorithmically: 1) first choose a random parameter \(\rho\) according to the density \(f\), 2) then generate a random variable \(X\) that is Bernoulli distributed of parameter \(p\).

Exercise 2.5 Argue why the probability measure \(\mathbb{P}\) defined in Example 2.8 is a valid probability measure on \([0,1]\times\{0,1\}\).

Exercise 2.6 Recall the geometric distribution introduced in 2.4. Invent a probabilty measure \(\mathbb{P}\) on \([0,1]\times \mathbb{Z}^+\) such that if \((\rho, X)\) has distribution \(\mathbb{P}\), then \(\rho\) is a continuous random variable on \([0,1]\) and such that the distribution of \(X\) can be interpreted algorithmically as: 1) first choose a random parameter \(\rho\) according to the density \(f\), 2) then generate a random variable \(X\) that is geometrically distributed of parameter \(p\).

2.3 Conditional Probability and Independence

2.3.1 Intuition

Let \((\Omega,\mathscr{F},\mathbb{P})\) be a probability space and let \(A\in\mathscr{F}\) be a set with positive measure, i.e. \[ \mathbb{P}\left(A\right)>0. \] Let \(B\in\mathscr{F}\) be another set. We would like to know what is the probability that \(B\) occurs given that we already know that \(A\) occurs. That is, we would like to know the updated probability of the event \(B\) after we weigh in the extra information that we know that \(A\) occurs. For example, if \(B\cap A=\emptyset\) then knowing that \(A\) occurs tells us that \(B\) does not occur so in this case \(\mathbb{P}(B\vert A) = 0\). In general we have the formula \[ \mathbb{P}\left(B\vert A\right) = \frac{\mathbb{P}(B \cap A)}{\mathbb{P}(A)} \] which follows from the reasoning that knowing that \(A\) occurs reduces our initial probability space \(\Omega\) to the set \(A\) which then becomes the new ‘total’ or ‘universe’ set, and in order to equip the space \(A\) with a probability measure we simply normalize our initial probability measure \(\mathbb{P}\) by the measure of the set \(A\), i.e. by \(\mathbb{P}(A)\).

Particularly, for random variables \(X\) and \(Y\) we find the conditional probability that \(X\in B\) given that we already know that \(Y\in A\) as \[ \mathbb{P}(X\in B \vert A\in A) = \frac{\mathbb{P}(X\in B, Y\in A)}{\mathbb{P}(Y\in A)}. \]

2.3.2 Discrete case

Let \(Y\) be a discrete random variable, and suppose that \(\mathbb{P}(Y = a)>0\). Then, for any \(B\in \mathscr{F}\) we have \[ \mathbb{P}(X\in B \vert Y=a) = \frac{\mathbb{P}(X\in B, Y = a)}{\mathbb{P}(Y=a)}. \] Compare this with the continuous case scenario. The following formula is often used when computing probabilities by first assuming that we know the value of the random variable \(Y\), thus computing \(\mathbb{P}(X\in B\vert Y = a)\), and then summing over the possible values for \(Y\). Indeed, let \((a_n)_{n}\) be the set of values for \(Y\), then

\[ \begin{aligned} \mathbb{P}\left(X \in B\right) &= \sum_{n}\mathbb{P}(X \in B, Y = a_n)\\ &= \sum_{n} \mathbb{P}(X\in B \vert Y = a_n) \mathbb{P}(Y = a_n). \end{aligned} \]

More generally we have

\[ \mathbb{P}\left(X \in B, Y\in A\right) = \sum_{n\colon a_n\in A} \mathbb{P}(X\in B \vert Y = a_n) \mathbb{P}(Y = a_n). \]

2.3.3 Continuous Case

Let \(Y\) be a continuous random variable, in which case we always have that, for any \(a\in\mathbb{R}\), \(\mathbb{P}(Y=a)=0\). In this case we cannot define a conditional probability given that \(Y=a\) as we did before (by simply normalizing with \(\mathbb{P}(Y=a)\)), but we can in fact define a conditional density function. So let \(X\) be another random variable and consider the random vector \((X,Y)\) with (joint) density function \(f_{X,Y}(x,y)\). For each fixed value of \(y\) define the conditional density function \(X\) given \(Y=y\) as \[ f_{X\vert Y}(x\vert y) = \frac{f(x,y)}{\int_{-\infty}^\infty f(x,y) dx } = \frac{f(x,y)}{f_Y(y)}, \] and define the conditional probability of \(X\in B\) given \(Y=y\) through the integral of the conditional density function:

\[ \mathbb{P}(X\in B\vert Y=y) = \int_B f_{X\vert Y}(x\vert y) dx. \]

Observe that

\[ f_{X,Y}(x,y) = f_{X\vert Y}(x\vert y)f_Y(y), \]

so that, for any \(B,A\in\Omega\),

\[ \begin{aligned} \mathbb{P}(X\in B, Y\in A) &= \int_A \int_B f_{X,Y}(x,y) dx dy \\ &=\int_A \int_B f_{X\vert Y}(x\vert y)f_Y(y) dx dy\\ &= \int_A \mathbb{P}(X\in B \vert Y = y)f_Y(y) dy. \end{aligned} \]

2.3.4 Independence

Definition 2.12 (Independence - Events) Let \((\Omega,\mathscr{F},\mathbb{P})\) be a probability space. Two events (sets) \(A,B\in\mathscr{F}\) are said to be independent if \[ \mathbb{P}\left(A \bigcap B\right) = \mathbb{P}(A) \mathbb{P}(B). \]

Compare this with the general formula \[ \mathbb{P}\left(A \bigcap B\right) = \mathbb{P}(A\vert B) \mathbb{P}(B) \] which is always true; observe that the definition of independence can be seen as the property that \[ \mathbb{P}(A\vert B) = \mathbb{P}(A), \] which says that having information about the event \(B\) does not provide us with any additional information about \(A\), and thus our ‘knowledge’ that \(A\) occurs (or may occur) remains the same.

Definition 2.13 (Independence - Random Elements) Let \(X\) and \(Y\) be two random elements (i.e. a random variables and/or random vectors) defined on a common probability space \((\Omega, \mathscr{F}, \mathbb{P})\). \(X\) and \(Y\) are said to be independent if for any³ sets \(A,B\), \[ \mathbb{P}\left(X\in A \bigcap Y\in B\right) = \mathbb{P}(X\in A) \mathbb{P}(Y\in B). \] The statement that \(X\) and \(Y\) are independent is ofte writeen as \[ X \perp Y \]

Theorem 2.3 Let \(X\) and \(Y\) be random variables with densities \(f_X\) and \(f_Y\) on \(\mathbb{R}\). Consider also the random vector \((X,Y)\) and its density \(f_{X,Y}\) on \(\mathbb{R}^2\). Then \[ X \perp Y \quad\text{ iff }\quad f_{X,Y} = f_Xf_Y. \]

The property in the definition of independence, i.e. that \(\mathbb{P}\left(X\in A \bigcap Y\in B\right) = \mathbb{P}(X\in A) \mathbb{P}(Y\in B)\), can be strengthen to \(\mathbb{P}\left(h(X)\in A \bigcap g(Y)\in B\right) = \mathbb{P}(h(X)\in A) \mathbb{P}(g(Y)\in B)\) for any pair of functions \(f\) and \(g\), and, furthermore, can be extended to a stronger property about expectations (see Theorem ?? below).

2.4 Cumulative Distribution Functions (CDFs)

2.4.1 Single Random Variables

Definition 2.14 (Cumulative Distribution Function) Let be a random varaible defined on a probability space \((\Omega, \mathscr{F}, \mathbb{P})\). The function \(F_X\colon \mathbb{R}\to [0,1]\) given by \[ F_X(x) :=\mathbb{P}\left(X\leq x\right) \] is called the cumulative distribution function (CDF) of \(X\).

Example 2.9 Let \(X\) be a random variable with distribution function \[ \mathbb{P}(X\in A) = \frac{1}{b-a}\int_{A}1 \hspace{3pt}dx = length(A) / (b-a), \quad A\subset [a,b], \] i.e. the (continuous) uniform distribution on the interval \([a,b]\) where \(a<b\). Then the cumulative distribution function \(F_X\) of \(X\) is given by \[ F_X(x) = \begin{cases} 0 &\text{ if } x < a,\\ \frac{x-a}{b-a} &\text{ if }x \in [a,b],\\ 1 &\text{ if } x \geq b. \end{cases} \]

a <- 0
b <- 3
x <- seq(a-2, b+2, length.out = 100)
par('plt' = c(0.08, 1-0.05, 0.08, 1-0.05), 'xaxs' = "r", 'yaxs' = "r")
plot(x = x, y = punif(x, min = a, max = b), type = "l", col = "skyblue3", lwd = 3)

Exercise 2.7 Recall Example 2.4. Compute the CDF of the geometric distribution of parameter \(p\).

Exercise 2.8 Recall Example 2.3. Compute the CDF of the exponential distribution of parameter \(\lambda\).

Example 2.10 (Plot Geometric CDF)

p <- 0.75
x <- seq(-1, 6, length.out = 1000)
par('plt' = c(0.08, 1-0.05, 0.08, 1-0.05), 'xaxs' = "r", 'yaxs' = "r")
plot(x = x, y = pgeom(x, p = p), type = "p", col = "skyblue3", lwd = 1)
points(x = 0:5, y = pgeom(0:5, p = p), col = "red", lwd = 3, pch = 19)

```

Example 2.11 (Plot Exponential CDF)

lambda <- 5
x <- seq(0, 6, length.out = 1000)
par('plt' = c(0.08, 1-0.05, 0.08, 1-0.05), 'xaxs' = "r", 'yaxs' = "r")
plot(x = x, y = pexp(x, rate = lambda), type = "l", col = "skyblue3", lwd = 3)

Theorem 2.4 (Properties of CDFs) A function \(F\) is a cumulative distribution function if and only if

\(F(-\infty):=\lim_{x\to-\infty}F(x)=0\) and \(F(\infty):=\lim_{x\to\infty}F(x)=1\).
\(F\) is right continuous, i.e. for all \(y\in\mathbb{R}\), \(\lim_{x\downarrow y} F(x) = F(y)\).
\(F\) is non-decreasing, i.e. for all \(x<y\), \(F(x)\leq F(y)\).

Theorem 2.5 Let \(X\) and \(Y\) be two random variables. Then \(X\) and \(Y\) have the same distribution if and only if \(F_X=F_Y\).

Theorem 2.5 is often useful in determining the distribution of a function of a random variable, say \(h(X)\), by attempting to compute \[ \mathbb{P}(h(X) \leq x) \] for all \(x\).

Example 2.12 Let \(X\) be exponentially distributed of parameter \(\lambda = 1\). Then, for any \(\lambda>0\) the random variable \(X/\lambda\) is exponentially distributed of parameter \(\lambda\). Indeed, note that \[ \begin{aligned} \mathbb{P}\left(\frac{X}{\lambda} \leq x\right) & = \mathbb{P}\left(X\leq \lambda x\right)\\ & = \int_0^{\lambda x} e^{-y} \hspace{3pt}dy\\ & = 1 - e^{-\lambda x}, \end{aligned} \] it remains to note that the function \(x\mapsto 1 - e^{-\lambda x}\) is the CDF of an exponentially distributed random variable of parameter \(\lambda\).

Definition 2.15 (Continuous and Discrete Random Variables) Let \(X\) be random variable.

\(X\) is a continuous continuous random varible if \(F_X\) is continuous.
\(X\) is a discrete random variable if \(F_X\) is piecewise constant.

Remark. It is possible that a random variable \(X\) is not continuous nor discrete, although in this course we will usually deal with random variables that fall in one of these categories.

In this course we will assume that continuous random variables have a probability distribution with a density (recall Definition 2.3 ), or, in other words, we will assume that for any continuous random variable \(X\) with CDF \(F_X\), we have \(F_X'=f_X\) for some density function \(f_X\) and \[ F_X(x) = \int_{-\infty}^x f_X(s) \hspace{3pt}ds. \]

2.5 Expectations and Moments

2.5.1 Single Random Variables

Definition 2.16 (Expectation - Single Random Variables) Let \(X\) be a random variable, \(h\colon \mathbb{R}\to\mathbb{R}\) be a function, and consider the random variable \(h(X)\). We define the expected value of h(X), \(\mathbb{E}[h(X)]\) as:

if \(X\) is continuous with density \(f_X\), \(\mathbb{E}[h(X)]\) is the integral \[ \mathbb{E}\left[h(X)\right] :=\int_{-\infty}^\infty h(x)f_X(x)\hspace{3pt}dx. \]
if \(X\) is discrete and takes values \((a_n)_n\) with probabilities \((p_n)_n\) \(\mathbb{E}[h(X)]\) is the sum \[ \mathbb{E}\left[h(X)\right] :=\sum_{n} p_n h(a_n). \]

Example 2.13 (Mean of Normal Distribution) In this example we will assume the known fact that \[ \int_{-\infty}^\infty e^{-x^2} \hspace{3pt}dx = \sqrt{\pi}. \]

Let \(X\) be a random variable with density function \[ f(x) = \frac{e^{-\frac{(x-\mu)^2}{2\sigma^2}}}{\sigma\sqrt{2\pi}}. \]

Exercise 2.9 Prove that \(f\) is a density function.

The distribution associated to this density function is called the normal distribution (with mean \(\mu\) and variance \(\sigma^2\)). Then the expected value (mean) of \(X\) is given by

\[ \begin{aligned} \mathbb{E}\left[X\right] &= \int_{-\infty}^\infty x \frac{e^{-\frac{(x-\mu)^2}{2\sigma^2}}}{\sigma\sqrt2\pi} \hspace{3pt}dx\\ &= \int_{-\infty}^\infty (\sigma\sqrt{2}y+\mu) \frac{e^{-y^2}}{\sigma\sqrt{2\pi}} \sigma\sqrt{2}\hspace{3pt}dy\\ &= \int_{-\infty}^\infty \sigma\sqrt2 y \frac{e^{-y^2}}{\sqrt{\pi}}\hspace{3pt}dx + \frac{\mu}{\sqrt\pi} \int_{-\infty}^\infty e^{-y^2}\hspace{3pt}dy. \end{aligned} \]

where we have used the change of variable \(y=\frac{x-\mu}{\sigma\sqrt{2}}\). Now observe that the function \(e^{-y^2}\) is symmetric around 0, therefore

\[ \int_{-\infty}^0 \sigma\sqrt2 y \frac{e^{-y^2}}{\sqrt{\pi}}\hspace{3pt}dx = - \int_0^\infty \sigma\sqrt2 y \frac{e^{-y^2}}{\sqrt{\pi}}\hspace{3pt}dx, \]

which implies

\[ \int_{-\infty}^\infty \sigma\sqrt2 y \frac{e^{-y^2}}{\sqrt{\pi}}\hspace{3pt}dx = 0. \]

Thus

\[ \mathbb{E}\left[X\right] = \mu. \]

Example 2.14 (Mean of Poisson Distribution)

Let \(\lambda > 0\). Let \(X\) be the discrete random variable with distribution given by \[ \mathbb{P}(X = n) = e^{-\lambda} \frac{\lambda^n}{n!},\quad \forall n\in\mathbb{Z}^+=\{0,1,2,\dots\} \]

Exercise 2.10 Argue why this is a valid definition of a discrete random variable.

A random variable with the above distribution is called a Poisson random variable of parameter \(\lambda\). The expected value of \(X\) is given by \[ \begin{aligned} \mathbb{E}[X] &= e^{-\lambda}\sum_{n=0}^\infty n \frac{\lambda^n}{n!} \\ &= e^{-\lambda}\sum_{n=1}^\infty n \frac{\lambda^n}{n!} \\ &= \lambda e^{-\lambda}\sum_{n=0}^\infty \frac{\lambda^{n-1}}{(n-1)!}\\ &= \lambda \end{aligned} \]

Definition 2.17 (Moments) Let \(n\in\mathbb{N}\). The nth moment of a random variable \(X\) is the expected value \[\mathbb{E}\left[X^n\right].\].

Exercise 2.11 Compute the first and second moments of an exponentially distributed random variable of parameter \(\lambda\). Hint: use the integration by parts formula.

Exercise 2.12 Compute the first and second moments of a geometrically distributed random variable of parameter \(p\). Hint: You may assume that \[ \frac{d}{dp} \left(\frac{1}{1-p}\right) = \frac{d}{dp} \sum_{n=0}^\infty p^n = \sum_{n=0}^\infty \frac{d}{dp} p^n. \]

Theorem 2.6 (Linearity of Expected Values) Let \(X\) and \(Y\) be two random variables. Then

\[\mathbb{E}[X + bY] = \mathbb{E}[X] + b\mathbb{E}[Y].\]

Definition 2.18 (Variance) The variance \(\mathbb{Var}\) of a random variable \(X\) with mean \(\mu\) is defined as \[ \mathbb{Var}(X) :=\mathbb{E}\left[(X - \mu)^2\right]. \]

Theorem 2.7 Let \(X\) be a random variable, then the following equality holds \[ \mathbb{Var}(X) = \mathbb{E}\left[X^2\right] - \mathbb{E}[X]^2. \]

Exercise 2.13 Use the linearity of the expetation to prove theorem 2.7.

Exercise 2.14 Compute the variance of a Poisson random variable of parameter \(\lambda\).

Example 2.15 (Mean Intuition)

Example 2.16 (Variance Intuition)

Exercise 2.15 (Variance of Exponential Distribution) Prove that if \(X\) is exponentially distributed with parameter \(\lambda>0\), then \[ \mathbb{Var}(X) = \frac{1}{\lambda^2}. \]

Example 2.17 (Mean and Variance of Exponential (plot))

2.5.2 Random Vectors

Definition 2.19 (Expected Value - Random Vectors) Let \(X\) be a random variable, \(h\colon \mathbb{R}\to\mathbb{R}\) be a function, and consider the random variable \(h(X)\). We define the expected value of h(X), \(\mathbb{E}[h(X)]\) as:

if \(\bar{X}\) has density \(f_{\bar{X}}\) then \(\mathbb{E}[h(\bar{X})]\) is the multiple integral \[ \mathbb{E}\left[g(\bar{X})\right] :=\int_{-\infty}^\infty\dots \int_{-\infty}^\infty g(x_1,\dots,x_n)f_{\bar{X}}(x_1,\dots,x_n) \hspace{3pt}dx_1,\dots,x_n, \]
if \(\bar{X}\) is a discrete random vector taking values \((a_k)_k\subset\mathbb{R}^n\) with probabilities \((p_k)_k\), then, in the same way as in the single random variable case, \(\mathbb{E}[h(\bar{X})]\) is the sum \[ \mathbb{E}\left[g(\bar{X})\right] = \sum_{k} p_k g(a_k), \]
if \(\bar{X}=(X_1,\dots,X_n)\) is such that some coordinates are discrete random variables and some are continuous random variables, then to compute \(\mathbb{E}[h(\bar{X})]\) we integrate or sum correspondingly.

Example 2.18 (Randomized Exponential) Recall Example 2.7 and Exercise 2.11. Let \(g\colon (\mathbb{R}^+)^2\to\mathbb{R}\) be the density \[ g(\lambda, x) = \lambda e^{-\lambda x}f(\lambda), \] and let \((\Lambda,X)\) be a corresponding random vector. Then \[ \begin{aligned} \mathbb{E}[X] &= \int_{-\infty}^\infty \int_{-\infty}^\infty x\lambda e^{-\lambda x}f(\lambda) \hspace{3pt}dx \hspace{3pt}d\lambda\\ &= \int_{-\infty}^\infty \frac{f(\lambda)}{\lambda} \hspace{3pt}d\lambda\\ &= \mathbb{E}\left[\frac{1}{\Lambda}\right]. \end{aligned} \] Similarly \[ \begin{aligned} \mathbb{E}[\Lambda X] &= \int_{-\infty}^\infty \int_{-\infty}^\infty \lambda x\lambda e^{-\lambda x}f(\lambda) \hspace{3pt}dx \hspace{3pt}d\lambda\\ &= \int_{-\infty}^\infty \frac{\lambda f(\lambda)}{\lambda} \hspace{3pt}d\lambda\\ &= 1 \end{aligned} \]

Example 2.19 (Randomized Bernoulli) Recall the random vector \((\rho, X)\) of Example 2.8. Then \[ \mathbb{E}[X] = \int_0^1 (p\cdot 1 + (1-p)\cdot 0) f(p) \hspace{3pt}dp =\mathbb{E}[\rho]. \] Similarly \[ \mathbb{E}[X/\rho] = \int_0^1 \frac{(p\cdot 1 + (1-p)\cdot 0)}{p} f(p) \hspace{3pt}dp = 1. \]

Let \(f\colon \mathbb{R}\to\mathbb{R}\) be a function and \(A\subset \mathbb{R}\). Let \(\mathbb{1}_A\) be the indicator function of \(A\), i.e. the function \[ \mathbb{1}_A(x) \begin{cases} 1 & \text{ if } x\in A\\ 0 & \text{ otherwise}, \end{cases} \] then the integral of \(f\) on the set \(A\) is defined as \[ \int_A f(x)\hspace{3pt}dx = \int_{-\infty}^\infty \mathbb{1}_A(x)f(x) \hspace{3pt}dx. \] For example, if \(A=[a,b]\cup [c,d]\), then \[ \begin{aligned} \int_A f(x)\hspace{3pt}dx &= \int_{-\infty}^a 0 \hspace{3pt}dx + \int_a^b f(x) \hspace{3pt}dx + \int_b^c 0 \hspace{3pt}dx + \int_c^d f(x) \hspace{3pt}dx + \int_d^\infty 0\hspace{3pt}dx\\ &= \int_a^b f(x) \hspace{3pt}dx + \int_c^d f(x) \hspace{3pt}dx. \end{aligned} \]↩
Let \(p\in (0,1)\). Let \(S_N\) \[ \begin{aligned} S_N &:=\sum_{k=0}^N p^k\\ &= p^0 + p^1 + \dots + p^{N}, \end{aligned} \] \(S_N\) is the partial sum of \(p^k\) up to \(k=N\). We want to compute \[ S_\infty :=\sum_{k=0}^\infty p^k = \lim_{N\to\infty} S_N. \] Observe that \[ \begin{aligned} pS_N &= p \sum_{k=0}^N p^k \\ &= \sum_{k=0}^N p^{k+1} \\ &= p^1 + \dots + p^{N+1} \\ &= S_{N} + p^{N+1} - 1, \end{aligned} \] where we have used that \(p^0=1\) in the last equality. Rearranging terms we obtain \[ S_N = \frac{1-p^{N+1}}{1-p}; \] thus, since \(0<p<1\) which implies \(\lim_{N\to\infty} p^{N+1}=0\), we have \[ \sum_{k=0}^\infty p^k = \lim_{N\to\infty} S_N = \lim_{N\to\infty} \frac{1-p^{N+1}}{1-p} = \frac{1}{1-p}. \]↩
More formally, if \(X\) takes values in \(E_1\) and \(Y\) takes values in \(E_2\) (e.g. \(E_1\) and \(E_2\) may be \(\mathbb{R}^n\), \(n\in\mathbb{N}\)), then we need \(A\in\mathscr{B}(E_1)\) and \(B\in\mathscr{B}(E_2)\).↩