Take a look at this string of numbers:
333 2 333 2 333 2 33 2 333 2 333 2 333 2 33 2 333 2 333 2 …
At first it looks like someone fell asleep on a keyboard. But there’s an inner logic to the sequence:
This sequence creates itself, number by number, as explained in the image above. Each digit refers to the number of consecutive 3s before a certain 2 appears. Specifically, the first digit refers to the number of consecutive 3s that appear before the first 2, the second digit refers to the number of 3s that appear consecutively before the second 2, and so on toward infinity.
The sequence never ends, but that won’t stop us from asking us questions about it. What is the ratio of 3s to 2s in the entire sequence?
Let's first implement the sequence in Python. We can use a queue to hold the sequence, and when we pop a new element off the queue, we can extend the queue (adding three 3s and a 2, or two 3s and a 2) based on the value of the popped element.
We could brute force the solution by computing the ratio as we extend the sequence ever longer, but this will only ever give us an approximation, and no realisation as to why the solution is so.
I'm going to show off how I played around with the problem, naively, and then started to formulate a solution. The first thing I computed was the ratio of 2s present in the sequence. This isn't the desired answer (ratio of 3s over 2s), but the ratio of 2s, which i'll call \(R\), can be used to find the original solution too: \(\frac{1- R}{R}\). Also, \(R\) can be seen as a probability, something that didn't turn out to be useful but was a possible attack. Using the Python code, R was computed to be approximately 0.2679491. So I had some place to start.
Next I looked at where the 2s were placed in the sequence. Ex: position 4, 8, 12, 15, and so on were 2s. Extending, 2s were placed at positions:
4, 8, 12, 15, 19, 23, 27, 30, 34, 38, 42, 45, 49, 53, 56, 60, 64, 68, ...
Inspecting this, it's clear that 2s will occur every fourth number, except every 4th time, when it appears after three numbers. Call this last exception a "skip". Playing around with this pattern, I constructed a crude function to estimate the position of the \(n\)th 2, \(4n - 4n / (16)\). (I didn't simplify this to make it more clear: the first term is where I expect the position to be given there were no "skips"; the second term is the estimated number of skips after \(4n\) elements of the sequence)
It gave pretty close results for values of \(n\) under 50. I could also estimate \(R\) by looking at \( \frac{n}{4n - 4 n / 16} \). For large \(n\), this gave a number pretty close to my Python approximation of \(R\) (above). But it wasn't close enough, and something was wrong. Simplifying the expression to \(\frac{1}{4 - \frac{1}{4}}\) - I guessed that maybe this looks like a continued fraction! Trying a more accurate approximation \(\frac{1}{4 - \frac{1}{1 - \frac{1}{4}}}\) gave an even better result! Newly motivated, I went back to my "crude approximation function". In the limit, the correct function would look like \( 4n - Rn\). Why? Again, given no "skips" we expect the \(n\)th 2 to be at \(4n\), then subtract the number of "skips", which is exactly the number of 2s, given by \(Rn\). Thus \(R = \frac{1}{ 4 - R} \). (This confirms the continued fraction). Solving the quadratic, \(R = 2 - \sqrt{3} \approx 0.26794919 \).
Now we can solve the original problem, given by \(\frac{1- R}{R} = \frac{\sqrt{3} - 1}{2 - \sqrt{3}} \approx 2.7320508\).
[Image credit to FiveThirtyEight]
]]>Last week, I achieved a goal of hitting 200 problems solved in Project Euler. If you are unfamiliar with Project Euler, it's a website where users can solve tricky problems that require both mathematics and programming to solve. An example question:
(BBBW) | (B,BBW) | (B,B,BW) | (B,B,B,W) |
(B,BB,W) | (BBB,W) | (BB,BW) |
I started in Jan. 2016, and was able to collect my progress of completion over time:
There is also the concept of the the "difficulty" of a problem, provided by the website itself as a measure of how often participants get the answer wrong. Here is the cumulative difficulty:
I'm not finished with Project Euler yet - there are way to many problems I spent time on but didn't solve and they are just itching me. However, I am taking a break for a while so I can focus on some other projects. After all, solving 200 problems was technically my lifetime goal.
If a person asked me "is Project Euler worth the time?" - I wouldn't immediately say yes. You do need lots of time, and it really is a personal thing (there is a leaderboard, but it's not nearly as public as, say, StackOverflow's leaderboards). You also need to think about the opportunity cost - will solving these problems further your career or research? Personally, I feel that Project Euler rounded out my understanding of computer science and mathematics, and not so heavily loaded on statistics and Python applied to statistics.
My problem was not as easy as this:
Essentially, all the information I had was the following: my job threw a NullPointerException somewhere when it tried to process this new dataset:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 533 in stage 1.0 failed 4 times, most recent failure: Lost task 533.3 in stage 1.0. java.lang.NullPointerException
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
...
But I still needed to finding this offending row. The idea of a binary search over the dataset appealed to me, but the data was distributed: there was no sense of an “order” to it, that is, I had no concept of a middle value to split the data on. The next idea I thought of was using modulus. Suppose each row had an integer 1,2,3,… associated with it, then I could split the data modulo 2 and test either side, 0 or 1, for a failure. Suppose that the 1s-half contained the failure. Then I should look at splitting the dataset by 1 and 3 modulo 4. And so on. This idea could be applied recursively. In practice, it relies on the fact that, given \(x \equiv k \bmod 2^n\) implies either \(x \equiv k \bmod 2^{n+1}\) or \(x \equiv k + 2^n \bmod 2^{n+1}\). Using this, I can recursively halve my dataset until I have found a single row (or a much smaller set atleast). I’ll explain how to associate each row to an integer in a moment, but first, here’s a sample algorithm in Python:
This is a toy example, but the gist of it is that each round I halve the dataset
and test one of the halves for the desired behaviour (implemented in test
). The running time of this algorithm is \(\mathcal{O}(T\log{n})\), where \(n\) is the size of the dataset and \(T\) is the running time of the job.
This assumes there is an easy way to assign an integer to each row. When implementing this idea in Spark to a distributed dataset, there is no way to index each row consecutively (Spark does have zipWithIndex
in its RDD layer, though this does an internal collect
which isn’t very desirable.) However there is a monotonically_increasing_id
function, which gives each row a monotonically increasing integer, but not necessarily consecutive. Fortunately this algorithm still works (and typically this is where a normal binary search would break). The runtime however is no longer \(\mathcal{O}(T\log{n})\), but \(\mathcal{O}(T\log{M})\), where \(M\) is the maximum integer assigned. So in Spark, my implementation to find the offending row looked like:
So I let this run for an hour or so, and it worked! I was very happy to find the offending row. Though next time, I would change this:
cache
right before your job fails, else you’ll be rerunning the same computations over and overI’m not totally sure this is the fastest way to find the offending line (though, given the tools plus information provided, it might be), but it works and is reliable.
]]>$$ \min_{\theta} -\ell(\theta) + \lambda ||\theta||_p^p $$
where \(\ell\) is the log-likelihood and \(||\cdot||\) is the \(p\) norm. This family of problems is typically solved by calculus because both terms are easy to differentiate ( when \(p\) is an integer greater than 1). Actually this is backwards: we want to solve the MLE using calculus (because that's our hammer) and in order to add a penalizer term, we also need it to be differentiable. Hence why we historically used the 2-norm.
If we don't solve the optimization problem with calculus, well then we can be more flexible with our penalizer term. I took this liberty when I developed the optimizations in lifetimes, my Python library for recency/frequency analysis. The MLE in the model is too complicated to differentiate, so I use numerical methods to find the minimum (Nelder-Mead to be exact). Because of this, I am free to add any penalizer term I wish, deviating as I choose from the traditional 2-norm. I made the choice of \(\log\), specifically:
$$ \min_{\alpha, \beta} -\ell(\alpha, \beta) + \lambda \left( \log(\alpha) + \log(\beta) \right) $$
This was my logic when I first developed lifetimes. Things were going well, until I started noticing some datasets that would produce unstable convergence only when the penalizer coefficient \(\lambda\) was positive: it was driving some variables to nearly 0. How could this be? Shouldn't any positive penalizer coefficient help convergence? For this, we'll take two perspectives of this problem.
I probably don't need to say it, but the log of a value less than 1 is negative. More extreme, the log of a very small value is very very negative (because the rate of change of log near zero gets larger as we approach the asymptote). Thus, during optimization, when a parameter starts to get small, the overall penalizer term starts to gain momentum. In fact, the optimizer starts to shrink a particular parameter to near 0 because that really really helps the overall optimization.
This is obviously not what I wanted. Sure, I wanted to keep values from being too large, but I certainly did not want to reduce parameters to near zero! So it made sense that when I had \(\lambda\) equal to 0 I did not observe this behaviour.
On the other extreme, the \(\log\) penalizer is kinda a terrible penalizer against large values too. An order of magnitude increase in a parameter barely makes a difference in the log of it! It's an extremely sub-linear function, so it doesn't really penalize large parameter sizes well.
As noted in a chapter in Bayesian Methods for Hackers, there is a really beautiful and useful relationship between MLE penalizer terms and Bayesian priors. Simply, it comes down to that the prior is equivalent to the negative exponential of the penalizer term. Thus, the 2-norm penalizer term is a Normal prior on the unknowns; the 1-norm penalizer term is a Laplace prior, and so on. What is then our \(\log\) prior? Well, it's a \(\exp(-\log(\theta)) = \frac{1}{\theta}\) prior on \(\theta\). This is strange, no? It's an improper prior, and it's in fact a Jeffery's prior, so I'm basically saying "I have no idea what this scale parameter should be" - not what I want to be doing.
As much as I like the scale-free property of the \(\log\), it's time to say goodbye to it in favor of another. I think for my purposes, I'll try the Laplace prior/1-norm penalizer as it's a nice balance between disallowing extremely large values and not penalizing the largest parameter too strongly.
Update: I actually went with the square penalizer in the end. Why? The 1-norm was still sending too many values to zero, when I really felt strongly that no value should be zero.
]]>Let \(S_n = \sum_{i=1}^n U_i \) be the sum of \(n\) Uniform random variables. Let \(N\) be the index of the first time the sum exceeds 1 (so \(S_{N-1} < 1\) and \(S_{N} \ge 1\)). The distribution of \(U_N\) is \(e - e^{1-u}\).This result is related to the really astounding result: \(N\) (the same \(N\) defined above) has expected value equal to \(e\). Yes, that's right, the average number of uniform random variables needed to exceed 1 is exactely \(e\). I just love this surprising result, but have still not found a good, intuitive reason as to why this is true. Of course it is true mathematically, and the proof is not too hard, but I just can't see why. Anyways, below is the proof of the first result.
I've seen some really interesting numerical solutions using a strategy called Poissonization, but Googling for it revealed very few resources (just some references in some textbooks that I don't have access to). So here it is: my notes and repository for Poissonization.
Theorem: Let \(N \sim \text{Poi}(\lambda)\) and suppose \(N=n, (X_1, X_2, ... X_k) \sim \text{Multi}(n, p_1, p_2, ..., p_k)\). Then, marginally, \(X_1, X_2, ..., X_k\) are are independent Poisson, with \(X_i \sim \text{Poi}(p_i \lambda)\). [1]
The proof is as follows. By the law of total probability:
$$
\begin{align*}
P(X_1=x_1, ..., X_k = x_k) \\
& = \sum_{n=0}^{\infty} P(X_1=x_1, ..., X_k = x_k \;|\; N=n) \frac{e^{-\lambda}\lambda^n}{n!} \\
& = \sum_{n=0}^{\infty} \frac{(x_1 + ... + x_k)!}{x_1!x_2! \cdots x_k!}p_1^{x_1}\cdots p_k^{x_k}\frac{e^{-\lambda}\lambda^n}{n!}I_{x_1 + ... + x_k = n} \\
& = e^{-\lambda}\lambda^{x_1} \cdots \lambda^{x_k}p_1^{x_1} \cdots p_k^{x_k}\frac{1}{x_1!x_2! \cdots x_k!}\\
& = e^{-\lambda}(\lambda p_1)^{x_1} \cdots (\lambda p_k)^{x_k}\frac{1}{x_1!x_2! \cdots x_k!}\\
& = \prod_{i=1}^k \frac{e^{-\lambda p_i}(\lambda p_i)^{x_i}}{x_i!}
\end{align*}
$$
Let's see how we can use this. Let \((Y_1, Y_2, ...Y_k)\) be \(\text{Multi}(n, p_1, p_2, ..., p_k)\), and we are interested in when the realized values fall into some set \(A\). Let \(X_i\) be defined as above, that is, \(X_i \sim \text{Poi}(p_i \lambda)\). Then, from above, we can write:
$$
\begin{align*}
P((X_1, ... X_k) \in A) & = \sum_{i=0}^\infty \frac{e^{-\lambda}\lambda^n}{n!} P((Y_1, ... Y_k) \in A)\\
& = e^{-\lambda} \sum_{i=0}^\infty \frac{\lambda^n}{n!} P((Y_1, ... Y_k) \in A)\\
& e^{\lambda}P((X_1, ... X_k) \in A) = \sum_{i=0}^\infty \frac{\lambda^n}{n!} P((Y_1, ... Y_k) \in A)
\end{align*}
$$
Both sides of the last line are infinite series of \(\lambda^n\), and thus their coefficients must be equal. Then, as \(N=n\), \(P((Y_1, ... Y_k) \in A)\) is \(n! c(n)\) where \(c(n)\) is the coefficient of \(\lambda^n\) in the expansion of \(e^{\lambda}P((X_1, ... X_k) \in A)\).
Let's try an example:
n defects are randomly distributed amongst k integrated-circuit chips produced by a factory (any number of defects may be found on a chip and each defect is independent of the other defects). If n = 4, k=7, find the probability that there is a chip with at least 3 defects.
This is a classic case where it's easier to compute the complementary probability: what is the probability that all chips have less than 3 defects (and then we subtract this from 1 to get the original probability).
$$
\begin{align*}
P(\text{Chips have 2 or less defects}) \\
& = \Big(P(\text{Chip has 0 defects}) + P(\text{Chip has 1 defect}) + P(\text{Chip has 2 defects})\Big)^7 \\
& = \Big(e^{-\lambda/7} + \frac{e^{-\lambda/7}(\lambda/7)^1}{1!} + \frac{e^{-\lambda/7}(\lambda/7)^2}{2!}\Big)^7\\
& = e^{-\lambda}\Big(1 + \frac{\lambda}{7} + \frac{\lambda^2}{2!7^2}\Big)^7\\
\end{align*}
$$
Here's the Poissonization step. We multiply the above by \(e^{\lambda}\) (oh nice it cancels) and expand to get the coefficient of \(\lambda^4\). For this, it's useful to know a bit about multinomial expansions. But basically you are looking for solutions to
$$
k_0 + k_1 + k_2 = 7\\
0 k_0 + 1 k_2 + 2 k_2 = 4
$$
The first equation comes from the constraint for the powers of a multinomial expansion to sum to the outside exponent. The second equation comes from our need to find powers of \(\lambda\) that sum up to 4 (the 0,1,2 come from the existing powers of \(\lambda\)). Two equations, three unknowns, solving for the values:
$$
\begin{align*}
k_2 &= k_0 - 3 \\
k_1 &= 10 - 2k_0\\
3 &\le k_0 \le 5
\end{align*}
$$
Computationally, we can write this in Python:
We get the answer 0.92711, so the complement to this is 0.07289 and is the probability of the original question.
Looking for more examples?
[1] DasGupta, A. Probability for Statistics and Machine Learning. http://link.springer.com/book/10.1007%2F978-1-4419-9634-3
]]>
Python developers are commonly using Python as a tool to explore datasets - but what if we reverse that analysis lens back on to the developer? In this talk, Cam will use Python as a data analysis tool to explore Python developers and code. With millions of data points, mostly scraped from Github and Stackoverflow, we'll reexamine who the Python developer is, and how Python developers write code.
]]>
You can purchase it on Amazon today!
]]>On the other hand, a non-constructive proof does not detail how to build the object, just states that it must exist. Often this is done by finding a proof by contradiction, but some interesting proofs use probability theory. For example, if I'm interested in showing that there exists an object with some property less than 0, then an argument of the form "the expected value of this property is 0, and if I know there is an object with property more than 0, there must be an object with property less than 0, too". I didn't find that specific object, but I determined it must exist.
There is an interesting trend occurring in data analysis that is analogous to a non-constructive proof. The author won't find the object desired, but will state, statistically, that it does exist. Sound odd?
Suppose it was early 2015, and I was interested in analyzing the Billboard's 2015 Song of the Summer. I might ask, "has the Song of the Summer been released yet?" By analyzing when the winning song was released in previous years, I can get a distribution of when this year's Song of the Summer might be released. If I'm past the majority of previous dates, then I can statistically conclude that 2015's Song of the Summer exists. I haven't shown what the song of the summer is, just that it exists.
In fact, this is what FiveThirtyEight did this year: they suggested that on May 5th, you have probably already heard the 2015 Song of the Summer. They didn't guess which song, just that it existed.
I find this to be a really powerful technique. By showing something exists, it can narrow down your search significantly. Suppose you and a friend were betting on what movie will win the Best Picture Oscar. You could probably do quite well by betting on all movies made after October, and nothing before. This is because the Oscar winner often is released (exists) after October.
Another extension, from a decision maker's perspective, is to understand when something should exist, and place yourself there. For example, good luck to the artist who hopes to achieve Song of the Summer with a late May release! Similar ideas apply to presidential elections (when should a candidate enter the nominee race to maximize chances of being president?) and theatrical releases hoping to win Oscars.
Any other applications? This feels like one of those tricks that is not generally applicable, but very powerful in a some situations. Leave other ideas in the comments!
]]>
This is a very simple Bayesian problem, on paper! See Allen's book for the solution (link above). What if we wanted to use PyMC to solve it? Below is the solution (originally from CrossValidated).
It looks like there is a bug in 2.3.4 (the most recent version) that is causing the wrong inference. This took me a while to discover, but it was solved when I downgraded to PyMC 2.3.2.
import pymc as pm
p = [
#brown, yellow, red, green, orange, tan, blue
[.3, .2, .2, .1, .1, .1, .0 ], # 1994 bag
[.13, .14, .13, .2, .16, .0, .24] # 1996 bag
]
bag = pm.Categorical('which_bag', [0.5, 0.5]) # prior on bag selected
@pm.deterministic
def first_bag_selection(p=p, bag=bag):
return p[bag]
@pm.deterministic
def second_bag_selection(p=p, bag=bag):
return p[1-bag]
# observe yellow in bag 1
obs_1 = pm.Categorical('bag_1', first_bag_selection, value=1, observed=True)
#observe green in bag 2
obs_2 = pm.Categorical('bag_2', second_bag_selection, value=3, observed=True)
mcmc = pm.MCMC([bag, p, first_bag_selection, second_bag_selection, obs_1, obs_2])
mcmc.sample(15000,3000)
bag_trace = mcmc.trace('which_bag')[:]
# what is the probability the bag chosen is from 1996?
print (bag_trace == 1).sum()*1./bag_trace.shape[0]
# should be about .26
]]>
We construct a dynamical population whose individuals are assigned elements from an algebraic group \(G\) and subject them to sexual reproduction. We investigate the relationship between the dynamical system and the underlying group and present three dynamical properties equivalent to the standard group properties.
Consider a sufficiently large population of \(N\) individuals. Sexual reproduction between individuals is a binary action, that is, sex involves two individuals. An interesting assumption is to consider if the population consists of individuals from an algebraic group. Recall the definition of a group : A set, \(G\), and binary operator \(\ast\) is called a group iff
there exists an identity element, \(e\), s.t. \(\forall x \in G, x\ast e=e \ast x = x\).
\(\forall x \in G\) there exists an inverse, \(x^{-1}\), s.t. \(x\ast x^{-1}=x^{-1}\ast x = e\).
\(\forall x,y \in G, x\ast y \in G\) and \(y\ast x \in G\).
What would the behaviour of a sexual system look like if all individuals were members of a group? Can we find such systems in nature, or are our constructions purely artificial?
We call the order of group \(G\) the number of elements in the group, denoted \(|G|\). A group, \(G\), is called abelian if the operation is commutative, i.e. \(\forall x,y \in G, x\ast y = y\ast x\). A subset \(H \subset G\) is called a subgroup iff \(e\in H, \forall x \in H \Rightarrow x^{-1} \in H\), and \(H\) is closed under the binary operation. A Cayley table of a group is \(n \times n\) matrix where the \((ij)th\) entry is the product of the \(ith\) and \(jth\) element. For example, consider \(\mathbb{Z}_2\times \mathbb{Z}_2=\{(0,0),(1,0),(0,1),(1,1)\}\) under addition mod 2. The Cayley table is \[\left[ \begin{array}{cccc} (0,0) & (1,0) & (0,1) & (1,1)\\ (1,0) & (0,0) & (1,1) & (0,1)\\ (0,1) & (1,1) & (0,0) & (1,0)\\ (1,1) & (0,1) & (1,0) & (0,0) \end{array} \right]\]
This matrix completely defines the group.
We describe a stochastic process that will characterize our evolving population. Consider again a population of \(N\) individuals where \(N_i(t)\) denote the number of \(i\) individuals in the population. We consider the interaction between two individuals to represent sexual reproduction. The birth-death process begins by two individuals being randomly chosen proportional to their frequency in the population. The new offspring ( product of the two individuals) is substituted for another randomly chosen individual, again proportional to frequency (thus the population size is constant). The process is repeated for each time step. For example, consider the group \(\mathbb{Z}_2=\{0,1\}\). Recall that in this group \(1\ast1=0\), so the probability that the population of 1s increases, i.e. \(N_1(t+1) = N_1(t) + 1\), is equal to the probabilty a 1 and 0 are chosen as parents, and a 0 is chosen to die:
\[\begin{aligned} P^{+} &=P(N_1(t+1)=N_1(t)+1| N_1(t))\\ &=\left( \frac{N_1(t)}{N} \frac{N_0(t)}{N-1}+\frac{N_0(t)}{N} \frac{N_1(t)}{N-1}\right) \frac{N_0(t)}{N}\\ &=\left( \frac{2N_1(t)}{N} \frac{N_0(t)}{N-1}\right) \frac{N_0(t)}{N}\end{aligned}\]
since for the population \(N_1(t)\) to increase by one in a time step, the new individual resulting from the interaction must be a 1 and the death must be a 0. Similarly,
\[\begin{aligned} P^{-} &=P(N_1(t+1)=N_1(t)-1| N_1(t))\\ &=\left( \frac{N_1(t)}{N} \frac{N_1(t)-1}{N-1}+\frac{N_0(t)}{N} \frac{N_0(t)-1}{N-1}\right) \frac{N_1(t)}{N}\\\end{aligned}\]
and \(P^{0}=P(N_1(t+1)=N_1(t)| N_1(t))=1-P^{+} -P^{-}\). Thus we have
\[\begin{aligned} E[N_1(t+1)|N_1(t)] &=P^+\cdot (N_1(t)+1)+P^-\cdot (N_1(t)-1)+P^0\cdot (N_1(t))\\ &=P^+-P^-+N_1(t)\Rightarrow\end{aligned}\]
\[E[N_1(t+1)-N_1(t)|N_1(t)]=P^+-P^-\] Working with probabilities can be unwieldy, so we take the limit as \(\Delta t\rightarrow 0\) and \(N \rightarrow \infty\). If we denote\(N_1(t)/N \approx x\) (then \(N_0(t)/N \approx 1-x\)), we have
\[\begin{aligned} &P^+=2x(1-x)^2\\ &P^-=x(x^2+(1-x)^2) \Rightarrow \\ &\frac{dx}{dt} \approx P^+-P^-= x(1-2x)\end{aligned}\]
This system has asymptotically stable points at \(x=0\) and \(x=1/2\), i.e. when the population of 1’s is extinct or equal to the population of 0’s.
Initially, to model such an evolving system, we consider only \(\Delta t\rightarrow 0\) and \(N \rightarrow \infty\). Thus we enter the well-studied realm of continuous dynamical systems. Given a group of order \(n\), we need some way to introduce the group structure into the dynamical system. The Reverse Cayley matrix, \(A\), is defined as an \(n\times n\) matrix where the \(ith\) column of \(A\) contains the product of frequencies such that the product of the corresponding elements is the \(ith\) element of the group. For example, consider \(\mathbb{Z}_2\times \mathbb{Z}_2=\{(0,0),(1,0),(0,1),(1,1)\}\), and let \(e,x,y,z\) denote the frequencies of the enumerated set. Then define \(A\) as \[\left[ \begin{array}{cccc} e^2 & ex & ey & ez\\ x^2 & xe & xz & xy\\ y^2 & yz & ye & yx\\ z^2 & zy & zx & ze \end{array} \right]\] The matrix can easily be constructed from the group Cayley table. In each row, for every element \(g \in G\), there exists a pair of elements such that the product is \(g\). Compare the Reverse Cayley matrix with the the Cayley matrix for \(\mathbb{Z}\times \mathbb{Z}_2\) given in section 2.
The dynamics of the system are given by the following system of equations, analogous to the limiting case in the example with \(\mathbb{Z}_2\). It is straightforward to show the dynamics are the same. In what follows, \(\mathbf{1}\)=(1,1…,1) and \(\mathbf{e_i}\) are the standard euclidean basis vectors and \(x_i\) is the \(ith\) element in the group.
\[\begin{aligned} \dot{x}_j &=P^{+}_{j}-P^{-}_{j}\\ &=\left(\sum^{n}_{i=1}\ A_{ij}\right)(1-x_j) - \left(1-\sum^{n}_{i=1}\ A_{ij}\right)x_j\\ &= \sum^{n}_{i=1}\ A_{ij} - x_j \\ &= \mathbf{1}\cdot A \mathbf{e_j} - x_j \text{ which can also be rewritten, if convenient, as}\\ &=\mathbf{1}\cdot A (-x_1,\ldots, -x_{i-1},1-x_i,-x_{i+1}, \ldots,-x_n)^T\end{aligned}\]
This greatly simplifies our dynamics. In fact, it is now trivial to show that the barycenter of the \(n\)-simplex, \(S_n\), is an interior stable point: suppose the frequency vector \(\vec{\mathbf{x}}=(1/n, \ldots, 1/n)\), then \(A=A(\vec{\mathbf{x}})=\frac{1}{n^2}J_n\), where \(J_n\) is the matrix full of ones, then
\[\begin{aligned} \dot{x}_i &=\frac{1}{n^2}\mathbf{1}\cdot J_n (1/n,\ldots,1-1/n,\ldots, 1/n)^T\\ &=\frac{1}{n^2} n\mathbf{1}\cdot (1/n,\ldots,1-1/n,\ldots, 1/n)^T\\ &=\frac{1}{n}(1-\frac{1}{n}-\ldots -\frac{1}{n})=0\end{aligned}\]
For the rest of the paper, let \(x_1\) represent the frequency of the identity element of the underlying group \(G\). Recall the heuristic definition of an invariant manifold is a manifold, \(N\), such that if the system starts somewhere in \(N\), it stays in \(N\) for all time . A useful concept in this study is the support of an invariant manifold, \(N\), defined as the following: \(i\in Supp(N)\) iff \(x_i>0\) in \(Int(N)\). Define the order of the manifold, denoted \(|N|\), as \(|Supp(N)|\).
\(\{ i_1, i_2, ...,i_k\}\) is an subgroup of G iff the sub-manifold of the dynamical system with support \(\{ i_1, i_2, ...,i_k\}\) is invariant.
----------------------
Suppose a population is described by an underlying group \(G\), and \(H\) is a subgroup of \(G\). Let \(\vec{\mathbf{x}}\) denote the frequencies of each member of the population. Suppose the population starts at a point where only members of \(H\) are present. For example, for all \(i\in G \setminus H, x_i=0\). Let element \(i\) not be in \(H\). We can derive \(i\)'s frequency in the population by \(\dot{x}_i=\mathbf{1}\cdot A\mathbf{e_i}\). We then let the system evolve. I claim that all elements in the \(ith\) column of \(A(\vec{\mathbf{x}})\) are zero. If there was an element in the column that was non-zero, say \(x_k x_j\), then \(x_k x_j=x_i\) by definition of the reverse Cayley matrix \(A\). But as \(x_k, x_j\) are non-zero, elements \(k\) and \(j\) belong to \(H\). As \(H\) is a group, this implies \(k \ast j = i \in H\), a contradiction. Thus, as the \(ith\) column is all zero, \(\dot{x}_i=0\).
On the other hand, suppose that \(H\) is not a subgroup. Then there exists \(u,v \in H\) s.t. \(u\ast v =k \not \in H\) thus \(x_k=0\) while \(x_u,x_v>0\). Then \(x_u x_v\) is in the \(kth\) column of \(A\). Thus \(\dot{x}_k=\mathbf{1}\cdot A\mathbf{e_k}>0\) and the system is not invariant.
----------------------
The above proposition implies that not all sub-simplicies of this system are invariant. If we consider the group \(\mathbb{Z}_2\times \mathbb{Z}_2\), then the only invariant manifolds of the system, described in terms of sub-simplicies of \(S_4\), are: the interior of \(S_4\), the interior of the three edges projecting from \((1,0,0,0)\) and the point \((1,0,0,0)\). It is clear that, say, \(\mathbf{e}_2=(0,1,0,0)\) is not an invariant manifold of the system as \((1,0)\ast (1,0) = (0,0) \not= (1,0)\), thus the system will evolve away from that point. It is clear that all invariant sub-manifolds of the dynamical system are actually sub-simplicies of \(S_n\). Both terms will be used interchangeably for the remainder of the paper.
\(1\in Supp(N)\) for all invariant manifolds, \(N\), of the dynamical system.
----------------------
If all the 2-dimension simplicies extending from \(\mathbf{e}_1\) are invariant, then the underlying group is abelian.
----------------------
The assumption implies that \(\{e,i\}\) is a subgroup for all \(i \in G\). Thus, by the necessity of (sub)groups, \(i\ast i =e\) for all \(i \in G\). It is a well-known (and interesting) theorem in group theory that this condition implies the group is abelian.
There is an alternative way to describe the dynamical system that will be used in the next proposition. Define \(B_k\) as the binary Cayley matrix s.t. if elements \(i\ast j=k\), then \((B_k)_{ij}=1\) and 0 otherwise, for all \(1 \le k \le n\). Clearly \(\sum_k\ B_k = J_n\) where \(J_n\) is the \(n\times n\) matrix of all ones. A property of \(B_k\) is that the columns of the matrix are the standard euclidean basis vectors. It is simple to show the dynamical system can be rewritten as:
\[\dot{x}_i =\vec{\mathbf{x}}\cdot B_i \vec{\mathbf{x}} -x_i\]
The barycenter of an invariant sub-manifold is (asymptotically) stable.
----------------------
What follows is a partial proof. Taking the time derivative of the Lyapunov function \(V(\vec{\mathbf{x}}) = \sum\ x_{i}^2\) we get
\[\begin{aligned} V'/2 & = \sum_i\ \vec{\mathbf{x}} \cdot x_i B_i \vec{\mathbf{x}} - \sum_i x_{i}^2 \\ & = \vec{\mathbf{x}} \cdot \sum_i \ x_i B_i \vec{\mathbf{x}} - \vec{\mathbf{x}} \cdot I \vec{\mathbf{x}} \\ & = \vec{\mathbf{x}} \cdot ( D-I) \vec{\mathbf{x}}\end{aligned}\]
where \(D = \sum \ x_i B_i\). Next is an exercise to show that \(D-I\) is a (semi) negative definite matrix which is equivalent to showing its eigenvalues are (negative) non-positive. I have not been able to generalize the idea to every system yet.
----------------------
Using this new approach in notation, we can translate the properties of groups, outlined in the introduction, to the language of dynamical systems if the population follows our birth-death process.
Unique Identity Element: the only invariant sub-manifold with order 1 is \(\mathbf{e}_1=(1,0,..,0)\).
Unique Inverse Element: \(B_1\) is symmetric.
Closure: the interior of the simplex \(S_n\) is invariant.
----------------------
This is clear as the only subgroup of order 1 of any group is \(\{e\}\), thus the only invariant manifold with order 1 is a population of just identity elements. On the other hand, if the only invariant manifold with order 1 is \((1,0,..0)\), then \(\{x_1\}\) is the only subgroup of order 1 of the underlying group. Thus, as \(e\) belongs to every (sub)group, \(x_1=e\).
For any \(x \in G\) with enumeration \(i_x\), there exists a 1 in the \(i_x\) column of \(B_1\). This means, there exists a \(y \in G\) s.t. \(x*y=e\). As \(B_1\) is symmetric, this also implies in the \(i_y\)th column of \(B_1\), there is a 1 in row \(i_x\), implying \(y*x=e\) . A remark: this is a poor characteristic of the unique inverse element property, as it relies too much on \(B_i\). I would prefer a more dynamical definition.
Follows from the proof of proposition 3.1.
----------------------
Aside from the poor definition of a unique inverse element, it would be a interesting question to ask, if a dynamical system satisfies the three characteristics above then does there exists an underlying group? Let’s tentatively say, if a given dynamical system satisfies the above three properties, we call such a system a Group-Equivalent system. Then, if there exists an underlying group, the system is Group-Equivalent i.e. satisfies the above properties; if a system is Group-Equivalent, does there exist an underlying group “driving” the system? A system driven by group \(G\) is not unique as adding multiplicative constant does not change the dynamics.
Suppose a dynamical system has an underlying group structure. Then by Lagrange’s Theorem (the order of every subgroup of \(G\) divides the order of \(G\)) we know that the order of every invariant sub-manifold divides the order of \(S_n\) (recall \(|Int(S_n)|=n\)). Suppose now that a system is Group-Equivalent, and it is true that this implies an underlying group structure. Can we prove Lagrange’s Theorem using a dynamical approach?
The last few remarks foreshadow the possibilities of studying a dynamical system with an underlying group, or a Group-Equivalent system. Two ideas can next occur: First, we use dynamical systems theory to prove theorems in group theory; second, we use group theory to solve problems involving dynamical systems. I believe the former to be of little value outside of itself; the literature and research in group theory is already so large that this novel approach to group theory will lead to nothing novel in the field as a whole. The latter idea is more fruitful I believe. If it is possible to transform a dynamical system into a system satisfying the properties outlines above, then we can apply very strong ideas from group theory to the system.
Further extensions include including fitness levels to the populations and finding real-world applications of group-driven systems, both of which have been addressed, to some degree, in a companion paper to this, see Extensions of Population Dynamics of Algebraic Groups.
J. Hofbauer, K. Sigmund, The Theory of Evolution and Dynamical Systems. Cambridge UP. 1988.
M. Hirsch, S. Smale, R.L. Devaney Differential Equations, Dynamical Systems & An Introduction to Chaos. Elsevier Academic Press. 2004.
J.A. Gallian, Contemporary Abstract Algebra. Houghton Mifflin. 2006.
C. Davidson-Pilon Extensions of Population Dynamics of Algebraic Groups Submitted to Journal of Theoretical Biology 2011.
]]>Next, suppose you are interested in the sample average of a dataset that exists on many computers. No problem you think, as you create a function that returns the sum of the elements and the count of the elements, and send this function to each computer, and divide the sum of all the sums by the sum of all the counts.
Next, suppose you are interested in the sample median of that same distributed dataset. No problem you think, as you create a function that sorts the array and takes the middle element, and send this function to each computer, and - wait. That won't work.
The problem is is that the median of median is not equal to the median. This is why many SQL engines do not support a MEDIAN function (which left many analysts scratching their heads). What's needed is an algorithm that can approximate the median, while still being space efficient.
First published in 2013 by the uber-practical and uber-intelligent Ted Dunning, the t-Digest is a probabilistic data structure for estimating the median (and more generally any percentile) from either distributed data or streaming data. Internally, the data structure is a sparse representation of the cumulative distribution function. After ingesting data, the data structure has learned the "interesting" points of the CDF, called centroids. What do I mean by that? Consider the following very lopsided distribution:
(It's really a mixture of Normals). The data has empirical CDF:
The t-Digest will literally eat this dataset, and try to represent the "interesting" points of its cumulative distribution. We can plot the internal sparse representation of the CDF:
The vertical lines denote where the digest thinks "interesting and relevant" centroids are. This is really neat. We can see the digest has only kept the parts of the CDF where the CDF is changing fastest. This makes sense: these are where quantities like the percentile are changing quickest. Flat portions of the CDF, like near x=3 and x=8, only need to be summarized by a few points.
Running a small test locally, I streamed 8mb of pareto-distributed data into a t-Digest. The resulting size was 5kb, and I could estimate any percentile or quantile desired. Accuracy was on the order of 0.002%.
So we can stream in data to the t-Digest, but how can we use this data structure for map-reduce? In the map stage, we can process data into the t-Digest sequentially, so we have N t-Digests in N computers. At the reduce stage, we can define an addition operation on two t-Digests as follows. Create a new t-Digest and treat the internal centroids of the two left-hand side digests as incoming data. The resulting t-Digest is a only slightly larger, but more accurate, t-Digest. Amazing: the t-Digest literally eats itself.
Interested in the algorithm, but without any code to read (I can't yet read Ted's implementation in Java), I wrote a semi-efficient t-Digest in Python (with helpers from Cython). You can find it in the CamDavidsonPilon/tdigest repo. There are some further optimizations to do, but it's probably efficient enough for most offline work!
I highly recommend Dunning's original paper on the t-Digest below - it's an easy read. Him and some other authors are constantly improving the data structure as well, see issues here.
Lifetimes is my latest Python project. Below is a summary, but you can also check out the source code on Github.
As emphasized by P. Fader and B. Hardie, understanding and acting on customer lifetime value (CLV) is the most important part of your business's sales efforts. And (apparently) everyone is doing it wrong. Lifetimes is a Python library to calculate CLV for you.
More generally, Lifetimes can be used to understand and predict future usage based on a few lax assumption:
Lifetimes can be used to both estimate if these entities are still alive, and predict how much more they will interact based on their existing history. If this is too abstract, consider these situations:
Really, "customers" is a very general term here, (similar to "birth" and "death" in survival analysis). Whenever we have individuals repeating occurrences, we can use Lifetimes to help understand behaviour.
pip install lifetimes
Requirements are only Numpy, Scipy, Pandas.
The examples below are using the cdnow_customers.csv
located in the datasets/
directory.
from lifetimes.datasets import load_cdnow
data = load_cdnow(index_col=[0])
data.head()
"""
x t_x T
ID
1 2 30.43 38.86
2 1 1.71 38.86
3 0 0.00 38.86
4 0 0.00 38.86
5 0 0.00 38.86
"""
x
represents the number of repeat purchases the customer has made (also called frequency
). T
represents the age of the customer. t_x
represents the age of the customer when they made their most recent purchases (also called recency
).
We'll use the BG/NBD model first. Interested in the model? See this nice paper here.
from lifetimes import BetaGeoFitter
# similar API to scikit-learn and lifelines.
bgf = BetaGeoFitter()
bgf.fit(data['x'], data['t_x'], data['T'])
print bgf
"""
<lifetimes.BetaGeoFitter: fitted with 2357 customers, a: 0.79, alpha: 4.41, r: 0.24, b: 2.43>
"""
After fitting, we have lots of nice methods and properties attached to the fitter object.
Consider: a customer bought from you every day for three weeks straight, and we haven't heard from them in months. What are the chances they are still "alive"? Pretty small. On the other hand, a customer who historically buys from you once a quarter, and bought last quarter, is likely still alive. We can visualize this relationship using the Frequency/Recency matrix, which computes the expected number of transactions a artifical customer is to make in the next time period, given his or her recency (age at last purchase) and frequency (the number of repeat transactions he or she has made).
from lifetimes.plotting import plot_frequency_recency_matrix
plot_frequency_recency_matrix(bgf)
We can see that if a customer has bought 25 times from you, and their lastest purchase was when they were 35 weeks old (given the individual is 35 weeks old), then they are you best customer (bottom-right). You coldest customers are those that in the top-right corner: they bought a lot quickly, and we haven't seen them in weeks.
There's also that beautiful "tail" around (5,25). That represents the customer who buys infrequently, but we've seen him or her recently, so they might buy again - we're not sure if they are dead or just between purchases.
Let's return to our customers and rank them from "highest expected purchases in the next period" to lowest. Models expose a method that will predict a customer's expected purchases in the next period using their history.
t = 1
data['predicted_purchases'] = data.apply(lambda r: bgf.conditional_expected_number_of_purchases_up_to_time(t, r['x'], r['t_x'], r['T']), axis=1)
data.sort('predicted_purchases').tail(5)
"""
x t_x T predicted_purchases
ID
509 18 35.14 35.86 0.424877
841 19 34.00 34.14 0.474738
1981 17 28.43 28.86 0.486526
157 29 37.71 38.00 0.662396
1516 26 30.86 31.00 0.710623
"""
Great, we can see that the customer who has made 26 purchases, and bought very recently from us, is probably going to buy again in the next period.
Ok, we can predict and we can visualize our customers' behaviour, but is our model correct? There are a few ways to assess the model's correctness. The first is to compare your data versus artifical data simulated with your fitted model's parameters.
from lifetimes.plotting import plot_period_transactions
plot_period_transactions(bgf)
We can see that our actual data and our simulated data line up well. This proves that our model doesn't suck.
Most often, the dataset you have at hand will be at the transaction level. Lifetimes has some utility functions to transform that transactional data (one row per purchase) into summary data (a frequency, recency and age dataset).
from lifetimes.datasets import load_transaction_data
from lifetimes.utils import summary_data_from_transaction_data
transaction_data = load_transaction_data()
transaction_data.head()
"""
date id
0 2014-03-08 00:00:00 0
1 2014-05-21 00:00:00 1
2 2014-03-14 00:00:00 2
3 2014-04-09 00:00:00 2
4 2014-05-21 00:00:00 2
"""
summary = summary_data_from_transaction_data(transaction_data, 'id', 'date', observation_period_end='2014-12-31')
print summary.head()
"""
frequency recency T
id
0 0 0 298
1 0 0 224
2 6 142 292
3 0 0 147
4 2 9 183
"""
bgf.fit(summary['frequency'], summary['recency'], summary['T'])
# <lifetimes.BetaGeoFitter: fitted with 5000 customers, a: 1.85, alpha: 1.86, r: 0.16, b: 3.18>
With transactional data, we can partition the dataset into a calibration period dataset and a holdout dataset. This is important as we want to test how our model performs on data not yet seen (think cross-validation in standard machine learning literature). Lifetimes has a function to partition our dataset like this:
from lifetimes.utils import calibration_and_holdout_data
summary_cal_holdout = calibration_and_holdout_data(transaction_data, 'id', 'date',
calibration_period_end='2014-09-01',
observation_period_end='2014-12-31' )
print summary_cal_holdout.head()
"""
frequency_cal recency_cal T_cal frequency_holdout duration_holdout
id
0 0 0 177 0 121
1 0 0 103 0 121
2 6 142 171 0 121
3 0 0 26 0 121
4 2 9 62 0 121
"""
With this dataset, we can perform fitting on the _cal
columns, and test on the _holdout
columns:
from lifetimes.plotting import plot_calibration_purchases_vs_holdout_purchases
bgf.fit(summary_cal_holdout['frequency_cal'], summary_cal_holdout['recency_cal'], summary_cal_holdout['T_cal'])
plot_calibration_purchases_vs_holdout_purchases(bgf, summary_cal_holdout)
Basic on customer history, we can predict what an individuals future purchases might look like:
t = 10 #predict purchases in 10 periods
individual = summary.iloc[20]
# The below function may be renamed to `predict` in a future version of lifetimes
bgf.conditional_expected_number_of_purchases_up_to_time(t, individual['frequency'], individual['recency'], individual['T'])
# 0.0576511
Drop me a line at @cmrn_dp
!
The data analysis was done using demographica.
]]>%magic functions
.
Today, I've opened up my startup scripts in a github repo, StartupFiles. The repo comes with some helper scripts too, to get your started:
./bin/build_symlink
: for linking your startup directory to a local version of the repo. ./bin/ipython_analysis
: for going through your ipython history and looking for simple recommendations to add. (Example below)Here are some small videos about what you can do with the new startup scripts:
Just because I am always misspelling these!
Check out StartupFiles and let me know what you think below!
]]>Richard Dawkins, in his early book The Extended Phenotype, describes what he means when he says "statistically, X occurs". His original motivation was addressing a comment about gender, but it applies more generally:
If, then, it were true that the possession of a Y chromosome had a causal influence on, say, musical ability or fondness for knitting, what would this mean? It would mean that, in some specified population and in some specified environment, an observer in possession of information about an individual's sex would be able to make a statistically more accurate prediction as to the person's musical ability than an observer ignorant of the person's sex. The emphasis is on the word “statistically”, and let us throw in an “other things being equal” for good measure. The observer might be provided with some additional information, say on the person's education or upbringing, which would lead him to revise, or even reverse, his prediction based on sex. If females are statistically more likely than males to enjoy knitting, this does not mean that all females enjoy knitting, nor even that a majority do. [emphasis added]
I really enjoy his precise description of what statistics is. Ignore distributions, modelling, p-values and other statistical ideas for a moment: what statistics is really interested in is
For some event \(A\), what characteristic \(X\) allows \(P( A | X ) > P(A)\).
For example, in Dawkins case, \(A\) = person is a female, and \(X\) = person enjoys knitting. Thus \(P(\text{person is female}\; |\; \text{person enjoys knitting}) > P(\text{person is female})\).
Of course, in reality I don't have this much perfect information: I may have the ideal X, but not have enough data points to determine that \(X\) is indeed significant. Conversely, I may have lots of data, but not the correct covariate \(X\).
]]>
In the previous article in this series on Joins in MapReduce, we looked at how a traditional join is performed in a distributed map-reduce setting. I next want to generalize the idea of a join: typically we think of joining rows on equal keys, but more generally, you could join two rows on any function that returns a boolean: \( l, r \mapsto \theta(l,r) \). We call these generalized joins \(\theta\)-joins, or theta-joins. Because of the arbitrariness of the \(\theta\) function, theta-joins can be incredibly more computationally challenging. Here are some examples of theta-joins:
R.key < L.key
|R.key - L.key| < 2
R.key1 = L.key1 and R.key2 = L.key2
And then, implementing a theta-join in mapreduce is even more challenging! If you recall the previous join algorithm, where we implemented an equality-join hence \(\theta\) equaled the equality operator, we used an implicit trick: rows with the same key on the lefthand side and the righthand side would map to the same reducer in the groupByKey
step. Really, no joining is actually done pre-reducer, just organizing. So how do we implement a theta-join in mapreduce?
Unfortunately, the problem is not generally solved. For very specific join types, there might exist very fragile tricks that one can use, but they are not generalizable. In order to accommodate any join, you'll need to check all elements of the RHS (righthand side) against all elements of the LHS (lefthand side). This is inefficient or just generally crappy, so how can we reduce the crappiness? The authors of a pretty accessible paper, Processing Theta-Joins using MapReduce, present an efficient way to partition this cross-product (what they call the join-space) across the reducers. What's unique about this algorithm is because of it's randomness (yes, it's stochastic), it handles the skew-problem common in mapreduce joins gracefully.
Below is a PySpark implementation of the 1-Bucket-Theta join:
And some example usage using datetimes, this might be called an interval-join as I'm searching for rows that have a datetime that falls into a interval.
Sorry, but the above implementation is just too inefficient: that join-space I mentioned above, which is the cross-product of the datasets, is simply too big. Most rows will not be included in the result set, yet we check them anyways. This is the burden of the theta-join. Though I don't follow suite, the paper does examine gains by using some summary statistics from the LHS and RHS to improve the performance of certain types of theta-joins.
]]>
In traditional databases, the JOIN algorithm has been exhaustively optimized: it's likely the bottleneck for most queries. On the other hand, MapReduce, being so primitive, has a simpler implementation. Let's look at a standard join in MapReduce (with syntax from PySpark).
We start off with two datasets, that might resemble something like below
# left # right
(1, 'A', 'B' ) | (1, 'Z', 'Y' )
(2, 'C', 'D' ) | (1, 'X', 'V' )
(2, 'E', 'F' ) | (2, 'W', 'U' )
(3, 'E', 'F' ) | (4, 'T', 'S' )
We need to specify a key for each record, something that we will use to join on. In PySpark, there is a keyBy
function that allows you to set a key. Below is the resulting output.
left = left.keyBy(lambda r: r[0])
right = right.keyBy(lambda r: r[0])
# left # right
(1, (1, 'A', 'B') ) | (1, (1, 'Z', 'Y' ))
(2, (2, 'C', 'D') ) | (1, (1, 'X', 'V' ))
(2, (2, 'E', 'F') ) | (2, (2, 'W', 'U' ))
(3, (3, 'E', 'F') ) | (4, (4, 'T', 'S' ))
Cool. Next we will add some values to denote the origin of the record: 1 if it is from left, and 2 if it is from right.
left = left.map(lambda (k, v): (k, (1, v)))
right = right.map(lambda (k, v): (k, (2, v)))
# left # right
(1, (1, (1, 'A', 'B')) ) | (1, (2, (1, 'Z', 'Y')) )
(2, (1, (2, 'C', 'D')) ) | (1, (2, (1, 'X', 'V')) )
(2, (1, (2, 'E', 'F')) ) | (2, (2, (2, 'W', 'U')) )
(3, (1, (3, 'E', 'F')) ) | (4, (2, (4, 'T', 'S')) )
# We now have (key, (origin, values)) tuples
Next we add the datasets together, using the union
function.
unioned = left.union(right)
# unioned
(1, (1, (1, 'A', 'B')) ) # key = 1
(2, (1, (2, 'C', 'D')) ) # key = 2
(2, (1, (2, 'E', 'F')) ) # key = 2
(3, (1, (3, 'E', 'F')) ) # key = 3
(1, (2, (1, 'Z', 'Y')) ) # key = 1
(1, (2, (1, 'X', 'V')) ) # key = 1
(2, (2, (2, 'W', 'U')) ) # key = 2
(4, (2, (4, 'T', 'S')) ) # key = 4
Up to now, we've really only been mapping. Next we want to start reducing. We start by performing a groupByKey
on the dataset, which will send items with the same key to the same reducer. The group by key will also add up all records with the same key into a list, as we see below:
grouped = unioned.groupByKey() # grouped (each key goes to a different reducer) (1, [(1, (1, 'A', 'B')), (2, (1, 'Z', 'Y')), (2, (1, 'X', 'V'))]) (2, [(1, (2, 'C', 'D')), (1, (2, 'E', 'F')),
(2, (2, 'W', 'U'))])
(3, [(1, (3, 'E', 'F'))])
(4, [(2, (4, 'T', 'S'))])
We know have (key, [list of all elements with that key])
Oh my god we're almost done! Finally, we take each array and map a function that will perform a mini-cross product:
result = grouped.flatMapValues(lambda x : dispatch(x))
# Take each value (the list in the second position), and perform the dispatch function (below) on it.
# For example, let's look at the first element of grouped, i.e. key=1:
def dispatch(seq):
left_origin = []
right_origin = []
for (n, v) in seq:
if n == 1:
left_origin.append(v)
elif n == 2:
right_origin.append(v)
return [(v, w) for v in left_origin for w in right_origin]
For example:
> d = [(1, (1, 'A', 'B')), (2, (1, 'Z', 'Y')),(2, (1, 'X', 'V'))]
> dispatch(d)
[
((1, 'A', 'B'), (1, 'Z', 'Y')),
((1, 'A', 'B'), (1, 'X', 'V'))
]
Which is the correct result for joining left 1 keys against right 1 keys.
This is a neat way to perform a distributed map-reduce join. But there are limitations:
This latter problem is called a \(theta\)-join, where in the case of an equality join \(theta\) is the equality operator. We'll explore how to solve both these problems at once in the next blog post.
]]>I've enumerated only four, but there are many others. Below is a video explaining one reason why we see powerlaw distributions.
I really like this video because it gives a very intuitive reason why powerlaws can exist. It also cements that the winners become winners scheme will create powerlaws and extreme inequalities.
It's tempting to go in reverse: given you see a powerlaw, the distribution must have been generated by a winners become winners scheme. Unfortunately, this is not true: there are other schemes that can create powerlaws. But I would go as far as to say that most empirical powerlaws we see are generated by winners become winners. The trick is to find out why winners become winners.
The above gives a good generating scheme of how powerlaws arise for quantities (books sales, wealth, followers), but what about durations? There exist durations that are powerlawed: response times in server requests are powerlawed, ages of technologies are powerlawed. The generating scheme winners become winners doesn't carry over well to the time domain. Is there an analogy we can use?
Update: Françoise P., in the comments, offers an interesting solution. Packet arrival time in networks often follow a powerlaw. There is another case, similar to this. Consider a random walk over the integers, starting at 0 and with a 0.5 chance of moving up one, and a 0.5 chance of moving down one. The distribution of the duration between returning to 0 is powerlawed: the probability of a duration of length n is proportional to one over 2 to the power of n.
]]>If you have been following this blog, you'll know that I employ Bayesian A/B testing for conversion tests (and see this screencast to see how it works). One of the strongest reasons for this is the interpretability of the analogous "p-value", which I call Confidence, defined as the probability the conversion rate of A is greater than B,
$$\text{confidence} = P( C_A > C_B ) $$
Really, this is what the experimenter wants - an answer to: "what are the chances I am wrong?". So if I saw a confidence of 1.0 (or as I report it, 100%), I would have to conclude that I know for certain which group is superior.
In practice, I've encountered a 100% confidence a few times, most often in two situations: when my sample size is very low, and when my sample size is very high. The latter situation makes sense: I have lots of data, enough so my testing algorithm can confidently conclude a winner. The former, when I have a small sample size, makes less sense - at first. Suppose the difference between the groups is small, then it doesn't make sense for my algorithm to be so certain in a winner. Instead what is happening is a break down between theory and practice.
Let's go back to the beginning. What are the assumptions of A/B testing. Luckily, they are quite short:
1. Before any treatment, your groups are the same.
That's it! The A/B testing algorithm proceeds by applying a treatment to one of the groups and you measure the results in both groups. In practice however, our groups are not the same. The best we can do is have them "statistically" equal, that is to say, equal on average. In practice, we need to rely on the Law of Large Numbers to make our groups statistically equal. Unfortunately, the LLN works for large sample sizes only - you can't assume equal, or even similar, populations with small sample sizes.
With small sample sizes, and even with medium sample sizes, your groups will be different and there is nothing you can do about it. Perhaps just by chance, more people who were going to convert were put in Bucket A, and thus the treatment's effect is dwarfed by this chance event. This is the class imbalance problem. This is why I was seeing 100% confidence with small sample sizes: by chance a bunch of converters were put into a single bucket and my A/B test algorithm was assuming equal populations.
Let's make this more formal. Define delta, denoted \(\delta\) as the observed difference between the two groups. \(\delta\) can be broken down into two parts:
$$ \delta = \delta_{ci} + \delta_{T}$$
where \(\delta_{T}\) is the difference in conversion rates due to the treatment, and \(\delta_{ci}\) is the difference due to the class imbalance problem. I'm curious about the variance in the observed \(\delta\), that is, how extreme might my observed \(\delta\) be. Taking the variance of both sides:
$$ \text{Var}(\delta) = \text{Var}(\delta_{ci}) + \text{Var}(\delta_{T}) $$
(As the two terms on the right hand side are independent, there will be no correlation term).
Let's do an A/A test: that is, let's not apply a treatment to either group. Then \(\delta_{T}=0\). So any variance we see in the observed delta is due purely to variance from the class imbalance problem. Hence in an A/A test:
$$ \text{Var}(\delta) = \text{Var}(\delta_{ci})$$
To find the variance, we'll perform some Monte Carlo simulations of a typical A/A test. Here are the steps:
Below is the Python code for this simulation:
Below is a grid of the observed variances, varying \(p\) and \(N\):
White is good: it implies the LLN has taken over and there is little variance in the observed \(\delta\). This is where an A/B experiment works: observed difference in the groups are highly likely to be the result of the treatment. If this is hard to read due to the lack of contrast, below is the log-plot of the above:
It means you have to distrust A/B results will small sample sizes, at least if the variance of the class imbalance problem is of the same order of magnitude as the variance of the treatment effect.
Furthermore, this is another reason why we see these huge "unicorn" experiments (ugh, this is term coined by a recent article that claimed 1000% improvement from an experiment). Suppose there is a positive increase from treatment and that group was assigned significantly more converters just by chance (and the sample size is small enough to allow this), then your estimate of "lift" is going to be much larger than it actually is. Compound this with mistaking the binary problem with the continuous problem, and you've got the recipe for the 1000% improvement fallacy.
Hopefully, you've learned from my mistake: don't believe anything with small sample sizes, especially if your treatment effect is very small. You can use the above figures to determine the variance at different population sizes and conversion probabilities (how do you know the conversion probability? Well, you don't - best to estimate it with the total conversion rate: total converts / total population). If the variance at that point is too high, then likely your observed delta is being influenced strongly by the class imbalance problem and not the treatment effect. You can find a csv of the data in the gist here.
Thanks to Chris Harland and Putra Manggala for their time discussing these ideas.
]]>It's clear that humans are irrational, but how irrational are they? After some research into behavourial economics, I became very interested in Prospect Theory, a modern theory about human behaviours. Part of Prospect Theory is determining how our brains value different outcomes under uncertainty, that is, how expected values are mentally calculated. A very interesting part of Prospect theory is that it is not probabilities that are used in the calculation of expected value:
Here, the q's are not the probabilities of outcome z, but it is from another probability measure called decision weights that humans actually use to weigh outcomes. Using a change of measure, we can observe the relationship between the actual probabilities and the decision weights:
My interest is in this change of measure.
Suppose you have two choices:
Which would you prefer?
Well, under the real world probabilty measure, these two choices are equal: .99 * 101 = .01 * 10000. Thus a rational human would be indifferent to either option. But an actualhuman would have a preference: they would see one more valuable than the other. Thus:
rewritten:
and dividing:
What's left to do is determine the direction of the first inequality.
So I created combinations of probabilities and prizes, all with equal real-world expected value, and asked Mechanical Turkers to pick which one they preferred:
The original HIT data and the python scripts that generate are below, plus the data that I just now recieved back from MTurk. Each pair of lotteries received 10 turkers votes.
Note: The Turking cost me $88.40, if you'd like to give back over Gittip, that would be great =)
Note: I called the first choice Lottery A and the second choice Lottery B.
Below is a slightly inappropriate heatmap of the choices people made. If everyone was rational, and hence indifferent to the two choices, the probabilities should hover around 0.5. This is clearly not the case.
What else do we see here?
I recorded my steps to take the raw Turker data, and create the visualization above:
Why did I ask the Turkers to deeply imagine winning $50 dollars before answering the question? This was to offset a potential anchoring effect: if a Turkers first choice had prize $10 000, then any other prize would have looked pitiful, as the anchor had been set at $10 000. By having them imagine winning $50 (lower than any prize), then any prize they latter saw would appear better than this anchor.
Next steps? I'd like to try this again, with more control over the Turkers (have a more diverse set of Turkers on it).
Can I see the code and raw data?Sure, here it is on Github!
One of the first data science books, though it wasn't labelled that at the time, was the excellent book "Freakonomics" (2005). The authors were the first to publicise using data to solve large problems, or to use data to discover novel details about society. The one chapter that I still deeply remember is their chapter on using first names from the US Census data to understand how first names change over time.
For example, the first name "Sandra" has been a very popular name over the past century, while "Brittany" has a brief period of intense popularity during the 80s and early 90s before fading out. What are the results of this difference? If we extrapolate the year of birth until now, then most "Brittany"s now are between the ages of 15-29. In fact, you can be almost certain they are between these ages. On the other, "Sandra" has had a more sustainable popularity, but the name peaked in 1947 and has declined since. Below we plot the current ages of everyone in the US with first names "Brittany" or "Sandra"
Freakonomics was the first book to bring this really cool idea, using first names to determine ages, to light. I was so inspired that I rushed out and gathered all the census data so I could play around with this myself. This inspiration finalized in the Demographica library (Latin for "population-drawing"), an easy way to query this data and answer questions about first names and age. Ex:
To do this, we can just look at the most common names in the 85 years and over bucket:
And what are the most common boy names these days? Similar to above: Jacob, Mason, Noah, Ethan, William, Michael and Jayden. What about your age? What are the most common names of people in your age bucket?
First of all, I was surprised this worked. I was curious about finding all the names that were really hot right now, or about to be really hot. For example, "Channing" is really hot right now:
I tried to think of some statistical ways to do this: compute the rate-of-change of the graph and determine some statistical thresholds, or create some complicated if-statements, etc. I didn't feel these were inspirational ideas. The idea of k-means crossed my mind, but I don't like using k-means on time series data - it just feels wrong. But, I did it anyways. Using some Pandas magic, I normalized each Male name to a probability distribution over the age buckets, and removed the bottom 90% of all names: this resulted in a list of 1300 popular names and their age distributions. (Why remove the bottom names? Name popularity are power-lawed - most names don't have enough data to form an accurate probability distribution.)
Using scikit-learn and its K-Means implementation, I clustered the distributions into 12 clusters (arbitrarily picked 12). Here are the resulting centers of the clusters:
There's some pretty neat stuff happening here:
You may laugh at that last list, until your own daughter brings home a Bently to meet the family
Here's my raw IPython code if you want to reproduce the above analysis (I used %hist in my terminal to generate this)
Me too, so I did the same analysis for Female names, below:
My friend who I was hacking on this with asked "Find some names that were massively popular for a very short period of time". To do this, I increased the value of k to 35, and plotted the cluster centers (Why? I knew a name like that would have a very spiked distribution, and if I plotted it it would stand out among the others - but I needed more granularity in the clusters to find it. So I increased k to 35): /p>
Bingo. That peaked distribution contain the names we want. Only two names were in that list: Catina and Katina. Both were massively popular about 40 years ago. A quick google search on Katina (the more popular to the two) determined the name was used for a newborn baby on the soap opera 'Where the Heart Is' in 1972.
Demographica is good for looking at age distributions of certain names but has another cool feature: population age inference. Let's say you run a small ecommerce website, and each order has an associated name with it. You are probably curious about the age distribution of your customers. The next steps of Demographica will enable the user to infer age distributions of their user. How do to this?
That's the idea, at least. I'll report back with how it goes!
Gelman is probably the leader in modern Bayesian inference - he's the author of the Bayesian Data Analysis, so popular that it can be referred to by its initials, BDA, and everyone knows what you are talking about. His blog is very active (and has an associated twitter account too) and he has great discussions on modelling, exposing bad analysis, MCMC, and statistical inference. One time he mentioned Bayesian Methods for Hackers and my heart melted.
Jeff Leek and co. are doing a great job with Simply Statistics. The blog is less technical than Gelman's, and focuses more on where statistics fits in with science, big data and data science. Recently, they have embarked on an amazing project: replicating the data analysis in Piketty's "Capitalism in the 21st Century".
Miller's blog articles, no matter how old, keep appearing on popular news sites, and it's well deserved. I still recall how excited I got reading "How not to run a A/B test" for the first time. Recently, he's been playing around with Bayesian A/B testing and survival analysis too, so clearly he is awesome.
Rasmus blew my mind with his terrific articles on Bayesian testing (below). His writing style is very clean with lots of custom graphics - you can tell he takes his time writing his articles. Rasmus is also the author of Bayesian First Aid, a bayesian testing framework for R.
Vanderplas, who is likely a robot and doesn't sleep, has made great contributions to the Python ecosystem: he's the author of mpld3, a translation of matplotlib figures to D3 for ipython notebooks, the amazing xkcd matplotlib styles, and he's been a leader in teaching python data analysis through conferences and lectures. His blog is an extension of his work: amazing tutorials, projects, and all very readable. It's really really difficult to only pick a sample to present:
Downey, probably the most prolific writer on this list, is the author of the "Think [Statistics, Python, Bayes, Complexity]" series. His blog is often his sketch pad before the book, and is full of fun articles. When learning survival analysis myself, I kept going back to his article (below) just to reinforce the application.
Without Flaxman's blog, I would probably not have understand Bayesian computations ideas. During my 2-day seclusion to grok Bayesian methods, and later while I was developing my tools, I constantly read and reread his articles on PyMC. His blog is still very active, and the research he produces on it (and yes, it is research) is terrific.
The team at Yhat have a really good blog, mostly of guest bloggers doing really cool things with data. Yhat is also the author of the python port of ggplot (which is pretty remarkable that it was done at all).
]]>
After pulling a few graphs locally, sampling colors, and crowd-sourcing the fonts used, I was able to come pretty close to replicating the style in Matplotlib styles. Here's an example (my figure dropped into an article on FiveThirtyEight.com)
Another example using the replicated styles:
[Edit: these steps are old, you can still use them, but there is a better solution in the Update below]. I've prepacked the styles in a json file, located in this gist. Saving that file locally, you can use the following code to load it temporarily into your matplotlib styles: code here.
I'm not sure what software the team at FiveThirtyEight uses (Excel, Stata?), but there is definitely some post-processing of whatever the software spits out. The same can be done here, in which case you could achieve complete replication.
Here's an R version that @AustinClemens2 worked out.
New way: Matplotlib 1.4 now handles changing styles really well, using the plt.style.use('ggplot')
syntax. (See full instructions here). I submitted a PR to have FiveThirtyEight style included in Matplotlib, so now you can type plt.style.use('fivethirtyeight')
! I also include the popular Bayesian Methods for Hackers color scheme, plt.style.use('bmh')
. Give it a try!
]]>
I feel like there is a misconception in performing A/B tests. I've seen blogs, articles, etc. that show off the result of an A/B test, something like "converted X% better". But this is not what the A/B test was actually measuring: an A/B test is measuring "which group is better" (the binary problem), not "how much better" (the continuous problem).
In practice, here's what happens: the tester waits until the A/B test is over (hence solving the binary problem), but now tries to solve the continuous problem, typically by using the estimate: $$ \frac{ \hat{p_B} - \hat{p_A} }{ \hat{p_A} } $$ This is a bad idea. It takes orders of magnitude more data to solve the continuous problem - the data you have is simply not enough. Tools like Optimizely are sensitive to the problem of trying to solve the continuous problem using too little data, and haven't really addressed it either. (On the other hand, I was happy to see RichRelevance understand this problem, I'll quote them later in this article). Likely most companies' internal experiment dashboard's have the same problem. This leads to hilarious headlines like:
We'd like to use the binary solution attempt to solve the continuous problem. The first problem is that the estimate of relative increase, from above, has incredibly high variance: the estimate jumps around widely. For example, let's consider data from the following A/B test:
Total | Conversions | |
---|---|---|
Group A | 1000 | 9 |
Group B | 9000 | 150 |
With this data we can solve the binary problem: there is 97% chance that group B's conversion rate is greater than group A's. The Bayesian A/B testing screencast explains how to calculate this. So using the naive estimate above, the "lift" would be 85%. Here's the distribution of possible relative increases, computed using the posteriors of A and B:
.This graph makes me lol. Experiment dashboard's will confidently proclaim an 85% lift and be done with it, but just look at the variability! We need to pick better estimate from this distribution.
What we need to do is pick a conservative estimate from this distribution:
Here are the different estimates on the same distribution:
To quote a recent whitepaper from RichRelevance:
Traditionally, we have examined statistical significance in testing using the null hypothesis method of test measurement—which can be misleading and cause erroneous decisions.For example, it can be tempting to interpret results as having strong confidence in lift for key metrics, when in reality, the null hypothesis method only measures the likelihood of a difference between the two treatments, not the degree of difference.
❤️ this - they continue:
The Bayesian method actually calculates the confidence of a specific lift, so the minimum lift from a test treatment is known.For this particular A/B test, we were able to conclude that RichRecs+ICS delivered 5% or greater lift at a 86% confidence using Bayesian analysis.
What we can determine from this is that they chose the 14% percentile (100% - 86%), which showed a 5% increase (solution to the continuous problem). I've never met anyone from RichRelevance, nor have I used their software, but they are doing something right!
Source. The logic, from what they have said, is actually wrong too. As a user who got this far into the funnel (they clearly tried to add email forwarding) is already more likely to convert than the average user. Combined with the likely small sample size, we get some ridiculous 350% lift.
Source. This is a terrible offender. They were kind enough to give us all the data too:
Their arrows point to whats wrong. Incredibly small sample size. Total mixup of the binary problem vs. the continuous problem. The "probability of outperforming original" can only be determined using Bayesian A/B testing. I've tried a bunch of priors and could not get 95.2, so they are not using Bayesian methodology, so the label is wrong.
]]>
What these technologies have in common is that are all deterministic engineering solutions. By that, I mean they have been created by techniques in mathematics, physics and engineering: often being modeled in a mathematical language, guided by physics' calculus and constrained and brought to life by engineering. I argue that these types of problems, of modeling deterministically, are problems that our father's had the luxury of solving.
Consider the universe of all problems. It is a large universe, no doubt. For simplicity, think of it as a three dimensional space, though in reality it is infinite-dimensional. The sub-space of all problems that can be solved by modeling deterministically, like building a bridge or modeling an airplane, constitute a line (or higher-dimensional equivalent of a line if you wish) in our three dimensional space. In mathematics, a line in 3-D space has measure 0, essentially its contribution to filling the space is negligible.
The problems we solved in the 20th Century, like flight, radio and digital computers, lie on this line. Using our current mathematics and engineering knowledge, we are reaching the limits of exploring that line: after all, the easy problems have already been solved (this bring to mind the tautology "Science is hard because all the easy problems are solved"). What's undiscovered on that line is still much, but major progression has slowed considerably (see previous sentence). Improvements are marginal.
What I am arguing is that our previous problem-solving steps of 1. model, 2. apply mathematics 3. ??? 4. profit does not have the same power as it use to have when applied to current day, and future, problems. We need to, and are starting to, explore problems off of this measure-0 line, like the red dot in the figure above. So, if this line characterizes all deterministic modeling problems, what problem might lie off this line? Statistical problems .
Statistical problems describe the space we haven't explored yet. Statistical problems are not new: they are likely as old as deterministic problems. What is new is our ability to solve them. Spear-headed by the (constantly increasing) tidal wave of data, practitioners are able to solve new problems otherwise thought impossible. Consider the development of a spellchecker: in a deterministic approach, an algorithm for spell checking would have needed to incorporate context and complicated ideas from the language's grammar (I shutter at the nested if
statements ), unique only up to that language; whereas a statistical approach can be written in under 20 lines. The difference between the two approaches is that the latter has taken advantage of the presence of a large corpus of text -- a very lenient assumption.
This isn't another big data article, but its hard underestimate, let along imagine, what we will be doing with these casual data sets. Fields like medicine, that previously relied on small sample sizes to make important one-size-fits-all decisions, will evolve into a very personal affair. By investigating traffic data, dynamic solutions can be built that mimic past successes. Aided by machine learning, specifically recommendation engines, companies can invoke desires never previously thought about in our minds. Ideas like multi-armed bandits will motivate UI and AI development.
Of course, there is a tradeoff. To speak of solving a statistical problem is silly. Whereas in our 20th Century past, we either solved the deterministic problem or did not, i.e. find or did not find a solution given the physical and engineering constraints, in statistical problems we are subject to some fraction of failure (hence why we don't build probabilistic bridges). This can be described by another visualization. The space of deterministic engineering problems can be found lying on the ends of the unit interval $[0,1]$. A 1 represents a problem that can be, or is, solved completely, eg: we successfully designed a method of flight. The space in between 0 and 1 is represented by statistical problems. For example, spell checking cannot be solved in the traditional sense, but it can be accurate 95% of the time. Harder statistical problems are those that involve reflexivity, that is your actions will affect the outcome (problems like the stock market and ad-targeting, where consumers can become desensitized to your optimized ad). Finally, problems that are unsolvable using current science and technology are assigned to 0.
The 20th Century breakthroughs were not breakthroughs at the time of their discoveries, minus a few exceptions. The technologies took years to percolate through society, and only after they became cheap enough for public consumption. Therefore, if we were to ask if we have already invented any breakthrough technologies of the 21st Century, we should search liberally through what we have done. I would suggest that yes, we have made a breakthrough: quality information search. Tech giants like Google and Microsoft are working on these technologies, but the are still in their infancy. Furthermore, this technology is the most naive technology given our data supply. Imagine you were the world's first librarian, and you have just received thousands of books. The first, and most naive, thing you would do is organize the books, i.e. make queries of the books easier. Returning to the present, we are at this stage where we have overcome our own data-indexing problem. In fact, we have gone a step further, and we can not only return all results, but damn good results too. Imagine an alternative internet that may have occurred where you browsed by selecting more and more specific topics from accordion menus until you reached a desired webpage -- this is a possible realization of the internet organization that luckily did not occur.
I am sticking my neck out, but I should point out an error in my overall argument. Previous authors may have been saying that we are at the end of so-and-so technology for years, right before a big new breakthrough. I, of course, cannot imagine these breakthroughs (else it would exist), hence I underestimate future deterministic solutions. There are still great advances possible, teleportation and quantum computers come to mind, that will be classified as breakthrough deterministic tech.
Simply, I claim we will start to see novel technological uses of data to statistical problems that we cannot even fathom right now (who could have imagined nuclear power in a pre-nuclear society). These technologies will be as revolutionary as radar was to man. So, why haven't you picked up Bayesian Methods for Hackers yet?
EDIT: I was probably too hasty in discounting possible ventures of new technologies. I think there is a second venture that compliments data-driven technological advances: the bio-tech industry. The fictional biotech I can think of now include nanobiotechnology, brain-machine interfaces and genetic engineering. Each of these techs, and probably all biotech advances, will produce massive amounts of data as a byproduct. This is why I said the data-driven tech and biotech will compliment each other.
]]>The feature space is \(\mathbf{R}^3\), or more accurately, the positive quadrant in \(\mathbf{R}^3\) as all the \(X\) variables can only be positive quantities. Domain knowledge about tires might suggest that the *speed* the vehicle was moving at is important, hence we generate another variable, \(X_4\) (this is the feature extraction part):
This extends our old feature space into a new one, the positive part of \(\mathbf{R}^4\).
Furthermore, a mapping in our example is a function, \(\phi\), from \(\mathbf{R}^3\) to \(\mathbf{R}^4\):
$$\phi(x_1,x_2,x_3) = (x_1, x_2, x_3, \frac{x_1}{x_2} )$$]]>