Autofocus Creole: a game
Jan. 12th, 2007 11:45 pmYou already know about googlewhacking, finding pairs of words that only bring up a single hit when looked for together on google. I decided that's not as good a measure as I'd like of how uncorrelated two words really are, because it doesn't take into account the base rarity of the individual words -- it's not really that odd that two words would only return a single google hit if each of them by themselves are only on a small number of pages. What's more intriguing, to me anyway, are words that by themselves are quite common, but which nevertheless rarely appear together.
So I'm defining a number I call the Google Correlation of two words:
GC(x,y) = #hits for x y / ( (#hits for x) * (#hits for y) )
Most pairs of words will have quite small GCs (except for stormy petrals, I suppose!), but the question is, how small can you get? What's the smallest nonzero GC you can find?
Any googlewhack will have a numerator of 1, but won't necessarily be that tiny because its denominator may also be small. For example, one of the recent finds at googlewhack.com, Jotted cruddiness,
has a GC of 1 / (795000 * 2740) = 4.59073589 * 10^-10.
But my best so far, Autofocus Creole, is down at 685 / (14700000 * 5570000) = 8.36600349 * 10^-12.
So have at it, and share your finds! Anyone who wants to help me make a webpage for people to report these, I'll share all the credit with you once it becomes a fad :^)
So I'm defining a number I call the Google Correlation of two words:
GC(x,y) = #hits for x y / ( (#hits for x) * (#hits for y) )
Most pairs of words will have quite small GCs (except for stormy petrals, I suppose!), but the question is, how small can you get? What's the smallest nonzero GC you can find?
Any googlewhack will have a numerator of 1, but won't necessarily be that tiny because its denominator may also be small. For example, one of the recent finds at googlewhack.com, Jotted cruddiness,
has a GC of 1 / (795000 * 2740) = 4.59073589 * 10^-10.
But my best so far, Autofocus Creole, is down at 685 / (14700000 * 5570000) = 8.36600349 * 10^-12.
So have at it, and share your finds! Anyone who wants to help me make a webpage for people to report these, I'll share all the credit with you once it becomes a fad :^)
no subject
Date: 2007-01-13 05:38 am (UTC)no subject
Date: 2007-01-13 05:59 am (UTC)no subject
Date: 2007-01-13 06:06 am (UTC)If GC = ln( # hits of x,y / sqrt( (# hits x) * (# hits y)) ) then:
GC(jotted cruddiness) = -10.6
GC(autofocus creole) = -9.4887
GC(of a) = -0.1509
(Assuming I didn't screw up the math again.)
It makes sense to interpret the counts as nonnormalized probabilities; if you don't square-root, then you have a leftover normalization, and hence the GC measure is a function of the size of Google's index.
The GC formula is also similar to mutual information. Information-theoretic measures might be the most natural way of expressing how "independent" two words are.
no subject
Date: 2007-01-13 06:12 am (UTC)no subject
Date: 2007-01-13 06:30 am (UTC)no subject
Date: 2007-01-13 06:46 am (UTC)no subject
Date: 2007-01-13 06:53 am (UTC)no subject
Date: 2007-01-13 08:21 am (UTC)no subject
Date: 2007-01-13 08:35 am (UTC)if anything, if each webpage is the result of n independent draws from the universal word distribution, we should expect the absence of a high-frequency word to increase the probability of finding any other word, for the simple reason that we know the absent word isn't hogging any slots.
no subject
Date: 2007-01-13 10:26 am (UTC)no subject
Date: 2007-01-13 08:25 pm (UTC)fundamentally, i don't see what problem we're trying to fix - if you don't like that the baseline number is so high, just add a normalizing factor of 22 to all scores before reporting them.
no subject
Date: 2007-01-13 06:46 am (UTC)a little back-of-the-envelope work suggests that, if we restrict attention to the English web, this puts the baseline someplace between 22.1 and 23.
this means that the example googlewhack is probably at significantly higher correlation than chance, which when you look at the words in question isn't that surprising.