Autofocus Creole: a game
Jan. 12th, 2007 11:45 pmYou already know about googlewhacking, finding pairs of words that only bring up a single hit when looked for together on google. I decided that's not as good a measure as I'd like of how uncorrelated two words really are, because it doesn't take into account the base rarity of the individual words -- it's not really that odd that two words would only return a single google hit if each of them by themselves are only on a small number of pages. What's more intriguing, to me anyway, are words that by themselves are quite common, but which nevertheless rarely appear together.
So I'm defining a number I call the Google Correlation of two words:
GC(x,y) = #hits for x y / ( (#hits for x) * (#hits for y) )
Most pairs of words will have quite small GCs (except for stormy petrals, I suppose!), but the question is, how small can you get? What's the smallest nonzero GC you can find?
Any googlewhack will have a numerator of 1, but won't necessarily be that tiny because its denominator may also be small. For example, one of the recent finds at googlewhack.com, Jotted cruddiness,
has a GC of 1 / (795000 * 2740) = 4.59073589 * 10^-10.
But my best so far, Autofocus Creole, is down at 685 / (14700000 * 5570000) = 8.36600349 * 10^-12.
So have at it, and share your finds! Anyone who wants to help me make a webpage for people to report these, I'll share all the credit with you once it becomes a fad :^)
So I'm defining a number I call the Google Correlation of two words:
GC(x,y) = #hits for x y / ( (#hits for x) * (#hits for y) )
Most pairs of words will have quite small GCs (except for stormy petrals, I suppose!), but the question is, how small can you get? What's the smallest nonzero GC you can find?
Any googlewhack will have a numerator of 1, but won't necessarily be that tiny because its denominator may also be small. For example, one of the recent finds at googlewhack.com, Jotted cruddiness,
has a GC of 1 / (795000 * 2740) = 4.59073589 * 10^-10.
But my best so far, Autofocus Creole, is down at 685 / (14700000 * 5570000) = 8.36600349 * 10^-12.
So have at it, and share your finds! Anyone who wants to help me make a webpage for people to report these, I'll share all the credit with you once it becomes a fad :^)
no subject
Date: 2007-01-13 06:12 am (UTC)no subject
Date: 2007-01-13 06:30 am (UTC)no subject
Date: 2007-01-13 06:46 am (UTC)no subject
Date: 2007-01-13 06:53 am (UTC)no subject
Date: 2007-01-13 08:21 am (UTC)no subject
Date: 2007-01-13 08:35 am (UTC)if anything, if each webpage is the result of n independent draws from the universal word distribution, we should expect the absence of a high-frequency word to increase the probability of finding any other word, for the simple reason that we know the absent word isn't hogging any slots.
no subject
Date: 2007-01-13 10:26 am (UTC)no subject
Date: 2007-01-13 08:25 pm (UTC)fundamentally, i don't see what problem we're trying to fix - if you don't like that the baseline number is so high, just add a normalizing factor of 22 to all scores before reporting them.