eclectic_boy: (Default)
[personal profile] eclectic_boy
You already know about googlewhacking, finding pairs of words that only bring up a single hit when looked for together on google. I decided that's not as good a measure as I'd like of how uncorrelated two words really are, because it doesn't take into account the base rarity of the individual words -- it's not really that odd that two words would only return a single google hit if each of them by themselves are only on a small number of pages. What's more intriguing, to me anyway, are words that by themselves are quite common, but which nevertheless rarely appear together.

So I'm defining a number I call the Google Correlation of two words:

GC(x,y) = #hits for x y / ( (#hits for x) * (#hits for y) )

Most pairs of words will have quite small GCs (except for stormy petrals, I suppose!), but the question is, how small can you get? What's the smallest nonzero GC you can find?

Any googlewhack will have a numerator of 1, but won't necessarily be that tiny because its denominator may also be small. For example, one of the recent finds at googlewhack.com, Jotted cruddiness,
has a GC of 1 / (795000 * 2740) = 4.59073589 * 10^-10.

But my best so far, Autofocus Creole, is down at 685 / (14700000 * 5570000) = 8.36600349 * 10^-12.

So have at it, and share your finds! Anyone who wants to help me make a webpage for people to report these, I'll share all the credit with you once it becomes a fad :^)

Date: 2007-01-13 07:38 am (UTC)
From: [identity profile] q10.livejournal.com
are we allowed to exploit typos? if so, i've got:
strunk yotube 1
yotube 223,000
strunk 4,670,000

ln(1/(4670000* 223000)) ~= -27.67

and of course, there needs to be a rule that using a pair with no cooccurrences isn't allowed.

Date: 2007-01-13 10:35 pm (UTC)
From: [identity profile] creed-of-hubris.livejournal.com
Youtube + escherichia

youtube 140,000,000
escherichia 23,500,000
youtube + escherichia 803

~29

Date: 2007-01-13 10:54 pm (UTC)

Date: 2007-01-13 11:47 pm (UTC)
From: [identity profile] creed-of-hubris.livejournal.com
caenorhabditis + blu-ray = 28.6, caenorhabditis + tekken also around there. those are the only other ones i've found over 28.

it's really hard to generate very large numbers.

Date: 2007-01-16 08:49 am (UTC)
From: [identity profile] q10.livejournal.com
well done, sir.

February 2014

S M T W T F S
      1
2345678
9101112131415
16171819202122
2324 25262728 

Style Credit

Expand Cut Tags

No cut tags
Page generated Jan. 24th, 2026 04:51 pm
Powered by Dreamwidth Studios