eclectic_boy: (Default)
[personal profile] eclectic_boy
You already know about googlewhacking, finding pairs of words that only bring up a single hit when looked for together on google. I decided that's not as good a measure as I'd like of how uncorrelated two words really are, because it doesn't take into account the base rarity of the individual words -- it's not really that odd that two words would only return a single google hit if each of them by themselves are only on a small number of pages. What's more intriguing, to me anyway, are words that by themselves are quite common, but which nevertheless rarely appear together.

So I'm defining a number I call the Google Correlation of two words:

GC(x,y) = #hits for x y / ( (#hits for x) * (#hits for y) )

Most pairs of words will have quite small GCs (except for stormy petrals, I suppose!), but the question is, how small can you get? What's the smallest nonzero GC you can find?

Any googlewhack will have a numerator of 1, but won't necessarily be that tiny because its denominator may also be small. For example, one of the recent finds at googlewhack.com, Jotted cruddiness,
has a GC of 1 / (795000 * 2740) = 4.59073589 * 10^-10.

But my best so far, Autofocus Creole, is down at 685 / (14700000 * 5570000) = 8.36600349 * 10^-12.

So have at it, and share your finds! Anyone who wants to help me make a webpage for people to report these, I'll share all the credit with you once it becomes a fad :^)

Date: 2007-01-13 08:25 pm (UTC)
From: [identity profile] q10.livejournal.com
but yielding the same result whenever two words only occur together probably isn't desirable. if you have two words that occur with P ~= 1 (‘sex’ and ‘the’, for example), then they can't help always occurring together, so that's not shocking, but if you have two one-in-a-million words that always occur together, that suggests they have some interesting connection.

fundamentally, i don't see what problem we're trying to fix - if you don't like that the baseline number is so high, just add a normalizing factor of 22 to all scores before reporting them.

February 2014

S M T W T F S
      1
2345678
9101112131415
16171819202122
2324 25262728 

Style Credit

Expand Cut Tags

No cut tags
Page generated Jan. 24th, 2026 08:13 pm
Powered by Dreamwidth Studios