eclectic_boy: (Default)
[personal profile] eclectic_boy
You already know about googlewhacking, finding pairs of words that only bring up a single hit when looked for together on google. I decided that's not as good a measure as I'd like of how uncorrelated two words really are, because it doesn't take into account the base rarity of the individual words -- it's not really that odd that two words would only return a single google hit if each of them by themselves are only on a small number of pages. What's more intriguing, to me anyway, are words that by themselves are quite common, but which nevertheless rarely appear together.

So I'm defining a number I call the Google Correlation of two words:

GC(x,y) = #hits for x y / ( (#hits for x) * (#hits for y) )

Most pairs of words will have quite small GCs (except for stormy petrals, I suppose!), but the question is, how small can you get? What's the smallest nonzero GC you can find?

Any googlewhack will have a numerator of 1, but won't necessarily be that tiny because its denominator may also be small. For example, one of the recent finds at googlewhack.com, Jotted cruddiness,
has a GC of 1 / (795000 * 2740) = 4.59073589 * 10^-10.

But my best so far, Autofocus Creole, is down at 685 / (14700000 * 5570000) = 8.36600349 * 10^-12.

So have at it, and share your finds! Anyone who wants to help me make a webpage for people to report these, I'll share all the credit with you once it becomes a fad :^)

Date: 2007-01-13 06:30 am (UTC)
From: [identity profile] q10.livejournal.com
but it looks like [personal profile] eclectic_boy's original formula has the nice property that all uncorrelated pairs will have yield the same value, while your new formula doesn't.

Date: 2007-01-13 06:46 am (UTC)
From: [identity profile] superbacana.livejournal.com
What do you mean by uncorrelated? If you mean (# hits of x,y)=0, then the GC is always the same (ln(0)) for any x,y.

Date: 2007-01-13 06:53 am (UTC)
From: [identity profile] q10.livejournal.com
i should have said ‘independent’, where of two events X and Y are independent iff P(X|Y) = P(X), or, equivalently, iff P(X,Y) = P(X)P(Y). my intuition that two different words that have no particular influence on each others' probability of occurrence should give you the same score whether they're both very common, both very rare, or of very different frequency.

Date: 2007-01-13 08:21 am (UTC)
From: [identity profile] superbacana.livejournal.com
I think it's a matter of taste. I would think that indepedent pairs with higher frequencies are more interesting, because one might think that, the more a word occurs, the more likely it is to co-occur with any other given word. If one views every web page as just having a bunch of random draws from a shared distribution of words, then this is definitely true; the metric finds the most egregious violations of this simple model.

Date: 2007-01-13 08:35 am (UTC)
From: [identity profile] q10.livejournal.com
i'm not sure i follow you. yes, of course, we should expect P(X,Y) to increase as P(X) does, but why should more common words be more likely to have nontrivial dependencies? in particular, in the hypothetical extreme case were a word is obligatory in every document (that is P(X) = 1) then it'll be trivially independent of everything else, since in that case P(X,Y) = P(Y) = P(X)P(Y)).

if anything, if each webpage is the result of n independent draws from the universal word distribution, we should expect the absence of a high-frequency word to increase the probability of finding any other word, for the simple reason that we know the absent word isn't hogging any slots.

Date: 2007-01-13 10:26 am (UTC)
From: [identity profile] tirerim.livejournal.com
On the other hand, his formula is always the same when two words occur only together—[livejournal.com profile] eclectic_boy's formula will yield different results depending on whether the words are rare or common.

Date: 2007-01-13 08:25 pm (UTC)
From: [identity profile] q10.livejournal.com
but yielding the same result whenever two words only occur together probably isn't desirable. if you have two words that occur with P ~= 1 (‘sex’ and ‘the’, for example), then they can't help always occurring together, so that's not shocking, but if you have two one-in-a-million words that always occur together, that suggests they have some interesting connection.

fundamentally, i don't see what problem we're trying to fix - if you don't like that the baseline number is so high, just add a normalizing factor of 22 to all scores before reporting them.

February 2014

S M T W T F S
      1
2345678
9101112131415
16171819202122
2324 25262728 

Style Credit

Expand Cut Tags

No cut tags
Page generated Jan. 24th, 2026 12:50 pm
Powered by Dreamwidth Studios