eclectic_boy

You already know about googlewhacking, finding pairs of words that only bring up a single hit when looked for together on google. I decided that's not as good a measure as I'd like of how uncorrelated two words really are, because it doesn't take into account the base rarity of the individual words -- it's not really that odd that two words would only return a single google hit if each of them by themselves are only on a small number of pages. What's more intriguing, to me anyway, are words that by themselves are quite common, but which nevertheless rarely appear together.

So I'm defining a number I call the Google Correlation of two words:

GC(x,y) = #hits for x y / ( (#hits for x) * (#hits for y) )

Most pairs of words will have quite small GCs (except for stormy petrals, I suppose!), but the question is, how small can you get? What's the smallest nonzero GC you can find?

Any googlewhack will have a numerator of 1, but won't necessarily be that tiny because its denominator may also be small. For example, one of the recent finds at googlewhack.com, Jotted cruddiness,
has a GC of 1 / (795000 * 2740) = 4.59073589 * 10^-10.

But my best so far, Autofocus Creole, is down at 685 / (14700000 * 5570000) = 8.36600349 * 10^-12.

So have at it, and share your finds! Anyone who wants to help me make a webpage for people to report these, I'll share all the credit with you once it becomes a fad :^)

Current Location: my parents'
Current Music: Carl Nielsen: Symphony #1 - bold & innovative

Threaded | Top-Level Comments Only

From:

eclectic-boy.livejournal.com

superbacana suggests using logarithms to make the numbers less cumbersome. In that case,

GC(Jotted Cruddiness) = -21.5
GC(Autofocus Creole) = -25.5

How low can you go?

From:

carnap.livejournal.com

Now that "Autofocus Creole" seems to be the name of the game, you need to start tracking the google correlation of those two terms to see how popular the game is. . .

From:

tirerim.livejournal.com

Hmm. Are you sure that multiplying the frequencies of the individual words is the way to go, rather than adding them? I ask because GC(of a) = ln(5.89 * 10^9 / (5.77 * 10^9 * 8.13 * 10^9)) = -22.8. This despite the fact that "of" and "a" occur more frequently together than "of" does by itself!

From:

tirerim.livejournal.com

Alternatively,

carpenter suggests that squaring the numerator might help.

From:

superbacana.livejournal.com

Good point. The correlation is normally defined in terms of real-valued variables, it's not technically correct to call GC a correlation. Nonetheless, modifying the GC to be similar to correlation would require taking the square-root of the denominator, or squaring the numerator. This seems to make sense; then the GC becomes dimensionless: the GC is a ratio of counts (or squared-counts).

If GC = ln( # hits of x,y / sqrt( (# hits x) * (# hits y)) ) then:

GC(jotted cruddiness) = -10.6
GC(autofocus creole) = -9.4887
GC(of a) = -0.1509

(Assuming I didn't screw up the math again.)

It makes sense to interpret the counts as nonnormalized probabilities; if you don't square-root, then you have a leftover normalization, and hence the GC measure is a function of the size of Google's index.

The GC formula is also similar to mutual information. Information-theoretic measures might be the most natural way of expressing how "independent" two words are.

From:

superbacana.livejournal.com

Another nice property of this formula is that the GC is 0 when two words _only_ occur together ("of a" is close).

From:

sildra.livejournal.com

Well, at least through this whole process I learned that quark is a type of cheese. Nonetheless, it does not seem to go well with turmeric (-24.6). Not as good as yours, but still, I thought I'd share.

From:

q10.livejournal.com

but it looks like

eclectic_boy's original formula has the nice property that all uncorrelated pairs will have yield the same value, while your new formula doesn't.

From:

q10.livejournal.com

well, keep in mind that, where A and B are independent, freq(A) = P(A)*sizeof(Web), freq(B) = P(B)*sizeof(Web), and freq(AB) = P(A)P(B)*sizeof(Web), so GC(A,B) = ln(1/sizeof(Web)). when that's the baseline, these numbers start to make sense.

a little back-of-the-envelope work suggests that, if we restrict attention to the English web, this puts the baseline someplace between 22.1 and 23.

this means that the example googlewhack is probably at significantly higher correlation than chance, which when you look at the words in question isn't that surprising.

From:

superbacana.livejournal.com

What do you mean by uncorrelated? If you mean (# hits of x,y)=0, then the GC is always the same (ln(0)) for any x,y.

From:

sildra.livejournal.com

dragon superfluid -26.2

From:

q10.livejournal.com

i should have said ‘independent’, where of two events X and Y are independent iff P(X|Y) = P(X), or, equivalently, iff P(X,Y) = P(X)P(Y). my intuition that two different words that have no particular influence on each others' probability of occurrence should give you the same score whether they're both very common, both very rare, or of very different frequency.

From:

sonatanator.livejournal.com

My best so far:
Kodaly: 1,710,000
YouTube: 137,000,000
Together: 990

I don't have a fancy calculator on me, but the original would have been 4.22589 x 10 (-12).

If anyone cares to do the other thing, go for it.

From:

sildra.livejournal.com

-26.2

Google does ln and various other useful functions. Also unit conversions, and it has various useful constants.

From:

q10.livejournal.com

are we allowed to exploit typos? if so, i've got:
strunk yotube 1
yotube 223,000
strunk 4,670,000

ln(1/(4670000* 223000)) ~= -27.67

and of course, there needs to be a rule that using a pair with no cooccurrences isn't allowed.

From:

superbacana.livejournal.com

I think it's a matter of taste. I would think that indepedent pairs with higher frequencies are more interesting, because one might think that, the more a word occurs, the more likely it is to co-occur with any other given word. If one views every web page as just having a bunch of random draws from a shared distribution of words, then this is definitely true; the metric finds the most egregious violations of this simple model.

From:

q10.livejournal.com

i'm not sure i follow you. yes, of course, we should expect P(X,Y) to increase as P(X) does, but why should more common words be more likely to have nontrivial dependencies? in particular, in the hypothetical extreme case were a word is obligatory in every document (that is P(X) = 1) then it'll be trivially independent of everything else, since in that case P(X,Y) = P(Y) = P(X)P(Y)).

if anything, if each webpage is the result of n independent draws from the universal word distribution, we should expect the absence of a high-frequency word to increase the probability of finding any other word, for the simple reason that we know the absent word isn't hogging any slots.

From:

tirerim.livejournal.com

On the other hand, his formula is always the same when two words occur only together—

eclectic_boy's formula will yield different results depending on whether the words are rare or common.

From:

q10.livejournal.com

but yielding the same result whenever two words only occur together probably isn't desirable. if you have two words that occur with P ~= 1 (‘sex’ and ‘the’, for example), then they can't help always occurring together, so that's not shocking, but if you have two one-in-a-million words that always occur together, that suggests they have some interesting connection.

fundamentally, i don't see what problem we're trying to fix - if you don't like that the baseline number is so high, just add a normalizing factor of 22 to all scores before reporting them.

From:

creed-of-hubris.livejournal.com

I object, because youtube is not a word!

From:

creed-of-hubris.livejournal.com

Youtube + escherichia

youtube 140,000,000
escherichia 23,500,000
youtube + escherichia 803

~29

From:

superbacana.livejournal.com

Nice.

From:

creed-of-hubris.livejournal.com

caenorhabditis + blu-ray = 28.6, caenorhabditis + tekken also around there. those are the only other ones i've found over 28.

it's really hard to generate very large numbers.

From:

q10.livejournal.com

well done, sir.

Threaded | Top-Level Comments Only

Profile

eclectic_boy

The Unknown Composers Page

February 2014

S	M	T	W	T	F	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

Page Summary

Style Credit

Style: by timeasmymeasure

Expand Cut Tags

No cut tags

Page generated Jan. 24th, 2026 07:36 pm

Autofocus Creole: a game

Autofocus Creole: a game

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

Profile

February 2014

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags