5. Let’s play a little with Google

By Juan Chamero



Google strength and weakness
    Searching is a game. The answers to our daily uncertainties are “up” there, somewhere hidden in the Web space as a sort of “established truth” we are challenged to unveil by smart querying.
    In the figure above we depict five factors that intervene in a search: first of all you need to have a basic knowledge or awareness about the topic you are trying to investigate or simply to satisfy your curiosity; then you need a strategy of searching adequate to the complexity of the topic, depicted in red; now you are ready to interact with a  Knowledge Pointer Database in some extent a Virtual Oracle where a significant sample of the established truth is filed and indexed; Along your search you need of your creativity to discover appropriate shortcuts; and finally you need some luck!.
    Let’s suppose that data are filed and indexed the best way: by meaningful concepts. If a document for instance deals with no more than 40 concepts it will be indexed by these 40 concepts in the average. On the contrary as we have seen in our past reflections if indexed by words this document will be indexed by 500 words in the average. However the big difference in gaming is that if indexed with concepts the retrieval will be almost immediate meanwhile if indexed with words the retrieval could be an adventure, a smart interception game. We have to admit that searching by concepts is not a game and may look a little bored, because whatever we were looking for could be found in no more that three clicks, and less than one and a half in the average!.
    If you are a Web search expert or your general culture level is high, it’s highly probably that you are going to define the boundary of your “uncertainty space” soon until you guess a query or a sequence of them that answers well enough your uncertainty. Being a smart search expert it’s possible to find almost everything – provided it exists- in no more of half an hour, applying special text-context strategies. The problem is that most people are not experts.

Let’s fun a little with Google
    Now let’s fun a little querying bizarre in Google. See how thru one of its weakness we may be aware of one of its strengths. Let’s start with:

hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh,

a series of 31 h’s. This unimaginable sequence of characters have existence in Google with 18,800 references!. It doesn’t end your shocking because if you try to add five more h’s you get 12,400 instead!. Is this a critic to Google or to the human inventive and stupidity?. It seems me that both. It’s funny trying to discover these types of long chains of characters.

Abracadabraabracadabraabracadabra 9
Abracadabra abracadabra abracadabra 2,560,000
Uffffffffffffffffffffffffffffffffffffffffffffffffff 186
Warwarwarwarwarwarwarwar 6

What does it mean?.
    It means that Google indexes absolutely everything within the text realm!. Not bad!. It’s a serious indexer of human textual chains creations. Correct and incorrect expressions, either intentionally or unintentionally written are tracks of human expression and behavior and/or why not codes?. We need these types of search engines to proceed to unveil the Web semantics.  In order to build SSSE Semantic Super Search Engines we need something like Google that provide us the raw cognitive material. Google is powerful, smart, ingenuous, and dummy at the same time.
Note. Could you explain why the second query above differs so much from the first?

How Google handles multiple words keywords
    Let’s game with the following queries:

Common expressions
1. sine qua non  313,000
1.1 sine 28,300,000
1.2 qua 14,300,000
1.3 non 1600,000,000
2. "sine qua non" 1,180,000
3. as_it_is 3210
4. "as it is" 19,000,000

Google does not discriminate well common expressions, a semantic entity used in the literary part of any text –we will see later the two parts of a text: literary and conceptual-. Sine qua non as a unity is a Latin term. Rarely could it be used as a concept but as a common word. Darwin Technology even ignoring its language origin classifies them as common words. Could you explain why 1, differs from 2, so much?. Something wrong may have happened because per answers we may infer that there are 1,180,000 documents that have inside their texts the expression “sine qua non” an d at the same time these documents satisfy that are documents where sine existence AND qua existence AND non existence is true!. Then why it renders 313,000 for query 1?. 

Multiple words keywords
the 9660,000,000
the son 25,000,000
the son of 1,620,000
the son of god 1,880,000
the son of god in 79,100
the son of god in the 22,400
the son of god in the biblical 5
the son of god in the biblical sense 5
the son of god in the biblical sense of 1
And even the full sentence
"Yet the Bible affirms Jesus to be the Son of God - in the Biblical sense of this term!" accounts for 1 reference!.

    Here we find another strength point of Google as a primary platform to go farther to the Web Semantics. It takes care of any number of keywords. We do not know its retrieving algorithms but one thing is true: it looks like processing unlimited number of words within quotation marks and perhaps in an extreme the whole text indexed as a keyword!. In the example above we’ve checked with a whole paragraph. This feature could be used to audit similarity and plagiarism.

Some inferences about Google inner architecture
    I will try to explain as simple as I can how Google probably searches. Google sends their robots (crawlers) to navigate through all routers of the Web entering into whatever environment is permitted. These agents are at large diligent, fast end efficient. I also ignore where the indexing is performed, if at the right moment agents locate documents or “at home” once their content is retrieved. Documents are adequately “undressed” remaining only their text content and hyperlinks.
    All character chains are considered “words”, specifically “single words”: a real common word in a given language (but, how, funny, uh, ah, aha, well, C+,… ), single letters (a, b, ….,z ), numbers (1, 333, 2,002,…. ) (acronyms/abbreviationsCAD, gym, ibm,….. ), names (Smith, opa locka, Kapalaroose,…… ), special characters (%, &, (, @,…. ) and any chain of accepted characters like ahhhhhhhhhhhhhhhhhhhh even though we ignore if a limitation in seize exists.
    As a result of its continuous task Google crawlers maintain updated a huge Index –onto tables?, as Open lists?, as matrices?- of as many columns as the total number of documents indexed by as many rows as different semantic objects within their text content may hold the universe of documents. This index is not public. Its size must be considerable high by facts mentioned. In this list, for a given language we may find real “Common Words” of that language, names, fancy names, and a huge amount of pseudo words including bad structured and bad spelled words. We ignore how common expressions like “sine die” “ex-post”, “front-end” are indexed, perhaps as a logical composition once you issue the query. We also ignore how Google handle names: of persons, things and places. Google have many other complementary indexes like the ones to implement its Rank algorithm, where in we may imagine giant 10**10x10**10 matrices that math against themselves.

How the Cyberspace looks like
    In the figure below we depict a search engine Cosmo vision. The Web is in black with nearly 10,000 million documents indexed by their corresponding URL’s. Among this ocean we may distinguish some Websites like Google. The cognitive offer of this type of sites (Search Engines) is open to users located in the Cyber space defined by all interconnected people via ISP’s, about 1000+ million as of today.