1. What’s in a document? – Part I



By Juan Chamero

“What’s in a document” existential question in the line of the famous Shakespeare’s Quotation from “Romeo and Juliet”:

By any other word would smell as sweet."
"What's in a name? That which we call a rose

How are documents registered?: We are trying to define how our Darwin methodology see the search process compared with conventional search engines. Darwin supersedes most conventional IR technologies because by navigating under its guidance you may see more and better the human knowledge logical structure . However to be honest some search engines save perfectly well the Human Memory but without classifying it. Some conventiuonal search engines like Google index documents in a single common level by any word used, so nothing is actually lost: and at least the Human Memory stored in the Web is safe thank to them!.

Have fun searching in Conventional Search Engines: Human information and knowledge inherently has from five to thirteenn levels of hierarchical complexity depending of the discipline. Most Conventional Search Engines are like huge Catalogs that register almost everything in only one level. American Constitution Acts are in the same level of a traffic infraction committed by a citizen named John Smith from Louisiana in November 20th of year 1992, a neonatology overview is at the same level of those two documents and in the same level of an experimental test of a new drug to combat some strange form of infantile malnutrition (marasmus) in Ghana. All of them are important pieces of information that should be detected, classified and saved for ever. In this sense Conventional Search Engines should be considered semantically “flat”: Google for instance have actually registered in a giant one floor virtual depot more than 10,000,000,000 references to an equal number of documents!. Let’s see how the retrieval service provided by these Search Engines is good enough in despite of this severe structural limitation. However the next generation of SSSE Super Semantic Search Engines thousand times more efficient in terms of time- memory tradeoff will be commercially available soon to end users.  

Brute force registration process: We say that the best Conventional Search Engines, Google among them, are semantically rude and primitive, but they are endowed with an extraordinary “brut force” registration facility. Brut force is not pejoratively used here but as an exhaustive method that leaves no possibility off consideration within a domain (no possible existence, no permutation, no combination, no alternative, and no space). In this sense brute force means that Search Engines procedures, algorithms and robots scan and take into account absolutely all possible “terms” within documents, from their beginning to their end indexing them (documents) in as many different terms tables or table vectors exist. If a given term like “parallel” is found 325 times in a document most of these exhaustive SE’s register the document in a table or vector assigned to the term “parallel”, not always weighting its presence within the document.

What’s in a term?: For these SE’s a term is any string of characters (ASCII characters for example) basically separated by the “BLANK” space character or the character that marks the end of a typed “word” in a keyboard (some SE’s consider many other special characters as separators (.,:;+-=….). “Fortunately”, for the sake of ulterior semantic behavior pattern analysis, some exhaustive SE’s like Google define as terms chains of any length!. Fun yourself a little by querying Google somehow atypically by ffffffffffffffffffffffffffffffffffffffffffffff and you will get 568 references, by 111111111111111111111111111111111111111 20,700 and by abracadabra repeated three times: abracadabraabracadabraabracadabra 61 (Google as of 19 July 2006)!. On the other semantic extreme try with “a” (24,570,000,000) and “the” (23,930,000,000) and you may guess the size of its virtual semantic depot!. .