8. The Web as a semantic hypercube

By Juan Chamero

About a rationale semantic indexing
    In this figure we depict how our Darwin Technology may transform a conventional Search Engine like Google in a SSSE, Semantic Super Search Engine and how it could work in the future by classifying documents at the moment they are discovered by search engine robots. As we will see in a next post this SSSE could be implemented in all type of Personal Computers via our Darwin e-membrane. See below how Darwin structures these hyper cubes.

Google as most conventional search engines is semantically “flat”, unstructured: documents are only indexed by words (see more details about this “flatness” in our past posts). Starting from its basement -as the data intelligence is hidden but retrievable by Darwin agents- thru a Darwin interface (Darwin e-membrane) all Google content could be seen as structured in an ordered building of up to thirteen levels. For each subject, now perfectly identifiable, Darwin may have a considerable amount of authorities, in the figure up to 100,000+. And for each of them Darwin fingerprint algorithm creates its semantic fingerprint. Another Darwin algorithm makes a synthesis of this authorities’ tank to build the subjects’ fingerprint.
Darwin hypercube: The Semantic Web could be seen as enormous data storage organized in virtual semantic discrete cubicles within a semantic “hypercube”. In each cubicle we may imagine a set of “documents” that sharing a common “subject” use a “keywords” set strongly bound to the common subject - In our Darwin Ontology see the keywords specificity conjecture -. Another approach would be to locate a set of documents that sharing a common subject use, with a high probability, a keywords set associated to the subject.  This architectures enables Internet users to retrieve directly, in only one click, valuable documents. To implement these features we need the Web space organized in at least three dimensions-domains, namely:

The subjects’ domain;
The keywords’ domain and;
The documents’ domain;

 In the figure above we see the Z subjects’ axis as coming out from page. Over this axis subjects are embedded either along paths or levels of its discipline logical tree. So the metric along this axis is discrete and associated to the linearization of a tree. The second axis corresponds to Y keywords’ axis classified alphabetically within subjects’ segments. It means that if a given subject is indexed by “pointer” 153,000 (in the figure we estimate 300,000 pointers lo allocate-locate equal number of subjects) their corresponding keywords segment should be located in “interval” 153,000 and classified alphabetically within it. Remember in our last post the example: will be the index to locate “lung cancer” (35) within “oncology” (17) within “allopathic medicine” (01), and all within medicine as discipline number 3, that we could make correspond it to pointer 153,000 that points to subject: “lung cancer diseases”.

Finally we have the X documents’ axis corresponding to documents or pointing to them. These documents could be imagined either with their URL’s classified alphabetically along the whole URL universe or allocated in dense intervals corresponding to subjects. Ideally it would be better this second alternative that implies classifying documents at time of their registration in their corresponding semantic “boxes” (we are talking of next generation Search Engines).

The virtual slab shown in this figure along X axis at Z=subject 153,000 depicts the space portion dealing with the fingerprint set corresponding to subject 153,000 interacting with all documents. The “spectral lines” in red show interaction of documents dealing with subject 153,000 in terms of keywords matching. These slabs could be reduced – for instance by sorting - to cubicles of elementary dimensions:

[1, h(s), n(s)]

Where h(s) and n(s) are the amount of keywords and amount of documents associated to subject (s). Alternative data architecture would be to assign cubicles [1, 1, 1] to each “node” with memory enough to allocate variable h(s) and n(s). In any case we may appreciate that this hyperspace is poorly filled:

only 300,000 cubicles of dimensions 1x<h>x<n> =~1x50x10,000 = 5x10^5  in the average out of a 300,000x10^7x10^10) Cartesian space that is one in 2x10^11 if using the above mentioned [1, 1, 1] architecture.

As we suspected concepts are like celestial bodies in the universe, extremely abundant tending to cluster around specific subjects but very distant among them - within the Web semantic -. This feature facilitates its detection.

Of course one of Darwin Conjectures says that keywords are extremely rare and strongly associated to subjects - the specificity rule - and documents, mainly authorities, are also very specific concerning the use of keywords. These characteristic were experimentally checked: within the Web Ocean specific set of keywords tend to appear in documents that belong to the same subject being almost inexistent in documents dealing with other subjects. This conjecture states that If we find that the same keyword appear as associated to distant subjects their meanings will be substantially different. Let’s remember that a concept in Darwin Ontology is a semantic chain where the keyword is only its tail, the word or chain of words that appear in documents dealing with a given subject. We also said that concepts are pairs [s, k] that stands for keyword (k) belonging to subject (s). Then why we talk about chains, tails and heads?. We say that because subject s inherits from s+ and this from s++ until arriving to the discipline root. According to this fact a better description would be [root… s++, s+, s, k], being k the tail and root -mathematics, art, games, etc.-  the head of the semantic chain  

In theory all this hypercube space could be filled by “Boolean ones and zeroes” no matter if our conjectures applies or not. Effectively any document even being an authority may use any keyword belonging to any subject but rarely and probably as examples and analogies. For instance we use in our posts some bio keywords like “membrane” but as an analogy. We introduce here “e-membrane” as a new concept that probably will become a new keyword as long as authors dealing with AI applications in Knowledge Management begin to use it. A Boolean one will corresponds to document (d) dealing with subject (s) that uses a keyword (k). In this simplified approach we are working with predefined (inferred) subjects and unveiled keywords. Objectively we may imagine a larger space where all possible subjects and keywords are listed discipline by discipline.

How to unveil the slippery subject: To make use of this architecture that lead us to what is known as the “Web Thesaurus” we need to know first, for each document, its subject and their associated keywords. Not easy task!. Let’s suppose that we ignore the subject of a document. How do we classify its keywords then?. We face here the problem of the egg and the chicken being both fuzzy objects besides. Effectively, subjects are not always well defined and concepts may have many literal acceptations sharing more or less the same meaning. However the Law of Large Numbers (weak and strong) comes to help us: for each “concept” the frequencies of its acceptations have a probability distribution with the following property: most frequent acceptations almost always correspond to their “established” expressions. We will explain these coincidences in next discourses. But the problem still remains complex: how do we detect “strings of words and symbols” that authors intentionally use in their writings to title them?. And sometimes authors decide to deal with a subject without even naming or calling it!. We will go back to this crucial topic soon once we discuss how to build bookmarks from “zero ground”, only knowing the name of the discipline we have to unveil. Subjects are special keywords that even authorities play literarily with them, for example to impact readers. Simple keywords on the contrary point to specific contents and statistically have dominant modal acceptations. In some disciplines like Computing and Medicine top authorities like the ACM and the JAMA respectively establish precise curricula whit precise names for their subjects. In some disciplines like Art and Politics the same subject could be titled by authorities very differently.  

Again: What’s in a document?
Well Written Documents

A document is a string of characters (out of images, figures, and tables once editing commands are stripped off), and at large a chain of words. In our last posts we mentioned that authoritative texts are in fact WWD’s, Well Written Documents (similar to the concept WFF’s Well Formed Formulae of Formal Logic), literarily well written documents where “Common Words and Expressions” and “keywords” alternate harmoniously to build a meaningful content. In a document of 5,000 words we may detect from 5 to 50 different keywords dispersed here and there in the context. Keywords could be used many times along the document but in WWD’s the same string of words and/or symbols express the same meaning. If we imagine a collection of 10,000 documents dealing with the same subject with a size of 5.000 words each in the average we are going to have a single sample of 50,000,000 words dealing with no more than 40 different keywords in the average!. On the contrary the amount of different common words has a bound that depends of the subject, the language and the literary sophistication level of authors, but ranging from 1,000 to 3,000.

WWD’s for humans: Now the problem has been clarified a little. We as humans are enabled to distinguish keywords from common words, especially if we are experts in the discipline we are dealing with. We know that thesauruses and glossaries deal with keywords, being the normal abundance of them from a few 5 to as much as 50 in WWD’s. By the way a WWD is a document edited by an “authority” that not only knows the subject but the whole discipline to which it belongs and masters pretty well how to document ideas, theories, essays, messages, reports and opinions as well. As knowledge is formalized like a “logical tree ”from general to particular” topics following a pyramidal hierarchy model from root to leaves, a WWD dealing with a subject of level 3 is supposed to use “established” keywords that belong to this level. Sometimes authors create new keywords (not yet accepted as that) that from that moment onwards belong “specifically” to the level were they were created. Of course authors may use upwards or downwards level keywords but prudentially, going upwards the minimum to offer a necessary frame of introduction to make the document more comprehensible and going downwards to clarify with details some topics. Authors may also use “collateral” terms belonging to subjects of the same level to use for example analogies also in order to make document more comprehensible. Darwin has built algorithms that once adjusted for each discipline generate for each document a number between 0 and 1 that act as its WWD factor.