18. A little more about Darwin Ontology I

ByJuan Chamero

Darwin Root
Darwin Prototypes

In the beginning…..

   Under Darwin Ontology Conjectures appears a New Knowledge Management vision. These Conjectures deal with Human Knowledge documents and how are they written. Knowledge is an essential substance of the Human Being, up to now beyond our reach except by its “documentation”. The knowledge is like the air we breathe is up there and everywhere, a sort of subtle energy we “understand” and share with others for mutual understanding. In our intent to make it attainable, transcend, and usable, we document it that is we spend physical energy to make meaningful marks, initially on rocks, trees, and soils, then on soft pieces of clay, papyri and papers and recently on much more sophisticated media.
    We may say that this Treasure of marks is what we humans believe the knowledge is. This representation constitutes our wisdom, feasible to be preserved and transmitted as long as we invest more and more energy. Of course these marks sequences have meanings that depend of something that we title as “languages”. 
    Along time documents have also evolved too much from simple messages of presentation, salutations and warning to codes and treatises. In Systems Theory we used to say that volume brings, by itself, complexity that in some cases tends to grow exponentially!. It’s not the same a single paragraph message that documents having 100,000 paragraphs. We may dare to say that perhaps knowledge tend to aggregate itself harmoniously from experience following an ordered architecture let’s say its natural architecture independently of how we “see” it. And why ordered?. We were tempted to say: because the “universe we see” always looks like ordered!. But that is a trick, isn’t it?.
    It seems that human wisdom was evolving hierarchically, initially from facts and particularities to general “findings” until discovering rules and principles and then creating from general subjects to very specific ones. In our Darwin Ontology we went farther venturing some other strong Conjectures such as how humans create new concepts specifically by discipline and by using common words and expressions and how t concepts define the semantic spectrum of subjects within disciplines resembling “semantic fingerprints”.

Darwin Prototypes
    These Conjectures were tested and found true in two prototypes: Computing and Art. The sample universe was the Web where are actually hosted about 12,000 million documents representing the Human Knowledge as_it_is or better as_it_is_seen, mainly by authorities. In numbers this sample approximately contains:

12,000,000,000 Web documents;
10,000,000 concepts per language;
400,000 subjects

Belonging to about 200 Major Subjects of the Human Knowledge or “Disciplines”, like Medicine, Arts, Computing, Mathematics, Economy, and Games.
    Out of this semantic “Web Ocean” we have unveiled the inner structure of two disciplines: Computing and Art that expressed in numbers are:

Computing Mapping:
~50,000,000 Web documents;
53.148 concepts (English);
1,200 subjects along a Semantic Tree of six levels (English);
Backed up by ~9,000 authorities;

    Here we see the upper and lower parts of the Computing Thesaurus skeleton (keywords), exported from its database hosted in www.intag.org . The map was created in year 2003 so some concepts could become obsolete. At that time the semantic “seed” was a combination of the ACM Curricula and of the IFIF-UNESCO Curricula as per the figure below.

    In this figure ACM stands for ACM Curricula and RW by Rest of the World. At that time, year 2003, we were in the beginning of unveiling the Web as_it_is. Now Darwin algorithms may tests “seeds” like this, let’s say hypothesis about a given knowledge it_is present in the Web. As you appreciate the logical tree for this prototype was of four levels. Now Darwin algorithms may go deeper, to thirteen levels in the case of the Art Map.
The IT Thesaurus skeleton above is read as follows:

Column A: the “keyword” number;
Column B: the keyword name;
Column C: The name of the “basket” to which the keyword belong

    In our first prototype our programming and computer resources were rather scarce so we discriminate the whole set of 53,148 keywords in 34 large baskets. 14 belonging to the main branches of the ACM Curricula, 12 to the main core branches of the IFIP-UNESCO Curricula -those emerging from root LT-, 4 contextual subjects G and 4 belonging to strong cross related disciplines U.  Our first intuitive approach was that concepts will be well identified by pairs [keyword, subject to which it belongs]. We checked that this criterion worked well for this discipline IT that is if we query by [keyword, the root]: [algorithmic techniques, computing] it would be sufficient discrimination. However it is not enough because it is probable to find the same keyword name for several subjects within the same discipline tree. In our second prototype we overcome this difficulty by defining concepts properly, as the semantic chains that get a given keyword along a path from the root instead!.
    In this skeleton the tree was run from top-down and from right to left within levels. Columns D, E, and F are auxiliary Boolean variables to navigate by the tree.

Art Mapping:
~100,000,000 documents;
~from 140,000 to 300,000 concepts (English);
7,570 subjects along a Semantic Tree of thirteen levels (English);
Backed up by ~20,000 authorities;

    As this map is so huge compared with the first we are showing here only samples of Art Logical Tree skeleton instead. Take into account that the first mapping involved less than 1,200 nodes and their keywords were discriminated in 34 baskets meanwhile here we are talking of a Logical Tree of 7,570 nodes and a Thesaurus of concepts ranging from 140,000 to 300,000. For understanding these two prototypes comparatively consider that we have to take into account 7,570 baskets versus 34!.
    Here we see a by far more complex and high resolution mapping. In the figure above is depicted. A micro sample of it concerning “theatre”, a sub-tree of sixth level: Art, The Arts, Performing Arts, Music, Genres, Theater. And within this sub-tree we marked “Lyric” a very frequent word within Art but unique if we consider its “semantic chain”: We have solved the inherent ambiguity of keyword names!. An Intelligent Wizard takes these semantic chains and transform them in efficient queries by eliminating redundancies and virtual nodes (like genres, nodes not yet existent but needed internally to Harmonize Thesauruses). As you see with the optimized query the Wizard proceeds to query a pool of Search Engines. The whole interface system between the user and the Thesaurus behave like “semantic glasses” enabling users to see the Web as perfectly ordered.

     Now Darwin algorithm takes into consideration 14 variables instead of five as in the first proto. If you have worked with trees before the explanation will follow straightforwardly.

    Here we show the upper levels of this Art Map. In this case the seed was a basic tree of less than 40 branches. We tested mapping with different seeds to tests the procedure capability of auto tests seeds. Seeds that no fit well to the discipline as it is represented in the Web gives place to strong incoherencies fast. Let’s imagine that we are Art experts but biased to be considered “too classic”, dismissing new expressions of art. Soon when diving deep past the first levels begin to appear strong incoherencies such as myriads of themes, authorities and documents that as per the seed have no way to reach the root!.  
    These structures have the form of Knowledge Maps where a particular knowledge is fully mapped at its deepest level of de-aggregation as per the Web possibilities. What does it mean?. Concerning Information Management we may improve both sides of it, namely: The Indexing side and the Retrieving side. The first by enabling us- humans- to index semantically any document of the Web in a Web Library by disciplines and the second by enabling us to search in the Web “directly”, ideally in only one click of our mouse. This mapping extended to the whole knowledge may enable us to build SSSE’s, Semantic Super Search engines of YGWYN, You Get What You Need type in only one click!.
    These SSSE’s could be implemented in any conventional search engine, in any personal computer, and even in telephone mobile units, entering into a new era of the Information Management.