4. What a concept is

By Juan Chamero





In the figure above we depict 5 paths of the ART TREE within our Darwin Ontology. By now please accept that established disciplines within the Established Human Knowledge as_it_is represented in the Web space are ordered logically over “topological trees” that behave as their semantic skeletons.

In Darwin Ontology concepts are bi-univocally related to tree paths. These paths have two singular nodes: the Head or root that corresponds to the top level of the discipline, in this case Art and the Tail that point to the concept as it is usually called or referenced by the art people and more specifically by the art people that belong to the tail neighborhood!.

For example Rigoletto is not a well defined concept by itself because it could be used in many acceptations -thru different paths!- within Art and within some other disciplines. Rigoletto here in this chain is used in reference to the Opera Rigoletto that is encased like Russian dolls in up to 10 semantic levels, namely:

Bel canto movement
Italian opera
History of opera
Opera
Genres
Theater
Main Performing arts
Performing arts
The Arts
Art

So the Rigoletto concept as used in art documents belonging to this level -11th - is very specific and if documents  dealing with Rigoletto Opera adjust along its full path to this semantic profile we used to say of them that are very “professional”.

Concepts transport meaning, a meaning that should be pretty much the same in all language. This is very important for any computing ontology that pretend to be universal. As we will see later these trees can not be translated literally because this characteristic. Knowledge Trees may differ with languages and must be built independently for each language. For example documents dealing with the Rigoletto Opera concept with a similar professionalism in English and in Spanish may be located thru different semantic chains and also topic names can not be trivially translated. See to this concern the noogony of Immanuel Kant and its work Critique of Pure Reason  .


This is an introduction to the essence of Darwin Technology, that stands for Distributed Agents to retrieve the Web Intelligence, which enables semantic search at any level of deepness. SSSE, Super Semantic Search Engines are those that work driven by concepts instead of keywords understood these as words (or chain of words. A SSSE supposes that you as a user is querying by concepts and guides you accordingly. If not advert a user querying by common words will learn soon the futility of their use. Querying by concepts users will learn soon the best strategy to get what they are looking for in a few queries, two in the average and at the same time learning fast how to create their own knowledge base. No more idle times, no more ambiguity, no more cognitive “noise”.

What a subject is

A concept is a meaningful word or a string of words (words in a precise ordering, in our Western languages from left to right) always associated to a given subject. Keywords are only meaningful words or strings of words that could be associated to many subjects instead. A subject is something that has to be known, to be mastered within a “discipline”, and sometimes a main activity of it. So if we name by k the keyword and by s the associated subject a concept is a pair [k, s]. If a discipline like Computing is a Major Subject (a branch of the Human Knowledge), Software Engineering, Information Management and Programming Fundamentals are three of its main subjects out of 14 if referencing the ACM, Association of Computing Machinery 2001 curriculum. For well known and “established disciplines" there exist their respective curricula, at large a hierarchically “tree” structure of up to eight levels of subjects.

In Information Management keywords like “cellular cloning” and “category systems” have different meaning in other disciplines, for instance, chemical industry, engineering, physics, and biology. Some very common keywords like “parallel processing” may exist as forming part of concepts in as many as hundred disciplines.

The miracle of literary specificity

Subjects – nodes: To avoid confusion concepts could be considered keywords when they are “by default” referred to a given discipline, as a notation simplification [k, s] will render [k] within s. These concepts somehow behave like precise semantic pointers to documents even within a given discipline. Let’s see why. If we call s0 a “Major Subject” root like Medicine, s01, s02, s03,…, s0n could be the n subsidiary subjects (clinic, pediatry, oncology,……..) that from level 0 as its root opens in branches to get level 1 immediately below level 0, and following the third branch (Oncology) we may go to a second level of deepening, of specialization: s031, s032, s033,…., s03b, where s03 (Oncology) in its turn opens from level 1 in b subsidiary branches (prostate, gastroenterology cancer, breast cancer, bones cancer,…….), to get level 2, and so on and so forth.

The specificity Rule: Of course keywords that appear as associated to a subsidiary fourth level subject s0xyz as concept pai [k, s0xyz] could be considered as “pertaining” to this subject “node” s0xyz. One important fact, statistically tested, is that the probability of finding a keyword belonging to any node in another node is inversely correlated to their respective semantic distance measured along its tree. It means that keywords are very “specific”. The probability of keywords to belong to specific nodes is very high, usually higher than 85% in the average. They could be however observed sparsely distributed (15% in the average) in documents belonging to their specific “subjects’ neighborhood” -with a metric for distance measured in levels over the tree-. In summary each node has a set of associated keywords. Some of these keywords are 100% specific meaning that it is almost impossible to find them in “authoritative” documents dealing with subjects one level up or down and hard to find also in collateral nodes belonging to the same level. Some others (a minority) may be found in documents dealing with subjects of the neighborhood, up-down and collateral. 

Concepts are like ideograms: We have to take into account that s in each pair [k, s] stands for the subject of a given node of the discipline tree from its root to its leaves. For instance, [k, s01325] would point to those keywords that belong to subject s01325 of the fourth level. Keeping these differences between keywords and concepts in mind we may continue. Any keyword could be represented by a symbol, like a Chinese ideogram. In the figure below we depict three arbitrary ideograms for

[firewall, networking]
[clustering, discrete structures]
[parallel processing, information management].



The same keyword clustering for another subject will identify a different concept and accordingly a different ideogram.
 
From words to keywords, then to good keywords, then to concepts: Initially users look for words, then they learn how to look for keywords until they discover “good keywords”. Not enough in the search of excellence: they have to go to next step: To use concepts instead good keywords.

Documents are complex objects made visible to humans by special editors, like for instance HTML. Textual content “behind” these edited texts are either common words and expressions or concepts. Supposedly any document may deal with one or more subjects, sometimes explicit sometimes misleading, sometimes hidden. Agents must be settled and tuned up to discriminate between these two types of semantic particles to unveil subjects besides. Any subject is a concept but the reverse is not true, only some concepts along their evolution may become subjects. In conventional search users handle just words or chain of words as pointers to get information and knowledge out of data reservoirs via queries.