19. A little more about Darwin Ontology II

ByJuan Chamero

Darwin Differences with Data Mining
Darwin Raw Data Distillation Analogy
Towards People Behavior understanding

 

Darwin versus Data Mining
    Data mining is a sorting process over huge amounts of data (the mine) trying to extract significant “data patterns”. It pre supposes no previous knowledge about data.  It belongs to the field of “Knowledge Discovery” and their procedures root deep into statistics.  It also pre supposes that is being used as an analytical tool for data analysis in man-machine environments.  Patterns then may lead to discover “behaviors”, in this case users’ behaviors. Its main field of application rests on real world data with unknown interrelations. This characteristic carries an unavoidable weakness: the critical data that allegedly built detected patterns could never be observed. Concerning this weakness data mining always require ulterior and costly “data dredging” studies.
    However documents are human generated data and as such they are somehow intelligently structured and suited to a better approach pre supposing that we know something about their inner structure that is to say we may choice a model about its inner structure. So Choice Modeling is an adequate tool for making probabilistic predictions about how humans decide and document. In this sense Darwin technology may be considered within the Choice Modeling realm.

The Darwin distilling analogy


 
    From Art raw data we may imagine a distillation process that proceeds along distillation paths (instead distillation steps and towers) In this figure the product “Rigoletto” experiment 11 distillation steps. In this case Rigoletto is a “leave” of the tree and at the same time a subject and a keyword. Some other like “Fantasy novel” and “Paella” are subjects that hare not leaves but roots of sub-trees like types of Fantasy novels within the literature and types of Paella by cooking and regions of Spain within gastronomy (these sub-trees are not represented here). Node Keywords are not shown here, only nodes that are subjects and remember that all subjects are keywords but not all keywords are or become subjects. For the Art map In the average we may find about 50 keywords per node.
    

    Darwin Ontology Conjectures intent to model: a) how new concepts are generated; b) how knowledge hierarchy evolves, and; c) the documentation process. With this model “in mind” Darwin anthropic algorithm proceeds to unveil along a sort of industrial distilling process the semantic hidden structure and authoritativeness behind data. As in the oil industry raw data is, like crude oil, thousand trillions of words accumulated here and there in billions of semantic containers (documents) and generated by humans following established and well known literary routines (chemical processes).
    First we have to eliminate wastes and heavy by-products, leaving only distillable oil (Web pages   are transformed in text strings). Then Darwin algorithm has to process those text strings in order to separate two groups of semantic particles: Common Words and Expressions from Specific Concepts (and of course as in any distillation process wastes appear). Then all concepts are “purified” and reclassified within their specific clusters, namely the core of the industrial semantic distillation process.
    To complete the analogy it is important to emphasize the role of Chemical Analysis.  We may distill crude oil because we know “in advance” its chemical analysis at its deepest detail. What we know in advance as TRUE apart from our Darwin Conjectures are Conventional Search Engines Indexes. Some Search Engines like Google index all Web documents by word, no matter if they are well or wrongly written!. So we may know “in advance” crucial information about their semantic composition, something equivalent to the crude oil chemical analysis.        
    Summarizing Darwin may upgrade Search Engines indexes to become directly retrievable in only one click. We may also apply a version of this algorithm working within the same ontology to properly index all documents at registering time, something equivalent to books classification. For each Web document we may automatically and autonomously compute its “semantic fingerprint”, something equivalent to its semantic spectrum described in concepts dealt with. Next generation search engines will have incorporated this feature as standard.  From this moment onwards we may sustain that “established knowledge is known” and may be properly used by all people. 
We may also say from now onwards that the Web content has been tamed and we may talk of it as the Semantic Web
    Paraphrasing Tim Berners Lee the Web creator this is equivalent to know the Web Thesaurus, a whole meaningful Web mapping; and as in a huge Virtual Library documents will be perfectly classified and retrieved directly at will. We may also argue that the Web version of the Established Human Knowledge is perfectly identified. However this identification strongly depends on “authorities”, those entities that rules what the established knowledge is at a given moment. But why?: Because via their authoritativeness authorities rule both meanings and names.

Towards the second Semantic Big Step

 
    We have worked with this image before: It depicts the K-K’ equilibrium. Initially the Web could be considered semantically “flat”, a huge reservoir of almost 12 billion documents not thematically indexed. First step would be to build the Semantic Web that is the whole Web mapped as depicted in the lower part of the figure. Initially the intelligent interface between these two realms whether existent (in yellow) would work at half of its possibilities broadcasting information and intelligence (in yellow arrow) on way from K to K’ but only receiving not analyzed users’ queries (in white arrow)). Once K is mapped both sides of an intelligent interface would work and enabling, at large, the K’ mapping.

    In Darwin Ontology the Web Thesaurus is only the first big semantic step to pursue the truth. To obtain the best approximation to it we need to perform the second big step: to know as much as possible the People’s Truth and People’ Thesaurus
    We may talk of a sort of collective brewing truth in one side trying to change in some extent the Established Truth existing at the other side in a process that resembles thermodynamic equilibrium.
    We may also imagine this equilibrium like a man-machine game where:

The established side try to broadcast its truth to the users’ side and at the same time trying to make inferences (to “learn”) about next users’ moves and;

The users’ side users trying to “learn” as much as possible about the established knowledge and at the same time trying to broadcast individually their pieces of truth.