9. How do e-membranes work to build Thesauruses?

By Juan Chamero

Darwin as a netowrk of e-membranes

 This figure depicts how Darwin works somehow out-flowing intelligence from the Web to the users' space and at the same time enabling the inflow of intelligence from people to the Web in a sort of "intelli-in/intelli-out" pumping process.

The black region represents the Web space where pages are hosted. The green crown represents the
Users’ Cyberspace, the platform where from you and I as users interact with the Web. Any Website, for example a Portal - in white - will interact with a region of the Cyberspace depicted as a conic segment. In yellow we represent the universe of e-membranes of our ideal Darwin network, a global interface between Websites and their users distributed “all over the world” along different conic regions – markets -. The interaction between a Portal and its users is magnified at right. The Website “owner” provides a cognitive offer that allegedly satisfies the user’s cognitive demand. MM stands for the Portal Matchmaking Membrane.

At both sides of the interface are computed different statistics represented by sigma yellow and sigma blue respectively. Users are by far less structured than Websites concerning how they try to satisfy their information needs. As we will see later they also have as groups of collective entities People’s Thesauruses similar in their topological structure to the Web Thesaurus but with a different meaning. What users are looking for are most times hidden to the “agents’ eyes” however one thing is sure: in order to obtain valuable information from the “Established Side” users have to express themselves as much as possible in terms of “Established Concepts” of the Web Thesaurus.

Being this common sense conjecture true things become easier to infer from inside (the Web) what is happening outside, the Realm where users interact. We have to instruct agents to detect users’ keywords that allegedly are bound to useful subjects for them. Let’ see briefly and conceptually how Darwin Technology builds first the Web Thesaurus, the essential tool to make these type of inferences.

Beginning from “ground Zero”: Let’s suppose that we are to map a whole discipline, for example Medicine. If Medicine would have 2,000 subjects - as defined by the AMA and JAMA associations for instance - we may retrieve from the Web and keep 2,000 samples of 200 million words each. For a moment we are going to simplify things considering Medicine as semantically “flat” that is those 2,000 subjects are indexed in a unique level all mixed up. As an industrial analogy this huge tank of data could be considered crude oil to distillate thru a sort of semantic industrial process. Our Darwin Technology could be imagined as a semantic industrial process of nearly 80 steps. These steps are structured like an anthropic algorithm where agents and humans work within a cooperative scenario. Actually 60 out of these 80 steps are performed automatically and autonomously by agents. The goal is transferring to agents almost all steps keeping for humans only a few ones concerning agent’s strategy, and their settings and tune ups. Agents and algorithms operate thru an e-membrane where from agents GO to the Web to GET a meaningful set of Medicine authorities – see the meaning of authorities in our last posts -.

Note: GO and GET are rather complex Operators. They resemble Programming Instructive documents providing agents accesses to all resources may need; tasks, functions, interfaces, pairs (id, password), parameters, templates, symbols, bounds, etc.

Once crude data is obtained a second family of agents and algorithm proceed to process it internally to build a first version of the Medicine Semantic Skeleton and of its associated “potential keywords” set. On a second run a third family of agents and algorithms proceed to test Darwin Conjectures against the Web. For example all semantic chains are checked against the alleged validity of the and whether necessary specificity ruleDarwin Technology proceeds to send the first family of agents to a second mission. As Darwin Thesauruses may evolve by themselves some minor distortions may be auto-adjusted along interactions.

Important: In next posts we are going to see what happen when we have to map disciplines that do not have a meaningful and consensual curriculum. In those cases Darwin Technology performs its task in one more step. To start the mapping Darwin needs a “semantic seed” provided by humans,

These e-membranes resemble bio membranes, endowed with an endoderm, a mesoderm and an ectoderm behaving as almost alive man-machine interfaces between Websites and Users. If plugged in a network, once the pertinent knowledge is mapped, they may detect and classify users’ main behavior patterns!. That’s the meaning of the D which stands for Distributed in the Darwin acronym. The ectoderm manages Users to Web traffic, the endoderm manages the updating and maintenance of internal databases and finally, the mesoderm manages and processes all the semantic combinatorial analysis of man-machine interrelations. Along time Darwin learns as much as possible about owners’ and users’ main behavior patterns. These e-membranes are invisible for users but their architecture is oriented to provide “massive and global” pattern learning in the sense that they are not built to “spy” people as individuals but to provide “win-win” collective scenarios instead: Users learn as much as possible from established knowledge and at the same time the “established side” learn as much as possible about the efficiency and reach of its global cognitive offer.

How do we detect potential keywords?

Let’s suppose that we have a long TXT string of words, most of them used as common words or expressions and some chains of them be potential keywords. We may program a sort of parsing algorithm that starting in any word proceeds to account for it as déj� vu “monads” “dyads”, “triads” and so on and so forth till large enough “n-ads”. In a similar way humans may proceed to differentiate keywords a “brute force” parsing algorithm proceeds to generate probabilistically potential keywords chains for a given subject. Along this distilling process we separate Common Words and Expressions from potential keywords and from ambiguous words and expressions, namely in three baskets. Each basket will have something like a “style sheet” defining its literary use and a concordance study.

In a second run sets of keywords associated to their corresponding subjects are matched within their semantic neighborhood, considering subjects organized by levels along a topological tree. Once finished this run we have unveiled potential keywords and for each set we may compute a quaternion of the following type: [83, 5,6,6] meaning that 83% of keywords belong specifically to its “bound subject”, 5% are shared with documents of the next upper level, 6% with documents of the next lower level and 6% to collateral subjects. We may then argue that 83% - provided that values satisfy a predefined masking condition - of these keywords strongly belong, as a semantic core, to its bound subject, being in fact something like its semantic skeleton, that adequately weighted will define in a next process the subject semantic fingerprint.

How to detect keywords from documents
We have seen how to detect potential keywords from a representative sample of subjects. It’s presupposed that we know all subjects belonging to a given discipline and their hierarchy within its curriculum. As a byproduct of this process we may build the “Jargon” each discipline uses as it_is in the Web that is the statistical distribution of common words and expressions. Now if we know the jargon we may filter all words and expressions in each document, leaving keywords unveiled and spurious terms ready for their deletion. These sequences of keywords are like a sort of “Tarzan” texts that in many cases are by themselves meaningful, over all for experts. We may also compress these skeletons as sequence of pairs {[k w]}, keywords and their respective weights within documents. These vectors complemented with their corresponding quaternions and up-down-collateral keywords define semantic skeletons,fingerprints.

Next Generation Search Engines will index documents by fingerprints instead of by words. Similar will be better defined now as those that match “as much as possible” their fingerprints. If the similar match is near zero or “poor” a special degrading Markovian type algorithm searching for similar will be activate by eliminating at random the “least important” as per a predetermined criterion. If match continue being poor the Markovian algorithm will be again activated and so on and so forth till keywords completion or till a significant match condition whatever occur first.

Note: We leave for next posts how Darwin Technology faces the mapping of disciplines without consensual curriculum.