6. How users search and how they “discover” their own keywords

By Juan Chamero

  The figure intent to depict the search process as it is today. Grey dots are failure trials meanwhile black ones are successes. If the search efficiency were measured by area we may say that it’s highly inefficient. We have tested similar graphs –in only one dimension- with university students in two disciplines Computing and Art. They were challenged to find something that prompts at random as a question related to these two disciplines. At the testing side we knew exactly the right answers, namely the specific subject-nodes to which the questions belonged. For each trial we measured its “semantic distance” from the right target. As we will see here and in next posts third generation search engines will enable user to find what they need in only one guided query

Users query Conventional Search Engines y words. That’s very important: actual search engines do not offer searching by “keywords” because they ignore them. From time to time they offer “tracking” services listing the most frequent keywords, single words or combination of words. On the contrary Darwin strategy is focused in unveiling from the Web all possible concepts assigning them to specific subjects In the future, Web pages and even Web sites will be indexed by subject and by concepts that define its semantic profile as a sort of semantic fingerprint. Defined this way any documents will have its individual fingerprint as well subjects and disciplines.

We may then imagine the Web as a semantic Ocean where are living - semantically - more than 10,000,000,000 creatures –documents- dealing with subjects that at their turn are expressed by concepts. Our estimation is that actually there exist about 10,000,000 concepts hierarchically ordered thru disciplines. Ideally each document tends to deal with one and only one subject that is semantically and probabilistically defined by its corresponding fingerprint. Document fingerprints tend to match strongly with one and only one subject and at their turn subjects tend to mach stronger wit one and only one ancestor subject and so on and so forth till arriving to the “root” of a discipline. Up to the moment we have tested these conjectures in two disciplines: Computing and Art. For Computing our Darwin agents have retrieved 1,200 subjects and 54,000 concepts meanwhile for Art 7,570 concepts and 300,000+ concepts.  

Humans and agents may unveil keywords: When people read editorial and articles in newspapers and when studying in books easily detect words and sequences of words that to their criteria are valuable concepts. Of course not all people who read a given document mark or highlight the same words or sequence of words because a) the main topic of the document is a subjective matter and because b) concepts highlighted strongly depend on people’s knowledge. Agents could do the same as long as we humans may transmit them a mechanical procedure to discriminate between words and concepts -whether this duality exists-. As we will see one of the strong conjectures of the Darwin Ontology is that WWD´s “Well Written Documents” documents could be split in two complementary parts: a literary part of Common Words and Expressions and a conceptual part of Concepts. Authoritative documents within each discipline tend to be WW, Well Written and because of this characteristic concepts are easily “distillated” by a sort of industrial process from the document tanks. 

Let’s suppose that we have to unveil the set of keywords -remember that keywords are the “tails” of semantic paths- belonging to neonatology, a sub discipline of medicine. Out of the basic semantic pair [neonatology, medicine] that has 750,000 references in Google we select 10,000 authoritative documents (in another industrial process a special family of agents discriminate authoritative documents from non-authoritative ones).  This authoritative set constitutes what we call the documents tank, in this case corresponding to the subject [neonatology, medicine]. If our agent were a medical doctor surely it will distinguish perfectly well common words and expressions used in the literarily part of these documents.

Note: Fortunately the glossary of this literary part is almost an invariant for the different medical subjects and because this feature it could be precisely unveiled and this knowledge transferred to agents.

Now go back to humans. If you assign now medical doctors –not specialized in neonatology- to the task of reviewing the tank content, what you think: will be them capable to unveil neonatology keywords?. Of course the answer is YES. There are many ways. Let’s envisage a progressive learning approach: select the first document, mark what you suspect is a keyword, and keep them in a database. Now go to the second and do the same. Have you perhaps noticed that the some keywords were wrong?, put a special mark on them and proceed. Finally you arrive to the end of this bored and exhausting task. You have completed a first run. Humans have learnt a lot haven’t they?. Of course a second run will be enough to have concepts unveiled.

Human versus agents: human learn by far faster and better than agents: for example humans may easily detect potential words chains as keywords because they know much more than agents about meaning: In a sequence of words as [w1 w2 w3 ………….w3209] in a document of 3209 words a human may be aware that the chain w757 w758 w759 is a keyword because even though he/she is not a neonatology expert he/she is smart enough to realize that this chain has a specific meaning,  that it fits perfect with the whole text. This subtleness is extremely difficult to transform in a procedure to be transferred to agents. However as agents are by far faster and precise than humans we may proceed to implement a sort of “brute force” algorithm to generate all possible n-words chains within a text and draw conclusions once a first run is finished instead!.
Semantic Glasses: As a byproduct of mapping the Web we may unveil all concepts and accordingly we may then proceed to query by them even though documents are not yet indexed neither by subject nor by concepts!. For that reason we say that we may “see” the inherent Web order via a given conventional Search Engine “as_ it_is”, provided we have a sort of “Semantic Glasses” plugged in either in our computer or in the chosen search engine!. From here opens two marketing strategies for the near future:

a) Conventional Search Engines becoming SSSE, Semantic Super Search Engines empowered with these glasses or;
b) Personal Computers empowered with them!. 

Important: If we were challenged to build our own search engine we would do it very efficiently indexing documents directly by keywords from the very beginning instead!.

More examples of searching ambiguity and noise: If a user queries by the two words keyword “A5-1 analysis” conventional Search Engines understand that he/she may be looking for

•    The famous “area 51” in Southern Nevada;
•    A business series room at Yale;
•    The A5/1 algorithm to encrypt mobile transmissions;
•    Ufo’s related places;
•    A reference of the series about “easy cracks”;
•    An assembler kit for: a microprocessor;
•    A skateboard brand;
•    Auto routes;
•    Map regions;
•    The Lenovo Think Center A51;
•    A mobile phone manufactured by Siemens;
•    Disk Jockeys, Websites;
•    ………………………….
•    And many others possible references.

Next user awareness step: from keywords to concepts. Most search engines consider A51, A.51, A-51, A5.1, A5/1 synonyms. Let’s suppose that our user was really looking for the A5/1 algorithm because he/she was interest about cryptanalysis. He/she will learn soon that the right query to hit the core of the topic is “A5/1 attacks”. In this case he/she we say thaht the user has discovered a keyword related to the specific subject

Cryptanalysis => Cryptanalysis in GMS => GMS Security and Encryption

If our user would have queried by the pair [A5/1 attacks, GMS Security and Encryption] the Search Engine, without any ambiguity, would have rendered what he/she needed, probably in only one click!. Do you agree?. OK, but surely you are going to argue: how could he/she know in advance what’s the appropriate bait to catch the fish?, and how to know that this fish “lives” in the “GMS Security and Encryption” waters besides?.  The answer is that you need a different type of Third Generation Search Engine, SSSE Super Semantic Search Engines. As we will see these engines have a Web Thesaurus where all documents are properly indexed, in up to thirteen levels of knowledge. A smart Wizard will guide users to discover the appropriate pairs [keyword, subject].

How extract users behavior patterns from apparent chaos: In the figure below we depict three search sessions represented in blue, where each “square wave” is a query within a session. Red, green and yellow dots are failures, successes, and not well defined outcomes respectively. If we only have access to their session logins we ignore what happens at the other side of the virtual membrane that separate us from them, we do not have any type of information feedback. Of course we may devise a sort of double blind experiment with volunteers searching and reporting at the other side but this experience is not equivalent to the spontaneous and free real searches. However let’s see what we can do.

We may register all queries content as they were issued. In these strings, provide we have a Web Thesaurus that maps all concepts as pairs [k, s], we may detect the appearance of concepts along the time. We may then focus our attention (agents do this) about how users refine their searches, presumably going deep down the corresponding thematic tree. Without identifying users as individuals we may however study the users learning process by region, language, time and subject. If subjects tend to form clusters we may also infer what they are looking for. An even more than that: each time we detect a user is interested in clustered subjects without invading his/her privacy we may proceed to invite him/her to join the touched cluster group. Not bad!

Note: This phenomenon has been extensively studied by the American R&D firm Intelligent Agents Internet Corp and by the CAECE University from Argentina in a joint research effort. Conclusions were that when a concept is “discovered” by a user sessions tend to finish (or change to another session) after a process of navigation by his/her own by the concept neighborhood. In the beginning of sessions there exist long chains of words and combination of words resembling users wandering by the Web space looking for the right subject until the suspected target and/or potential keyword are discovered. Users’ tracks have the form

[w1, w1+w2, w3, w1+w3,…..[w6 + w8], (w6+w81), (w6+w82), (w6 + w82 + w9), (w6 + w82 +w10), (w6+w82+w11)]

Of course this sequence by itself means scientifically nothing!. We may only argue that “perhaps” the user found a suspected keyword [w6 + w8] by chance, but we may continuing arguing that he/she seemed to apparently insist in deepening this concept by querying by similar w6+w81) and (w6+w82) and then he would tuned up a little his/her search navigating by the (w6+w82) neighborhood. This is really too much imagination. However if this pattern of reasoning appeared with a significant statistical presence things are different!. Big numbers come to help us.