3. What’s in a keyword?



By Juan Chamero

A keyword means something like a magic word that opens information boxes. The figure above depicts a Magic of Oz Wizard icon that intents to represent this concept. In order to obtain information it would be ideal to have a sort of Oracle box enabling man-machine dialog where users are guided to retrieve what they need. Conventional search engines intent to satisfy these needs without dialog: users issue queries and search engines provide lists of references that point to documents that presumably “deal with” topics related to them. We say presumably because search engines only could assure that documents pointed have, somewhere in their text strings, one or all words queried.

Paradoxically it is important to have in mind that neither users nor search engines know precisely the "right" subject name of the piece of knowledge to retrieve where presumably the query semantically belongs!. Users have an idea of it, however many times fuzzy and frequently they adjust their queries as long as they go deep learning from search engines outcomes. After long journeys thru the Web Ocean users learn to know “the best names” that point to topics wanted better.

On the contrary search engines are static partners concerning dialog: they do not guide users thru their navigation learning and users retrieving efficiency is only associated to their knowledge, navigation talent and luck!. Most conventional search engines ignore topics dealt with in documents indexed. Web pages are hosted in the Cyber space by authors and titled at will, many times to mislead readers and many times, unwillingly, their content has nothing to do with titles and major subtitles either. Dialog is an intelligent interchange of ideas (Plato dialogs) or an effective means of on-going communication (Martin Buber). What most conventional search engines provide is a Q-A communication channel of Questions-Answers between users and a database thru queries.

What we understand by “keyword” could be a chain of words, generally from one to no more than five. When some one looks for something by querying a database via keywords he/she has to know in advance how the search engine will interpret the chain. Let’s suppose the keyword be “parallel process” (a chain of two common words) that could be interpreted in several logic forms, namely:

1. [parallel AND process] that should be equivalent to [process AND parallel];
2. [parallel OR process that should be equivalent to [process OR parallel];
3. [parallel*process] that’s not equivalent to [process*parallel];

And we say that the two first expressions should be equivalent because AND/OR logic operations are commutative. First expression means that the person who queries is looking for documents that somewhere within their text string the word parallel exists AND somewhere within the same text the word process exists. The second expression is read the same way changing the operator AND by OR.

Many search engines omit the AND operator given by understood that any sequence of n words W1 W2 …. Wn, will be interpreted as AND replacing the blank or empty space between them. There are many other logic formats and interpretations of queries but let’s focus our attention on the third type expression. When we explicit our query as a rigid semantic chain W1*W2*…Wn of n words where character (*) stands for blank or empty space we usually mean that we are looking for documents that within their text strings exists at least a match. For example when we look for “el hijo del hombre” (the son of man in Spanish) perhaps looking for something related with biblical literature it means that we are looking for documents that somewhere uses this expression exactly as_it_is. This chain could also be written as el*hijo*del*hombre being a 4-words keyword.

Some search engines like Google will interpret this n-words chain as keywords if and only if it is enclosed between quotation marks “el hijo del hombre”. What happens if the query is a little different concerning blank spaces between words?. Try for example with “**el****hijo del***hombre**” and you will see that the result is “practically” the same. We say “practically” because Google outcomes depend on the time of the query, the IP where from the query is issued, and slightly on the blanks, almost imperceptible.

We may imagine the search process as a sort of interception game of two where in one side we have humans trying to find something and on the other side is the stock of “all” existing things (semantic things); intercepting is matchmaking all existing things with keywords. To accomplish this purpose the “Oracle” side must be adequately indexed by words and eventually by all possible “semantic chains”. An elementary approach would be to index all existing Web documents only by single words. An alternative approach would be indexing also by chain of words. In an extreme any text could be seen as a very specific keyword!.

Conventional search engines have then two basic options of service architecture: a) to index semantic things (objects, documents) by words and an algorithm to find n-words keywords within the set of pre-selected documents by the words that conform the multiple words keyword or; b) to index semantic things by words and chains of words even though limited to a given number of links. Let’s go back to our first example of parallel processing:

Parallel points to 90,500,000 document references;
Process points to 604,000,000 document references;
Parallel process points to 9,910,000 document references
“Parallel process” as a semantic chain points to 134,000 document references

The a) method involve an “ex-post” searching for the semantic chain “parallel process” within 9,910,000 documents meanwhile method b) involves to index documents by words and at least by potential two words keywords “ex-antes”. We have to face here a classic tradeoff conflict between memory and time of process.

We ignore how actual search engines manage this problem even though we can make some inferences (see below). For us as users we may consider search engines like black boxes that respond with document references to our queries (we have checked that some of them work well with up to 7-words keywords. Let’s see the following example as per Google:

el hijo del hombre 743,000 Ref.
“el hijo del hombre” 47,400 Ref.
“el hijo del hombre muerto” 1,060 Ref.
“el hijo del hombre muerto en” 4 Ref.
“el hijo del hombre muerto en Canero” 2 Ref.

In this example the search engine may compute fast the chain retrieval because the chains are relatively large, from 4-words keywords to 7-words keywords, matchmaking from 47,400 documents to 2 respectively. Let’s see what happens with short chains of high “popularity”:

World 2,500,000,000 Ref., 60 msec.
War 847,000,000 Ref., 170 msec.
“world war” 73,800,000 Ref., 140 msec
“world war II” 39,700,000 Ref., 120 msec

Note: we add the time of response User keywords Users only issue words.

These words are not keywords; at most they are “subjective keywords”, semantic bullets users have to obtain what they are looking for. At the other side of the game the search engine is only enabled to receive queries, chain of words and eventually logic operators but they know nothing about the results on the other side, namely if the sequence of queries was satisfactory or not or if users have found something valuable. The only data search engines have to infer about how valuable certain queries are  are statistic figures, probabilities estimations. From time to time and sometimes on a regular basis search engines deliver lists of Most Used Keywords, thru which we as users may infer useful keywords. Let’s suppose that parallel process -pp- is considered a keyword with a higher probability of occurrence than other similar like for instance concurrent process, process in parallel or parallel processing. The problem is that those pretended keywords are valuable as such for several disciplines.

Conventional search engines outcomes are global, without semantic discrimination. For instance parallel process as a keyword-query brings us (Google) 134,000 references but belonging to more than a hundred of Disciplines or Main Themes of interest like the ones depicted below: In fact a “user keyword” is only a word or chain of words that presumably will point to the information we are looking for or to its neighborhood but the outcomes may belong to many disciplines. A better approximation to “concepts” (well defined symbols, figure, ideograms, or chains of words/keywords bi-univocally associated to a piece of knowledge) would be pairs [subject, keyword] that in the example below would be

[programming, parallel process],
[computing, parallel process],
….............,
[circuitry, parallel process].

As we will see this would be a great improvement to make our searches more efficient but only a better approximation, not the best. Why?: Because within any discipline, for instance programming, the parallel process -pp- keyword could be present with different meanings in more than one sub discipline within each discipline.

pp programming 83,300
 pp computing 178,000
pp networking 18,500
pp engineering 31,000
pp chemical 16,700
pp war 47,000
pp biology 26,700
pp ecology 17,800
pp economy 19,600
pp marketing 14,500
pp learning 32,000
pp process control 10,300
pp medicine 15,200
pp agriculture 10,600
pp games 44,500
pp sports 20,900
pp education 35,200
pp physiology 10,300
pp energy 83,400
pp fashion 21,300
pp terrorism 11,700
pp oil 11,700
pp art 103,000
pp strategy 38,500
pp management 51,800
pp artificial intelligence 15,500
pp automata 10,500
pp poetry 12,100
pp philosophy 14,600
pp circuitry 89,100
………..

Some elementary tradeoff computations:

1. d: Documents: 10**10
2. #w: Different words in each document: 500 in the average
3. p: Pairs (URL, word ID): 5x10**12
4. pS: Pair memory size: 50-60 Bytes
5. pMS: Pairs memory size: 2.5x10**14 Bytes = 250,000 GB!
6. ds: Document size: 2,000 words in the average
7. #T: Word Tables: 10**7 ?
8. L: Languages: 100 ? 
9. c: n-ads in a document: 2,000 in the average for every n
10. m: Matchmakings to check keyword existence in a document: 1,000 in the average
11. dAND: Words AND space: 10*5 documents (from 1000 to 100,000,000) 12) E: Amount of elementary computations to determine all references that use the n-ad: 10**5.10**3 = 10**8

Where
1. Number of documents actually registered by search engines;
2. Common words and expressions sets for each language have from 2,000 to 3,500 terms depending of the language, and strongly of the literary style. However Web documents use in the average much less;
3. Any exhaustive procedure must index all used words in each document;
4. The field length assigned to both variables: URL and word code:
5. Memory needed to stores all pairs:
6. Document size in the average. It refers to the Text content once editing commands are striped off. We include here our estimation.
7. Word tables. It is not easy to evaluate this value. Apart from Common Words and expressions we have “names” of people, geographical places, and juridical persons, abbreviations, acronyms, and 1-word keywords.
8. It is our estimation.
9. and 10. Each word within a text string could be considered the head of a n-words semantic chains. The matchmaking algorithm could be programmed to stop once the first positive match is obtained.
11. dAND means the space of logical query words intersection, for example warANDwordANDII equal to 39,700,000 references!. The average is hard to estimate. For our tradeoff excersice we estimate 100,000.
12. Number of elementary computations (a loop with a compare along the text string). If a search engine intent to issue its outcome in 50 millisecond or less (Google for instance) it needs a process power of 2,000 MIPS! where the I, that stands for Instruction represents one iteration of the loop.

Note: a word is a sequence or chain of characters belonging to a given code (for instance ASCII code) except one that acts as “words separator”.