14. Keywords detection

By Juan Chamero

Semantic space

    The Web semantic space is too big. Today we have 10,000,000,000 documents indexed, versus an estimated total population of 500,000 subjects and versus 10,000,000 concepts for each language: 5x10**22. It looks prima facie terrific. However this space is extremely rarified. It could be synthesized either in 500,000 cubicles or else it may be imagined as a huge search fork of 10,000,000 concepts pointing to a documents Ocean of 10,000,000,000 documents. Chosen the right concept next generation search engines will point and bring to us the “best” authoritative documents directly, meaning in only one click!. The problem is how to find the right keyword. A smart Darwin Wizard help users to find the keyword that better fit to their expectancies easily, in a matter of seconds.  In the average a user may find 1,000 references (Web pages) per concept in the average. At the Ten Top he/she will find the best ones, probably authorities. However if he/she wants now to go deeper within this specific sample, no problem: the Darwin Wizard will guide him/her thru a sort of convergent Markovian process. 

Potential keywords detection
    As we have seen a keyword k is a Word o chain of words that has a precise meaning within a certain subject context to which is correlated. And if this subject belongs to a given discipline tree we may define a more precise entity named “concept” as the “semantic chain” that goes from the root of that discipline to the subject with k added as its terminal link. If the subject we are talking about were in the fifth level of the discipline tree the concept that uses k within this context would have the following form:

[s0, s1.3, s2.2, s3.17, s4.1, s5.6, k]

    Where arbitrarily s1.3 would be the third subject of level 1, that has the root s0 as ancestor, s2.2 the second subject of level 2 that has s1.3 as its ancestor, and so on and so forth till our subject of discussion s5.6. And within this subject k is used as a specific keyword.  A more rustic and ambiguous identification of this concept would be [s0, k] with only the head of the semantic chain that goes from root to the subject and the specific keyword k as its tail.  Let’s discuss now some hints to detect these keywords within a collection of authoritative documents belonging to a given subject s, in the example above s5.6.

    Once striped all programming and/or editing commands, images and tables structures off documents could be ideally considered txt chain of characters being one of them recognized as a words separator and at its turns this chain could be considered a one-dimensional array. Let’s suppose that each document has 2,000 words in the average and for this subject we have a collection of 20,000 authoritative documents. Then the resultant words vector will have about 40 million words. We may also know that the language is English and that authorities have written their documents attaining to a Jargon Common Words and Expressions of well known frequency of use distribution, generally with no more than 3,000 entries.

    We are not going to discuss here the procedure to discriminate keywords from Common Words and Expressions -it will be discussed in next posts- but only some critical hints to take into account. By now the strong conjecture assumed is that keywords tend to appear in one and only one subject within the subjects semantic space. If eventually it appears in others it will have a different meaning for each of them. If we find the same or similar meanings it will mean that their corresponding subjects cluster in well defined neighborhoods or topological connected sub regions of a discipline tree: probably the neighborhood defined by up-down-collateral nodes.

Reflections about potential keywords detection
    Be a string at i-word position within the mentioned array. From it onwards we may inspect its 5-onward vicinity: [w(i) w(i+1) w(i+2) w(i+3) w(i+4)]. From this we may save five potential keywords, namely

(w(i)) : 1-k 
(w(i), w(i+1)) : 2-k
(w(i), w(i+1), w(i+2)) : 3-k
(w(i), w(i+1), w(i+2), w(i+3)) : 4-k
(w(i), w(i+1), w(i+2), w(i+3), w(i+4)) : 5-k

And we may also consider all possible 32 5-onward vicinity outcomes at each i by the code 5 bits sequence 00000 to 11111, where 0 may stands for word existent in the Jargon and 1 non existent. From each word we keep 5 potential keywords from 1-k to 5-k with up to 32 semantic different structures from 00000 to 11111. However not all of these structures apply:

00000  00001  00010  00011  00100  00101  00110  00111
01000  01001  01010  01011  01100  01101  01110  01111
10000  10001  10010  10011  10100  10101  10110  10111
11000  11001  11010  11011  11100  11101  11110  11111

Some examples 
    00000 points to a highly improbable keyword, for instance : [from each  word we keep].

Note: As a byproduct of the computation we will have the structure probabilities of appearance estimations. Prima facie we may bet that 11111 have a very low probability of appearance as well 00000.

    01101 as [The Albert Einstein procedure Z-31] could be a candidate and also [The Albert Einstein], [The Albert Einstein procedure] [The Albert Einstein procedure Z-31]. Other variants such as Albert Einstein without “the” will appear when processing 5-onward vicinity corresponding to (i+1). So this chain opens in five potential keywords as follows:

0 : 1-k : The
01 : 2-k : The Albert
011 : 3-k : The Albert Einstein
0110 : 4-k : The Albert Einstein procedure
01101 : 5-k : The Albert Einstein procedure Z-31

    01110 as [New Zealand John Kennedy Airport] is another example where geographical places and names appear, potential keywords by its intrinsic specificity.

Special case of 1-k
    Provided that the word pointed belongs to the class Common Words & Expressions, CW&E’s could be considered a keyword as long as it deviates significantly from its expected appearance – see figure below-.

In the figure above we have depicted the Jargon Frequency Distribution. We zoomed up a spectrum section from words w1571 to 1578. We note that w1577 does not appeared and outstands w1575 warning us about its potentiality as a keyword in despite of being a single word like for instance “pond” that has up to 60 specific meanings –in 60 different semantic fields!-.

    Nesting content: cases as [the Albert], [the Albert Einstein], [the Albert Einstein Z-31] are syntactically analyzed rejecting 2-k, 3-k, and 4-k and keeping only as a potential keyword 5-k : “The Albert Einstein procedure Z-31”. However criteria strongly depend of the discipline we are dealing with. Remember that Darwin has to settle, train and tune-up agents. In Biblic literature for example some CW’ like earth, heaven, man, have different meanings depending on the context.

What’s in  !CW&E’s (! Stands for non-)
    We have seen that textual chains only have two types of semantic particles, namely CW&E’s and concepts. Out of Expressions CW’s are single words of a Common Jargon. Keywords are chains of CW’s and !CW’s like names, places, events, acronyms, neologisms, special characters, crypto-chains, abbreviations, numbers, dates, measures, fantasy chains, misspelled words, and many others!. Some examples are:

Names: John, Mosley
Places: Georgetown, Alabama
Events: WWII, Grammy Annual Contest
Acronyms: SITCOM, CAT
Neologisms: Systemologist, neurasia
Special characters: &, %
Fantasy chains: Bafapufa433, godisgod
Abbreviations: Kgm, cm
Numbers: 34557, 31416
Dates: 01-05-1900, 20th October 1945
Measures: 2” 3/8, 03:52:41
Fantasy chains: Uffffffffffffff, grrrr
Misspelled words: Concertinon, chritique

    Initially we may have atlases, dictionaries and glossaries that will enrich these semantic classes along time. However on thing is true: if a potential keyword is highly frequent within the subject sample and if it is well written and spelled besides it is probably a keyword. Most of them will be formed by chain of CW’s.   

How the neighborhood is performed
    The neighborhood analysis supposes the knowledge of keywords sets along all paths. We have to imagine how this process would proceed from “zero ground”.

    In the figure above we depict the sequence of calculations. In the small sub-tree of 11 nodes we have to compute “path mode” first as indicated in the table upright at left: once we finished the three first paths we are in condition of knowing if inheritance and descent rules applies. Then we pass to the other two sequences and test node 9 versus its descentand upwards versus its ancestor. Finally we pass to the last two sequences to test node 10 accordingly. 

    Next step we have to compute “level mode” as depicted in the upper table by studying collateral overlapping and degree of collateral reciprocal influences.

    Once finished we are in conditions to proceed to explore the nodes neighborhood and test the whole tree validity of conjectures.