16. K-side Dissection

ByJuan Chamero

K-side Dissection

    In the figure below we have depicted a K-side as_it_is micro sample, a global overview of its semantic structure in order to appreciate the different entities of Darwin Ontology: authorities, concepts, semantic chains, names, words, keywords, popularity, acceptations, black and white hat stratagems, etc.

Note 1: This analysis has been entirely performed for the English language. We use as “raw data” the auxiliary information retrieved by Darwin agents when mapping Art in the World, the second Darwin prototype. So you may find it a little art biased.

1. Colors

    We choose colors as our first item of analysis because we all use them in our conversations, in our messages and when documenting. It will serve as a friendly introduction to some search talent themes.
    We choose a palette with most frequent used colors. Querying Google with them we obtained the following list of triads: [color, hexadecimal code, popularity].

Note 2: By popularity we mean the amount of references as reported by search engines. It differs from Google popularity, the main variable to rank Websites. Concerning Google for instance we choose as “query popularity” its number of results that is how many documents match the query logically. We explain something of this logic in the examples below. Perhaps a more suited name would be “match-making frequency”. 

Black, 000000, 1,300,000,000
Red, ff0000, 1,040,000,000
White, ffffff, 1020,000,000
Blue, 0000ff, 866,000,000
Green, 008000, 876,000,000
Yellow, ffff00, 491,000,000
Brown, a52a2a, 412,000,000
Silver, c0c0c0, 405,000,000
Orange, ffa500, 371,000,000  [the fruit, The Royal House of Orange]
Orange color, 16,100,000
“Orange color”, 4,720,000
Orange colour, 1,990,000
Purple, 800080, 193,000,000
Grey, 808080, 148,000,000
Navy, 000080, 116,000,000
Navy color, 2,480,000
Olive, 808000, 70,000,000
Olive color, 1,470,000
Beige, f5f5dc,   61,200,000
Lime, 00ff00, 45,400,000
Indigo, 4b0082, 44,900,000
Violet, ee82ee, 44,400,000
Teal, 008080, 33,000,000
Tel color, 894,000
Maroon, 800000, 32,700,000 [Caribbean Islands, runaway slaves,,,maroon people..]
Maroon color, 747,000
Aqua, 00ffff, 29,800,000
Aqua color 844,000
Cyan, 00ffff, 26,50 0,000
Fuchsia, ff00ff, 7,600,000

   The hexadecimal code defines precisely the color in the (RGB) RED, GREEN BLUE scale. So red is Full Red and zero blue and green: ff0000: ff, 00, 00 with up to 255 tonalities for each color, from 00 to ff in hexadecimal that is the same as to say from 000 to 255 in decimal (Red:  255,0,0). For green color we have the “name” Lime, as purest green coded as 00, ff, 00: 00ff00: 0, 255, 0. And for the purest blue we have 00, 00, ff: 0000ff: 0, 0, 255. Black is absence of color 00, 00, 00: 000000:  0, 0, 0 and White the mixing of all colors that in RGB system is represented by ff, ff, ff: ffffff: 255, 255, 255.
    What does popularity number mean here?. It tells us that the word black is present in 1,300,000,000 documents as detected and accounted by Google robots.

Note 3: we are not sure about how Google compute the value of this variable. Because as we have warned in anterior posts it seems to depend of things such as: the moment of the query, the IP where from you are connected and sometimes of the querying sequence.

Unexpected findings
    Rare and unexpected occurrences may carry important information. Let’s inspect some those highlighted in turquoise. 
    Any color could point to a color or to many other things as a function of its thematic context. For instance “orange color” -exactly as it is represented- make reference to a color whether is exactly  written as “orange color”. If you query by orange color without quotation marks Google understand you are looking for documents that have the words orange AND color within their texts. In this example we notice some other remarkable fact: small and sometimes subtle differences between words, like for instance between theatre and theater and color versus colour. Color is in USA and Colour in UK and statistically American acceptations have more popularity simple because the USA presence in the Web is higher than the UK presence. 

The beginning of the “semantic chain” concept
    The same happen for navy, teal, and maroon colors. Note in all these examples how abruptly popularity converges, as long as we turn more specific adding a second word, in this case “color”: as per this sample from 30 to 50 times!. The more “links” properly selected, you add to the query chain the more focused and specific the search outcome!. In these examples we have defined two links “semantic chains”, for instance [navy color]. A three links semantic chain would be [The Human Genome].
    Another thing you realize is that each color name may have many meanings; it is to say that names could have different meanings depending of their context. For instance Teal could be a color, a type of duck and TEAL which stands for Technology-Enabled Active Learning; Maroon, a color or in the past Caribbean slaves, and blue a color or a folk spiritual song. .

2. The Genre
    Diving into the English speaking Web you may bid that men and women should have more or less the same weight. Isn’t it?. As a sample we select he/she, man/woman, boy/girl pairs. It seems that at least documenting masculine is still dominant. Concerning the pair he/she take into account that in English documents the pair “he/she” is more and more used instead of the implicitly supposed “he”. Of course if you dive in some other languages these relative weights may differ substantially.

3. Sacred documents

    Have you ever imagined the Kabbalah as second after The Bible?. And the Gnosis fourth?. Just testing chains of words you may find different listing hierarchies. For instance the symbol Aum -or Ohm- points to Yoga and to other Oriental disciplines. If we want to know how many references corresponds to the Yoga concept we have to discover first the right “keyword” that match what we have in our mind as Yoga: Aum, Ohm, Yoga sutras, Yoga asanas perhaps?.

Inherent subjectivity 
: What are listed in this sample are keywords, subjective keywords, in fact mines!. If being an expert or an agent trying to map the Web –me or “he”- should unveil the best keyword for each subject, for you and for the majority of users, English speaking in this case. One criterion should be to choose the one with the greatest popularity/weight as a keyword once checked that top relevant documents retrieved deal with the sought subject and reasonably well besides. Let’s see this criterion in action:

Note 4: Take into account that this mastery must be transferred to Darwin agents when mapping the Web.

“Yoga” 72,500,000: high number but does not point exclusively to sacred writings;
“Yoga books”: 218,000: not significant and ambiguous;
“Yoga sutra”: 116,000: could be but agents always test by plurals;
“Yoga sutras” 228,000: reasonable high and points to an important keyword: Patanjali
“Patanjali”: 580,000: high but a human would reject it: Patanjali is like to talk about a Bible Version (we are studying how to instruct an agent to do the same, with similar proficiency);
“Yoga sacred writings”:  271,000 but deals with an ampler subject: Hinduism…

So we choose “Yoga sutras”. Let’s continue with other unexpected outcomes. Isn’t admirable to discover the Hammurabi code popularity?. We could explain the Tao Te Ching ranking because the strong Far East influences along the last sixty years instead. Perhaps because Iran?.  

4. The Power

    It’s reasonable but not trivial the WTO preponderance. It seems that commerce is actually ruling the Globalization. A geo strategic “troika” leads the second place: UN, UE and NATO. Within this upper level of establishment sample “The Group of 8” follows in importance. From a geo political point of view and for many outstanding authors, globalization reaction is a still amorphous multitude that expresses in erratic reactions such as The Kyoto Protocol and Anti-Globalization Movements that as you may easily appreciate are relatively weakly as per its Web presence.

5. The Great Works

    I never imagined this ranking, not even dreaming as living 50 years ago!. The Great Wall OK, because the up growing of China, the spatial trips and the film industry. In this item most keywords are multiple words and the inclusion or not of the particle “the” deserve special consideration
With “Apollo Program” our intention was to point to the first moon landing performed in 1969. For example the keyword “moon landing” has 707,000 references but covers much more than the specific Apollo Program, a tremendous effort along almost 15 years spending about 150,000,000,000 dollars as per dollar 2007.
    The tremendous difference between The Great Wall and the Great Pyramids popularities should call our attention as well. We tried with some equivalent keywords to point to the ancient Egypt works, namely: “Egyptian pyramids”, but results could not be improved. “The human genome” was another singular case because the keyword “human genome” has much more references (4,330,000) but human genome alone, without “the”, tell us about something more generic, for instance Human Genome Sciences and Human Genome living strategy.

6. Something about Internet power
    Internet is for many a sort of democratic-anarchist utopia. However there are some societies like Internet Society, ISOC that dictates rules, up to you rules advised to follow but rules at last. ISOC was founded in 1992 to provide leadership in Internet related standards, education, and policies. It has 1,900,000,000 references -at time of query, March 14th 2008 but at the moment I’m writing this page March 17th it went down to 1,650,000,000!-. Anyway this figure looks equally high and reasonable. What’s not reasonable at all is that Google also has 1,900,000,000 and Yahoo almost the same 1,740,000,000. Why?. Because they are “Search Engines” an extraordinary and necessary service but one thing is a service and another thing dictatorship. If we talk today of 9,000,000,000 Web documents indexed those figures suggest that from any of each 5 documents one mention either Google or Yahoo within their textual context, intentionally. In my humble opinion that’s false. Let’s see some other important search engines outcomes

Technorati, 232,000,000
Lycos, 75,300,000
Live.com, 66,400,000
Altavista, 22,000,000

7. Scientific Theories
    Here we have opportunity to learn a little more about keywords building and how to retrieve information from the Web. “Information Theory” is a keyword that belongs to Information Science and corresponds with a specific mathematical approach to communications. “Economic Theory” wouldn’t be  in this context a keyword instead because its ambiguity: something like to focus economics facts from theoretical points of view. “The Theory of Evolution” points to Darwin findings meanwhile “Evolution Theory” is more generic and talks about theoretical subjects about evolution, for instance “Web evolution theory”. One remark: why so referenced “Chaos Theory”?.  To make inferences you are invited to browse a little through references and/or build semantic chains with it, for example querying for [“chaos theory”, games], and [“chaos theory”, life], uncertain fields of science where this theory could add meaning.

8. Countries
    You may check that this sample rankinh has nothing to do with World statistics. Unexpected findings would be for France, Mexico, Argentina, Russia, Iraq and Iran.

English  1,880,000, versus French : 515,000,000;
English language : 49,200,000 versus French language : 4,690,000;

Then why 1,030,000,000 references?. Perhaps because the word “france” points to France either in English and French?. Perhaps this fact justify that Google sometimes gives 900,000,000   instead of 1,030,000,000.

    Mexico has an explanation: share frontiers with US and Mexicans are the largest Hispanic speaking minority, living and/or working in US: almost 70% of nearly 50 million “Latinos” living in US are of Mexican origin or descent. 

    Perhaps because: its extension, the variety and richness of its natural resources, its extension, its historically welcoming spirit, its low demographic density and its geopolitical situation: relatively distant from all actual and expected catastrophic scenarios. 

Note 5: When looking for names we have to take into account “native” names versus called names, for instance España versus Spain, Argentina versus Argentine. Argentina is another case where Google gives for “Argentina” two different values 620,000 versus 350,000. Argentine has a low outcome: 28,000,000. On the contrary España and Spain have pretty much the same outcome.

    An unexpected low outcome perhaps because the language barrier. Is hard for Russian people to speak in English and even harder to write!, For instance Putin has 144,000,000 references against 215,000,000 for Bush, quite comparable. To be fair perhaps we have to look for Cyrillic.
    Concerning Iraq and Iran I invite you to draw your own conclusions. They differ from Russia case in the sense that in these countries Internet is not an extended communication tool for economic and cultural reasons so their relatively low Internet popularity corresponds with what K-side says – in English- about them. 

9. Authorities – Physical Persons

    In my humble opinion this sample looks reasonable, as expected with only one exception: Bill Gates. We use the word “authority” to design a person, either physical or juridical, that has influence over a given community. Evidently Bill Gates has an outstanding “semantic” influence in the Web. Its creature Microsoft has 650,000,000 references for Google!. 
    Alan Turing for me the father of Computing Science has relatively few but this outcome has certain logic: Computing Science is very specific and low profile from a marketing point of view. Albert Einstein has a high outcome and he should have more, The Great Leonardo was almost submerged for common people until recently because some other institutional “authorities” promote him, the film and book editing industries. John Von Neumann is a curious case. He should be at Einstein level of popularity not only because they shared the same époque, scenarios and a comparable talent insight but for its public image as well.
Another question: why Machiavelli is so high?. Perhaps because his “Machiavelism”, is an art always present in politics and war affairs. A similar reborn occurs with the “Art of War” of “Sun Tzu” that have 2,100,000 and 2,330,000 mentions respectively!.

10. Art Works
    This list also looks reasonable by only on exception: the “emoticons”. All of you may issue credible analysis to justify the ranking of this sample. Emoticons and smilies (134,000,000) has become a popular art as it is the case of “Graffiti”, “street dance”, and “street theater”.

11. Literary Works
    We have here another example where something popular displaces long established concepts. Cinderella the popular fairy tale for children of “all times” overcomes Hamlet!. You didn’t expect that!, did you?. Another example is the actual reborn of Hans Christian Andersen (1,910,000,000) fairy tales with "the ugly duckling" having 630,000 references.
Concerning “The Lord of the Rings” work we expected a dominant rank because is in fact a multimedia work inspired by the book.

12. Diseases
Just to think it over, and over again. I was tempted to include stress in this list but this word is in fact a keyword that belong to many subjects such as engineering, gym, sports, political sciences, etc.. A shortcut would be querying by the word “stressed” that points fundamentally to our notion of body-mind stress.  From this “seed” it would be interesting to dive a little further with semantic chains of the type [disease, genre, country, socioeconomic level, age].