13. Art Map Prototype II

By Juan Chamero

Fingerprints and Semantic Neighborhood

    Let’s now explore a little a node and its neighborhood in order to build its semantic fingerprint.  Be the node: “comedy of situation”.

On sheet 3 of the Art sample (yoy may download from our Darwin Website, located at http://www.intag.org ) you may see the Content of a node, the meaningful semantic unit of a Web Thesaurus.

Within a node of the Art Tree we could host all semantic objects associated to its name namely:

a) A BVL, Basic Virtual Library of its selected Authorities (6,700 URL's);
b) Its semantic fingerprint, a sort of semantic spectrum of concepts;
c) Its related statistics.

This subject has the "path code" 0.1.2.2.2.2.5, that is the semantic chain  [art(0), the arts(1), performing arts(2), main performing arts(2), theatre(2), genres(2), comedy of situation(5)], where the link “art” is its head and the link “comedy of situation” is its tail. Each semantic chain is associated to a unique concept and you must be aware that the tail is a special keyword that because its semantic importance became a subject, in this case “Comedy of Situation”.

From each BVL we extract all keywords that belong to its subject, in some extent the basic components of its semantic spectrum. As we have seen in any document there exist two and only two types of semantic molecules: Common Words and Expressions as the literary molecules and concepts represented by consensual keywords. Trough a process that resembles crude oil– distillation -not yet explained- keywords are separated from Common Words and Expressions.

We may distinguish several types of keywords' sets, namely:

1. Upper level keywords,
2. Core keywords,
3. Collateral keywords,
4. Down level keywords,

5. Outer-distant keywords,

6. Authors-Artists-Critics-Management keywords,
7. Names of places,
8. Events,
9. Entities,
10. Numbers and symbols,
11. Works

We don’t dare to classify them yet. Perhaps classes 1 to 4 are concepts used to make the subject locally meaningful. Class 5 make mention to concepts that are relatively distant within the same discipline and outer concepts belonging to other disciplines. Classes 6 and 7 concern to either juridical or physical persons closely related to the subject: in a film for example, actors, directors, choreographers, writers, and critics belong to this class. Class 7 refers to places, any type of space, for example geographical places, geographical accidents, all types of domains like “Web space”, “ideas realm”, etc. Class 8 refers to all type of events, from dates October 12th 1492, to events like WWI, Olympic Games, and Oscar Awards. Class 9 refers to all type of entities, most times juridical persons like professional associations, ong’s, universities, brands, etc. Class 10 refers to all type of strings of numbers and characters: many of them could be literary noise, and errors but some of them may have very specific meanings. Finally class 11 refers to the objects produced/generated within the subject realm or related to it. Talking of Comedy of Situations could be names of famous comedies of this type.

Note: any keyword could be the tail of many other semantic chains pertaining to the same or to other disciplines. For example an event or a work could be subjects, for example: Olympic Games, WWII cited above.

For each set we determine the probability (percentage in red) of being used 1, 2, 3,.... or all of their members in documents of their related BVL’s. The numbers in blue represent the sets cardinals and the numbers in fuchsia are the keywords’ popularities.

Core keywords are those that semantically define the meaning of screened documents so its "red" percentage should be as close as possible to 100%. For each discipline we may define a neighborhood mask vector of five components [m1, m2, m3, m4, m5] for each type of keywords from 1 to 5 respectively. In Darwin Ontology it’s supposed that subjects that "pass" this test have a BVL of WWD's, Well Written Documents.

Concepts by type
We omit types 8 to 10

Fuchsia: Google popularity
Red: probability of appearances in node BVL’s
Blue: amount of different keywords used

 

1. Upper neighborhood -15%- -12-
Narrative genre 20,500
Sitcom (sitcom AND art: 493,000 )
Comedy of errors: 1,170,000
Real life situations: 930,000
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

2. Core -90%- -27-
Science fiction sitcom: 1,100
Comedy of intrigue: 12,800
Stratagems: stratagems AND comedy: 21,200
Ridiculous situations: 11,500
Farcical humor: 2,210
Honeymooners: honeymooners AND comedy: 412,000
Contrived situations: 15,600
..........................................

3. Collateral -25%- -11-
Parodying: 386,000
Sketch comedy: 1,720,000
Stand-up comedy: 3,460,000
Comedy of manners: 270,000
............................................

4. Down (AND Comedy) -20%- -15-
Laugh tracks: 2200
Canned applause: 2,700
Verbal sparring: 17,600
Sight gags: 162,000
Running gags: 115,000
..................................

5. Outer-distant -12%- -9-
Stock character: 64,700
Gag character: 2,350
Vaudeville: 3,290,000
Minstrel Show: 238,000
Burlesque: 4,870,000
................................

6. Authors-Artists-Critics-Management – 18%- -16.
William Shakespeare: 9,500,000
Henry Bergson: 11,300
Patrick Marber: 122,000
Garry Marshall: 772,000
…………………………

11. Works -65%- -372-
I love Lucy: 3,600,000
Amos n' Andy: 55,500
Fibber McGee and Molly: 182,000
The Burns and Allen Show: 22,100
The Adventures of Ozzie and Harriet: 58,700
The Adams Family: 119,000
El Chavo del Ocho: 323,000
All in the Family: 1,840,000
Batman: 76,800,000
Adam West: 1,300,000
Murphy Brown: 630,000
…………………………

Fingerprint

The information above of classes 1 to 11 could be considered the subject fingerprint because it identifies its semantic profile. One of Darwin Conjectures states that only the five classes 1 to 5 may differentiate –statistically- any subject of the Human Knowledge. It has only been successfully tested in our first two prototypes: Computing and Art.

Towards a Darwin Utopia: Let’s suppose that all existent Web documents were irreversibly deleted remaining their fingerprints. We may also suppose that were also deleted the BVL’s fingerprints.  It means that all literary expression disappeared. No glossaries at hand only a Jargon of Common Words and Expressions. Will it be possible to reconstruct our knowledge?. We state that if experts are still alive the whole knowledge could be reconstructed and perhaps improved!. In the experts’ collective mind there exists the equivalent of encyclopedia, glossaries, and basic manuals and tutorials. This brain has to rebuild a reasonable good literary filling of all subjects.

Actually we are designing an algorithm to create different types of fingerprints resembling wave spectral lines. The analogy is wave versus concepts. Potential concepts are born within a specific subject at a given date. Many of these concepts evolve along two lines: in their meaning and in their names. Along time arrive to rather stable meanings and to names that have consensus. As time passes by these new concepts continue evolving probably at a slower rate than before gaining or loosing consensus until obsolescence appear, and finally dying. This happens with classes 1 to 5.   

Document fingerprints

Once determined the subject fingerprint we may proceed to compute the BVL documents' fingerprints. In well defined disciplines it is perfectly possible to infer the content of the whole document by reading these fingerprints. As a matter of fact we may determine from now on the fingerprint of any document!.  Document fingerprints are essential to retrieve their similar meaningfully via a kind of Markovian process as we will see later when dealing with Darwin Search Wizard.

Document fingerprinting is also a highly desired feature to avoid and prevent all types of plagiarism and a fundamental tool to enhance the efficiency of SDI, Selective Distribution of Information systems.