2. What’s in a document? – Part II




By Juan Chamero

Trying to see some light along the tunnel: As per this scheme documents, once all their editing and presentation commands are stripped of, could be assimilated to a long TXT corpus chain, ideally separated in sections, chapters, paragraphs and within them in terms (and figures). Also ideally and simplified each document deals with a given “subject”. Let´s see a sample of a single paragraph from Wikipedia.org page when querying by “Westphalia”:

Westphalia is roughly the region between the rivers Rhine and Weser, located north of the Ruhr River. No exact definition of borders can be given, because the name "Westphalia" was applied to several different entities in history. For this reason specifications of area and population are greatly differing. They range between and 16,000 - 22,000 km² resp. between 4.3 million and.8 million inhabitants.

We have highlighted those “terms” formed by a single word or by a sequence of them that “sounds” meaningful for a human, at least not considered “common”. Let’s suppose that our human uses a procedure that could be expressed as a screening algorithm. Let’s suppose that our human reader encountered the expression definition of political borders……instead of definition of borders. He/she is aware that definition and borders are very probable common words but perhaps the sequence political borders could be considered a keyword instead -and effectively it is!-. Now let’s proceed to separate our initial string in two, one with the sequence of “suspected” non common terms and the other with the string of common words as follows:

{Westphalia, Rhine, Weser, Ruhr River, Westphalia, history, 16,000, 22,000 km², resp., 4.3 million,.8 million inhabitants}

{* is roughly the region between the rivers * and * located north of the *. No exact definition of borders can be given because the name * was applied to several different entities in *. For this reason specifications of area and population are greatly differing. They range between * and * * between *and *.}

Now let’s play a little counting “semantic particles”:

36 different terms as seen by Conventional Search Engines:
[4-and, 1-applied, 1-are, 1-area, 1 be, 1 because, 3-between, 1-borders, 1-can, 1-definition, 2-different, 1-entitites, 1-exact, 1-for, 1-given, 1-greatly, 1-in, 1-Is, 1-located, 1-name, 1-no, 1-north, 3-of, 1-population, 1-range, 1-reason, 1-region, 1-rivers, 1-roughly, 1-several, 1-specification, 4-the, 1-they, 1-thus, 1-to, 1-was]

15 Possible suspected non common terms
[1-4.3, 1-4.3 million, 1-8, 1-8 million, 1-8 million inhabitants, 1-16,000, 1-22,000, 1-22,000 km², 1-history, 1-resp., 1-rhine, 1-ruhr, 1-ruhr river, 1-weser, 2-westfalia]

12 different terms as seen by Conventional Search Engines
[4.3, 8, 16,000, 22,000, history, inhabitants, km², river, resp., rhine, ruhr, weser, westphalia]

In summary CSE’s, Conventional Search Engines, see this paragraph like a chain of 36+12=48  different “words” and concerning only this piece of information they will index (restricting to the paragraph) the document in as many as 48 tables. A better approach would be to index this one-paragraph document in as much as 12 suspected non common words that provisionally we dare to name as “keywords”, real concepts that will aid us to to retrieve meaningful things.

Note: Take into account that documents, in the average, may use from 500 to 1,200 common terms but only from 20 to 40 keywords in the average!.

What’s in a document then?: Only two types of semantic particles!: Common Words and Expressions and Concepts many times wrongly assimilated to keywords and misused in the Cyber literature. As we will see later concepts are semantic paths within Human Knowledge Disciplines meanwhile keywords are chain  of words that eventually point to the right/expected semantic target when using search engines facilities to retrieve information and knowledge.