7. Towards e-Libraries

By Juan Chamero




What a bookmark is?
Literarily “to bookmark” is to mark a document or a place of it for ulterior reviews. It also refers to a browser feature that saves URL’s for similar purposes. In this sense it could be seen as a list of authority addresses. We are going now to deep a little about these types of “lists” that enable us to map the Human Knowledge similarly to how the Human Genome maps the immanent knowledge that leads to life.

One overview to the bookmark skeleton: In the figure above we depict a bookmark section for a given discipline, for instance Geography that opens in 65 subjects (branches). Subject S56 is detailed at its third level branch S563 that in its turn opens in 9 branches from S5631 to 5639. Keywords that collide with documents referenced in the “neighborhood” of subject S5635 are shown down left. Within bookmarks of this type are embedded disciplines’ semantic skeletons as logical trees and within each subject node of any tree we may imagine stored its semantic profile -its “fingerprint”- and a specific Directory of Authorities (see Authorities meaning in this context below) behaving as a BVL, Basic Virtual Library for this subject.  

We may now imagine a Human Knowledge Bookmark as a virtual encyclopedia encompassing all disciplines, from classic ones like Medicine, Mathematics, Engineering, Physics, Computing, to new ones like Security, Games, Geopolitics, and Entertainment. This bookmark –that may evolve to become the book of books skeleton- could be hierarchically organized by discipline and within each by subject, sub-subject, sub-sub-subject, and so on and so forth until covering the most specific topics. Some disciplines have well defined curricula displayed in up to thirteen levels of hierarchy correlated with “logical trees” of the same number of levels and the same number of subjects. If this bookmark is retrievable and safe the only we need to access any piece of it is a precise index of pointers. 

Towards e-Libraries: It would be advisable that all meaningful pieces of knowledge be located in the right “room” –in fact e-rooms- , for example a room for Medicine, and each room subdivided in sections, sub-sections and so on and so forth till exhausting the hierarchy, let’ say a big section for allopathic medicine, an aisle for oncology, several shelves for the different types of cancer, etc.. Following a decimal classification system a code like 003.01.17.35 will be the index to locate “lung cancer” (35) within “oncology” (17) within “allopathic medicine” (01), and all within medicine as discipline number 3. And this shelve should be large enough and structured to accommodate thousands of topics classified in two to three more levels. These virtual building of e-rooms has a significant advantage: their size and layout may accommodate perfectly to the knowledge evolution.
 In the figure above is depicted the process of Thesaurus creation once a bookmark as a discipline skeleton has been unveiled from the Web. From the discipline Logical Tree we pass to the Bookmark generation step as shown in top. And then we pass to the Thesaurus generation by filling keywords and authorities node by node as show down.
This is the way most Libraries order their books but in this case we are talking about a not yet existent library several times the size of the Library of Congress of the United States. How do we ideally retrieve books in these type of “conventional libraries”?. Usually by title, authors, editor, dates and eventually by main subjects/topics covered for books and documents that were previously commented and classified by experts-human authorities. .

Going back to the Web as a substantial reference its size grows at a terrific pace and the themes dealt with and their interrelations are continuously changing. Our Darwin technology may fit to these changes. The US Library of Congress for instance can not change easily neither its “Authorities” code nor the building lay out even substantial changes in the disciplines curricula were recognized!.   


 In the figure above is depicted the Logical Tree Evolution process. As all the Thesaurus objects, specifically keywords and URL’s are demand sensitive, continuously updated it’s possible to agents suggest curricula and authoritative changes to humans such as deleting a branch as shown on left. Darwin URL’s are in fact i-URL’s, because they have not only the URL but a summary of the content pointed, its subject, its fingerprint and a set of counters that register the demand of it ling its life. Agents may also suggest new branches and new nodes.
Let’s take a look to the Web reservoir: If our concern deals with documents hosted in the Web things are different. Each document could be considered a piece of knowledge located in any place within the huge Web space even though perfectly retrievable via its URL. However documents are not located at “conventional libraries mode” but rather in an apparent chaos meaning that their URL’s have nothing to do with their “inherent” thematic order. Some Search Engines classify documents by “words” and some others intent to do it thematically. Search Engines robots detect all different words, acronyms and symbols within each document and classify them accordingly in as many different words, acronyms and symbols tables as detected. Not easy to perform because they have to unveil the textual information first, normally embedded and disperse in an ocean of editing frames and commands!.

Some tasks actual robots do not perform: It would be desirable to add some elementary thematic talent to these robots such as unveiling the subjects and main topics dealt with in each document. To compensate this actual weakness some thematic Search Engines enable Website authors to report this crucial information by themselves. Unfortunately many authors frequently try to deceive Search Engines and accordingly misleading users, choosing highly demanded themes and topics as baits. Many others also fail to attract the right readers because ignorance of “established subjects names”. Effectively, in any well known discipline, like for instance computing and medicine, all subjects have precise names that point to them. If unnoticed authors use slightly different names, even though with the same human meanings, users may not locate them via Search Engines.

Note: Actual Search Engines neither classify documents by their titles nor have the capability to unveil their subjects. Next Search Engines generation will do it perfectly well once the whole Web is mapped as we will see it in next posts.

Actual “Flat” Indexing, all along a single line depot: In a system basically driven by word indexing, like for instance Google, we have billions of documents indexed by as many different words detected within their text content. That’s not bad and many Search Engines index almost everything generated “at the second and all over the world” and some of them, like Google, enhance searching efficiency with smart popularity algorithms. These algorithms guide users to find the “best ones” references among similar. Searching with these Search Engines is like fishing: smart users build specific baits that “catch” the best sets of references I a few trials. 

Our thesis is that the best fishing baits correlate with established concepts belonging to established curricula or the modal curricula instead that could be hidden in the Web. Satisfactory results correspond to specific pairs [keyword, subject], that is a specific keyword belonging to a specific subject of a given discipline, what we have already defined as a concept. Our thesis go farther stating that subjects are in their turn associated to specific set of keywords that only have sense (meaningfulness) when associated to them!. Let’s see what these subtle differences mean concerning the retrieving efficiency.

Note: by modal curricula we understand the ones that are statistically behind the Web and that must be unveiled. They are the same as the curricula as_they_are in the Web. 

The actual game of searching: Actually documents are located in their respective URL’s. These documents are indexed in as many different word tables as found within their textual body. Users try to locate documents via a sort of interception game by words. Search Engines renders their best offer, namely documents that match better the string of words, in the average under certain predetermined criteria. Finally users decide by their own to make click over some of the offered references. We ignore if users were satisfied or not. The only feedback they are aware of is traffic measures, number of sessions per IP, queries content, queries performed by unit of time and per user, and sometimes users’ segmentation by language, region, and IP’s. Search Engines suppose that if their traffic is good its service is good, like the relation between newspaper and their reading, if sales grow owners infer that readers are satisfied in the average.

From our experience and experiments the amount of clicks to obtain satisfactory results strongly depends of the subject, the knowledge level of users, their navigation and retrieving experience and talent, and of the deepness and/or specificity of the information to be retrieved. Notwithstanding it ranges from a few clicks -two to four- to a practical infinity with users giving up their mission after investing hours, days and sometimes weeks searching unsuccessfully. Under this panorama we may ask ourselves: is it possible to find something reasonable good in only a few clicks, from one to three, no matter how wise and smart users are?. The answer is yes!. The only we need is to have the knowledge properly mapped!.

Next Search Engine architecture IR oriented: Going back to our imaginary gym, what if we assign to each document its “right place” within a huge Virtual Library?. The right place would be the “shelve” that corresponds exactly to its “main subject”. We may also implement such a library indirectly, just accommodating all documents along a huge one-dimensional list reservoir. Effectively the whole Web could be considered such a “list” managed by the net of routers. This list could be imagined like all URL’s listed alphabetically as triads

 [URL,
(discipline, subject_level_1.subject_level_2. …. Subject_level_n),
keywords]


for a document dealing with a level n subject. This list could be open at both ends enabling that new URL’s may be inserted and deleted at any time. This list could be assimilated to a knowledge bookmark. Another way to see it better organized would be as “logical trees” one for each discipline of knowledge and within each node its semantic profile constituted by all keywords that belong to it, with a head that would define its architecture and parameters (a class) and a specific Virtual Library with references to Authorities, documents that deal with the subject properly. This is also a peculiar knowledge bookmark that may map the whole human knowledge registered in the Web at any moment!. Is that mapping possible?. Yes it is. Not trivial of course. We will try to explain it stepwise along next posts. 

A common place, data navigators and explorers: This knowledge architecture resembles the computer data organized in folders and files: the “root” folder at “level 0” and in its turn it opens in subfolders or folders of “level 1” and each of these subfolders open in folders of “level 2” and so on and so forth. Each folder associates to a ”node” of the data logical tree. Now what’s inside each node?. Only subordinate folders?. No, they have folders and files inside. Following the tree downstream we finally found nodes that only have files. Then we have to ask ourselves: what the criterion to save files and nodes within nodes?. The answer is logical but not so trivial.

Something about WWD, Well Written Documents: Concerning knowledge, in any of these folders we keep those folders that deal with subordinate subjects and we only save files on it that significantly deal with this “mother” subject in a broad sense. As each node, has a “neighborhood” (upwards folders, except the root, downwards folders and collateral folders) we as users are almost always challenged by files allocation uncertainties. As an example let our folder be identified by the code 1.2.5.3 pointing to a 4th level subject, being 1.2.5 its mother and nodes 1.2.5.3.1 to 1.2.3.5.7 corresponding to its seven 5th level subordinate subjects. Within this realm we will continuously face the challenge to save files in the right folder guided by our personal criteria taking into account that authors are not robots but humans. A document that according to our judgment was saved in node 1.2.5.3 was because we used in our mind a sort of optimum allocation algorithm that somehow ponders the thematic wandering along the whole document but focusing on subject 1.2.5.3.

And we as humans will justify our decision with reasoning of the following type: mentions and citations to mother node 1.2.5 was an obliged contextual reference to orient readers (beginners for example) and citations belonging to subordinate nodes 1.2.3.5.4 and 1.2.3.5.7 were considered necessary to convince specialists. It’s the normal way scientists and technicians all over the world write their papers: the core of thesis using as much as possible established concepts daring to introduce a few new ones, a little wandering upwards as looking for a contextual backup and justification, a little wandering downwards to put examples,  a little collateral to cite similar and related efforts.

Authorities: We have made intensive use of the word “Authorities”. In this context authority refers to legitimacy, justification, and recognized merit to spread theories, ideas and concepts. Authorities are websites that emanate “authoritativeness”. We are talking of a cognitive authoritativeness that is guaranteed by some agreed mechanism. One example of agreed certification is the Library of Congress Authorities used in conventional libraries. To certify that a given document hosted in the Web is an Authority would be a controversial matter even performed by content experts. In our Darwin technology subject authoritativeness is a function of how Well Written a document is.

Most doctoral and master thesis are WWD’s, news articles of important newspapers are also WWD’s, e-books and tutorials are also generally WWD’s, legislation documents, testimonies, online courses are also generally WWD’s. When WW, Well Written tests on a document are hard to perform or their arguments may appear confuse or ambiguous authoritativeness could be guaranteed if it was issued backed up by a world recognized “alma mater” in its widest sense. Agents may list documents of high popularity that having a poor WW factor may be considered authorities because they hierarchically hung of prestigious domains.

As it is perfectly possible the existence of authorities of low popularity Darwin agents may be instructed to select documents that satisfy all concepts for the subject to which they belong!. All these criteria (WW and WWD) are approximations based in great numbers. Take into consideration that we are talking of subjects that point to their corresponding document tanks that may hold from 100,000 to 100, 000,000 pieces and that from these tanks we have to extract 1% or less of authorities. It is highly possible thaw within these sets of automatically unveiled authorities there exists many errors; let’s say 10% of the sample of 10,000 authorities. As all Darwin maps are enabled to evolve by themselves as a function of traffic the “wrong” i-URL’s the will be at large virtually killed by users.