Metadata Extension

From Gnutella Developers

<< Bye Message Extension | Active Queuing Extension >> | Main Page

Table of contents

Metadata Extension

By Sumeet Thadani, LimeWire LLC

Copyright 2001 LimeWire LLC

Abstract

LimeWire and other clients on the Gnutella network currently respond to searches by doing string matching between the query and the names of files on the hosts’ library. Consequently searches are restricted to strings that can be contained in a filename.

We propose a technique for allowing richer querying. Every file in a host’s library may have some meta-data associated with it. Query Requests will encode the richer queries and responses will contains results based on the rich query searches in addition to the regular results. The proposed scheme will ensure that the protocol continues to work with older clients, which do not understand the embedded rich queries.

Introduction

LimeWire and other clients on the Gnutella network currently respond to searches by doing string matching between the query and the names of files on the hosts’ library. Consequently searches are restricted to strings that can be contained in a filename and directory path.

An example of a rich query would be a user looking for a book titled “The Big Bang – Origin of the universe”, which was written Mr. John Doe and was published by the ABC Publishing Co. in March 1997. We will use this sample query throughout this paper to illustrate various points.

If anyone actually had the book, the file would probably be called “The Big-Bang.txt”. If a user searched for “Big Bang” on the current system she would get the correct response, along with numerous others, which are not really of interest to her.

The idea is to allow users to give specific information in queries, and thus be able to search more efficiently for the information that really concerns them.

Each file in a user’s library can be associated with multiple sets of meta-data tags. To use the running example, the file “Big-Bang.txt” could be associated with multiple sets of meta-data tags – each of which is an XML document (a document is thus a set of related information about the file).

A file may be associated with multiple documents, but only one document corresponding to each schema (a schema defines the format of the information a document may contain, it is like the blueprint of the documents that correspond to it). In the example, the schema we would use to annotate the file Big-Bang.txt would be the book schema (which contains fields Title, Author, Publisher, Language, Cover Image, etc.). See appendix A to review some of the schemas that will be in the first release of XML enabled LimeWire.

A document about Big-Bang.txt pertaining to the book schema could contain this information:

  • Publisher = “ABC Publishing Co.”
  • Title = “The Big Bang – Origin of the universe.”
  • Author = “John Doe”

Once this information is associated with the file, some user may be able to search for it by doing a search using the book schema, and specifying that she is looking for the Author = “John Doe”, or Publisher = “ABC Publishing Co”, or by specifying a combination of these attributes.

Implementation

Encoding

The meta-information will be encoded in XML. Note that it’s possible to use other encoding systems (like binary encoding) to represent the grammar we wish other clients to be able to understand (and respond to).

Please see Appendix B for a brief discussion on encoding the meta-information in binary. It also talks about the advantages of using XML over binary encoding.

Every schema is associated with a URI. The URI serves two purposes – first it is used to uniquely identify a schema. So when a rich query comes along, it will contain the URI of the schema it corresponds to. And so while trying to reply to this rich query, each servant needs only to look at documents corresponding to that schema. The servant will send back a query reply containing files that have documents, which match the criterion in the rich query.

The second reason for having the URI, is that if a client is not aware of the schema corresponding to the URI of a rich search, the client should try to get it from the location specified by it. It is obvious that since the client has not encountered this schema before, it is impossible to reply to the query at hand, but it would be a good idea to download it for the future. This feature is not yet implemented, but we would like to see it become a reality in a future version of LimeWire.

Here is a sample rich query.


<books>
<book schema=”http://www.limewire.com/schemas/book.xsd”
      title=”Big Bang – Origin of the universe”
      publisher=”ABC Publishing Co”
      author=”John Doe”
/>
</books>

This query does not contain fields that are in the schema - but are not specified by the user in the query - for example the schema contains the field genre, but since the user, did not specify a genre in her search the query does not mention a genre. This saves a lot of bandwidth. Particularly, since there are a lot more queries than responses on the network.

One response to the above query may look like this:


<?xml version="1.0"?>
<books xsi:noNamespaceSchemaLocation="http://www.limewire.com/schemas/book.xsd">
<book  identifier="C:\shared\Big-Bang.txt "
       title=”Origin of the Universe”
       author=”John B Doe”
       chapters=”11”
       genre=”Non-fiction”
       publisher”ABC Publishing Co”
       price=”$12.50”
       comments=”Extremely well written”
</book>
<book identifier=”abc.txt”
      title = “Big Bang Universe”
      publisher=”ABC Publishing Co LLC”
      author=”John Doeky”
      chapters=”14”
      comments=”Not about the universe at all”
</book>
</books>

This response says that there are two files, called “Big-Bang.txt” and “abc.txt” – both have annotations, that that would make them candidates for matching the sample query.

Notice that the XML is all in one big chunk. The original proposal was the put smaller chunks of XML between the double nulls of responses. Then the chunk of XML found between the double nulls following a response would be a XML document pertaining to that response.

We decided to collate all the XML and put it in one chunk for two reasons – First - this way of encoding the XML reduces redundancy, by preventing several tags and the URI from being repeated in the rich part of each response within the Query Reply. Second, if all the XML is in one chunk, we can effectively use a compression algorithm to make the size even smaller – compression would be difficult to achieve with several smaller chunks of XML. Compression is discussed in more detail, later in the paper.

One last point about aggregate XML in Query Replies – if there are aggregate XML strings pertaining to more than one schema, in a particular Query Reply, they must be separated from each other, using a delimiter. The chosen delimiter string is <?xml version=”x”?>, where x is the version of the XML being used.

Query Requests

Lets use the running example while taking a closer look at Query Requests. When the user wants to do a rich search she should choose the schema of the rich search from the Drop Down Menu as shown in the figure. The Schemas the user can chose from have “...” following the name of the schema – to indicate that the user will have to enter more information.

[image removed]

The user chooses the book schema, and populates the fields to create the rich query.

[image removed]

Then the user clicks on search.

The Query Request is created with this format:

  • Bytes 0 and 1: Minimum speed (Not changed)
  • Bytes 2 to null termination (say position x): normal query string (Not changed)
  • Bytes x+1 to next null termination: rich Query (Added)

This is illustrated in the figure below:

Original Query:

[image removed]

The proposed new Query Request will look like this:

[image removed]

Older clients that do not understand rich queries will just ignore the stuff after the first null. Newer clients that can respond to the rich query will understand that the first null is not the end of the packet and that there is a rich query after that first null.

Query Reply

If a rich query reaches an older client that does not recognize rich queries, it will do a normal search and send out a “normal reply” - based on a string match between the normal part of the Query and the file names in the users library.

If the rich query reaches a newer client that understands rich queries, the search proceeds on the basis of the rich query (and also the normal query) and the client generates a Query Reply and sends it out just like a normal reply. The only difference is that the client that responds to the rich part of the query will also pack some XML into the QHD of Query Reply.

An original query reply with two matches for files is illustrated below.

[image removed]

The only difference between a Query Reply that contains XML and one that does not exists within the QHD of the Query Reply. The figure below explains the new format of the QHD.

This is further explained/described below (all byte numbers are inclusive):


Byte Description
0 ... 3 QHD Vendor code (4 ASCII characters).
4 Public area size 's'.
5 ... 6 Public area. Two first bytes (left as it is now).
7 ... 8 The size m of the XML area (the end of the private area).
9 ... 4 + s (reserved for optional public area extensions).
5 + s ... sizeOfMessage - 16 - m - 1 Private area. See GEM, HUGE, and GGEP.
sizeOfMessage - 16 - m ... z - 1 XML area XML payload.
z Required null byte (to indicate end of XML payload).
z+1 ... sizeOfMessage - 17 (reserved for future use).
sizeOfMessage - 16 ... sizeOfMessage - 1 Client GUID (last 16 bytes of the message).


Note 1: The size of the XML is specified in little-endian format.

Note 2: Since we have 2 bytes to specify the size of the XML we can theoretically the size of the XML string could be 64K (one byte can represent only 255 bytes which is too small). We encourage the XML string to be as small as possible. Remember most clients will drop any message whose size is more than 64K.

Since the XML string can be potentially huge, we may compress the XML String. This is explained in the section talking about compression.

Meta-Data creation

There following steps illustrate how file can be annotated according to schemas.

The file to be annotated is selected, and an “annotate” action is chosen from the file menu. This is illustrated in the figure below:

[image removed]

The schema with which we are annotating this file is chosen and fields are filled according to the selected schema:

[image removed]

Compression

This last section of the paper talks about XML compression in LimeWire, which is discussed below:

The compressed XML will be prefaced with a string of the following format:

{%C_TYPE} This prefix is white-space trimmed and is optional.

At this time, the value of %C_TYPE may be any of the following:

  • “deflate” - The “zlib” format defined in RFC 1950 in combination with the "deflate" compression mechanism described in RFC 1951.
  • “plaintext” - No compression performed.

In future we may want to support more compression algorithms like xmill; and we may choose the prefix string "xmill" for it. But for the moment the prefix string may have only the two values mentioned above. If the prefix is not present, clients should interpret that as there being no compression.

In the future, when vendors support more than one compression standard, it is possible that a client sees a string in the prefix for which it does not know the de-compression algorithm. In this case the XML should be discarded as undecipherable.

A note about the value of the size (whose value was described above as the size of the XML payload +1(for null)) is calculated as follows: (size of XML after compression + length of the compression header)+1

Appendix A

The audio schema


<?xml version="1.0" encoding="UTF-8"?>
<schema targetNamespace="http://www.limewire.com/schemas/audio.xsd">
  <element name="audio" type="audioType"/>
  <complexType name="audioType">
    <all>
      <element name="title" type="string"/>
      <element name="artist" type="string"/>
      <element name="album" type="string"/>
      <element name="track" type="short"/>
      <element name="genre">
        <simpleType base="string">
          <enumeration value="Blues"/>
          <enumeration value="Classic Rock"/>
          <enumeration value="Country"/>
          <enumeration value="Dance"/>
          <enumeration value="Disco"/>
          <enumeration value="Funk"/>
          <enumeration value="Grunge"/>
          <enumeration value="Hip-Hop"/>
          <enumeration value="Jazz"/>
          <enumeration value="Metal"/>
          <enumeration value="New Age"/>
          <enumeration value="Oldies"/>
          <enumeration value="Other"/>
          <enumeration value="Pop"/>
          <enumeration value="R & B"/>
          <enumeration value="Rap"/>
          <enumeration value="Reggae"/>
          <enumeration value="Rock"/>
          <enumeration value="Techno"/>
          <enumeration value="Industrial"/>
          <enumeration value="Alternative"/>
          <enumeration value="Ska"/>
          <enumeration value="Death Metal"/>
          <enumeration value="Pranks"/>
          <enumeration value="Soundtrack"/>
          <enumeration value="Euro-Techno"/>
          <enumeration value="Ambient"/>
          <enumeration value="Trip-Hop"/>
          <enumeration value="Vocal"/>
          <enumeration value="Jazz+Funk"/>
          <enumeration value="Fusion"/>
          <enumeration value="Trance"/>
          <enumeration value="Classical"/>
          <enumeration value="Instrumental"/>
          <enumeration value="Acid"/>
          <enumeration value="House"/>
          <enumeration value="Game"/>
          <enumeration value="Sound Clip"/>
          <enumeration value="Gospel"/>
          <enumeration value="Noise"/>
          <enumeration value="AlternRock"/>
          <enumeration value="Bass"/>
          <enumeration value="Soul"/>
          <enumeration value="Punk"/>
          <enumeration value="Space"/>
          <enumeration value="Meditative"/>
          <enumeration value="Instrumental Pop"/>
          <enumeration value="Instrumental Rock"/>
          <enumeration value="Ethnic"/>
          <enumeration value="Gothic"/>
          <enumeration value="Darkwave"/>
          <enumeration value="Techno-Industrial"/>
          <enumeration value="Electronic"/>
          <enumeration value="Pop-Folk"/>
          <enumeration value="Eurodance"/>
          <enumeration value="Dream"/>
          <enumeration value="Southern Rock"/>
          <enumeration value="Comedy"/>
          <enumeration value="Cult"/>
          <enumeration value="Gangsta"/>
          <enumeration value="Top 40"/>
          <enumeration value="Christian Rap"/>
          <enumeration value="Pop/Funk"/>
          <enumeration value="Jungle"/>
          <enumeration value="Native American"/>
          <enumeration value="Cabaret"/>
          <enumeration value="New Wave"/>
          <enumeration value="Psychadelic"/>
          <enumeration value="Rave"/>
          <enumeration value="Showtunes"/>
          <enumeration value="Trailer"/>
          <enumeration value="Lo-Fi"/>
          <enumeration value="Tribal"/>
          <enumeration value="Acid Punk"/>
          <enumeration value="Acid Jazz"/>
          <enumeration value="Polka"/>
          <enumeration value="Retro"/>
          <enumeration value="Musical"/>
          <enumeration value="Rock & Roll"/>
          <enumeration value="Hard Rock"/>
          <enumeration value="Folk"/>
          <enumeration value="Folk-Rock"/>
          <enumeration value="National Folk"/>
          <enumeration value="Swing"/>
          <enumeration value="Fast Fusion"/>
          <enumeration value="Bebob"/>
          <enumeration value="Latin"/>
          <enumeration value="Revival"/>
          <enumeration value="Celtic"/>
          <enumeration value="Bluegrass"/>
          <enumeration value="Avantgarde"/>
          <enumeration value="Gothic Rock"/>
          <enumeration value="Progressive Rock"/>
          <enumeration value="Psychedelic Rock"/>
          <enumeration value="Symphonic Rock"/>
          <enumeration value="Slow Rock"/>
          <enumeration value="Big Band"/>
          <enumeration value="Chorus"/>
          <enumeration value="Easy Listening"/>
          <enumeration value="Acoustic"/>
          <enumeration value="Humour"/>
          <enumeration value="Speech"/>
          <enumeration value="Chanson"/>
          <enumeration value="Opera"/>
          <enumeration value="Chamber Music"/>
          <enumeration value="Sonata"/>
          <enumeration value="Symphony"/>
          <enumeration value="Booty Bass"/>
          <enumeration value="Primus"/>
          <enumeration value="Porn Groove"/>
          <enumeration value="Satire"/>
          <enumeration value="Slow Jam"/>
          <enumeration value="Club"/>
          <enumeration value="Tango"/>
          <enumeration value="Samba"/>
          <enumeration value="Folklore"/>
          <enumeration value="Ballad"/>
          <enumeration value="Power Ballad"/>
          <enumeration value="Rhythmic Soul"/>
          <enumeration value="Freestyle"/>
          <enumeration value="Duet"/>
          <enumeration value="Punk Rock"/>
          <enumeration value="Drum Solo"/>
          <enumeration value="A capella"/>
          <enumeration value="Euro-House"/>
          <enumeration value="Dance Hall"/>
        </simpleType>
      </element>
      <element name="type">
        <simpleType base="string">
                <enumeration value="Song"/>
                <enumeration value="Speech"/>
                <enumeration value="Audiobook"/>
                <enumeration value="Other"/>
        </simpleType>
      </element>
      <element name="seconds" type="int"/>
      <element name="year" type="year"/>
      <element name="language" type="language"/>
      <element name="SHA1" type="int"/>
      <element name="bitrate" type="short"/>
      <element name="price" type="string"/>
      <element name="link" type="uriReference"/>
      <element name="comments">
        <simpleType base="string">
          <maxInclusive value="100"/>
        </simpleType>
      </element>
    </all>
  </complexType>
</schema>

The video schema


<?xml version="1.0" encoding="UTF-8"?>
<schema targetNamespace="http://www.limewire.com/schemas/xerces/video.xsd">
  <element name="video" type="videoType"/>
  <complexType name="videoType">
    <element name="Title" type="string" minOccurs="0" maxOccurs="1"/>
    <element name="Director" type="string" minOccurs="0" maxOccurs="1"/>
    <element name="Producer" type="string" minOccurs="0" maxOccurs="2"/>
    <element name="Studio" type="short" minOccurs="0" maxOccurs="1"/>
    <element name="Stars" type="string" minOccurs="0" maxOccurs="3"/>
    <element name="Type" minOccurs="0" maxOccurs="1">
      <simpleType base="string">
        <enumeration value="Music Video"/>
        <enumeration value="Commercial"/>
        <enumeration value="Trailer"/>
        <enumeration value="Movie Clip"/>
        <enumeration value="Video Clip"/>
        <enumeration value="VHS Movie"/>
        <enumeration value="DVD Movie"/>
        <enumeration value="Other"/>
      </simpleType>
    </element>
    <element name="Minutes" type="decimal" minOccurs="0" maxOccurs="1"/>
    <element name="Size" minOccurs="0" maxOccurs="1">
      <complexType>
         <element name="Pixel Height" type="int"/>
         <element name="Pixel Width" type="int"/>
      </complexType>
    </element>
    <element name="Year" type="year" minOccurs="0" maxOccurs="1"/>
    <element name="Langauge" type="language" minOccurs="0" maxOccurs="1"/>
    <element name="Subtitles" type="language" minOccurs="0" maxOccurs="3"/>
    <element name="SHA1" type="int" minOccurs="0" maxOccurs="1"/>
    <element name="Rating" minOccurs="0" maxOccurs="1">
      <simpleType base="string">
        <enumeration value="G"/>
        <enumeration value="PG"/>
        <enumeration value="PG-13"/>
        <enumeration value="R"/>
        <enumeration value="NC-17"/>
        <enumeration value="NR"/>
      </simpleType>
    </element>
    <element name="Availability" type="string" minOccurs="0" maxOccurs="1"/>
    <element name="Price" type="string" minOccurs="0" maxOccurs="1"/>
    <element name="Shipping" type="string" minOccurs="0" maxOccurs="1"/>
    <element name="Link" type="uriReference" minOccurs="0" maxOccurs="1"/>
    <element name="Comments" minOccurs="0" maxOccurs="1">
      <simpleType base="string">
        <maxInclusive value="100"/>
      </simpleType>
    </element>
  </complexType>
</schema>

The book schema


<?xml version="1.0" encoding="UTF-8"?>
<schema targetNamespace="http://www.limewire.com/schemas/xerces/book.xsd">
  <element name="book" type="bookType"/>
  <complexType name="bookType">
    <all>
      <element name="Title" type="string"/>
      <element name="Edition" type="short"/>
      <element name="Author" type="string"/>
      <element name="Publisher" type="string"/>
      <element name="Genre">
        <simpleType base="string">
          <enumeration value="Computers, Information, and Reference"/>
          <enumeration value="Philosophy and Psychology"/>
          <enumeration value="Religion"/>
          <enumeration value="Social Sciences"/>
          <enumeration value="Language"/>
          <enumeration value="Science"/>
          <enumeration value="Technology"/>
          <enumeration value="Arts and Recreation"/>
          <enumeration value="History and Geography"/>
          <enumeration value="General Fiction"/>
          <enumeration value="Classics"/>
          <enumeration value="Crime"/>
          <enumeration value="Science Fiction"/>
          <enumeration value="Poetry"/>
          <enumeration value="Children and Teens"/>
          <enumeration value="Other"/>
        </simpleType>
      </element>
      <element name="Subject" type="string"/>
      <element name="Pages" type="int"/>
      <element name="Year" type="year"/>
      <element name="Language" type="language"/>
      <element name="SHA1" type="int"/>
      <element name="ISBN">
        <simpleType base="int">
          <pattern value="\d{10}"/>
        </simpleType>
      </element>
      <element name="Dimensions">
        <complexType>
          <element name="Length" type="decimal"/>
          <element name="Width" type="decimal"/>
          <element name="Height" type="decimal"/>
        </complexType>
      </element>
      <element name="Back">
        <simpleType base="string">
          <enumeration value="Hardback"/>
          <enumeration value="Paperback"/>
        </simpleType>
      </element>
      <element name="Availability" type="string"/>
      <element name="Price" type="string"/>
      <element name="Shipping" type="string"/>
      <element name="Link" type="uriReference"/>
      <element name="Comments">
        <simpleType base="string">
          <maxInclusive value="100"/>
        </simpleType>
      </element>
    </all>
  </complexType>
</schema>

Appendix B - Binary format for encoding meta-data

It’s possible to encode the meta-information in binary format.

First lets take a look at how this scheme would work.

In Query Replies the metadata could be placed between the nulls or in the QHD as proposed in this document; and in Queries, the metadata will be after the first null. However the structure of the metadata itself will be different.

Query Requests

Let’s use the running example. A Query Request in our example would normally have three fields (maybe less) in the Rich Query String. When using XML tags, the values of those fields would be encoded within the tags, to indicate what data they represent.

When using a binary encoding scheme, the schema becomes very important. The first part of the rich query must contain the URI of the schema. Followed by a delimiter. This would be followed by individual values for each of the fields in the schema. Each value must be delimiter separated. If a particular field in the Query was not populated by the user while doing the search – that field would still have to have a null (or zero) value in the Rich Query String. Why? – Because we now only have the schema to define what field to expect in what position – so order becomes very important and we just cannot skip any field!

Query Replies

The concept is the same as Query Requests. The various fields in the Rich Query Reply must be delimiter separated and they must all be in the same order as specified in the schema. Again, if a certain field does not have a value, it should still be included in the Rich Query Reply to ensure that the ordering remains consistent with the schema.

The trade-offs

Many people have alleged that, the use of XML is overkill, and that XML uses too much bandwidth.

We tested both schemes to compare their bandwidth requirements, and found that XML does require more bandwidth (as predicted), but the difference was tolerable. Further if the XML were compressed before it was sent out on the wire, it would take up less bandwidth than the uncompressed binary encoding scheme. Compression works well on XML because of the high number of redundant repeated characters.

The advantages of using XML far outweigh and therefore justify the use of the (little) extra bandwidth. These advantages include:

  • XML allows for nesting: a tag name can have sub-elements like “First “and “Last”.
  • No need to make code changes for accommodating new types of searches.
  • Parsers are easily available.
  • XML is a well-understood standard.
  • Schemas are a standard, with tools constantly being developed for them.
  • Schemas allow for typing of fields and in future integrity constraints can also be checked.


<< Bye Message Extension | Active Queuing Extension >> | Main Page