Querying the Network
From Gnutella Developers
Source - [Latest draft (http://rfc-gnutella.sourceforge.net/src/rfc-0_6-draft.html)]
Note that queries should now be managed by Ultrapeers according to the Dynamic Querying specification.
This chapter describes the use of Query and QueryHits messages.
|Table of contents|
Forwarding and routing of Query and Query Hit messages
A servent SHOULD forward incoming Query messages to all of its directly connected servents, except the one that delivered the incoming Query. Servents using Flow control or Ultrapeers (sections 3.1 and 3.2) will not always forward every Query over every connection.
A servent MUST decrement a message header's TTL field, and increment its Hops field, before it forwards the message to any directly connected servent. If, after decrementing the header's TTL field, the TTL field is found to be zero, the message MUST NOT be forwarded along any connection.
A servent receiving a message with the same Payload Message and Message ID as one it has received before, MUST discard the message. It means the message has already been seen.
QueryHit messages MUST only be sent along the same path that carried the incoming Query message. This ensures that only those servents that routed the Query message will see the QueryHit message in response. A servent that receives a QueryHit message with Message ID = n, but has not seen a Query message with Message ID = n SHOULD remove the QueryHit message from the network.
When and how to send new Query messages
Query messages are usually sent when the user initiates a search. A servent MAY also create Queries automatically, to find more locations of a resource for example. If doing so the servent MUST be very careful not overload the network. A servent SHOULD not send more than one automatic query per hour.
Servents SHOULD NOT allow the user to create a large amount of queries by repeatedly clicking on a button.
Servents SHOULD watch queries originating from its neighbours (Hops=0) If those queries are too frequent, are duplicates or indicate bad servents behavior in any other way, the servents SHOULD drop those queries or even close the connection.
The TTL value of a new query created by a servent SHOULD NOT be higher than 7, and MUST NOT be higher than 10. The hops value MUST be set to 0.
When and how to respond with Query Hit messages
When a servent receives an incoming Query message it SHOULD match the Search Criteria of the query against its local shared files.
How to encode text
The Search Criteria is text, and historically, it had not been specified which charset that text was encoded with. Therefore, old servents assumed it was pure ASCII only. However, many servents have been deployed that used extended charset, where the 8th bit of each byte is set. This has created interoperability problems, because different platforms use different local native charsets, including on Windows, where several native "ANSI" codepages exist.
However, given that there are a very large number of servents deployed that use the internal Java encoding that hides platforms difference, and most others are running on Windows using a local ANSI charset, most often on Western European versions of Windows, this has largely promoted the ISO8859-1 character set as the initial standard for sending queries, and returning query hits including meta-data.
With the internationalization of Gnutella, it is clear that this initial ISO8859-1 (which is still widely deployed) is inappropriate for users using languages written with other scripts than the restricted Western European Latin alphabet. Now, the more universal Unicode encoding' is widely used on the Internet as it can support all languages of the world. The UTF-8 encoding form for Unicode is also prefered because it does not modify US-ASCII sequences where one byte represent each aSCII character). The transition from ISO8859-1 to UTF-8 is simple (because it can be computed algorithmically). So:
- For new developments on Gnutella, UTF-8 is then the HIGHLY RECOMMENDED charset instead of ISO8859-1, and servents should now encode and decode Query strings and results according to the STRICT UTF-8 encoding (extended Latin or Greek or Basic Cyrillic characters are coded on two bytes, and all other characters of modern scripts are coded on three bytes). Applications that run on other platforms MUST include the support for encoding/decoding the local native charset to/from UTF-8 (this just requires a mapping table for the supported native charsets, or using an API provided by almost all OS).
- When decoding UTF-8 text, if any encoding error is detected, applications should fallback by decoding with the ISO8859-1 charset, or possibly the Windows-1252 charset (which is a widely used extension of ISO8859-1 that replaces all unused C1 controls by other characters commonly used with Latin texts). This preserves an excellent interoperability with almost all deployed servents that still only use ISO8859-1 (note that actual texts encoded with ISO8859-1 will almost never decode as valid UTF-8, notably in search strings that should be all lowercase, because ambiguity may occur only when some extended lowercase' extended Latin-1 letter is immediately followed by some uppercase Latin-1 letter).
- When encoding UTF-8 text, no leading "Byte Order Mark" (or BOM, U+FFFE) should be used. If a leading BOM present (coded 0xEF,0xBB,0xBF with UTF-8), it should be ignored (applications should filter all occurences of U+FFFE in strings even when it could mean a Zero-Width-Non-Breaking Space also because the use of U+FFFE for meaning ZWNBSP is also deprecated by the Unicode standard).
- In any case, the internal processing of query strings SHOULD be done as if the strings were all originally coded with UTF-16, with no leading BOM. The UTF-16 encoding SHOULD NOT be used on the network, because its use of null bytes and its endianness ambiguity is incompatible with many messages and extensions (unless COBS is used), and also because it will interoperate badly with servents that still assume either US-ASCII only or ISO8859-1, Windows-1252, or any other legacy 8-bit charset.
- The precomposed "Normalization Form C" (NFC) of Unicode strings is also STRONGLY RECOMMENDED because it encodes most characters in one Unicode code point, or in predictable sequences, and eases the transition with legacy 8-bit charsets used by native platforms. Some platforms (like the HFS+ filesystem on MacOSX) encode filenames using decomposed form which won't interoperate (including with MacOS Classic systems using HFS filesystem, but also most Windows 95/98/ME platforms that have a limited support for Unicode); on such systems, the servent SHOULD provide the necessary support to recompose those filenames returned by the filesystem into NFC form (note that HFS+ filesystem on MacOSX will locate successfully all filenames given with the NFC form).
Meaning of the Search Criteria
Exactly how to interpret the Search Criteria is not specified either, but here are some guidelines for interoperability between servents:
- The Search Criteria is a string of keywords. A servent SHOULD only respond with files that has all the keywords. It is RECOMMENDED to break up the words on any non-alphanumeric characters (anything but letters and numbers). The regular US-ASCII space (U+0020) is the standard separator between keywords.
- Servents MAY also require that all matching terms be present in the same number and order as in the query.
- The matching SHOULD be case-insensitive.
- Empty queries or queries containing only 1-letter words SHOULD be ignored.
- Servents MAY ignore queries whose Search Criteria is shorter than a chosen length. The reason is to ignore too broad searches.
Special Search Criteria
- Regular expressions are not supported and common regexp "meta-characters" such as "*" or "." will either stand for themselves or be ignored.
- GGEP extensions MAY be used to provide details on how to parse the Search Criteria (such as specifying that regular expressions matching should be used), but a servent can never be sure other servents will understand the GGEP extension.
- Query messages with TTL=1, hops=0 and Search Criteria=" " (four spaces) are used to index all files a host is sharing. Servents SHOULD reply to such queries with all its shared files. Multiple Query Hit messages SHOULD be used if sharing many files. Allowed reasons not to respond to index queries include privacy and bandwidth.
Determining the value of other fields in Query Hits
- Query Hit messages MUST have the same Message ID as the Query message it is sent in reply to.
- The TTL value MUST be at least the hops value of the corresponding query, and the initial hops value of the Query Hit message MUST (as usual) be set to 0.
- The TTL value MAY be set to at least the hops value of the corresponding Query plus 2, to allow the Query Hit to take a longer route back, if necessary. Some servents use a TTL of (2 * Query_TTL + 2) in their replies to be sure that the reply will reach its destination. Replies with high TTL level SHOULD be allowed to pass through.