Spider options

Top  Previous  Next

Spider downloading options

These are options which control the way in which Zoom will download files when it is in spider mode. Note that it does not apply for offline mode indexing.

Single-threaded downloading

This option uses only one dedicated thread for downloading files. This is typically the slower option but provides reasonable speed when indexing a site with a fast connection. It is also recommended when you are trying to follow the spider’s crawling path, to determine if it is scanning the pages you are expecting.

Multiple threads

This option allows you to specify more than one dedicated thread for downloading files in spider mode. This is recommended to increase the speed of indexing. It allows Zoom to download multiple files in the background whilst indexing at the same time.

Reload all files (do not use cache)

Check this option to ensure that all files are downloaded from the site and that the cached copies of pages are not used.

Spider throttling

This option allows you to add or increase a delay between requests made to a web server when indexing in Spider Mode. This can be useful when you are crawling a web server which is under heavy load and you wish to minimize any additional load that can be placed on a server during the spidering process. Note that in most cases, the spider should not put much strain on a web server as it is limited by the bandwidth and processing capability of one single desktop computer. This option should only be necessary when indexing a server which is overloaded with an unreasonable number of websites or tasks, running on underpowered hardware. Using this option will significantly slow down your spider indexing process. For most other situations, we recommend setting this to "No delay between pages".

Enable "robots.txt" support

When this option is enabled, Zoom will look for "robots.txt" files when indexing a website in Spider Mode. The "robots.txt" file can specify instructions for spiders and other user-agents, on what pages should be excluded from indexing (similar in effect to the "Skip pages list") and also whether a crawl delay should be required (similar in effect to the "Spider throttling" option).

Zoom will download a "robots.txt" file (if available) for each start point, so this method allows you to have per start point skip pages and throttling settings. It is also a good idea when indexing third party websites, so that you can make sure your spider is obeying the webmaster's rules.

Note that Zoom will only locate a "robots.txt" file for each start point, at the root level of the domain being indexed. It will not parse "robots.txt" files which are located in sub-folders. This means you should have all your sub-folder robots settings located within the one "robots.txt" file, specifying your rules relative to the base URL.

For example, the following "robots.txt" file will block Zoom from indexing any files in a folder named "secret" and any files named "private.html". It will also force a delay of 5 seconds between requests to this start point.

# this is my robots.txt for http://www.mysite.com/ (this comment is ignored)

User-agent: ZoomSpider        

Disallow: /secret/ 

Disallow: private.html

Crawl-delay: 5
For more information on the "robots.txt" file format, please refer to online resources such as http://www.robotstxt.org/

When this option is enabled, Zoom will also support the "robots" meta tag for "noindex" and "nofollow". This allows you to specify certain pages to be excluded from indexing (or crawled for links) by simply adding a tag such as the following within the page head:

<meta name="robots" content="noindex,nofollow">

Note that specifying "index" or "follow" values in the robots meta tag will have no effect as this is the default behaviour for all pages scanned.

Parse for links in JavaScript code

Some links on your web page may be embedded within JavaScript code (e.g. pop-up navigation menus). These links are generally considered to be search engine unfriendly, because a web spider is unable to execute the script and crawl the resulting links.

While Zoom will not execute JavaScript, this option asks Zoom to attempt to look for URLs within the JavaScript code and crawl/follow any valid links that it finds. This will typically find some of the links in your JavaScript but not necessarily all of them. It also increases the time it takes to index a web page. However, it can be a decent solution if you have many links within JavaScript code and do not have the time to fix your web page to be more search engine friendly.

You can find more information on indexing JavaScript links and better long-term solutions explained in our online FAQ here: http://www.wrensoft.com/zoom/support/faq_problems.html#javascriptmenus

Scan files linked via “file://” URLs in spider mode

This allows the spider mode to follow file:// style hypertext links. This can be useful for indexing an Intranet where you may have accessible files on the web server as well as the shared drives on the network.

Check thumbnails exist on website before using URL

This option applies for search result thumbnails configured as described in "Icons and thumbnails".

With this option enabled, Zoom will check each thumbnail URL and see if the image file exists on the web server (at the time of indexing) before deciding to use the determined URL for the thumbnail. This prevents "broken image" thumbnails.

Use offline folder for .desc files

This option allows you to specify custom description (.desc) files for your plugin supported files, by hosting them locally in an offline folder.

It allows you to override incorrect meta data on remotely hosted files (possibly on sites that you can not change, or where you do not wish to host the .desc files). For more information on .desc files, see "Using custom descriptions (.desc) files".

With this setup, you can now index external sites using Spider Mode, and and the Indexer will look for the .desc files for any plugin supported file formats (such as .pdf, .doc, etc.) in the local directory. This allows you to specify custom .desc files without having to host them up on the remote web site.

The offline .desc files need to include the full domain name and URL path in its filename. This is usually everything after the "http://" or "https://" prefix. It must also end in ".desc" (see examples below).

However, since a number of characters possible in a URL are not valid as filenames, you must encode these characters in their hexadecimal form and precede them with a "%" sign. This is similar to the HTTP encoding required for URLs. The following is a list of the characters in URL which must be encoded.

Character

Encoded

\

%5C

/

%2F

:

%3A

*

%2A

?

%3F

"

%22

<

%3C

>

%3E

|

%7C

For each of the above characters in a URL, substitute them with the Encoded form of the character when naming a .desc file for that URL.

Here are some examples of URLs and their corresponding .desc filenames,

Example 1,

URL:

http://www.mysite.com/files/mydocument.pdf

.desc filename:

www.mysite.com%2Ffiles%2Fmydocument.pdf.desc

Example 2,

URL:

http://www.mysite.com/download.php?fileid=123

.desc filename:

www.mysite.com%2Fdownload.php%3Ffileid=123.desc

Of course the prefered solution would be to create documents with correct meta data in the first place. But when this hasn't been done, local .desc files can provide more accurate searches and better looking results.