Content filtering

Top  Previous  Next

Content Filtering allows you to filter out an entire page based on words found within the page's content. A list of filter words can be entered, prefixed with a "+" or a "-". You can specify positive filters (keywords beginning with a "+" character) which means that only pages with these words will be indexed, or you can specify negative filters (keywords beginning with a "-" character) meaning that pages containing these words will NOT be indexed.

This can be useful for two reasons:

1.It helps if you want to create a specialised 'vertical' search engine. For example you could create a search engine about pets. In this case the word filter list might look like:
+dog
+cat
+bird
+mouse
+hamster
+pet
2.You might want to avoid indexing some types of content. For example if you were building a religious search engine or a search engine for children, you might want to use negative filters:
-adult
-casino
-sex

You can also filter out pages based on the URL by using the Skip options by using the skip pages list. This method however, requires that you know the pages' URLs in advance, and manually determine which page has wanted or unwanted content. This is not always possible when indexing external sites, so content filtering solves this problem.

lightbulb

Tip: Note that content based filtering will be less efficient than URL based filtering because each page must be downloaded before it can be filtered. With URL based filtering (using the "Skip pages" list), the page can be discarded before it is downloaded, thus speeding up the indexing process. So URL filtering should still be used when possible.

Content filtering is applied to the HTML source code of the page being scanned. This means that you can filter by HTML tags, for example:
-<meta name="robots" content="noindex">
 
However, note that this means that support for filtering by exact phrases is limited because the words may be broken up by HTML tags such as line breaks, comments, etc. (For example, it will not match if we are trying to filter out pages containing the phrase "adults only", and the page actually contains the HTML "adults<br>only").