Filter modules for Squid

Version 0.8, February 2003
  1. Purpose
  2. Prerequisites
  3. Installation
  4. Configuration
  5. Available modules
  6. Using
  7. Internals
  8. Related projects
  9. Getting this package
  10. Bugs

This is a project to build filtering capabilities comparable to those of Muffin into Squid. It consists of

Currently available filters: Special features: Also, this contains a new, module-based scheme for proxy authentication.

Purpose

A filtering proxy allows users to remove unwanted stuff from Web pages as they browse them. What "unwanted stuff" is obviously depends on the individual user, but things which are commonly regarded as annoyances include Some of those things can be avoided by filtering URIs, which Squid can already do via an external redirect program. Others require a content filter.

Usually, a filtering proxy runs standalone and does nothing but filtering. Users have to configure this proxy in their browsers, and if they use a caching proxy too, chain them after the filter. In situations where the user runs Squid anyway (mostly because of caching for different browsers or a small LAN), it is convenient to build this capability into Squid.

Prerequisites

This patch is for Squid 2.4STABLE7. It requires an operating system with a libdl or libdld dynamic-loader library and a compiler which can produce the needed shared objects. Tested in the following environments:

You need the Squid sources, everything for compiling them, GNU "patch" and GNU "autoconf".

Installation

  1. Apply the patch: (In the Squid source directory)
    gzip -cd squid-filter-0.8.patch.gz | patch -p1
    
  2. Run configure:
    autoconf
    sh configure (options...) --enable-filters
    
  3. Compile and install Squid as usual. The filter modules will be installed in the same directory as the binary.

Configuration

Loading modules

There is a new squid.conf directive:
load_module <modulefile> <arguments>
It tells Squid to load a filter module from the given file. The file should be specified with a full path name. The filter modules can take arguments as documented for the individual modules. Most filter modules take the name of a pattern file as optional (last) argument. Arguments are separated with whitespace, no quoting mechanism is available (by now). A module can be specified more than once, in that case several filter instances will be built from the already loaded module code.

When Squid is reconfigured, all active module instances are deleted and all modules are unloaded, then all modules in the (new) config file are loaded. (This does not unload the actual code segments if they are used in the new configuration. To update a module with a replaced binary a restart of Squid is needed.)

There is another new squid.conf directive:

nofilter_port <portnum>
The port number given must be one of the numbers specified in http_port. Requests arriving on this port will not be filtered. Effectively this makes a filtering and a non-filtering proxy running at once, on different ports.

Pattern files

Pattern files are files containing lists of regular expressions (POSIX extended, or grep -E syntax), one pattern per line, against which the URI is matched. Blank lines and lines starting with a "number sign" are ignored in the usual fashion. Whenever a pattern file is changed, it gets reloaded at the next request automatically, no reconfigure needed. A pattern is marked as case-insensitive by prepending a dash. (To place a real dash at the start of a pattern use a class, like [-]). Patterns may not contain literal TABs, use \t instead.

There are two types of pattern files: allow lists and replacement lists.

Allow lists

Unless otherwise stated, every filter can optionally take an allow list. This is a list of URI patterns to which the filter should not apply. Any pattern can be prepended with an exclamation mark meaning "do not match this", in which case the first matching pattern in the file counts. Example:
!^http://example.com/foo/bar
^http://example.com
means: Apply the filter to URIs starting with http://example.com/foo/bar, but don't apply the filter to anything else in http://example.com.
If this sounds confusing, just stick to the word "allow list": What matches is allowed through. Or look at it like this: "bang means filter".

Replacement lists

A replacement list allows URIs to be replaced by other URIs, in a sed s///-like fashion. This type of pattern file is used by the redirection filter. Each line in the file consists of two elements separated by (at least) one TAB character. The first is a pattern, the second a replacement. The replacement may contain \1, \2... \9 references to parenthesized subpatterns (\0 means the whole match). A special replacement can be given as a shortcut for patterns which have no explicit replacement. This default is specified as replacement for the pattern consisting of a single exclamation mark, which should be the first line in the file. Negative match does not work in a replacement list.

Other configuration dependencies

When content filters (see next section) are in use, an anonymize_headers clause must be set up to filter out the Accept-Encoding request header. See below for the exact reason.

Available modules

Currently there are the following modules:

Filters

Filters fall into one of the following categories: Filters of the same category operate either independently or chainable. Chaining is described below where appropriate (for request and content filters). In any case, all applicable filters are called in exactly the order in which they are specified in the config file.

redirect.so

Replaces Squid's external redirect program. Takes one argument, the name of a replacement list file. Performs pattern substitution on the requested URI. As soon as a pattern is found, the search stops, i.e. redirections are not chained within one redirection filter. However, if the module is specified several times (probably with different replacement list files), all of them are called in order, with a later filter operating on the results of an earlier one. If an external redirector is in use, it is called first, before the filters. NOFILTER does not apply to external redirectors.

rejecttype.so

Allows to reject Web objects based on their MIME content type. Takes the type to be rejected as first argument, an allow list file name as optional second argument. The type must be given in lowercase letters. It can contain a * as either of its components (such as audio/*) meaning "all". To reject several types, specify this module more than once. Bug: returns "empty" (even without header) to the client instead of an error message.

allowtype.so

The opposite of rejecttype: rejects all MIME content types except those specified. Takes a list of types to be allowed through (must be given in lowercase letters, wildcards possible) as arguments. An optional last argument specifies an allow list file, given as an absolute file name (this allows to distinguish file name and content type, as content types never start with a slash). It makes little sense to chain this module with either itself or rejecttype.so (unless one module should affect the other one's allow list files).

cookies.so

Kills cookies in both request and reply. Takes an allow list file name as optional argument.

htmlfilter.so

A library module which provides a generic HTML filtering service. It does no filtering by itself, but must be loaded before script.so, activex.so or any other future HTML filter.

script.so

Removes JavaScript (SCRIPT tags, on... handlers and browser-specific ways of inserting Javascript into tag attributes) from HTML pages. (For also blocking JavaScript files use rejecttype.so on "application/x-javascript".) Takes an allow list file name as optional argument.

activex.so

Removes ActiveX OBJECT tags from HTML pages. The tags are preserved, only the classid parameter is replaced by a dummy, so the page will still be processed correctly (as if by a non-ActiveX browser). Takes an allow list file name as optional argument.

gifanim.so

Breaks animated GIF pictures to remove the annoying blinking. Takes as first argument the allowed number of cycles. If zero, no animation (show only the first picture). If < zero, stop loading animations altogether (client shows broken picture). Default is one, meaning show the whole content but don't blink. An allow list file name is optional second argument.

Each content filter specifies the MIME content type to which it applies (like image/gif for the gifanim module) and ignores all other types.

Content filters can be chained. When more than one filter applies to a given MIME content type, every filter operates on the results of its predecessor. (This will probably become important in later releases.)

bugfinder.so

Identifies GIF and PNG images smaller than 3x3 pixels. Since these are often used as "Web bugs", it may be desirable to block them with a redirector. The filter can only log them to cache.log; to effectively block bugs it is necessary to filter the requests for these URIs, i.e. manual processing of the log file is needed. Like others this filter takes an allow list file name as optional argument.

Authenticators

This has nothing to do with filtering, but it was added as it fits nicely into the module framework. Authentication modules implement various methods for proxy authentication without external programs.

Authenticators can be chained. They are tried in the order in which they are configured, until one either succeeds or presents a challenge (see below). An external authenticator, if available, is tried last.

auth_passwd.so

Authenticates using a password file, like the old NCSA authenticator. The name of the file is given as a parameter, it defaults to /etc/passwd. Only the first two fields are used. Entries which specify empty, locked or invalid passwords are ignored. The file is kept in memory in a hash table, ensuring fast lookup even for big password files. It is reloaded whenever it changes. An optional second argument gives the size of the hash table relative to the number of entries, defaulting to 0.6.

This does not use NIS or any other external sources. Because those lookups can block, they have to be programmed differently (in particular, getpwnam() is not safe, and this is the reason why the original authenticators are external programs). The easiest way to get authentication from a NIS password map is to use auth_passwd.so on a regularly updated (via ypcat) copy of the master password file.

auth_authsrv.so

Authenticates using the authentication server from the TIS Firewall Toolkit. The host name and port number of the server are given as (mandatory) arguments to the module.

Authsrv uses challenge/response schemes, which are not supported directly by HTTP Basic Authentication. As a workaround, the challenge is displayed as authentication realm. To authenticate using authsrv, the user has to first give the user name with an empty password, get a failure, and retry. Now he gets the challenge as the authentication realm and can answer with the response as password (to the same user name). This works with S/Key, but would fail with an authserver which gives different challenges on each request.

This implementation does not use actual code from the FWTK or require it to be installed. It talks the protocol used in FWTK 2.1.

Using

On the client side, no additional configuration is necessary. Simply set the patched Squid as your proxy.

The NOFILTER feature

Note: This feature has changed since version 0.5.
Users can request that all filters (including the redirection filter, but not the external redirector) are bypassed for a single request. This is done by appending .X.nofilter to the host name in the URL, where the X is replaced by the Squid's visible host name. Example: to get http://www.example.com/foo/bar unfiltered from a Squid called squid.cache, use the URI http://www.example.com.squid.cache.nofilter/foo/bar.

The NOFILTER tag as part of the hostname in the URL implies that correctly written relative links, including images, linked scripts etc. on the same server, will also be unfiltered. Apply the necessary caution.

Reason for the inclusion of the Squid's host name is to avoid that web servers add the NOFILTER tag to their junk banner links themselves. This works best when visible_hostname, unique_hostname and the canonical (DNS) host name of the proxy are all different and not too related, because the origin server sees the latter two but not the former.

Since ".nofilter" is not a valid top level domain, it can't clash with real host names.

Another possible way to bypass filters is to use a nofilter_port, as described above. Requests arriving on that port will always bypass all filters.

Internals

Object structure

The filter routines use a common object-oriented framework. An object in this context is a structure which contains at least a reference counter and a (pointer to a) destructor and is managed via the REF and UNREF macros from src/module.h. These structures are generated via the filters/classdef preprocessor. A module object is an object which is created by a module when it is loaded, and contains data relevant to an instance of the module, which is created by a load_module line. A filter object is an object which is created for a single request. Documentation of this stuff is spread across src/module.h and the individual filters.

Library modules

The patfile library provides the pattern file facility described above. So far every filter uses it, mostly for an optional allow list. The patfile library is included in the Squid core and described in src/patfile.h.

Content filters for HTML pages use htmlfilter.c for module framework and HTML parser. This is documented in script.c. In theory the operating system's dynamic linker should take care of the inter-module dependencies this creates, but many dynamic linkers are too stupid, so this has to be loaded manually before any HTML filtering module.

Debugging options

The following debugging sections and levels (see the debug_options directive) are used:
Section 92  Module loader (src/module.c)
Section 93  Filter modules
Section 94  Library modules (src/patfile.c, filters/htmlfilter.c)
Section 95  Authenticator modules
Level 1     Error messages
Level 3     "Filter caught something" messages
Level 4     Initialization/finalization messages
Level 5     Initialization/finalization trace
Level 8     Minor trace
Level 9     Full trace (big!)

Content-Length and Range

Content filters which alter the content of the data returned have to keep the data length constant if the HTTP reply contains a Content-Length header. Most other filtering proxies simply remove that header instead, which breaks persistent connections and clients' progress meters. Here the filters have to pad out the data at the end if they remove anything in the middle, and are not allowed to insert stuff which would lengthen it. It is the filter modules' own responsibility to ensure this.

If a content filter gets applied, the patched Squid will ignore Range requests and always send all, since in general filters can not properly determine ranges. Without Range requests the origin server should refrain from sending Transfer-Encodings which would confuse the filters. See also next two paragraphs.

Content-Encoding

Content filters get the data as delivered by the server. With a non-identity Content-Encoding the filter would operate on the encoded data, which it generally can not process correctly. (It has been confirmed by experience that HTML filters like script.so applied to a file with compression encoding will silently deliver corrupted files.)

For this reason, the Accept-Encoding headers should always be filtered out with an appropriate anonymize_headers clause. This causes the origin server to always send unencoded data.

Filters in the data path

The cache stores always unfiltered objects. Content filtering happens in the data path from cache or memory to the client. The filter object is expected to copy the data into a new buffer, so it can do anything with it including insertions and deletions (but see the paragraph above on Content-Length).

The only exception to the rule that filtering happens only in the path to the client are those filters which alter the request. This applies to the redirect and the cookies module.

In a cache hierarchy, a filtering cache should only be placed at the bottom, i.e. where only clients directly access it. If another cache sits between the filter and client, that one will cache filtered pages and break the NOFILTER feature.

Blocking and callbacks

Authentication modules use a callback scheme which is explained in filters/auth_passwd.c. Thus they may use arbitrary I/O as long as they arrange for the proper callbacks. Filter modules currently are simple functions, they can not use callbacks and are expected to avoid blocking I/O (aside from reading config files, which therefore should not be mounted over a network).

Related projects

This project was mostly inspired by Muffin, a modular filtering proxy written in Java and distributed under GPL. By now that is the most powerful filter I know of.

The Junkbusters web page has one of the oldest and best known web filters as well as a very comprehensive resources list covering most issues from "What is this all about?" to a list of filtering software (by now most of them are either for Windows or for pay or both, which indicates there is a real demand for filtering).

Getting this package

This package can be found at http://sites.inka.de/bigred/devel/squid.filter-0.8.patch.gz.
For use and distribution of this package, the same terms and conditions as for the Squid package itself (i.e. the GNU General Public License) apply. Note, however, that using a version or installation setup which has the NOFILTER feature removed or restricted in any way is in gross contradiction to the author's intentions, and people who do so should feel guilty of abuse.

An up-to-date version of this page is always found at http://sites.inka.de/bigred/devel/squid-filter.html.

Bugs

As with any pre-release, this surely contains bugs. In particular I'm not sure if I really avoided memory leaks (although the module/object stuff carries leakfinder instrumentation and during testing no remaining leaks were detected). If someone finds problems, please tell me.

Known issues