This is a project to build filtering capabilities comparable to those of Muffin into Squid. It consists of
Usually, a filtering proxy runs standalone and does nothing but filtering. Users have to configure this proxy in their browsers, and if they use a caching proxy too, chain them after the filter. In situations where the user runs Squid anyway (mostly because of caching for different browsers or a small LAN), it is convenient to build this capability into Squid.
libdl
or
libdld
dynamic-loader library and a compiler which can
produce the needed shared objects. Tested in the following
environments:
You need the Squid sources, everything for compiling them, GNU "patch" and GNU "autoconf".
gzip -cd squid-filter-0.8.patch.gz | patch -p1
autoconf sh configure (options...) --enable-filters
load_module <modulefile> <arguments>It tells Squid to load a filter module from the given file. The file should be specified with a full path name. The filter modules can take arguments as documented for the individual modules. Most filter modules take the name of a pattern file as optional (last) argument. Arguments are separated with whitespace, no quoting mechanism is available (by now). A module can be specified more than once, in that case several filter instances will be built from the already loaded module code.
When Squid is reconfigured, all active module instances are deleted and all modules are unloaded, then all modules in the (new) config file are loaded. (This does not unload the actual code segments if they are used in the new configuration. To update a module with a replaced binary a restart of Squid is needed.)
There is another new squid.conf directive:
nofilter_port <portnum>The port number given must be one of the numbers specified in
http_port
. Requests arriving on this port will
not be filtered. Effectively this makes a filtering and a
non-filtering proxy running at once, on different ports.
grep -E
syntax), one pattern per
line, against which the URI is matched. Blank lines and lines
starting with a "number sign" are ignored in the usual fashion.
Whenever a pattern file is changed, it gets reloaded at the next
request automatically, no reconfigure needed. A pattern is marked as
case-insensitive by prepending a dash. (To place a real dash at the
start of a pattern use a class, like [-]
). Patterns may
not contain literal TABs, use \t
instead.
There are two types of pattern files: allow lists and replacement lists.
!^http://example.com/foo/bar ^http://example.commeans: Apply the filter to URIs starting with
http://example.com/foo/bar
, but don't apply the
filter to anything else in http://example.com
.
sed s///
-like fashion. This type of pattern file is
used by the redirection filter. Each line in the file consists of
two elements separated by (at least) one TAB character. The
first is a pattern, the second a replacement. The replacement may
contain \1, \2... \9
references to parenthesized
subpatterns (\0
means the whole match).
A special replacement can be given as a shortcut for
patterns which have no explicit replacement. This default is
specified as replacement for the pattern consisting of a single
exclamation mark, which should be the first line in the file.
Negative match does not work in a replacement list.
anonymize_headers
clause must be set up to filter out
the Accept-Encoding request header. See below
for the exact reason.
Currently there are the following modules:
*
as either of its components
(such as audio/*
) meaning "all". To reject several
types, specify this module more than once. Bug: returns "empty"
(even without header) to the client instead of an error message.
rejecttype.so
(unless one module should affect the
other one's allow list files).
script.so
, activex.so
or any other future
HTML filter.
SCRIPT
tags, on...
handlers and browser-specific ways of inserting Javascript into tag
attributes) from HTML pages. (For also blocking JavaScript files use
rejecttype.so on "application/x-javascript".) Takes an allow list
file name as optional argument.
OBJECT
tags from HTML pages. The tags
are preserved, only the classid
parameter is replaced
by a dummy, so the page will still be processed correctly (as if by
a non-ActiveX browser). Takes an allow list file name as optional
argument.
Each content filter specifies the MIME content type to which it
applies (like image/gif
for the gifanim module) and
ignores all other types.
Content filters can be chained. When more than one filter applies to a given MIME content type, every filter operates on the results of its predecessor. (This will probably become important in later releases.)
Authenticators can be chained. They are tried in the order in which they are configured, until one either succeeds or presents a challenge (see below). An external authenticator, if available, is tried last.
/etc/passwd
. Only the first two fields are
used. Entries which specify empty, locked or invalid passwords are
ignored. The file is kept in memory in a hash table, ensuring fast
lookup even for big password files. It is reloaded whenever it
changes. An optional second argument gives the size of the hash
table relative to the number of entries, defaulting to 0.6.
This does not use NIS or any other external sources. Because
those lookups can block, they have to be programmed differently (in
particular, getpwnam()
is not safe, and this is the
reason why the original authenticators are external programs). The
easiest way to get authentication from a NIS password map is to
use auth_passwd.so
on a regularly updated (via
ypcat
) copy of the master password file.
Authsrv uses challenge/response schemes, which are not supported directly by HTTP Basic Authentication. As a workaround, the challenge is displayed as authentication realm. To authenticate using authsrv, the user has to first give the user name with an empty password, get a failure, and retry. Now he gets the challenge as the authentication realm and can answer with the response as password (to the same user name). This works with S/Key, but would fail with an authserver which gives different challenges on each request.
This implementation does not use actual code from the FWTK or require it to be installed. It talks the protocol used in FWTK 2.1.
.X.nofilter
to the
host name in the URL, where the X
is replaced by the
Squid's visible host name. Example: to get
http://www.example.com/foo/bar
unfiltered from a Squid
called squid.cache
, use the URI
http://www.example.com.squid.cache.nofilter/foo/bar
.
The NOFILTER tag as part of the hostname in the URL implies that correctly written relative links, including images, linked scripts etc. on the same server, will also be unfiltered. Apply the necessary caution.
Reason for the inclusion of the Squid's host name is to avoid
that web servers add the NOFILTER tag to their junk banner links
themselves. This works best when visible_hostname
,
unique_hostname
and the canonical (DNS) host name of
the proxy are all different and not too related, because the origin
server sees the latter two but not the former.
Since ".nofilter" is not a valid top level domain, it can't clash with real host names.
Another possible way to bypass filters is to use a
nofilter_port
, as described above. Requests arriving
on that port will always bypass all filters.
REF
and UNREF
macros from
src/module.h
. These structures are generated via the
filters/classdef
preprocessor. A module object
is an object which is created by a module when it is loaded, and
contains data relevant to an instance of the module, which is
created by a load_module
line. A filter object
is an object which is created for a single request. Documentation of
this stuff is spread across src/module.h
and the
individual filters.
src/patfile.h
.
Content filters for HTML pages use htmlfilter.c
for module framework and HTML parser. This is documented in
script.c
. In theory the operating system's dynamic
linker should take care of the inter-module dependencies this
creates, but many dynamic linkers are too stupid, so this has to be
loaded manually before any HTML filtering module.
debug_options
directive) are used:
Section 92 | Module loader (src/module.c )
|
Section 93 | Filter modules |
Section 94 | Library modules (src/patfile.c,
filters/htmlfilter.c )
|
Section 95 | Authenticator modules |
Level 1   | Error messages |
Level 3   | "Filter caught something" messages |
Level 4   | Initialization/finalization messages |
Level 5   | Initialization/finalization trace |
Level 8   | Minor trace |
Level 9   | Full trace (big!) |
If a content filter gets applied, the patched Squid will ignore Range requests and always send all, since in general filters can not properly determine ranges. Without Range requests the origin server should refrain from sending Transfer-Encodings which would confuse the filters. See also next two paragraphs.
script.so
applied to a file with compression encoding
will silently deliver corrupted files.)
For this reason, the Accept-Encoding headers should always be
filtered out with an appropriate anonymize_headers
clause. This causes the origin server to always send unencoded data.
The only exception to the rule that filtering happens only in the path to the client are those filters which alter the request. This applies to the redirect and the cookies module.
In a cache hierarchy, a filtering cache should only be placed at the bottom, i.e. where only clients directly access it. If another cache sits between the filter and client, that one will cache filtered pages and break the NOFILTER feature.
filters/auth_passwd.c
. Thus they may use arbitrary I/O
as long as they arrange for the proper callbacks. Filter modules
currently are simple functions, they can not use callbacks and are
expected to avoid blocking I/O (aside from reading config files,
which therefore should not be mounted over a network).
The Junkbusters web page has one of the oldest and best known web filters as well as a very comprehensive resources list covering most issues from "What is this all about?" to a list of filtering software (by now most of them are either for Windows or for pay or both, which indicates there is a real demand for filtering).
An up-to-date version of this page is always found at http://sites.inka.de/bigred/devel/squid-filter.html.
patfile.c
and module.c
).