![]()
From: Bill Moseley (moseley@hank.org)
Date: Thu Jan 24 2002 - 19:29:58 CST
At 02:22 PM 01/24/02 -0800, Diwakar Kannan wrote:
>Hi
>
>I need some ideas on how to hook up a search engine to hypermail archives.
I can offer two suggestions. One would be to use swish-e
(http://swish-e.org) and use the CGI script and the included script for
parsing hypermail archives.
The other option is a script I use to manage a number of hypermail archives
which also adds a search box to the archive lists, and manages reindexing
and so on. It needs to be updated for archives that are split into
directories.
If you want reasonably untested code, and unedited docs I can offer up this
script:
> pod2text mail_archive.pl
NAME
mail_archive.pl - Creates and indexes a hypermail archive
SYNOPSIS
mail_archive.pl [options]
Options:
--mode=[create|update|index]
--chdir=dir cd to "dir"
--config=file lists config (default: lists.conf)
--help brief docs
--man full documentation
--test show what would happen
--verbose
DESCRIPTION
make_archive.pl creates and updates hypermail archives, and is designed
to manange a number of different lists (all configured within a single
configuration file). It also is designed to assist in indexing the
archives for searching with swish-e. Swish-e, and associated files must
be installed.
This script is not designed for use with a very large number of lists,
or where there's a high volume of email traffic, due to the startup
costs of perl.
If you are reading this with the --man option, you might find the
formatting better if you run
perldoc mail_archive.pl
The program can be run in one of three modes:
create mode
In create mode the program scans the list configuration file
(lists.conf by default) and create an archive directory for each
defined list that doesn't already exist. A hypermail configuration
file is written to this directory, and a symlink it created to the
search CGI script.
If a mailbox directory is defined in the config file (lists.conf) all
mailbox files are imported into the newly created hypermail archive.
The mailbox directory is defined by the mbox_dir setting:
mbox_dir = /path/to/my/mailboxes
No recursion is done when reading the mailbox directory. If a mailbox
file ends in .gz the file will be passed through `gzip(1)' with the
`-dc' flags.
By default, the program looks for mailbox files that match the
regular expression:
^(\d{6,6})(?:\.gz)?$
That is, it's expecting mbox files to look like:
200112
200111.gz
The pattern used to match files can be defined by the mbox_macth
configuration setting. If every file in `mbox_dir' is a mailbox, you
can use a pattern to match all files:
mbox_match = .
Capture parenthesis can be used to capture a *numeric* substring.
This string is used for sorting the mailbox files when reading in
messages (to help put the messages in numeric order by date). The
default pattern of:
^(\d{6,6})(?:\.gz)?$
Will extract out the six digits (year plus month) and use that for
sorting.
If the captured pattern is not numeric (or not used), then the file
name will be assigned the number zero with regard to sorting. A
warning will be issues if the captured pattern is not numeric. When
two files have the same numeric sort value they will be sorted by
file name.
Once created, the archive will be indexed by swish-e.
Example:
cd ~/archives
./mail_archive.pl --create --verbose
update mode
Update mode is used to read a *single* message from stdin, and route
it to the correct archive. This makes configuring with procmail
simple:
MAILDIR=$ARCHIVE_DIR
: 0w
| ./mail_archive.pl --mode=update
The mail_archive.pl program will return a non-zero exit status on
messages that are not delivered to a defined mailing list. A non-zero
return will cause procmail to continue processing for the message.
This allows non-defined mail to be delivered normally.
You can avoid this setup and use the more standard use of directing
mail to the archive via an aliases file, but this allows one command
to manage all lists.
This setup is not designed for a very large number of very high
volume lists.
index mode
Index mode is used to reindex the archive with swish-e. This allows
the use of cron to better control how often the archives are checked
for re-indexing. Only archives that have been added to since the last
indexing will be indexed again.
For example, to check every ten minutes:
0,10,20,30,40,50 * * * * ./mail_archive.pl --mode=index
--chdir=$HOME/archives
INSTALLATION
Create a top-level directory. All the individual list archives will be
created below this directory. The idea is that all paths can then be
relative which makes relocating the archives easy.
For the sake of discussion, we will call the top-level directory:
~/archives
You must also have a reasonably current version of hypermail installed.
You will need a 2.1-dev or later version of swish-e. http://swish-e.org.
It's recommended to build swish-e with both zlib and libxml2 support,
but neither are required. For example:
cd
swish-e
cp ~/swish-e-<version>/src/swish-e .
cp ~/swish-e-<version>/example/swish.cgi .
modules directory.
cd ~/archives
This files need read access by the web server.
index_hypermail.pl
cd ~/archives
mail_archive.pl
Run this program with the --mode=create option:
chmod 755 mail_archive.pl
It will create a few support files if they do not already exist:
lists.conf - configuration file for your lists
By default, it is expected that swish-e is compiled with libxml2. If
Change these lines:
IndexContents HTML2 .html
to:
IndexContents HTML .html
It is also HIGHLY recommended that you build swish-e with zlib
Now you are ready to use the mail_archvie.pl program.
AUTOMATIC UPDATES
Before defining your lists in the lists.conf file, you may want to
When the mail_archive.pl program is run with the `update' mode, it reads
For example, if all your mail is processed by procmail, you can add this
: 0w
Each incoming message will be passed through the mail_archive.pl
After this is setup you can define lists in your lists.conf file. The
Reindexing the archive
You will want to reindex the archive when new messages are added to keep
0,10,20,30,40,50 * * * * cd $HOME/archives && ./mail_archive.pl
or the same:
0,10,20,30,40,50 * * * * $HOME/archives/mail_archive.pl --mode=index
Then every ten minutes the program will be run and it will look for any
LIST CONFIGURATION
Note: You will probably want to have list messages delivered to the
The format of the configuration file is described in the configuration
A configuration file template should have been created automatically in
make_archive.pl --mode=create
This creates the default configuration file lists.conf.
Or to specify a configuration file.
make_archive.pl --mode=create --config=mylists.conf
Open the configuration file with your favorite editor, and define your
Blank lines and lines that begin with a "#" are ignored.
The configuration file contains a section for every defined list.
#------------- pigs -------------------------------
[ Pig Lovers List Archive ]
Not all config options are shown above, and not all are required. You
You can disable a list simply by placing a ! at the start of the list
# Disable for now
List Configuration Options
list_email (required)
archive_dir (optional)
match_string (optional)
All the match strings for all the lists are sorted from longest to
By default the headers are searched in this order:
List-Post:
This list can be changed by the `header_order' setting.
Currently, Received are ignored.
header_order (optional)
header_order = List-Name To Cc
mbox_dir (optional)
mbox_match (optional)
strip_subject (optional)
hypermail_opts (optional)
By default, the settings used are:
showhtml = 0
Any setting you specify will override these settings.
Example:
hypermail_opts = gmtime=On, showhtml=1
hmrc (optional)
To test your new configuration additions:
./mail_archvie.pl --mode=create --verbose --test
which will display what will happen. To actually create the list(s) run:
./mail_archvie.pl --mode=create --verbose
WEB SETUP
It's up to you how to link the archives to your web site.
One suggestion:
cd /usr/local/apache/htdocs
AUTHOR
lwp-download http://swish-e.org/
gzip -dc <name of swish-e tarball>.tar.gz | tar xof -
cd swish-e-<version>
./configure --with-zlib --with-libxml2
make
make test
In the top-level directory place the following files and directories:
Copy the swish-e binary from the swish-e/src directory. This needs to
be executable by you and by the web server process. 0755 perms should
work.
chmod 0755 swish-e
swish.cgi
This is the CGI script included with the swish-e distribution,
located in the swish-e/example directory. Again, must be executable
by the web server process.
chmod 0755 swish.cgi
Open up swish.cgi in your editory and make sure the first line of the
program points to the location of perl.
The swish.cgi script needs a few modules to operate. Copy the modules
directory from the swish-e distribution to the ~/archives directory.
For example
cp -rp ~/swish-e-<version>/example/modules .
Copy the index_hypermail.pl program from the swish-e distribution.
This program parses the hypermail formatted messages.
cp ~/swish-e/prog-bin/index_hypermail.pl .
Place this program (mail_archive.pl) also in your top-level directory
(e.g.~/archives).
./mail_archive.pl --mode=create
swish-e.conf - configuration file using by swish-e.
indexheader.html - hypermail template file
msgheader.html - hypermail template file
this is not the case, then you MUST edit swish-e.conf:
StoreDescription HTML2 <body> 100000
StoreDescription HTML <body> 100000
support for compression of the stored descriptions in the swish-e
index.
Adding new messages to the archvie
enable automatic updates.
a single message from stdin, and tries to match it up with one of the
active lists in the archive. If no match is found, the program returns a
non-zero exit status.
to your .procmailrc file:
| $HOME/archives/mail_archive.pl --mode=update --chdir=$HOME/archives
program, and passed onto hypermail if a list is matched. If no list is
matched and active the program exits with a non-zero exit stat, and
procmail will continue processing.
list will be activated when you run the program in create mode after
defining a new list or lists.
the swish-e index up to date. Add the following to your crontab:
--mode=index
--chdir=$HOME/archives
swish-e indexes that need to be updated.
The lists.conf configuration file defines all your lists. You may define
as many lists as you like. After defining a new list (or lists) run with
the `--mode=create' option to create the new list. Only new lists are
operated on when running in create mode.
mail_archive.pl program before actually creating a new list with the
`--mode=create' option to avoid missing any messages. This is discussed
above.
file itself.
the INSTALLATION section above, but if the configuration file does not
exist, simply create a new config by running this program:
lists.
Sections are defined by placing the description of the list in brackets,
followed by configuration settings. Leading white space may be used.
list_email = pig-lovers@piggiesweare.com
archive_dir = bacon
strip_subject = [Pigs Discussion]
mbox_dir = /path/to/mbox/pigs
mbox_match = ^pigs(\d{6,6})$
hypermail_opts = gmtime=On, showhtml=1
header_order = List-Post To Cc
can have as many sections as you like. Other than `hypermail_opts', you
may not repeate a config option in a section.
name:
[! Pig Lovers List Archive ]
`list_email' is the email address of the specific list. It's used as
the name of the archive directory unless `archive_dir' is defined,
and is used for matching new messages up with the correct list (for
routing a new mail message to the correct list) unless `match_string'
is defined. See `match_string' for how the matching works.
Defines the hypermail archive directory. This should be a relative
directory (e.g. relative to ~/archives in the examples above.
Sting used to match an incoming message to a list. If this is not set
`list_email' is used.
shortest strings, then the string is matched with a case-insensitive
regular expression against the mail headers.
To:
Cc:
Defines the order headers are checked for the match string. Case is
not important. Do not end the headers with ':'.
If specified, files listed in this directory will be used to
initialize the list's archive. See above for more information.
Defines the perl regular expression to use to match against file
names in the `mbox_dir' directory. See above for more information.
This simply passes on the setting to hypermail.
Define parameters that are passed on directly to the hypermail
configuration file for this list. The settings must be separated by a
comma. This setting may be repeated on more than one line.
deleted = X-blabla
gmtime = On
warn_surpressions = On
hypermail_opts = spamprotect = On
Hypermail config file to use. No need to change this. The default is
to use .hmrc in the list's directory.
Ok, it's not completely automatic.
mkdir archives
cd archives
ln -s $HOME/archives/somelist
ln -s $HOME/archives/otherlist
...
Bill Moseley - moseley@hank.org
--
Bill Moseley
mailto:moseley@hank.org
![]()