ht://Dig Copyright © 1995-2002 The ht://Dig Group
Please see the file COPYING for
license information.
See the sample htdig.conf file for some examples of usage.
<SELECT NAME="search_algorithm"> <OPTION VALUE="exact:1 prefix:0.6 synonyms:0.5 endings:0.1" SELECTED>fuzzy <OPTION VALUE="exact:1">exact </SELECT> |
allow_in_form: search_algorithm search_results_header |
bad_querystr: forum=private section=topsecret&passwd=required |
bad_word_list: ${common_dir}/badwords.txt |
The default value of this attribute is determined at compile time.
build_select_lists: |
MATCH_LIST matchesperpage matches_per_page_list \ 1 1 1 matches_per_page "Previous Amount" \ RESTRICT_LIST,multiple restrict restrict_names 2 1 2 restrict "" \ FORMAT_LIST,radio format template_map 3 2 1 template_name "" |
common_url_parts: |
http://www.htdig.org/ml/ \ .html \ http://www.htdig.org/ |
The default value of this attribute is determined at compile time.
The default value of this attribute is determined at compile time.
description_meta_tag_names: htdig-description description |
doc_db: ${database_base}documents.db |
endings_affix_file: /var/htdig/affix_rules |
endings_dictionary: /var/htdig/dictionary |
endings_root2word_db: /var/htdig/r2w.db |
endings_word2root_db: /var/htdig/w2r.bm |
The two main internal parsers are for text/html and text/plain. There is also a simple parser for application/pdf, described under pdf_parser, which is quite limited and is typically overridden with an external one.
The parser program takes four command-line
parameters, not counting any parameters already
given in the command string:
infile content-type URL configuration-file
Parameter | Description | Example |
---|---|---|
infile | A temporary file with the contents to be parsed. | /var/tmp/htdext.14242 |
content-type | The MIME-type of the contents. | text/html |
URL | The URL of the contents. | http://www.htdig.org/attrs.html |
configuration-file | The configuration-file in effect. | /etc/htdig/htdig.conf |
The external parser is to write information for
htdig on its standard output. Unless it is an
external converter, which will output a document
of a different content-type, then its output must
follow the format described here.
The output consists of records, each record terminated
with a newline. Each record is a series of (unless
expressively allowed to be empty) non-empty tab-separated
fields. The first field is a single character
that specifies the record type. The rest of the fields
are determined by the record type.
Record type | Fields | Description |
---|---|---|
w | word | A word that was found in the document. |
location | A number indicating the normalized location of the word within the document. The number has to fall in the range 0-1000 where 0 means the top of the document. | |
heading level |
A heading level that is used to compute the
weight of the word depending on its context in
the document itself. The level is in the range of
0-10 and are defined as follows:
|
|
u | document URL | A hyperlink to another document that is referenced by the current document. It must be complete and non-relative, using the URL parameter to resolve any relative references found in the document. |
hyperlink description | For HTML documents, this would be the text between the <a href...> and </a> tags. | |
t | title | The title of the document |
h | head | The top of the document itself. This is used to build the excerpt. This should only contain normal ASCII text |
a | anchor | The label that identifies an anchor that can be used as a target in an URL. This really only makes sense for HTML documents. |
i | image URL | An URL that points at an image that is part of the document. |
m | http-equiv | The HTTP-EQUIV attribute of a META tag. May be empty. |
name | The NAME attribute of this META tag. May be empty. | |
contents | The CONTENTS attribute of this META tag. May be empty. |
external_parsers: |
text/html /usr/local/bin/htmlparser \ application/pdf /usr/local/bin/parse_doc.pl \ application/msword->text/plain "/usr/local/bin/mswordtotxt -w" \ application/x-gunzip->user-defined /usr/local/bin/ungzipper |
htnotify_prefix_file: | ${common_dir}/notify_prefix.txt |
htnotify_replyto: | [email protected] |
htnotify_sender: [email protected] |
htnotify_suffix_file: | ${common_dir}/notify_suffix.txt |
htnotify_webmaster: | Notification Service |
http_proxy: http://proxy.bigbucks.com:3128 |
http_proxy_exclude: http://intranet.foo.com/ |
The default value of this attribute is determined at compile time.
keywords_meta_tag_names: keywords description |
limit_normalized: http://www.mydomain.com |
local_default_doc: | default.html default.htm index.html index.htm |
local_urls: http://www.foo.com/=/usr/www/htdocs/ |
local_user_urls: http://www.my.org/=/home/,/www/ |
metaphone_db: ${database_base}.mp.db |
next_page_text: <img src="/htdig/buttonr.gif"> |
no_page_list_header: <hr noshade size=2>All results on this page.<br> |
no_page_number_text: |
<strong>1</strong> <strong>2</strong> \ <strong>3</strong> <strong>4</strong> \ <strong>5</strong> <strong>6</strong> \ <strong>7</strong> <strong>8</strong> \ <strong>9</strong> <strong>10</strong> |
nothing_found_file: /www/searching/nothing.html |
page_number_text: |
<em>1</em> <em>2</em> \ <em>3</em> <em>4</em> \ <em>5</em> <em>6</em> \ <em>7</em> <em>8</em> \ <em>9</em> <em>10</em> |
The program is supposed to convert to a variant of PostScript, which is then parsed internally. Currently, only Adobe's acroread program has been tested as a pdf_parser. The default value of path is determined at compile time, to include the path to the acroread executable. This defaults to /usr/local/bin if the configuration program can't find acroread.
To successfully index PDF files, be sure to set the max_doc_size attribute to a value larger than the size of your largest PDF file. PDF documents can not be parsed if they are truncated.
Note: There is a bug in Acrobat 4's acroread command, which causes it to fail when -pairs is used. Ht://Dig version 3.1.3 and later include a work-around for this bug such that when acroread is the parser, and the -pairs option is not given, the second parameter will be the output directory rather than the output file name.
The pdftops program that is part of the xpdf package is not suitable as a pdf_parser, because its variant of PostScript is slightly different. However, an alternative is to use xpdf's pdftotext program as a component of an external parser with the xpdf 0.90 package installed on your system, as described in FAQ question 4.9.
prev_page_text: <img src="/htdig/buttonl.gif"> |
remove_default_doc: default.html default.htm index.html index.htm
or remove_default_doc: |
script_name: /search/results.shtml |
search_algorithm: exact:1 soundex:0.3 |
search_results_contenttype: text/xml |
search_results_footer: /usr/local/etc/ht/end-stuff.html |
search_results_header: /usr/local/etc/ht/start-stuff.html |
search_results_wrapper: ${common_dir}/wrapper.html |
search_rewrite_rules: |
http://(.*)\\.mydomain\\.org/([^/]*) http://\\2.\\1.com \ http://www\\.myschool\\.edu/myorgs/([^/]*) http://\\1.org |
server_aliases: |
foo.mydomain.com:80=www.mydomain.com:80 \ bar.mydomain.com:80=www.mydomain.com:80 |
|
|
sort_names: |
score 'Best Match' time Newest title A-Z \ revscore 'Worst Match' revtime Oldest revtitle Z-A |
star_blank: http://www.somewhere.org/icons/elephant.gif |
star_image: http://www.somewhere.org/icons/elephant.gif |
star_patterns: |
http://www.sdsu.edu /sdsu.gif \ http://www.ucsd.edu /ucsd.gif |
start_url: http://www.somewhere.org/alldata/index.html |
synonym_dictionary: /usr/dict/synonyms |
syntax_error_file: ${common_dir}/synerror.html |
template_map: |
Short short ${common_dir}/short.html \ Normal normal builtin-long \ Detailed detail ${common_dir}/detail.html |
template_patterns: |
http://www.sdsu.edu ${common_dir}/sdsu.html \ http://www.ucsd.edu ${common_dir}/ucsd.html |
url_part_aliases: |
http://search.example.com/~htdig/ *site \ http://www.htdig.org/this/ *1 \ .html *2 |
url_part_aliases: |
http://www.htdig.org/ *site \ http://www.htdig.org/that/ *1 \ .htm *2 |
url_rewrite_rules: |
(.*)\\?JServSessionIdroot=.* \\1 \ (.*)\\&JServSessionIdroot=.* \\1 \ (.*)&context=.* \\1 |
word_list: ${database_base}.allwords.text |