Log 2017

December 2016

[log merged here afterwards — May 2017]

Extracted and built the source tree in ~marc/tmp/htdig-3.1.16
I made minor but systematic changes such as:


htdig-3.1.6> diff htsearch/Display.cc~ htsearch/Display.cc
19c19
< #include <fstream.h>
---
> #include <fstream>
27c27,28
< 
---
> #include <iostream>
> using namespace std;

Installed to /opt/www/htdig (config file: /opt/www/htdig/conf/htdig.conf)

Run with (problem non investigated, and obviously non critical!? possibly related to my changes for gcc 4.6.3):


tmp> sudo /opt/www/htdig/bin/rundig
DB2 problem...: PANIC: Invalid argument
Segmentation fault
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted
Result (example):

tmp> ll /opt/www/htdig/db
total 36036
drwxr-xr-x 2 root root     4096 Mar  4 17:22 .
drwxr-xr-x 6 root root     4096 Dec 10 19:07 ..
-rw-r--r-- 1 root root  7359488 Mar  4 17:22 db.docdb
-rw-r--r-- 1 root root   207872 Mar  4 17:22 db.docs.index
-rw-r--r-- 1 root root 13242638 Mar  4 17:22 db.wordlist
-rw-r--r-- 1 root root 16074752 Mar  4 17:22 db.words.db

5.3

Created a git repo for htdig.
Saved there as the master branch origin the extraction of the plain 3.1.6 tar.
Tagged it as 3.1.6.
Created from this state a dev branch, and committed my own changes, as well as Tanya's fix to rewrite the rules.
Now... the changes already mix changes I could publish (adapt to contemporary C++) and local configuration ones...

Run configure (note: debug not enabled!) with:


htdig> ./configure --with-image-dir=/var/www/htdig --with-search-dir=/var/www/htdig
in order to preserve the CONFIG file produced (and which I checked in).
Note: configure took some time to determine whether to use:

checking if we should use the included regex?... yes

Attempted to build.
Got errors in aclocal.m4, bacuase of an upgrade of autconf from 1.13 to 2.69.
Rerun aclocal, configure, make.
This worked. Committed the new aclocal.m4 and configure...

Taken a backup, and run install:


htdig> ll /opt/www/htdig/db
total 36036
drwxr-xr-x 2 root root     4096 Mar  4 17:22 .
drwxr-xr-x 6 root root     4096 Dec 10 19:07 ..
-rw-r--r-- 1 root root  7359488 Mar  4 17:22 db.docdb
-rw-r--r-- 1 root root   207872 Mar  4 17:22 db.docs.index
-rw-r--r-- 1 root root 13242638 Mar  4 17:22 db.wordlist
-rw-r--r-- 1 root root 16074752 Mar  4 17:22 db.words.db
htdig> sudo /opt/www/htdig/bin/rundig 
DB2 problem...: Unable to allocate 1936618136 bytes from mpool shared region: Cannot allocate memory

DB2 problem...: Unable to allocate 1936618136 bytes from mpool shared region: Cannot allocate memory

...
DB2 problem...: Unable to allocate 1936618136 bytes from mpool shared region: Cannot allocate memory

DB2 problem...: Unable to allocate 1936618136 bytes from mpool shared region: Cannot allocate memory
[5000 lines interrupted with Ctl-C ]
htdig> ll /opt/www/htdig/db
total 37316
drwxr-xr-x 2 root root     4096 Mar  6 21:25 .
drwxr-xr-x 6 root root     4096 Dec 10 19:07 ..
-rw-r--r-- 1 root root  7361536 Mar  6 21:24 db.docdb
-rw-r--r-- 1 root root   207872 Mar  4 17:22 db.docs.index
-rw-r--r-- 1 root root    11984 Mar  6 21:25 db.log
-rw-r--r-- 1 root root 14537589 Mar  6 21:25 db.wordlist
-rw-r--r-- 1 root root 16074752 Mar  4 17:22 db.words.db

12.3

Created a dbg branch in ~/git/htdig, and switched to it.
Configure to deploy under ~/tst, with debug enabled (in configure), and the start_url as http://berry314/test

htdig> ./configure --prefix=$HOME/tst --with-image-dir=/var/www/htdig
--with-search-dir=/var/www/htdig
...
htdig> make
...
defaults.cc:195:1: warning: deprecated conversion from string constant to ‘char*’ [-Wwrite-strings]
...
htdig> make install
Maybe fixed this warning for this file (in htlib/Confgure.h), but there are still many occurrences elsewhere.
It looks like the installation of htdig.conf went to two locations -- i.e. also overwrote the default copy... Restored. Protected with chown.

htdig> ~/tst/bin/rundig 
DB2 problem...: /opt/www/htdig/db/db.docdb: Permission denied
htdig: Unable to open/create document database '/opt/www/htdig/db/db.docdb'

htmerge: Unable to open word list file '/home/marc/tst/db/db.wordlist'.
  Did you index anything?
  Check your config file and try running htdig again.

DB2 problem...: /home/marc/tst/db/db.docdb: No such file or directory
  C-c C-c
This is a bug... Trying:

htdig> ./configure --prefix=$HOME/tst --with-image-dir=/var/www/htdig --with-search-dir=$HOME/tst/htdig
The next warning such as previously is for htfuzzy.cc:84 and String.h.
Started to fix the warnings... but got interrupted. Committed the changes to the dbg branch.

13.3

Continued...
Found ./contrib/htparsedoc/catdoc.c with Cyrillic KOI-8 encodings!

17-18.3

Downloaded htdig-3.2.0b6.tar.gz in /tmp [but lost later]
Built the tst branch without const warnings
Installed

...
make[1]: Entering directory '/home/marc/git/htdig/htsearch'
transform=s,x,x,
/usr/bin/install -c htsearch /opt/www/cgi-bin/`echo htsearch | sed ''`
/usr/bin/install: cannot remove `/opt/www/cgi-bin/htsearch': Permission denied
Makefile:24: recipe for target 'install' failed
make[1]: *** [install] Error 1
make[1]: Leaving directory '/home/marc/git/htdig/htsearch'
...
ran:

htdig> /home/marc/tst/bin/rundig
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted
htdig> ll ~/tst/db
total 24
drwxr-xr-x 2 marc marc 4096 Mar 18 18:32 .
drwxr-xr-x 8 marc marc 4096 Mar 18 18:27 ..
-rw-r--r-- 1 marc marc 2048 Mar 18 18:30 db.docdb
-rw-r--r-- 1 marc marc 2048 Mar 18 18:30 db.docs.index
-rw-r--r-- 1 marc marc  297 Mar 18 18:30 db.wordlist
-rw-r--r-- 1 marc marc 2048 Mar 18 18:30 db.words.db
htdig> sudo mkdir /opt/www/cgi-bin/tst
htdig> sudo chown marc /opt/www/cgi-bin/tst
htdig> cp htsearch/htsearch /opt/www/cgi-bin/tst/
Edited the test page so that it uses this search, but this doesn't work. The server replies that it doesn't find the script. And from the command line, the script finds no match for simple words.
It looks like the command aborting is:

htdig> ~/tst/bin/htnotify
...
htdig> gdb ~/tst/bin/htnotify
...
(gdb) r
Starting program: /home/marc/tst/bin/htnotify 
Traceback (most recent call last):
  File "/usr/lib/debug/usr/lib/arm-linux-gnueabihf/libstdc++.so.6.0.19-gdb.py", line 63, in <module>
    from libstdcxx.v6.printers import register_libstdcxx_printers
ImportError: No module named libstdcxx.v6.printers
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

Program received signal SIGABRT, Aborted.
0xb6d2f8dc in raise () from /lib/arm-linux-gnueabihf/libc.so.6
(gdb) bt
#0  0xb6d2f8dc in raise () from /lib/arm-linux-gnueabihf/libc.so.6
#1  0xb6d3365c in abort () from /lib/arm-linux-gnueabihf/libc.so.6
#2  0xb6eebc0c in ?? () from /usr/lib/arm-linux-gnueabihf/libstdc++.so.6
#3  0xb6eebc0c in ?? () from /usr/lib/arm-linux-gnueabihf/libstdc++.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
It dies within readPreAndPostamble, and never reaches line 209
The failing allocation is: Data = new char[Allocated]; (line 584 in String.cc)

(gdb) p Allocated
$1 = 4
(gdb) bt
#0  0x00013b28 in allocate_space (this=0x5c0ec, len=<optimized out>)
    at String.cc:584
#1  String::allocate_space (this=0x5c0ec, len=2) at String.cc:570
#2  0x00013d04 in String::append (this=0x5c0ec, ch=<optimized out>)
    at String.cc:166
#3  0x000131f4 in operator<< (ch=<optimized out>, this=0x5c0ec)
    at htString.h:208
#4  ParsedString::get (this=0x5c0d8, dict=...) at ParsedString.cc:187
#5  0x00010da4 in Configuration::AddParsed (this=0x56c28, 
    name=0x49330 "locale", value=<optimized out>) at Configuration.cc:189
#6  0x0001160c in Configuration::Defaults (this=0x56c28, 
    array=<optimized out>) at Configuration.cc:398
#7  0x0000aafc in main (ac=1, av=0xbefffc84) at htnotify.cc:103
Only not this (first) time... Rather:

(gdb) c
Continuing.
Catchpoint 7 (exception caught), __cxa_begin_catch ()
    at ../../../../src/libstdc++-v3/libsupc++/eh_catch.cc:41
41	../../../../src/libstdc++-v3/libsupc++/eh_catch.cc: No such file or directory.
(gdb) bt
#0  __cxa_begin_catch ()
    at ../../../../src/libstdc++-v3/libsupc++/eh_catch.cc:41
#1  0xb6f1f324 in __cxa_throw ()
    at ../../../../src/libstdc++-v3/libsupc++/eh_throw.cc:86
#2  0xb6f1f96c in operator new(unsigned int) ()
    at ../../../../src/libstdc++-v3/libsupc++/new_op.cc:56
#3  0xb6f1fa24 in operator new[](unsigned int) ()
    at ../../../../src/libstdc++-v3/libsupc++/new_opv.cc:32
#4  0x00013b28 in allocate_space (this=0x56b94, len=<optimized out>)
    at String.cc:584
#5  String::allocate_space (this=0x56b94, len=268435457) at String.cc:570
#6  0x00013cac in String::reallocate_space (this=0x56b94, len=<optimized out>)
    at String.cc:614
#7  0x00013d04 in String::append (this=0x56b94, ch=<optimized out>)
    at String.cc:166
#8  0x0000b708 in operator<< (ch=10 '\n', this=0x56b94)
    at ../htlib/htString.h:208
#9  readPreAndPostamble () at htnotify.cc:202
#10 0x0000abe4 in main (ac=1, av=<optimized out>) at htnotify.cc:139
...
(gdb) up
#4  0x00013b28 in allocate_space (this=0x56b94, len=<optimized out>)
    at String.cc:584
584	    Data = new char[Allocated];
(gdb) p Allocated
$6 = 536870912

2.4

Found that the cause of the memory allocation failure was the test that htnotify_prefix_file was 'NULL', when it was in fact equal to "".
Added links to the Release notes and the design documentation to the local htdig page.
Now I could run ~/tst/bin/rundig.
I tested that this doesn't disturb the normal search, but the one from the test page fails to find /cgi-bin/tst/htsearch
Found from /var/log/apache2/error.log successively:

[Sun Apr 02 18:14:44 2017] [error] [client 192.168.1.9] script not found or unable to stat: /usr/lib/cgi-bin/tstsearch, referer: http://berry314.dyndns-pics.com/test/
[Sun Apr 02 18:24:37 2017] [error] [client 192.168.1.9] Symbolic link not allowed or link target not accessible: /usr/lib/cgi-bin/tst, referer: http://berry314.dyndns-pics.com/test/
So, fixed, with a copy of the ~marc/git/htsearch/htsearch file.
Now though, the search engine fails to find any strings from the test page.
Note that on can test from the command line, which will be easier for debugging:

htdig> ./htsearch/htsearch -c ~/tst/conf/htdig.conf 
Enter value for words: default
Content-type: text/html

Enter value for format: short
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html><head><title>No match for '(defaulted or defaulting or defaulter or defaults)'</title></head>
...
I also check that the word queried was in the db:

htdig> grep default ~/tst/db/db.wordlist 
default	i:1	l:42	w:958
Late, so abandoning, but logging what attempted:

htdig> gdb ./htsearch/htsearch
...
(gdb) b 318
...
(gdb) run -c ~/tst/conf/htdig.conf
...
Enter value for words: default

Breakpoint 1, main (ac=<optimized out>, av=<optimized out>) at htsearch.cc:318
318	    ResultList	*results = htsearch(word_db, searchWords, parser);

May 14

Could reproduce the previous status, but failed to print the value of variables under gdb. Found a hit in Google. Switched from wheezy to jessie in /etc/apt/source.list and under source.list.d
Looks like collabora.list is not found with jessie.

htdig> sudo apt-get update
...
Reading package lists... Done
N: Ignoring file 'collabora.list.jessie' in directory '/etc/apt/sources.list.d/' as it has an invalid filename extension
N: Ignoring file 'raspi.list.wheezy' in directory '/etc/apt/sources.list.d/' as it has an invalid filename extension
W: Ignoring Provides line with DepCompareOp for package pypy-cffi
W: Ignoring Provides line with DepCompareOp for package pypy-cffi-backend-api-max
W: Ignoring Provides line with DepCompareOp for package pypy-cffi-backend-api-min
W: You may want to run apt-get update to correct these problems
htdig> sudo apt-get upgrade
...
Ahum... Started to read a bit late https://www.raspberrypi.org/forums/viewtopic.php?f=66&t=121880

~> rm ph.tgz 
~> du -sh .
1.4G	.
~> tar zfc /tmp/marc.tgz .
~> cd /var/www
www> sudo rm -rf tmfish.bak
www> sudo du -sh .
3.8M	.
www> sudo tar zfc /tmp/www.tgz .
www> cd ~tanya
tanya> sudo tar zfc /tmp/tanya.tgz .
Uploaded to Google drive.
Followed the instructions... 1-4 (created the pi account back)

apt> sudo adduser --disabled-password  --disabled-login pi
And then ...5

apt> sudo apt-get dist-upgrade
...
Configuring wicd-daemon
-----------------------

Users who should be able to run wicd clients need to be added to the group 
"netdev".

  1. marc  2. pi  3. Sergey  4. tanya

(Enter the items you want to select, separated by spaces.)

Users to add to the netdev group: marc pi tanya
...
Installing new version of config file /etc/init.d/procps ...

Configuration file '/etc/sysctl.conf'
 ==> Modified (by you or by a script) since installation.
 ==> Package distributor has shipped an updated version.
   What would you like to do about it ?  Your options are:
    Y or I  : install the package maintainer's version
    N or O  : keep your currently-installed version
      D     : show the differences between the versions
      Z     : start a shell to examine the situation
 The default action is to keep your current version.
*** sysctl.conf (Y/I/N/O/D/Z) [default=N] ? y
...
Configuration file '/etc/login.defs'
 ==> Modified (by you or by a script) since installation.
 ==> Package distributor has shipped an updated version.
   What would you like to do about it ?  Your options are:
    Y or I  : install the package maintainer's version
    N or O  : keep your currently-installed version
      D     : show the differences between the versions
      Z     : start a shell to examine the situation
 The default action is to keep your current version.
*** login.defs (Y/I/N/O/D/Z) [default=N] ? y
...
Configuration file '/etc/dphys-swapfile'
 ==> File on system created by you or by a script.
 ==> File also in package provided by package maintainer.
   What would you like to do about it ?  Your options are:
    Y or I  : install the package maintainer's version
    N or O  : keep your currently-installed version
      D     : show the differences between the versions
      Z     : start a shell to examine the situation
 The default action is to keep your current version.
*** dphys-swapfile (Y/I/N/O/D/Z) [default=N] ? y
...
apt> sudo shutdown -r now
One fsck and a reconfig of default DocumentRoot as /var/www/html later...

318	    ResultList	*results = htsearch(word_db, searchWords, parser);
(gdb) x searchWords
Value can't be converted to integer.
(gdb) x/1s searchWords
Value can't be converted to integer.
(gdb) x/1s *searchWords
No symbol "operator*" in current context.
(gdb) whatis searchWords
type = List
(gdb) x/1s searchWords.current
0xa71b8:	""
(gdb) x/1s searchWords.head
0xa6f80:	"Xo\n"
(gdb) x/1s searchWords.tail
0xa71b8:	""

May 20


(gdb) whatis word_db
type = String
(gdb) x/1s word_db.Data
0xa6248:	"/home/marc/tst/db/db.words.db"

htdig> strings /home/marc/tst/db/db.words.db | wc -l
20
htdig> strings /home/marc/tst/db/db.words.db | grep -C 2 default
long
format
default
content
boolean

htdig> gdb ./htsearch/htsearch
(gdb) b 275
(gdb) run -c ~/tst/conf/htdig.conf
Starting program: /home/marc/git/htdig/htsearch/htsearch -c ~/tst/conf/htdig.conf
Enter value for words: default

Breakpoint 1, main (ac=<optimized out>, av=<optimized out>) at htsearch.cc:276
276		       strcmp(config["match_method"], "boolean") == 0,
(gdb) x /1s originalWords.Data
0x79220:	"default"
...
(gdb) n
288	    origPattern += logicalPattern;
(gdb) x /1s logicalPattern.Data
0xa5e78:	"defaulted|defaulting|defaulter|defaults"
(gdb) x /1s logicalWords.Data
0xa6478:	"(defaulted or defaulting or defaulter or defaults)"
I set debug to 2 in htsearch.cc, and rebuilt.

htdig> ./htsearch/htsearch -c ~/tst/conf/htdig.conf
Enter value for words: default
tempWords: 'default:0 '
Boolean: 'default:0 '
initial: ''
Fuzzy on: default
   exact
   synonyms
   endings defaulted defaulting defaulter defaults
searchWords: '(:0 defaulted:0 |:0 defaulting:0 |:0 defaulter:0 |:0 defaults:0 ):0 '
LogicalWords: (defaulted or defaulting or defaulter or defaults)
Pattern: defaulted|defaulting|defaulter|defaults
Content-type: text/html

Enter value for format: short
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html><head><title>No match for '(defaulted or defaulting or defaulter or defaults)'</title></head>
<body bgcolor="#eef7ff">
...
BTW: I update IMAGE_DIR in CONFIG to /var/www/html/htdig
Should do the same in the main branch... before building next time.
Found that the code has been built with -O2
htdig> make CXXFLAGS="-g -O0"
This seems to recompile only the top binaries...
No real progress...
(gdb) x/1s searchWords.head.object
0xb9050:	"x\256\006"
(gdb) x/1s searchWords.head.next.object
0xb8eb0:	"x\256\006"
(gdb) x/1s searchWords.head.next.next.object
0xb9088:	"x\256\006"
(gdb) x/1s searchWords.tail.object
0xb9170:	"x\256\006"
(gdb) x/1s searchWords.current.object
0xb9170:	"x\256\006"
I reset both the debug value and CXXFLAGS...

In file included from regex.c:215:0:
../htlib/gregex.h:530:0: warning: "__restrict_arr" redefined
 #define __restrict_arr
 ^
In file included from /usr/include/features.h:374:0,
                 from /usr/include/arm-linux-gnueabihf/sys/types.h:25,
                 from regex.c:46:
/usr/include/arm-linux-gnueabihf/sys/cdefs.h:363:0: note: this is the location of the previous definition
 # define __restrict_arr __restrict
 ^
...
SGMLEntities.cc: In member function ‘void SGMLEntities::init()’:
SGMLEntities.cc:178:56: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
  trans->Add(entities[i].entity, (Object *) entities[i].equiv);
                                                        ^
...
words.cc: In function ‘void mergeWords(const char*, const char*)’:
words.cc:112:10: warning: deprecated conversion from string constant to ‘char*’ [-Wwrite-strings]
      sid = "-";
          ^
...

May 21

I cloned my repo in order to have a reference tree in the master branch at hand, to compare the value of the list —not so easy to examine under gdb.

git> git clone htdig hdmst
git> cd hdmst/
hdmst> git checkout master
hdmst> git cherry-pick dbg
error: could not apply 2f9112d... RootDirectory changed from apache 2.2 to 2.4
hint: after resolving the conflicts, mark the corrected paths
hint: with 'git add <paths>' or 'git rm <paths>'
hint: and commit the result with 'git commit'
hdmst> git status
On branch master
Your branch is up-to-date with 'origin/master'.
You are currently cherry-picking commit 2f9112d.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Unmerged paths:
  (use "git add <file>..." to mark resolution)

	both modified:   CONFIG

no changes added to commit (use "git add" and/or "git commit -a")
hdmst> git reset --merge dbg
hdmst> git status
On branch master
Your branch is ahead of 'origin/master' by 7 commits.
  (use "git push" to publish your local commits)
nothing to commit, working directory clean
hdmst> git cherry-pick --abort
error: no cherry-pick or revert in progress
fatal: cherry-pick failed
hdmst> git checkout master
Already on 'master'
Your branch is ahead of 'origin/master' by 7 commits.
  (use "git push" to publish your local commits)
hdmst> git status
On branch master
Your branch is ahead of 'origin/master' by 7 commits.
  (use "git push" to publish your local commits)
nothing to commit, working directory clean
hdmst> make
make: *** No targets specified and no makefile found.  Stop.
hdmst> git checkout dev
Branch dev set up to track remote branch dev from origin.
Switched to a new branch 'dev'
hdmst> git branch
  dbg
* dev
  master
hdmst> git status
On branch dev
Your branch is up-to-date with 'origin/dev'.
nothing to commit, working directory clean
hdmst> ll CONFIG
-rw-r--r-- 1 marc marc 1963 May 21 10:04 CONFIG
hdmst> grep IMAGE_DIR CONFIG
# IMAGE_DIR
IMAGE_DIR=              /var/www/htdig
# This is the URL to prefix the images placed in IMAGE_DIR.
hdmst> git cherry-pick dbg
[dev fda0985] RootDirectory changed from apache 2.2 to 2.4
 Date: Sun May 21 09:44:13 2017 +0000
 1 file changed, 1 insertion(+), 1 deletion(-)
hdmst> cd ..
git> mv hdmst hddev
git> cd hddev
hddev> ./configure
hddev> make CXXFLAGS="-g -O0"
hddev> ./htsearch/htsearch -c ~/tst/conf/htdig.conf
Enter value for words: default
Content-type: text/html

Enter value for format: short
...
<strong>Documents 1 - 1 of 1 matches.
...
OK, so... I can use this version of htsearch with my test db, and it works...
However, under gdg, searchWords is still as opaque:

(gdb) x/1s searchWords.current.object
0xc7230:	"x\212\a"
(gdb) x/1sw searchWords.current.object
0xc7230:	U"\x78a78\x7b4b0\001\004\xc7268"
(gdb) x/1sw searchWords.head.object
0xc70a8:	U"\x78a78\x7b4b0\001\004\xc6fe8"
(gdb) x/1sw searchWords.head.next.object
0xc6ab8:	U"\x78a78\x7b4b0\a\b\xc6af0"
(gdb) x/1s ((String)searchWords.head.object).Data
0xc6fb0:	"xo\f"
(gdb) x/1sw ((String)searchWords.head.object).Data
0xc6fb0:	U"\xc6f78\xc6fd8\xc6ab8\031\xc6f80\b\t\xc6fd8"
Updated the search database from the installed version, with new errors (due to the new data, I hope):

hddev> sudo /opt/www/htdig/bin/rundig
DB2 problem...: PANIC: Invalid argument
Segmentation fault
BAD TAG IN SERIALIZED DATA: 108
BAD TAG IN SERIALIZED DATA: 111
DB2 problem...: missing or empty key value specified
DB2 problem...: missing or empty key value specified
DB2 problem...: missing or empty key value specified
DB2 problem...: missing or empty key value specified
BAD TAG IN SERIALIZED DATA: 108
BAD TAG IN SERIALIZED DATA: 111
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted
In order to compare the trace of the two versions, made clean and rebuilt with -g -O0 in htdig as well. Otherwise, I keep getting:

value has been optimized out
I stop there, but I believe I have 0 element on line 462 in parser.cc:

462	    for (int i = 0; i < elements->Count(); i++)
(gdb) x elements->number
0x0:	Cannot access memory at address 0x0
...whereas in hddev I have 1:

0x1:	Cannot access memory at address 0x1

May 27


(gdb) handle SIGALRM ignore
...
564	        name = strtok((char*)algs[i], ":");
(gdb) 
565		weight = strtok(0, ":");
(gdb) p name
$24 = {<Object> = {_vptr.Object = 0x7b518 <vtable for String+8>}, Length = 5, 
  Allocated = 6, Data = 0xc5db8 "exact"}
What I fail to do with tempWords[0] or words[0]... Printing the value of the Data member (offset 8 in the String structure).

(gdb) tb 581
Temporary breakpoint 5 at 0x1eaa8: file htsearch.cc, line 581.
(gdb) c
...
283	    createLogicalWords(searchWords, logicalWords, logicalPattern);
(gdb) p searchWords
$30 = {<Object> = {_vptr.Object = 0x7b3b0 <vtable for List+8>}, 
  head = 0xc6f80, tail = 0xc71b8, current = 0x0, current_index = -1, 
  number = 9}
This was in dbg; in the dev branch, the number of searchWords is 11...

(gdb) p originalWords
$32 = {<Object> = {_vptr.Object = 0x7b518 <vtable for String+8>}, Length = 7, 
  Allocated = 8, Data = 0x99220 "default"}
The difference between the number of searchWords is a consequence of a difference in the number of weigthWords in doFuzzy: 4 vs 5 — one adds one parenthesis before and after, as well as one '|' between them.
There are however 3 algorithms in each case (exact, synonyms, endings).
The problem is within fuzzyWords.Get_Next(), invoked on line 636, or even on the previous line, in fuzzyWords.Start_Get(). Indeed... current is not initialized!
Earlier yet: in the constructor... head is not initialized!
The constructor is OK. fuzzy->getWords(ww->word, fuzzyWords);
It is not the same getWords which gets invoked... There are some const differences... Obviously one overloading failed to match the intended signature!
I didn't fix the const in Exact.h... and it derives from Fuzzy.h
Fixed now. Rebuilding.
It works... At least finds now default, and even for work (finding works, i.e. using endings)

May 28

I built yesterday with just make, i.e. it used -g -O2 (not -g -O0, meaning that debugging will be inconsistent).
Committed my changes.
Edited ignore at two levels: ~/.config/git/ignore and ~/git/htdig/.git/info/exclude, maybe not 100% correct.

htdig> git tag -a -m 'Hopefully working and const correct' const
Note: the 3.1.6 tag was not annotated...
Added some unicode to the test page, and ran:

htdig> /home/marc/tst/bin/rundig
without error or warning... producing a db.wordlist from which the unicode characters are stripped away.

July 29-30

Trying to figure out where I am. Updated CONFIG in the hddev branch to match the current state of things (the file was not checked in, and I did not commit it).

The status was not the one recorded: looking for default, one does now find the root page (It works). Updated.
Started to play with replacing String with string, first for configFile in htdig.cc and htsearch.cc.
One annoying issue is the String class supports a family of operator<< members, which are extensively used to append stuff to strings... Although this works: (foo += '/') += bar;
Only scratched the surface... Building with make CXXFLAGS="-g -O0"
Added a join2s member in StringList, used only from htsearch.cc.
I'm afraid I didn't properly test the changes so far. Although:

(gdb) run restrict=foo+bar;words=default;format=builtin-short
...
225	        urllist.Release();  // release the temporary list of URLs
(gdb) p urlpat
$15 = "foo|bar"
Committed, installed, run rundig, and tested that I didn't break it yet.
Found UTF-8 with C++ in a Portable Way. I had already found it last December, but either forgotten, or not really checked. This is quite impressive, as very simple indeed: only inline code, using std vector. Works fine on ubuntu.
Checked that (some) support for Unicode was introduced in C++11, and more in C++14. The version of g++ was: 4.9.2 (5.4.0 in ubuntu, which does refer to C++14).
Run apt-get update/upgrade on berry. This did not upgrade g++. Checked that the utfcpp package works fine on g++ 4.9.2 (it was time stamped in 2013), and groks my Иностранка page.

August 12-13

Written a strlist class, for use as a replacement for StringList.
Trying to use it from htsearch.cc, for urlList.
Compiled. Built. Debugging.
The create function doesn't work correctly. It yields:

(gdb) p word
$19 = "âme\242memee"
The reason is that the original input:

strlist::create (this=0xbefffa34, str=0x9a918 "âme", 
    sep=0x79f68 "| \t\r\n\001") at strlist.cc:25
is appended several times, every time removing the initial character.
What fails is thus the test (?). The bug doesn't depend on the utf-8 char: I get the same with plain ascii:

(gdb) p word
$2 = "defaultefaultfaultaultultltt"

No: the issue is not the test -- it is append(str) which appends the full str (word) instead of one char, as in the original code: fixed.
Now join doesn't work: there is a copy of my aux function object which doesn't preserve its contents: fixed.
Also, I can see that my using char will not support utf-8 characters...
So, I'll have to revisit this shortcoming.
string is typedef'ed to basic_string<char> in stringfwd.h. There is as well typedef basic_string<wchar_t> wstring.
The ICU project clearly recommends against using wchar_t for unicode. But then, I cannot use wstring either. They favour (March 2000) utf-16... although utf-8 is probably closer to my needs (?)
Trying wstrlist... Not so easy: there is no implicit conversion from wchar_t to char... c_str() returns then wchar_t...

September 9-10

Goal for now: use strlist in all the binaries (even if only for char), first to read the configuration. I.e. after htsearch: htdig, htmerge, htnotify, htfuzzy (er... htdump, htload?).
First extend in htsearch (only for urllist so far...).
form_vars: need to provide non default constructors, and to explicit default ones... (not done yet for copy ctor)
Count -> size and needs operator[]? No: iterator.
Modified:

    StringList form_vars(config["allow_in_form"], " \t\r\n");
    for (i= 0; i < form_vars.Count(); i++)
    {
      if (input.exists(form_vars[i]))
	config.Add(form_vars[i], input[form_vars[i]]);
    }
into:

    strlist form_vars(config["allow_in_form"], " \t\r\n");
    for (strlist::const_iterator it = form_vars.begin(); it != form_vars.end(); it++) {
      if (input.exists(it->c_str()))
	config.Add(it->c_str(), input[it->c_str()]);
    }

Debugged by adding to htdig.conf:
  allow_in_form: search_algorithm search_results_header
Problem: this is only intermediate, as the other classes, List, WeightWord, etc... still require char*, resulting in superfluous calls to c_str().
OK: removed StringList from htsearch...
However:

htdig> /home/marc/tst/bin/rundig
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted
Of the main binaries, only htdig and htsearch were relinked today — the other ones on Aug 23 (no log?). htsearch doesn't crash under gdb... The 4 db files were recreated at 16:17 utc — the wordlist is as usual.
It is htnotify which crashes...
Debugging. My shell over tramp dies/times out!?

Program received signal SIGABRT, Aborted.
0xb6ce5f70 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56	../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0xb6ce5f70 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0xb6ce7324 in __GI_abort () at abort.c:89
#2  0xb6eedb5c in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/arm-linux-gnueabihf/libstdc++.so.6
#3  0xb6eeb9a0 in ?? () from /usr/lib/arm-linux-gnueabihf/libstdc++.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
OK... Maybe the problem is not in htnotify... Update/upgrade/dist-upgrade... Only: failed.
Not clear what the cause was... There was a similar issue on sartre, which got resolved with:

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 1397BC53640DB551
...but on berry (Raspian) the issue persisted. I tried to reset /etc/apt/sources.list, but this failed, then touched /etc/resolv.conf, and found I was competing against an fsync process!? Then I tried to reboot, and failed: disk corruption! I fixed it eventually by adding fsck.repair=yes to /boot/cmdline.txt (as e2fsck must be run on an umounted fs, yet e2fsck itself sits on the root disk /dev/mmclbk0p6, with all the shared libraries it uses... Another option would have been to boot from a usb disk, but this has to be enabled in advance and is still experimental (Jessie).
Upgraded now to apache2-4, python 2.7.9, and emacs 24, among others...
Back to debugging htnotify... which still seems to crash.

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

Program received signal SIGABRT, Aborted.
0xb6ce7f70 in __GI_raise (sig=sig@entry=6)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56	../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) Not stopped at any breakpoint; argument ignored.
I do a make clean, and rebuild... At least three warnings:

make[1]: Entering directory '/home/marc/git/htdig/htlib'
...
In file included from regex.c:215:0:
../htlib/gregex.h:530:0: warning: "__restrict_arr" redefined
 #define __restrict_arr
 ^
In file included from /usr/include/features.h:374:0,
                 from /usr/include/arm-linux-gnueabihf/sys/types.h:25,
                 from regex.c:46:
/usr/include/arm-linux-gnueabihf/sys/cdefs.h:363:0: note: this is the location of the previous definition
 # define __restrict_arr __restrict
 ^
...
make[1]: Entering directory '/home/marc/git/htdig/htdig'
...
SGMLEntities.cc: In member function ‘void SGMLEntities::init()’:
SGMLEntities.cc:178:56: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
  trans->Add(entities[i].entity, (Object *) entities[i].equiv);
                                                        ^
...
make[1]: Entering directory '/home/marc/git/htdig/htmerge'
...
words.cc: In function ‘void mergeWords(const char*, const char*)’:
words.cc:112:10: warning: deprecated conversion from string constant to ‘char*’ [-Wwrite-strings]
      sid = "-";
          ^
The problem is in readPreAndPostamble();, in htnotify.cc:139
OK... This is an old friend... I.e. a bad fix. OK: second fix, involving now a string.
The next target might be QuotedStringList... I'll keep it its name.
Oops... Difficulties start... lowercase as a member of String? Especially when the C++ tolower depends on the locale.
Also, ExternalParser.cc which consumes QuotedStringLists, wants to put their items into Dictionary...
Stopping there, with uncommitted changes.

September 17

Trying to code lowercase with string and wstring...

tests> ./lc AAÉ
aaÉ
tests> ./wlc A
Segmentation fault
With the default (char based) string, the different characters in the string have different size:

(gdb) p s
$1 = "AAÉ"
(gdb) p s.substr(0,2)
$2 = "AA"
(gdb) p s.substr(0,3)
$3 = "A", <incomplete sequence \303>
(gdb) p s.substr(0,4)
$4 = "AAÉ"
I don't understand how to construct my wide strings for the input. I get bad cast errors. Tried also:

tests> c++ -std=c++11 -g -O0 -o u16lc u16lc.cc

September 23-24

Going on with my bottom-up test to implement lowercase for utf-8.
Making lc a template.

tests> c++ -o wlc wlc.cc -g -O0
tests> ./wlc fooÉ
fooÉ: fooÉ
tests> c++ -o u16lc u16lc.cc -g -O0 -std=c++11
tests> ./u16lc fooÉ
terminate called after throwing an instance of 'std::bad_cast'
  what():  std::bad_cast
Aborted
tests> locale -a
C
C.UTF-8
en_US.utf8
POSIX
This —lowercase function— is a simple and interesting starting point. Now that I experienced a failure to implement it with plain stc c++ library, I'll try to see what icu has to offer.
Downloaded icu4c-59_1-src.tgz, and extracted under ~/git/icu
Create as a git repo, and committed.
configure, built...
There is a case example.
It works.
Back to htdig: created an icu branch...
Added the flags required for icu (including -g -O0) to Makefile.config
Modified strlist to use UnicodeString instead of string
Skipping htdig/ExternalParser for now.
Building first in htnotify...
cloned the current state of the repo for backup purposes. I assume I may now delete hddev...?

git> git clone -l --no-hardlinks htdig bkp
Deleted htlib/(htString.h,String.cc,StringList.cc,StringList.h), as well as wstrlist.{h,cc} (added to dbg).
Inside htnotify.cc... Many changes to header files for strings and lists, leaving the source files inconsistent.

October 30

The question is what strings should I leave if any as (const)? char*?
urls, dates, could; but even the subject and content of emails could be unicode strings!
Checking what support there might be in icu for email... I find none, but maybe none is needed? Anyway, maybe not critical for my purposes
To build the examples, I need to add using directives, such as: using icu::UnicodeString;
I'll have to replace the Dictionary class with a c++ collection template (map?) and get rid of Object...
Completed htnotify... and committed this intermediate this state.

November 2

Updated and reindexed the search database:

public_html> sudo /opt/www/htdig/bin/rundig
DB2 problem...: PANIC: Invalid argument
Segmentation fault
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted
Hopefully only because of the unicode data in the pages...

November 5

Next: htsearch... Maybe Display.cc first? Lots of header files to 'fix'... Lots of guesses...
Hitting things like HtRegexp... Maybe that's the nut at the heart?
I'' try to replace HtRegexp with icu::RegExpPattern, and HtRegExpReplace with icu::RegExpMatcher. Not removing them from git yet (some changes there that I'd have to revert...):

htsearch> mkdir away
htsearch> mv ../htlib/HtRegex* away/
htsearch> ll away
total 32
drwxr-xr-x 2 marc marc 4096 Nov  5 15:30 .
drwxr-xr-x 3 marc marc 4096 Nov  5 15:30 ..
-rw-r--r-- 1 marc marc 2341 Jan 31  2002 HtRegex.cc
-rw-r--r-- 1 marc marc 1093 Nov  5 10:57 HtRegex.h
-rw-r--r-- 1 marc marc 3671 Jan 31  2002 HtRegexReplace.cc
-rw-r--r-- 1 marc marc 1294 Jan 31  2002 HtRegexReplace.h
-rw-r--r-- 1 marc marc 1776 Mar  5  2017 HtRegexReplaceList.cc
-rw-r--r-- 1 marc marc  563 Nov  5 11:01 HtRegexReplaceList.h

In htnotify, I replaced Dictionary with a multimap, but this doesn't mean I need to do the same everywhere... There there could be several notifications for the same key.

November 26

In HtURLRewriter, replacing the HtRegexReplaceList data member (renamed to repl) with a map<UnicodeString, RegexMatcher>.
In fact, not quite sure... Maybe this class is useless, if it contains the list of matches of one specific pattern, because then, RegexMatcher already does that. as seems to be the case, with its replaceAll member function... Apart that this is a singleton... What I miss are the arguments I would expect: match, and replacement.
HtURLRewriter is used from htsearch.cc and htdig.cc, but without arguments!
The construction of the singleton uses the url_rewrite_rules configuration, defaulted to empty in htcommon/defaults.cc, where both defaults and config are defined. Now configs is initialized in htsearch.cc, by calling Configuration::Defaults.
What I didn't find yet, is where the urls are rewritten, i.e. where the rules are applied to the urls.
What has been used is search_rewrite_rules (in /opt/www/htdig/conf/htdig.conf), but this is the same thing!
I stop here, in the middle...

December 17

In bkp, found in htsearch.cc:253:

    config.AddParsed("url_rewrite_rules", "${search_rewrite_rules}");
Since I had in htdig.conf:180:

search_rewrite_rules:	http://(berry314)/(.*) http://\\1.dyndns-pics.com/\\2
I see that something happens in Configuration::AddParsed (184) applying to the dict member of config.
So... one invokes Parser::parse which returns a ResultList (specialized from Dictionary).
Maybe what I was looking for was in URL::rewrite:

      HtURLRewriter::instance()->Replace(_url);
So... the urls get replaced one by one; only the 'match' is passed as argument, when the 'replace' value is already recorded in the HtURLRewriter singleton.
So, back to the icu branch, and to HtURLRewriter. In fact, icu::RegexpMatcher doesn't quite fit the use from URL. In addition, the urls will not contain unicode characters (?)
Trying to understand the use of icu::RegexpMatcher with the ugrep example.
Git repositories, icu, 2018, log
Marc Girod