2018 HtDig Log
- 2017 log
- January 13;
20;
27
- March 17
- April 15;
22;
28
- May 19;
26, 27
- June 10;
16, 17
- August 11, 12;
18, 19, 20;
25, 26
- September 2;
9;
15, 16
- October 27, 28, 29
- November 25
- December 8, 9
apt-get update/upgrade/dist-upgrade
My example was a minimum edit of ugrep.cpp saved as
~/git/icu/samples/ugrep/usr.cpp
The structures in use are:
const char *pattern = NULL; // The regular expression
UErrorCode status = U_ZERO_ERROR; // All ICU operations report success or failure
UParseError parseErr; // In the event of a syntax error in the regex pattern,
RegexPattern *rePat = RegexPattern::compile((const UnicodeString)pattern, parseErr, status);
UnicodeString empty;
RegexMatcher *matcher = rePat->matcher(empty, status);
UnicodeString s(FALSE, ucharBuf+lineStart, lineEnd-lineStart);
matcher->reset(s);
if (matcher->find()) {
UErrorCode st;
UnicodeString r("http://$1.dyndns-pics.com/$2");
cout << "Replacement: " << matcher->replaceAll(r, st) << endl;
matchFound = TRUE;
printMatch();
}
In htdig.conf, we have both the pattern and the replacement—we have to read them,
in Configuration::AddParsed
.
This is what I already analysed.
The value
contains both the pattern and the replacement.
Working on ParsedString::get...
Looking at icu/source/samples/citer/citer.cpp...
Not sure I didn't confuse incrementing the iterator and getting to
the next word (but was there a next word? no...!?).
Well... not quite sure: it is a recursive function...
I change the return value of the function: from pointer to
reference.
Hoping that the value may be initialized to an empty string,
and that nothing breaks...
Done —ParsedString...
Next: Configuration... Changing the interface: from char* to UnicodeString&
Compiled.
I have still not solved the issue of setting up HtURLRewriter...
January 20
So... There is a singleton HtURLRewriter
,
and this one is constructed
from config["url_rewrite_rules"]
,
which contains strings with space separated pattern and replacement.
A TAB separates these strings from the search_rewrite_rules: prefix.
I moved the HtRegexReplaceList
files into
the away
directory.
In fact, it contained a list of pairs, which it used as from and to.
My strlist
replacement for StringList
is not adequate, as it
uses char
instead of UChar
,
although it might work as long as the urls do not contain multibyte chars.
But icu most probably offers better parsing tools
Extracted a test case in ~/git/tests/parse, with a local copy of
strlist.
Fails abominably:
parse> make CXXFLAGS="-std=c++11 -g -O0"
g++ -std=c++11 -g -O0 -I/usr/local/include -c -o parse.o parse.cc
g++ -std=c++11 -g -O0 -I/usr/local/include parse.o strlist.o -L/usr/local/lib -licui18n -licuio -licuuc -licudata -o parse
parse> ./parse 'il y de la joie Ваня'
original text: il y de la joie Ваня
l.join('+'): il y de la joie ÐанÑ
First, the default constructor sets '\t'
as separated,
so the the string is not parsed/split.
Found BreakIterator
, which begets:
parse> ./parse 'il y a de la joie Ваня'
original text: il y a de la joie Ваня
l.join('+'): il+ +y+ +a+ +de+ +la+ +joie+ +Ваня
So, now, getting rid of the whitespace—done, although not perfect:
this will split (and record) also on punctuation, including tabs.
parse> ./parse ' il y a de la joie Ваня'
original text: il y a de la joie Ваня
l.join('+'): il+y+a+de+la+joie+Ваня
parse> ./parse ' il y a de la joie, Ваня'
original text: il y a de la joie, Ваня
l.join('+'): il+y+a+de+la+joie+,+Ваня
parse> ./parse ' il y a-aussi-de la joie Ваня'
original text: il y a-aussi-de la joie Ваня
l.join('+'): il+y+a+-+aussi+-+de+la+joie+ +Ваня
Adapted for HtURLRewriter
.
Still working in the icu
branch.
htdig> find . -type f -name Stack\*
./htlib/Stack.cc
./htlib/Stack.h
htdig> grep -rl Stack --include=*.h .
./htdig/Server.h
./db/include/btree.h
./htsearch/parser.h
./htlib/Stack.h
In btree.h
, the match is in a comment.
The actual type is:
typedef struct __epg EPG;
struct __epg {
PAGE *page; /* The page. */
db_indx_t indx; /* The index on the page. */
DB_LOCK lock; /* The page's lock. */
};
PAGE
is defined as a large struct in db_page.h
The hope for a start is that I do not need to modify the db...
This is version 2.6.4 (12/16/98) of Sleepycat Software's Berkeley DB product.
Before looking at replacing Stack
,
I might look at ResultList
, as e.g. in parser.cc
,
one pushes such a list into (and pops from) the stack
member.
ResultList
specialized Dictionary
which I removed, replacing it with an included map
.
Fixed the issue with strlist
of recording punction and tabs.
parse> ./parse ' il y a de la joie, Ваня'
original text: il y a de la joie, Ваня
l.join('+'): il+y+a+de+la+joie+Ваня
parse> ./parse ' il y a de la joie Ваня'
original text: il y a de la joie Ваня
l.join('+'): il+y+a+de+la+joie+Ваня
Added const iterators.
New test:
find> ./indexof mydefault:0 default:
text: mydefault:0, pattern: default:
position of pattern in text: 2
text after removing pattern: my0
find> ./indexof default:0 foo
text: default:0, pattern: foo
position of pattern in text: -1
Reindex, with new errors:
public_html> sudo /opt/www/htdig/bin/rundig
DB2 problem...: Unable to allocate 1852256170 bytes from mpool shared region: Cannot allocate memory
...
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted
Where was I? If under htsearch
, I build, the first error
is for Display.cc
which
includes QuotedStringList.h
,
now renamed to qstrings.h
.
If I build htsearch.o
, the fist error is
for configFile
, which still believes being
a string
,in stead of now a UnicodeString
.
The notes tell to look into WordList
,
which is not opened yet.
OK: we are under parser
, and result
is a
member there.
Building parser.o
, the first error is
for tokens
, a vector<UnicodeString>
member.
The point is that the first thing one does in lexan
is to case the token into a WeightWord
...
So, that's what it should be: done.
But a problem is the way the tokens
get iterated.
There was a Start_Get
to reset the list
in fullexpr
, following which current
was set
to Get_Next
at the beginning of lexan
,
itself called from expr
, etc.
But that's not how it works with vector
...
OK: converted, although with a doubt:
e.g. now perform_push
may return,
if the end of tokens
was reached.
Hopefully I do not skip any token (I go to the next in lexan
).
Committed all the changes to date.
Downloaded Berkeley DB:
db-6.2.32.tar.gz, and extracted under git.
I noticed a few days back, that my last htdig db update
had in fact aborted before indexing the last pages
(no new hits for climat although expected).
I ran apt-get update/upgrade/dist-upgrade and retried:
same result, finding in fact that there are not hits for distinction,
which would have been the previous update yet.
Ran in verbose mode, found multiple errors (invalid links),
fixed some, and ran again:
public_html> sudo /opt/www/htdig/bin/rundig -vvv 2>&1 | egrep -B5 ^DB2
*href: http://berry314/bdb/docs/gsg/C/CoreCursorUsage.html (example_database_read)
resolving 'http://berry314/bdb/docs/gsg/C/CoreCursorUsage.html'
*href: http://berry314/bdb/docs/gsg/C/preface.html (Next)
resolving 'http://berry314/bdb/docs/gsg/C/preface.html'
* size = 17666
DB2 problem...: PANIC: Bad address
Note that this is not the exact same page as previously:
+href: http://berry314/bdb/docs/gsg/C/CoreEnvUsage.html (Managing Databases in Environments)
resolving 'http://berry314/bdb/docs/gsg/C/CoreEnvUsage.html'
DB2 problem...: PANIC: Bad address
It seems in fact to be later, but this log itself gets indexed,
so that it may contribute to pushing the error (should be backwards?).
What I have practiced is e.g:
winfo> perl -0777 -pi -e 's%<!--.*?>%%gsm' $(find . -type f -name \*.html)
More...
public_html> sudo /opt/www/htdig/bin/rundig -vvv 2>&1 | egrep -B5 ^DB2
resolving 'http://berry314/bdb/docs/gsg/C/databaseLimits.html'
pushing http://berry314/bdb/docs/gsg/C/databaseLimits.html
+href: http://berry314/bdb/docs/gsg/C/environments.html (Environments)
resolving 'http://berry314/bdb/docs/gsg/C/environments.html'
DB2 problem...: PANIC: Invalid argument
The problem may be that I added a large base with bdb,
and overran a limit...
I try to remove it for a try.
public_html> sudo ls -ld /var/www/html/bdb
lrwxrwxrwx 1 root root 18 Mar 18 14:22 /var/www/html/bdb -> /home/marc/git/bdb
public_html> sudo rm /var/www/html/bdb
OK... It looks like this is the explanation...
I restore the link and exclude it:
public_html> sudo ln -s /home/marc/git/bdb /var/www/html/bdb
public_html> sudo perl -pi.bak -e 's%^(exclude_urls:.*)%$1 /bdb/%' /opt/www/htdig/conf/htdig.conf
public_html> diff /opt/www/htdig/conf/htdig.conf.bak /opt/www/htdig/conf/htdig.conf
52c52
< exclude_urls: /cgi-bin/ .cgi
---
> exclude_urls: /cgi-bin/ .cgi /bdb/
That's enough to get the climat indexed,
but I still get a DB2 error.
I'll also exclude icu:
public_html> sudo perl -pi.bak -e 's%^(exclude_urls:.*)%$1 /icu/%' /opt/www/htdig/conf/htdig.conf
public_html> diff /opt/www/htdig/conf/htdig.conf.bak /opt/www/htdig/conf/htdig.conf
52c52
< exclude_urls: /cgi-bin/ .cgi /bdb/
---
> exclude_urls: /cgi-bin/ .cgi /bdb/ /icu/
Still a DB2 problem. Let's hope moving to bdb will solve it.
Progress logged in objects
Built and extended the udata example of icu.
Explored std::fstream, in an fstr test,
trying to identify what to implement in terms of write.cpp,
and unewdata.cpp,
and as explicit instantiations of the std templates,
for UChar or UnicodeString.
Working as such with wchar_t, but not with char16_t
(empty string—maybe just a locale issue?):
fstr> make
g++ -std=c++11 -I/usr/local/include -c -o fstr.o fstr.cc
g++ -std=c++11 -I/usr/local/include fstr.o -L/usr/local/lib -licui18n -licuio -licuuc -licudata -o fstr
fstr> ./fstr
fstr> cat test.txt
Il y a de la joie
fstr> nm -C ./fstr | grep wchar_t
U std::basic_ofstream<wchar_t, std::char_traits<wchar_t> >::close()@@GLIBCXX_3.4
U std::basic_ofstream<wchar_t, std::char_traits<wchar_t> >::basic_ofstream(char const*, std::_Ios_Openmode)@@GLIBCXX_3.4
U std::basic_ofstream<wchar_t, std::char_traits<wchar_t> >::~basic_ofstream()@@GLIBCXX_3.4
U std::basic_ostream<wchar_t, std::char_traits<wchar_t> >& std::operator<< <wchar_t, std::char_traits<wchar_t> >(std::basic_ostream<wchar_t, std::char_traits<wchar_t> >&, wchar_t const*)@@GLIBCXX_3.4
OK... Maybe framework ready for the actual implementation
(for now: dummy/empty bodies)...
fstr> make
g++ -std=c++11 -I/usr/local/include -c -o fstr.o fstr.cc
g++ -std=c++11 -I/usr/local/include -c -o uofstream.o uofstream.cc
g++ -std=c++11 -I/usr/local/include fstr.o uofstream.o -L/usr/local/lib -licui18n -licuio -licuuc -licudata -o fstr
Minor progress in integration, although:
- using udata_create from unewdata,
instead of stealing the body into the constructors
- added the local source dir in the makefile
- kept both buffers, even if only one will be used
(the icu one wrapped into a specialization of the std one?)
Committed: builds, but doesn't work (with chat16_t
)
I kept both code variants,
writing to wchar_t
stream (file: Ltest.txt
),
and to chat16_t
stream (file: utest.txt
).
So far only with plain ascii (only the second uses icu).
[Slightly edited the transcript:
the magic sytring of the utest.txt file is not text]
fstr> make
g++ -std=c++11 -I/usr/local/include -I/home/marc/git/icu/source/tools/toolutil -c -o fstr.o fstr.cc
g++ -std=c++11 -I/usr/local/include -I/home/marc/git/icu/source/tools/toolutil fstr.o uofstream.o -L/usr/local/lib -licui18n -licuio -licutu -licuuc -licudata -o fstr
fstr> ./fstr
fstr> cat Ltest.txt; echo; cat utest.txt; echo
Il y a de la joie
^@[...]MyDt[...]^@ Copyright (C) 2016 and later: Unicode, Inc. and others. License & terms of use: http://www.unicode.org/copyright.html
fstr> nm -C ./fstr | grep wchar_t
U std::basic_ofstream<wchar_t, std::char_traits<wchar_t> >::close()@@GLIBCXX_3.4
U std::basic_ofstream<wchar_t, std::char_traits<wchar_t> >::basic_ofstream(char const*, std::_Ios_Openmode)@@GLIBCXX_3.4
U std::basic_ofstream<wchar_t, std::char_traits<wchar_t> >::~basic_ofstream()@@GLIBCXX_3.4
U std::basic_ostream<wchar_t, std::char_traits<wchar_t> >& std::operator<< <wchar_t, std::char_traits<wchar_t> >(std::basic_ostream<wchar_t, std::char_traits<wchar_t> >&, wchar_t const*)@@GLIBCXX_3.4
fstr> nm -C ./fstr | grep char16_t | sort -u | grep 'close()'
00012538 W std::basic_ofstream<char16_t, std::char_traits<char16_t> >::close()
00012dd8 W std::basic_filebuf<char16_t, std::char_traits<char16_t> >::close()::__close_sentry::__close_sentry(std::basic_filebuf<char16_t, std::char_traits<char16_t> >*)
00012e0c W std::basic_filebuf<char16_t, std::char_traits<char16_t> >::close()::__close_sentry::~__close_sentry()
00012ed0 W std::basic_filebuf<char16_t, std::char_traits<char16_t> >::close()
fstr> nm -C ./fstr | grep char16_t | sort -u | grep 'basic_ofstream(char const*'
00012168 W std::basic_ofstream<char16_t, std::char_traits<char16_t> >::basic_ofstream(char const*, std::_Ios_Openmode)
fstr> nm -C ./fstr | grep char16_t | sort -u | grep 'basic_ofstream()'
00012328 W std::basic_ofstream<char16_t, std::char_traits<char16_t> >::~basic_ofstream()
000123c8 W virtual thunk to std::basic_ofstream<char16_t, std::char_traits<char16_t> >::~basic_ofstream()
0001240c W std::basic_ofstream<char16_t, std::char_traits<char16_t> >::~basic_ofstream()
0001243c W virtual thunk to std::basic_ofstream<char16_t, std::char_traits<char16_t> >::~basic_ofstream()
fstr> nm -C ./fstr | grep char16_t | sort -u | grep 'operator<<'
00017934 T std::basic_ostream<char16_t, std::char_traits<char16_t> >& std::operator<< <char16_t, std::char_traits<char16_t> >(std::basic_ostream<char16_t, std::char_traits<char16_t> >&, char16_t const*)
W
in the nm output denotes a weak symbol.
Anyway, I have no achieved yet what I got in the udata sample:
udata> ./reader
Read value 2000 from data file
Read string EXAMPLE from data file
Read ustring from data file: архипелаг
Read ustring from data file: Ça va... Il y a toujours à boire et à manger
Although... I have no reader... Only the writer...
But well, there is nothing past the copyright in the test file.
Ah... I only wrote stubs so far!
And only part of the stubs (for the template specializations).
For instance, I did not specialize the basic_ostream
base class, and it looks like operator<<
is a member function of it...
Inserted now a specialization of it
—so far no change from the default.
But the point is indeed to add to it an operator<<
taking UChar
...
Or is it needed? UChar
is one type with which to
instantiate to template...
Only needed if it requires some change in the declaration,
such as passing a buffer.
Otherwise, all the specialization was useless.
Maybe the only thing I need to specialize is
ostream::_M_insert
Or maybe __ostream_insert
template function)
The problem is the basic_filebuf
or the basic_streambuf
(base class of the previous)
Taken away (into ostr.h) the specialization attempt of basic_ostream
Next: basic_streambuf... specialized in bstr.h, just to see it...
But UNewDataMemory
is maybe more of
basic_filebuf
specialization? Here we go...
Maybe only some member functions might be enough and legal...
First open
(init
comes from basic_ios
).
In the template code, I have:
_M_file.open(s, mode);
if (this->is_open()) {
_M_allocate_internal_buffer();
to replace with code
from icu/source/tools/toolutil/unewdata.cpp.
_M_file
is a basic_file<char>
,
defined in
/usr/include/arm-linux-gnueabihf/c++/4.9/bits/basic_file.h
The file
in UNewDataMemory
is a
FileStream
defined in source/tools/toolutil/filestrm.h
It is opened in unewdata.cpp on line 102,
i.e. in filestrm.cpp on line 36
So far: this is equivalent: no need to change anything.
Next (in my fstr.h):
_M_allocate_internal_buffer();
At this point, something must have set the value of _M_buf_size
.
The _M_buf
is not the equivalent
of the UNewDataMemory
.
One needs to specialize the diverse operator<<
at least in order to add the padding...
Added to ostr.h, but with no change (yet),
e.g. from unewdata.cpp
Also, the two magic values, and the copyright header
need to be inserted into the new file —
maybe no need for a data structure to hold them...
Trying to understand the file header in utest.txt.
Its size is 90H i.e. 9 lines of 16 bytes (144).
2 bytes of size: 9000, i.e. 144
2 bytes of magic: da27
20 bytes of dataInfo (from uofstream.h):
8 bytes (1400 0000 0000 0200),
then 12 ('MyDt' 0100 0000 0100 0000)
119 bytes (20 43...6c 20): The Copyright string
(from uvernum.h),
including one space before and after
1 byte: 00
This is in fact written to the file
by T_FileStream_write
in filestrm.cpp,
invoked 3 times, plus 1 for the 0 padding.
from udata_create
in unewdata.cpp:
4, pInfo->size
, and commentLength
It uses plain fwrite
.
Now: where should I write this header?
In the open
function of basic_filebuf
when I'm creating the file?
Followed this plan:
only specialized basic_filebuf<UChar, uctraits>::open
Same result as previously: header in the file,
but no text insert.
open
is a member function, not the contructor:
the object is ready, one can use the operator
instead of the implementation dependent details of _M_file
.
Except that open
is a member
of basic_filebuf<>
,
and operator<<
of ostream<>
...
What is probably missing is the conversion between the streams,
as endl
and the ascii text
return an ostream<char>
No: endl
is a template as well:
it should return an ostream<UChar>
In the mypkg_example.dat file
produced by the udata example,
The header is similar as the one in utest.txt,
and even EXAMPLE is in contiguous chars,
but Il y a toujours... are in 16 bit UChar
s.
I need to specialize the instances of _M_insert
for the different types: uint8_t
,
char
, const char*
,
following the example of udata_write8
,
udata_write16
, udate_writeString
in unewdata.cpp, based on T_FileStream_write
in filestrm.cpp.
I'm surprised by udata_writePadding
padding with 0xaa
, not with 0,
but I cannot find it used—keep this in mind.
Also, the int16_t
and UChar
automatic specializations of _M_insert
should be OK...?
Debugged. Stepping on line 17 in fstr.cc,
I get into ostream_insert.h:88,
in __ostream_insert
, and there,
__out.width()
equals 0,
and __n
is 18
(the length of the string in chars),
but even then, the stream is in badbit state,
and nothing gets inserted.
Added a default specialization
of __ostream_write
, which never reaches line 85:
SIGSEGV—not even always reaching line 84:
Breakpoint 1, std::__ostream_write<char16_t, std::char_traits<char16_t> > (
out=..., s=0x180bc u"Il y a de la joie\n", n=18) at fstr.cc:84
84 const streamsize put = out.rdbuf()->sputn(s, n);
(gdb) n
0x000132d4 in std::__ostream_insert<char16_t, std::char_traits<char16_t> > (
__out=..., __s=0x180bc u"Il y a de la joie\n", __n=18)
at /usr/include/c++/4.9/bits/ostream_insert.h:109
109 __catch(...)
(gdb) bt
#0 0x000132d4 in std::__ostream_insert<char16_t, std::char_traits<char16_t> >
(__out=..., __s=0x180bc u"Il y a de la joie\n", __n=18)
at /usr/include/c++/4.9/bits/ostream_insert.h:109
#1 0x00012050 in std::operator<< <char16_t, std::char_traits<char16_t> > (
__out=..., __s=0x180bc u"Il y a de la joie\n")
at /usr/include/c++/4.9/ostream:518
#2 0x00011424 in main () at fstr.cc:17
Debugging, the basic_streambuf
this pointer is 0
in streambuf:451,
when invoked from fstr.cc:84.
However, the basic_streambuf
base of basic_filebuf
,
itself base of basic_ofstream
, was initialized
in streambuf:466, invoked from fstream.tcc:85:
std::basic_streambuf<char16_t, std::char_traits<char16_t> >::basic_streambuf (
this=0xbefff8c0) at /usr/include/c++/4.9/streambuf:466
466 _M_buf_locale(locale())
(gdb) bt
#0 std::basic_streambuf<char16_t, std::char_traits<char16_t> >::basic_streambuf (this=0xbefff8c0) at /usr/include/c++/4.9/streambuf:466
#1 0x00012d18 in std::basic_filebuf<char16_t, std::char_traits<char16_t> >::basic_filebuf (this=0xbefff8c0) at /usr/include/c++/4.9/bits/fstream.tcc:85
#2 0x00011e00 in std::basic_ofstream<char16_t, std::char_traits<char16_t> >::basic_ofstream (this=0xbefff8bc, __s=0x180b0 "utest.txt", __mode=std::_S_out,
__in_chrg=<optimized out>, __vtt_parm=<optimized out>)
at /usr/include/c++/4.9/fstream:645
#3 0x00011414 in main () at fstr.cc:16
(gdb) p this
$15 = (std::basic_streambuf<char16_t, std::char_traits<char16_t> > * const) 0xbefff8c0
There is a second basic_streambuf
which comes as a member of the basic_ios
base:
_M_streambuf
.
This is the one not initialized...
Progressed, although not absolutely clear why.
Maybe just forced a better order of definition
by introducing a specialization of rdbuf
.
Now failing because of the _M_codecvt
facet
being found 0 in /usr/include/c++/4.9/bits/fstream.tcc:640
Still in tests/fstr
Looking for the missing facet.
Doesn't crash!? Ah, no. It shouldn't crash: just do nothing.
fstr> gdb fstr
(gdb) b 84
(gdb) r
(gdb) s
std::operator<< <char16_t, std::char_traits<char16_t> > (__out=...,
__s=0x18128 u"瑵獥\x2e74硴t") at /usr/include/c++/4.9/ostream:513
513 operator<<(basic_ostream<_CharT, _Traits>& __out, const _CharT* __s)
(gdb) n
515 if (!__s)
(gdb) p __s
$1 = 0x18134 u"Il y a de la joie\n"
(gdb) n
519 static_cast<streamsize>(_Traits::length(__s)));
(gdb) s
std::char_traits<char16_t>::length (__s=0x18134 u"Il y a de la joie\n")
at /usr/include/c++/4.9/bits/char_traits.h:421
421 size_t __i = 0;
(gdb) finish
Run till exit from #0 std::char_traits<char16_t>::length (
__s=0x18134 u"Il y a de la joie\n")
at /usr/include/c++/4.9/bits/char_traits.h:421
0x000129ec in std::operator<< <char16_t, std::char_traits<char16_t> > (
__out=..., __s=0x18134 u"Il y a de la joie\n")
at /usr/include/c++/4.9/ostream:519
519 static_cast<streamsize>(_Traits::length(__s)));
Value returned is $2 = 18
(gdb) s
518 __ostream_insert(__out, __s,
(gdb) s
std::__ostream_insert<char16_t, std::char_traits<char16_t> > (__out=...,
__s=0x18134 u"Il y a de la joie\n", __n=18)
at /usr/include/c++/4.9/bits/ostream_insert.h:82
82 typename __ostream_type::sentry __cerb(__out);
(gdb) n
83 if (__cerb)
(gdb)
87 const streamsize __w = __out.width();
(gdb) p __out
$3 = (std::basic_ostream<char16_t, std::char_traits<char16_t> > &) @0xbefff8bc: {<std::basic_ios<char16_t, std::char_traits<char16_t> >> = {<std::ios_base> = {<No data fields>}, _M_tie = 0x85e2f8, _M_fill = 7808 u'Ẁ',
_M_fill_init = 140, _M_streambuf = 0x0, _M_ctype = 0x31,
_M_num_put = 0x7273752f, _M_num_get = 0x636e692f},
_vptr.basic_ostream = 0x182d4 <vtable for std::basic_ofstream<char16_t, std::char_traits<char16_t> >+12>}
(gdb) s
std::ios_base::width (this=0xbefff94c)
at /usr/include/c++/4.9/bits/ios_base.h:645
645 { return _M_width; }
(gdb) finish
Run till exit from #0 std::ios_base::width (this=0xbefff94c)
at /usr/include/c++/4.9/bits/ios_base.h:645
0x00013dd4 in std::__ostream_insert<char16_t, std::char_traits<char16_t> > (
__out=..., __s=0x18134 u"Il y a de la joie\n", __n=18)
at /usr/include/c++/4.9/bits/ostream_insert.h:87
87 const streamsize __w = __out.width();
Value returned is $4 = 0
(gdb) n
88 if (__w > __n)
(gdb) n
101 __ostream_write(__out, __s, __n);
(gdb) s
std::__ostream_write<char16_t, std::char_traits<char16_t> > (out=...,
s=0x18134 u"Il y a de la joie\n", n=18) at fstr.cc:74
74 const streamsize put = out.rdbuf()->sputn(s, n);
(gdb) s
std::basic_ios<char16_t, std::char_traits<char16_t> >::rdbuf (this=0xbefff94c)
at /usr/include/c++/4.9/bits/basic_ios.h:316
316 { return _M_streambuf; }
(gdb) s
std::basic_streambuf<char16_t, std::char_traits<char16_t> >::sputn (
this=0xbefff8c0, __s=0x18134 u"Il y a de la joie\n", __n=18)
at /usr/include/c++/4.9/streambuf:451
451 { return this->xsputn(__s, __n); }
(gdb) s
std::basic_filebuf<char16_t, std::char_traits<char16_t> >::xsputn (
this=0xbefff8c0, __s=0x18134 u"Il y a de la joie\n", __n=18)
at /usr/include/c++/4.9/bits/fstream.tcc:640
640 streamsize __ret = 0;
(gdb) n
644 const bool __testout = (_M_mode & ios_base::out
(gdb) n
645 || _M_mode & ios_base::app);
(gdb) n
646 if (__check_facet(_M_codecvt).always_noconv()
(gdb) p __testout
$8 = true
(gdb) s
std::__check_facet<std::codecvt<char16_t, char, __mbstate_t> > (__f=0x0)
at /usr/include/c++/4.9/bits/basic_ios.h:48
48 if (!__f)
(gdb) p __f
$9 = (const std::codecvt<char16_t, char, __mbstate_t> *) 0x0
And here is our unset facet, which results in a bad_cast exception.
In /usr/include/c++/4.9/fstream:84, there is:
typedef codecvt<char_type, char, __state_type> __codecvt_type;
The codecvt
template class is defined in:
/usr/include/c++/4.9/bits/codecvt.h
There is also a specialization for char
:
template<>
class codecvt<char, char, mbstate_t>
: public __codecvt_abstract_base<char, char, mbstate_t>
...
and one for wchar_t
, plus a few extern definitions in
order to...
// Inhibit implicit instantiations for required instantiations,
// which are defined via explicit instantiations elsewhere.
namely in /usr/include/c++/4.9/ext/codecvt_specializations.h
Added a specialization (from the one for wchar_t
for
codecvt<UChar, char, mbstate_t>
but... now, I need to specialize the member functions,
and maybe the base class (hopefully not).
Inserted now a specialization for the first member: do_out
from the partial specialization in codecvt_specializations.h...
Note: the partial specialization uses iconv
,
and obviously, that's what needs to be changed.
Asked for help/guidance in the ICU support mailing list.
The specializations for wchat_
uses
the encoding_state
class,
defined in /usr/include/c++/4.9/ext/codecvt_specializations.h,
and which uses iconv
.
I'll define and use a new uencstate
instead.
Shortened the variable names, including those of protected members
(stripped underscores).
Stripped comments (see original).
Removing references to iconv (from /usr/include/iconv.h)?
Note: boost has boost_1_62_0/libs/locale/src/util/iconv.hpp
Maybe
something here
(Boost 1.67.0)
Although, that's only for regex?!
ICU has uconv
as an iconv
replacement.
That's for the standalone tool, but there is its source code.
The main header file in /usr/local/include/unicode/ucnv.h
I'll replace:
iconv_t iconv_open(const char *tocode, const char *fromcode);
with
UConverter* ucnv_open(const char *converterName, UErrorCode *err);
assuming tocode
in the first is implicit in the second.
I commit my intermediate code before doing this,
just in case.
I get rid of:
typedef iconv_t descriptor_type;
But this descriptor_type
will now become,
depending on the context, either UConverter*
or UErrorCode*
...
I have only one desc
left,
since the other is implicit...
Maybe not so: iconv
works both ways,
whereas in ICU, the are two functions:
ucnv_toUnicode
and ucnv_fromUnicode
...
do_out
is from UChar
to char
.
It should use toUnicode
.
I didn't know what to put for the flush
argument.
Used true
to compile.
I specialized the codecvt
template class
and its do_out
member.
But this is derived from a member in __codecvt__abstract_base
.
Given a compilation error, maybe I need to specialize this one?
fstr.cc:162:5: error: template-id ‘do_out<>’ for ‘std::codecvt_base::result std::codecvt<char16_t, char, uencodstate>::do_out(uencodstate&, const UChar*, const UChar*, const UChar*&, char*, char*, char*&) const’ does not match any template declaration
codecvt<UChar, char, uencodstate>::
^
No: just syntax error: template<> not used for a member of a specialization.
Successful build. Checking where we are in terms of the error:
fstr> make CXXFLAGS="-std=c++11 -g -O0"
fstr> gdb fstr
(gdb) b 205
(gdb) r
Breakpoint 1, std::__ostream_write<char16_t, std::char_traits<char16_t> > (
out=..., s=0x755ac u"Il y a de la joie\n", n=18) at fstr.cc:205
205 const streamsize put = out.rdbuf()->sputn(s, n);
(gdb) s
(gdb) s
(gdb) s
(gdb) n
(gdb) n
(gdb) n
646 if (__check_facet(_M_codecvt).always_noconv()
(gdb) s
std::__check_facet<std::codecvt<char16_t, char, __mbstate_t> > (__f=0x0)
at /usr/include/c++/4.9/bits/basic_ios.h:48
48 if (!__f)
(gdb) p __f
$1 = (const std::codecvt<char16_t, char, __mbstate_t> *) 0x0
char16_t
, OK;
why __mbstate_t
and not uencodstate
?
Besides: no progress at all!
Looks like there is a locale
class,
in /usr/include/c++/4.9/bits/locale_classes.h
with template friends,
and has_facet<UChar>
returns false
Again, in /usr/include/c++/4.9/bits/locale_classes.tcc,
there are extern
declarations
for explicit instantiations elsewhere of
has_facet<collate<char> >
and
has_facet<collate<wchar_t> >
.
Added collate<UChar>
(probably only needing the member implementations).
The comment actually said:
These virtual functions are hooks for developers
to implement the behavior they require from the collate facet
I cannot see how I can work with a derived class.
To work with an explicit specialization,
I need to insert between
#include <bits/localefwd.h>
and
#include <bits/locale_classes.h>
extern declarations (and more!?) such as the ones above for
char
and wchar_t
.
No, this doesn't work:
explicit instantiation ... before definition of template.
Hopefully not needed.
Looks that not: provided stubs of the member functions,
to be filled in with ICU code...
Builds.
has_facet
still returns false
in /usr/include/c++/4.9/bits/basic_ios.tcc:159
but __i
was now 28
in
/usr/include/c++/4.9/bits/locale_classes.tcc:106
Downloaded and extracted under ~/git boost 1.67.0; Committed
Long pause. Where was I?
under work says htsearch...
But wasn't it rather in some tests? fstr...
I had just installed the boost libraries,
and this related to the task at hand...
Tried and failed to understand how to setup boost.
Created a /usr/local/boost
with write access for group girod, and did:
boost> ./bootstrap.sh --show-libraries --prefix=/usr/local/boost --with-icu
boost> ./b2 install
Setting the prefix did not apply as I expected.
Need to replay all the cp, and then to /usr/local
(forget /usr/local/boost)
Also, the install command ended up in core dump!?
Looking at the adapations of iCU made for C++:
/scp:berry:/home/marc/git/boost/boost/:
find . \( -type f -exec grep -q -e U_NAMESPACE_QUALIFIER \{\} \; \) -ls
regex/v4/u32regex_token_iterator.hpp
regex/v4/u32regex_iterator.hpp
regex/icu.hpp
/scp:berry:/usr/local/include/unicode/:
find . \( -type f -exec grep -q -e U_NAMESPACE_QUALIFIER \{\} \; \) -ls
394946 8 -rw-r--r-- 1 root staff 6597 Apr 16 01:23
Some interesting traits...
So... what I was interested in was facets, collate?
Maybe boost/detail/utf8_codecvt_facet.hpp and .ipp,
as well as boost/regex/icu.hpp,
may be sources of inspiration...
Back to fstr.
The debug is as previously,
apart that the first breakpoint moved to line 239.
Tracing the construction
of basic_ofstream<UChar>
on line 248 in fstr.cc, I get in sequence to:
std::basic_ofstream<char16_t, std::char_traits<char16_t> >::basic_ofstream (
this=0xbefff86c, __s=0x759bc "utest.txt", __mode=std::_S_out,
__in_chrg=<optimized out>, __vtt_parm=<optimized out>)
at /usr/include/c++/4.9/fstream:645
645 : __ostream_type(), _M_filebuf()
std::basic_ios<char16_t, std::char_traits<char16_t> >::basic_ios (
this=0xbefff8fc) at /usr/include/c++/4.9/bits/basic_ios.h:456
456 _M_streambuf(0), _M_ctype(0), _M_num_put(0), _M_num_get(0)
std::basic_ostream<char16_t, std::char_traits<char16_t> >::basic_ostream (
this=0xbefff86c,
__vtt_parm=0x75bc4 <VTT for std::basic_ofstream<char16_t, std::char_traits<char16_t> >+4>, __in_chrg=<optimized out>) at /usr/include/c++/4.9/ostream:385
385 { this->init(0); }
std::basic_ios<char16_t, std::char_traits<char16_t> >::init (this=0xbefff8fc,
__sb=0x0) at /usr/include/c++/4.9/bits/basic_ios.tcc:129
129 ios_base::_M_init();
132 _M_cache_locale(_M_ios_locale);
std::basic_ios<char16_t, std::char_traits<char16_t> >::_M_cache_locale (
this=0xbefff8fc, __loc=...) at /usr/include/c++/4.9/bits/basic_ios.tcc:159
159 if (__builtin_expect(has_facet<__ctype_type>(__loc), true))
std::has_facet<std::ctype<char16_t> > (__loc=...)
at /usr/include/c++/4.9/bits/locale_classes.tcc:106
106 const size_t __i = _Facet::id._M_id();
107 const locale::facet** __facets = __loc._M_impl->_M_facets;
(gdb) p __facets[__i]
$3 = (const std::locale::facet *) 0xb6f9d4d8 <vtable for std::__timepunct<wchar_t>+8>
(gdb) p __i
$4 = 28
So... the issue is that __facets[28]
points to
a wchar_t
(automatic?) specialization
of std::__timepunct
,
instead of to one for
UChar
aka char16_t
?
fstr> grep -rl __timepunct /usr/include/c++/4.9
/usr/include/c++/4.9/bits/locale_facets_nonio.tcc
/usr/include/c++/4.9/bits/locale_facets_nonio.h
There are indeed extern declarations for explicit specializations of
the templates for char
and wchar_t
, but the
code itself is not there.
Thought of defining the id for char16_t
as 39,
compile and see.
But then found that boost has a collator
template.
Tried to set the id, but this is not the way it goes.
fstr.cc:106:19: error: ‘id’ in ‘class std::collate<char16_t>’ does not name a type
collate<UChar>::id = locale::id(39);
I can also cheat and define:
template<> bool has_facet<collate<UChar>>(const locale& l) throw() {
return true;
}
...which compiles, but I don't think this is right.
The template code should work fine.
It should test the id
which should have been installed
into the proper list data member in locale::_Impl
by _M_init_facet
...
I can try to force specialize this one, from the template code.
Except that I fail:
fstr.cc:106:48: error: variable or field ‘_M_init_facet’ declared void
template<> void locale::_Impl::_M_init_facet(collate<UChar>* f) {
Not sure this would be more right.
In fact, I feel I'll have big problems with
basic_string<UChar>
, whereas what I really have is
icu::UnicodeString
, which has a different interface.
I rather implement my cheat, just to notice that...
collate
is not the right facet!
What we are check now is ctype
...
Of course, I can cheat this one as well,
but I end up in a bad cast anyway,
with __facets[28]
still pointing to
vtable for std::__timepunct<wchar_t>+8
which fails the dynamic cast to yet another facet...
One interest of this cheating is to check which factes are
involved...
Trying now the ones I find in locale_facets.h
Under the debugger, hitting ctype
,
and dying on bad cast.
Unfortunately, I cannot see from where this was thrown.
And I cannot as easily cheat use_facet
,
because I'd need to return one, and the destructor is protected.
So, I'd need to find where _M_init_facet
ought to be called, and why it is not.
Cloning now the new unicode-org git repo (v 62.1, release candidate)
I don't build it yet, waiting for the release.
I guess I'll have to fork it?
I try to specialize ctaye
from UChar16
from the wchar_t
specialization in local_facets.h.
Now the linker complains about missing the code:
fstr> make CXXFLAGS="-std=c++11 -g -O0"
g++ -std=c++11 -g -O0 -I/usr/local/include -c -o fstr.o fstr.cc
g++ -std=c++11 -g -O0 -I/usr/local/include fstr.o -L/usr/local/lib -licui18n -licuio -licuuc -licudata -o fstr
fstr.o: In function `bool std::has_facet<std::ctype<char16_t> >(std::locale const&)':
/usr/include/c++/4.9/bits/locale_classes.tcc:114: undefined reference to `std::ctype<char16_t>::id'
/usr/include/c++/4.9/bits/locale_classes.tcc:114: undefined reference to `typeinfo for std::ctype<char16_t>'
fstr.o: In function `std::ctype<char16_t> const& std::use_facet<std::ctype<char16_t> >(std::locale const&)':
/usr/include/c++/4.9/bits/locale_classes.tcc:143: undefined reference to `std::ctype<char16_t>::id'
/usr/include/c++/4.9/bits/locale_classes.tcc:143: undefined reference to `typeinfo for std::ctype<char16_t>'
collect2: error: ld returned 1 exit status
../parse/rules.mk:13: recipe for target 'fstr' failed
make: *** [fstr] Error 1
Though, I don't declare or invoke these functions explicitly.
Cloning gcc git repo now in ~/git/gnu/gcc...
Apart that it failed. Twice
gnu> git clone https://github.com/gcc-mirror/gcc.git
Cloning into 'gcc'...
remote: Counting objects: 2356004, done.
remote: Compressing objects: 100% (51/51), done.
remote: Total 2356004 (delta 12), reused 14 (delta 7), pack-reused 2355945
Receiving objects: 100% (2356004/2356004), 2.57 GiB | 978.00 KiB/s, done.
Resolving deltas: 95% (1838195/1932842)
error: index-pack died of signal 11
fatal: index-pack failed
Trying now to fork, and clone the fork...
This worked. Even if it won't worl for pull requests,
as this is a mirror.
Tried to specialize ctype::_M_initialize_ctype
from gcc/libstdc++-v3/config/locale/gnu/ctype_members.cc
but failed to compile:
fstr> make CXXFLAGS="-std=c++11 -g -O0"
g++ -std=c++11 -g -O0 -I/usr/local/include -c -o fstr.o fstr.cc
fstr.cc: In member function ‘void std::ctype<char16_t>::_M_initialize_ctype()’:
fstr.cc:140:51: error: ‘__uselocale’ was not declared in this scope
__c_locale old = __uselocale(_M_c_locale_ctype);
^
Found that mandb was running forever.
Tried a reboot, which failed, because of disk corruption.
This time, the boot string /boot/cmdlin.txt was OK.
ran:
# umount /dev/mmcblk0p6
# e2fsck -n /dev/mmcblk0p6
# e2fsck -p /dev/mmcblk0p6
# e2fsck /dev/mmcblk0p6
and accepted everything.
There was:
- one old issue with:
code
Multiply-claimed block(s) 5345739
(There are two inodes containing multiply-claimed blocks.)
file /home/marc/public_html/externsw/www/csn/dev-env/doc/share.html (inode #1327195, mod time Sat Jun 9 13:17:11 2018)
/home/marc/public_html/externsw/www/csn/dev-env/doc/hlink.html (inode #1327229, mod time Sat Jun 9 15:51:02 2018)
Cloned the block, and later restored the corrupted file from sartre.
- an old (Mar 21 2017) issue with Sergey's files
- a recent issue with the failure to clone the gcc git repo
The errors were:
Entry 'sort' in /home/marc/git/mgirod/gcc/libgo/go (294060) has deleted/unused inode 687723. Clear<y>? yes
...
Entry 'fortran' in /home/marc/git/mgirod/gcc/libgo/misc/go (572908) has deleted/unused inode 572933. Clear<y>? yes
...
Pass 3: Checking directory connectivity
Unconnected directory inode 573083 (...)
Connect to lost+found<y>? yes
...
[note: these are entries 'cleared' earlier...]
Pass 4: Checking reference counts
Inode 294060 ref count is 45, should be 36. Fix<y>? yes
Inode 572908 ref count is 20, should be 13. Fix<y>? yes
[note: these had deleted/unused entries, see above]
Unattached zero-length inode 573041, Clear<y>? yes
...
[note: contiguous numbers up to 573082]
Inode 573083 ref count is 3, should be 2. Fix<y>? yes
[note this was connected to lost+found]
...
Unattached inode 815073
Connect to lost+found<y>? yes
Inode 815073 ref count is 2, should be 1. Fix<y>? yes
...
Pass 5: Checking group summary information
Block bitmap differences: -1666987 -(2108371--2108379) -...
Fix<y>? yes
Free blocks count wrong for group #0 (21148, counted 21147)
Fix<y>? yes
...
Free blocks count wrong for group #69 (226, counted 326)
Fix<y>? yes
Directories count wrong for group #69 (317, counted 308)
Fix<y>? yes
...
Afterwards, cleaned up lost+found
Back to fstr...
extern "C" __typeof(uselocale) __uselocale;
And now defining the specializations in order to link:
/home/marc/git/tests/fstr/fstr.cc:153: undefined reference to `std::ctype<char16_t>::_M_convert_to_wmask(unsigned short) const'
/usr/include/c++/4.9/bits/locale_classes.tcc:114: undefined reference to `std::ctype<char16_t>::id'
/usr/include/c++/4.9/bits/locale_classes.tcc:114: undefined reference to `typeinfo for std::ctype<char16_t>'
But there comes ICU... The logic is different...
Anyway... not clear what the intention is. Let's debug when running.
I'm afraid ctype
is far too simple for ICU.
May also try to debug printUnicodeString
e.g. in ilc
.
Puzzled by typeinfo
. Cannot see where it comes from.
No symbol with that name explicit in the source.
Noting the use of quotes in the error, typeinfo is not a symbol,
it's a function of the compiler,
probably prevented by the fact the class is not completely defined.
Defining the virtual destructor... This did it.
Only now, the linker complains about the other virtual member
functions.
Of course... at least some of these functions involve Unicode!
Obviously do_toupper
and do_tolower
,
but also do_widen
and do_narrow
!
In fact, I leave for now _M_widen
,
which may use UConverter
.
Same with _M_narrow
, and wctob
.
Compiled and linked...
Doesn't crash, but still produces nothing.
In fact, there is still no change whatsoever:
fstr> gdb fstr
(gdb) b 446
(gdb) r
446 basic_ofstream<UChar> ufs("utest.txt", basic_ofstream<UChar>::out);
(gdb) s
...
std::has_facet<std::ctype<char16_t> > (__loc=...)
at /usr/include/c++/4.9/bits/locale_classes.tcc:106
106 const size_t __i = _Facet::id._M_id();
(gdb) s
107 const locale::facet** __facets = __loc._M_impl->_M_facets;
(gdb) p __i
$1 = 28
(gdb) s
110 && dynamic_cast<const _Facet*>(__facets[__i]));
(gdb) p __facets[__i]
$2 = (const std::locale::facet *) 0xb6f9d4d8 <vtable for std::__timepunct<wchar_t>+8>
New run, stopping earlier:
446 basic_ofstream<UChar> ufs("utest.txt", basic_ofstream<UChar>::out);
(gdb) s
std::basic_ofstream<char16_t, std::char_traits<char16_t> >::basic_ofstream (
this=0xbefff86c, __s=0x8095c "utest.txt", __mode=std::_S_out,
__in_chrg=<optimized out>, __vtt_parm=<optimized out>)
at /usr/include/c++/4.9/fstream:645
645 : __ostream_type(), _M_filebuf()
(gdb) s
std::basic_ios<char16_t, std::char_traits<char16_t> >::basic_ios (
this=0xbefff8fc) at /usr/include/c++/4.9/bits/basic_ios.h:456
456 _M_streambuf(0), _M_ctype(0), _M_num_put(0), _M_num_get(0)
(gdb) s
457 { }
(gdb) s
std::basic_ostream<char16_t, std::char_traits<char16_t> >::basic_ostream (
this=0xbefff86c,
__vtt_parm=0x80bc4 <VTT for std::basic_ofstream<char16_t, std::char_traits<char16_t> >+4>, __in_chrg=<optimized out>) at /usr/include/c++/4.9/ostream:385
385 { this->init(0); }
(gdb) s
std::basic_ios<char16_t, std::char_traits<char16_t> >::init (this=0xbefff8fc,
__sb=0x0) at /usr/include/c++/4.9/bits/basic_ios.tcc:129
129 ios_base::_M_init();
(gdb) s
132 _M_cache_locale(_M_ios_locale);
(gdb) s
std::basic_ios<char16_t, std::char_traits<char16_t> >::_M_cache_locale (
this=0xbefff8fc, __loc=...) at /usr/include/c++/4.9/bits/basic_ios.tcc:159
159 if (__builtin_expect(has_facet<__ctype_type>(__loc), true))
(gdb) bt
#0 std::basic_ios<char16_t, std::char_traits<char16_t> >::_M_cache_locale (
this=0xbefff8fc, __loc=...) at /usr/include/c++/4.9/bits/basic_ios.tcc:159
#1 0x00015c50 in std::basic_ios<char16_t, std::char_traits<char16_t> >::init
(this=0xbefff8fc, __sb=0x0) at /usr/include/c++/4.9/bits/basic_ios.tcc:132
#2 0x000158e8 in std::basic_ostream<char16_t, std::char_traits<char16_t> >::basic_ostream (this=0xbefff86c,
__vtt_parm=0x80bc4 <VTT for std::basic_ofstream<char16_t, std::char_traits<char16_t> >+4>, __in_chrg=<optimized out>) at /usr/include/c++/4.9/ostream:385
#3 0x0001467c in std::basic_ofstream<char16_t, std::char_traits<char16_t> >::basic_ofstream (this=0xbefff86c, __s=0x8095c "utest.txt", __mode=std::_S_out,
__in_chrg=<optimized out>, __vtt_parm=<optimized out>)
at /usr/include/c++/4.9/fstream:645
#4 0x00013434 in main () at fstr.cc:446
And has_facet<char_traits<char16_t>>(_M_ios_locale)
returns false, and if we step in, we get our old:
(gdb) s
std::has_facet<std::ctype<char16_t> > (__loc=...)
at /usr/include/c++/4.9/bits/locale_classes.tcc:106
106 const size_t __i = _Facet::id._M_id();
(gdb) bt
#0 std::has_facet<std::ctype<char16_t> > (__loc=...)
at /usr/include/c++/4.9/bits/locale_classes.tcc:106
...
with __id
28 (i.e. wchar_t
).
libstdc++-v3> egrep -rl 'has_facet<(std::)?ctype<wchar_t> ?>' .
./include/bits/locale_facets.tcc
Only extern declarations...
libstdc++-v3> egrep -rl 'has_facet<(std::)?ctype<wchar_t> ?>' ..
I believe that the problem is ctype<UChar>::_M_initialize_ctype
It should add a new entry (e.g. 29) into __facets
Breakpoint in the function: not caught...
fstr> nm -C fstr | grep ctype | wc -l
35
fstr> nm -C fstr | egrep 'ctype(_abstract_base)?<char16_t>' | wc -l
31
fstr> nm -C fstr | grep ctype | egrep -v 'ctype(_abstract_base)?<char16_t>'nm -C fstr | grep ctype | egrep -v 'ctype(_abstract_base)?<char16_t>'
000135c0 t _GLOBAL__sub_I__ZNSt5ctypeIDsE2idE
U __wctype_l@@GLIBC_2.4
00080f60 V typeinfo for std::ctype_base
00080f50 V typeinfo name for std::ctype_base
fstr> nm -C /usr/lib/gcc/arm-linux-gnueabihf/4.9/libstdc++.a 2>/dev/null | grep 'ctype<wchar_t>' | wc -l
50
fstr> nm -C /usr/lib/gcc/arm-linux-gnueabihf/4.9/libstdc++.a 2>/dev/null | grep 'ctype<wchar_t>::_M'
U std::ctype<wchar_t>::_M_initialize_ctype()
00000000 T std::ctype<wchar_t>::_M_convert_to_wmask(unsigned short) const
00000000 T std::ctype<wchar_t>::_M_initialize_ctype()
fstr> nm -C fstr | grep 'ctype<char16_t>::_M'
000129e8 T std::ctype<char16_t>::_M_convert_to_wmask(unsigned short) const
00012c30 T std::ctype<char16_t>::_M_initialize_ctype()
Some hope: char16_t
and wchar_t
are distinct.
My function, howver incomplete, is just not invoked yet.
Added the constructors, but they are not called:
fstr> nm -C fstr | grep 'ctype<char16_t>::ctype'
00012ed8 T std::ctype<char16_t>::ctype(unsigned int)
00012f54 T std::ctype<char16_t>::ctype(__locale_struct*, unsigned int)
00012ed8 T std::ctype<char16_t>::ctype(unsigned int)
00012f54 T std::ctype<char16_t>::ctype(__locale_struct*, unsigned int)
fstr> nm -C /usr/lib/gcc/arm-linux-gnueabihf/4.9/libstdc++.a 2>/dev/null | grep 'ctype<wchar_t>::ctype'
00000000 T std::ctype<wchar_t>::ctype(unsigned int)
00000000 T std::ctype<wchar_t>::ctype(__locale_struct*, unsigned int)
00000000 T std::ctype<wchar_t>::ctype(unsigned int)
00000000 T std::ctype<wchar_t>::ctype(__locale_struct*, unsigned int)
U std::ctype<wchar_t>::ctype(unsigned int)
U std::ctype<wchar_t>::ctype(__locale_struct*, unsigned int)
In gcc/libstdc++-v3/src/c++98/localename.cc,
there is a locale::_Impl::_Impl(...)
which constructs all the facets for char
and wchar_t
.
It does it with its private template member function
_M_init_facet
, defined inline in
gcc/libstdc++-v3/include/bits/locale_classes.h:
_M_install_facet(&_Facet::id, __facet);
The use would be:
_M_init_facet<UChar>(new ctype<UChar>());
Looking for classes with protected members,
that one might extend in derived classes:
- in /usr/include/c++/4.9/bits/locale_classes.h
- locale::facet:
it is the collate template classes
which specialize it.
- locale::_Impl:
locale, locale:facet,
and the has_facet and use_facet template functions
are friends
The class locale
(in locale_classes.h), is commented as:
an extensible container for user-defined localization.
Inside, there are 3 private _Impl
pointer members:
_M_impl
(shared),
and 2 static
:
_S_classic
("C" reference) and
_S_global
(Current).
libstdc++-v3> pwd
/home/marc/git/mgirod/gcc/libstdc++-v3
libstdc++-v3> find src -type f -name localename.cc
src/c++98/localename.cc
libstdc++-v3> cksum include/bits/locale_classes.h
1219623670 24897 include/bits/locale_classes.h
libstdc++-v3> cksum /usr/include/c++/4.9/bits/locale_classes.h
2000905944 22985 /usr/include/c++/4.9/bits/locale_classes.h
libstdc++-v3> diff include/bits/locale_classes.h /usr/include/c++/4.9/bits/locale_classes.h
3c3
< // Copyright (C) 1997-2018 Free Software Foundation, Inc.
---
> // Copyright (C) 1997-2014 Free Software Foundation, Inc.
...
Significant additions (~9%)
Trying an update/upgrade cycle...
This did not affect libstd++
Maybe
Apache
has an example...
int main () {
std::locale loc; // Default locale
std::locale my_loc (loc, new ex_codecvt);
Found the source code, and attempted to build it
in ~/git/tests/imbue
Not trivial: uses RogueWave...
Committed the files as such, and switched to a dev branch.
Cleaned up the RW macros, and built!
This shows ISO 8859-1 converted to US ASCII.
Trying to apply, by constructing a locale copy
with the ctype<UChar>
facet.
Builds, but no effect.
More and more symbols using char16_t
,
mostly weak:
fstr> nm -C ./fstr | grep char16_t | wc -l
280
fstr> nm -C ./fstr | grep char16_t | perl -nle '$h{$1}++ if /^\w+ (\w) /;END{print"$_: $h{$_}" for sort keys %h}'
B: 1
R: 6
T: 54
V: 27
W: 166
r: 1
t: 18
u: 7
I confused traits (e.g. char_traits
)
and facets (e.g. ctype
).
It looks like ICU doesn't build upon the C locale
Explored constructing the ctype<UChar>
with a reference of 1:
locale loc(defloc, new ctype<UChar>(1));
Which leads to:
std::locale::locale<std::ctype<char16_t> > (this=0xbefffa98, __other=...,
__f=0xc5008) at /usr/include/c++/4.9/bits/locale_classes.tcc:47
47 _M_impl = new _Impl(*__other._M_impl, 1);
(gdb) bt
#0 std::locale::locale<std::ctype<char16_t> > (this=0xbefffa98, __other=...,
__f=0xc5008) at /usr/include/c++/4.9/bits/locale_classes.tcc:47
#1 0x0001387c in main () at fstr.cc:451
(gdb) p __other
$5 = (const std::locale &) @0xbefffa9c: {static none = 0, static ctype = 1,
static numeric = 2, static collate = 4, static time = 8,
static monetary = 16, static messages = 32, static all = 63,
_M_impl = 0xb6fa3d14, static _S_classic = <optimized out>,
static _S_global = <optimized out>, static _S_categories = <optimized out>,
static _S_once = <optimized out>}
(gdb) s
50 { _M_impl->_M_install_facet(&_Facet::id, __f); }
(gdb) p _Facet::id
$6 = {_M_index = 0, static _S_refcount = <optimized out>}
(gdb) s
56 delete [] _M_impl->_M_names[0];
(gdb)
57 _M_impl->_M_names[0] = 0; // Unnamed.
(gdb)
58 }
again with no other visible effect
(as the new _Impl
gets deleted?).
Tried to refresh my fork of gcc, but failed to pull:
gcc> git remote -v
origin https://github.com/mgirod/gcc.git (fetch)
origin https://github.com/mgirod/gcc.git (push)
upstream https://github.com/gcc-mirror/gcc.git (fetch)
upstream https://github.com/gcc-mirror/gcc.git (push)
gcc> git pull upstream master
...
Updating 9dec9a1..698c03a
error: Your local changes to the following files would be overwritten by merge:
libgo/go/runtime/mapspeed_test.go
...
libgomp/testsuite/libgomp.oacc-c++/non-scalar-data.C
Please, commit your changes or stash them before you can merge.
Aborting
gcc> git commit -m 'changes between mirrors' -a
[master b82f073] changes between mirrors
2616 files changed, 484545 deletions(-)
...
gcc> git status
On branch master
Your branch is ahead of 'origin/master' by 1 commit.
(use "git push" to publish your local commits)
nothing to commit, working directory clean
gcc> git push
...
C-c C-c
gcc> git status
On branch master
Your branch is ahead of 'origin/master' by 1 commit.
(use "git push" to publish your local commits)
nothing to commit, working directory clean
gcc> cd ..
mgirod> rm -rf gcc
mgirod> git clone [email protected]:mgirod/gcc.git
...
Receiving objects: 100% (2356011/2356011), 2.49 GiB | 1.17 MiB/s, done.
Connection to github.com closed by remote host.
...
Resolving deltas: 100% (1940070/1940070), done.
Checking connectivity... done.
Checking out files: 100% (81432/81432), done.
mgirod> cd gcc
gcc> git remote add upstream [email protected]:gcc-mirror/gcc.git
gcc> git remote -v
origin [email protected]:mgirod/gcc.git (fetch)
origin [email protected]:mgirod/gcc.git (push)
upstream [email protected]:gcc-mirror/gcc.git (fetch)
upstream [email protected]:gcc-mirror/gcc.git (push)
gcc> git pull upstream master
...
Receiving objects: 100% (14662/14662), 29.93 MiB | 1.44 MiB/s, done.
Resolving deltas: 100% (11859/11859), completed with 3474 local objects.
From github.com:gcc-mirror/gcc
* branch master -> FETCH_HEAD
* [new branch] master -> upstream/master
Updating 9dec9a1..698c03a
Checking out files: 100% (4453/4453), done.
Fast-forward
ChangeLog | 45 +-
...
create mode 100644 libstdc++-v3/testsuite/ext/new_allocator/eq.cc
gcc> git branch
* master
gcc> git status
On branch master
Your branch is ahead of 'origin/master' by 1215 commits.
(use "git push" to publish your local commits)
It took 2.65 seconds to enumerate untracked files. 'status -uno'
may speed it up, but you have to be careful not to forget to add
new files yourself (see 'git help status').
nothing to commit, working directory clean
gcc> git config --global push.default simple
gcc> git push
Trying again to debug...
I don't touch the source for now,
i.e. I leave on line 451 (1 will keep the ctype object):
locale loc(defloc, new ctype<UChar>(1));
But during the basic_ofstream construction,
one initializes a new ctype object, this time with 0
(line 456 in basic_ios.h—_M_ctype is a member object of type
ctype<char16_>
):
basic_ios()
: ios_base(), _M_tie(0), _M_fill(char_type()), _M_fill_init(false),
_M_streambuf(0), _M_ctype(0), _M_num_put(0), _M_num_get(0)
{ }
and the constructor is empty!?
My breakpoints get caught from the construction on line 451,
but not from this on line 452:
fstr> gdb fstr
(gdb) b 290
(gdb) b 275
(gdb) b 452
(gdb) run
Breakpoint 1, std::ctype<char16_t>::ctype (this=0xc5008, refs=1) at fstr.cc:290
290 _M_c_locale_ctype(_S_get_c_locale()), _M_narrow_ok(false) { _M_initialize_ctype(); }
(gdb) info stack
#0 std::ctype<char16_t>::ctype (this=0xc5008, refs=1) at fstr.cc:290
#1 0x00013864 in main () at fstr.cc:451
(gdb) c
Breakpoint 2, std::ctype<char16_t>::_M_initialize_ctype (this=0xc5008) at fstr.cc:275
275 for (i = 0; i < 128; ++i) {
(gdb) c
Breakpoint 3, main () at fstr.cc:452
452 basic_ofstream<UChar> ufs("utest.txt", basic_ofstream<UChar>::out);
(gdb) c
Continuing.
[Inferior 1 (process 2067) exited normally]
Tried to solve this with forward declarations...
Didn't help.
Commented away the loc construction,
in case it was this which prevented the second construction, but no.
I am building with:
fstr> make CXXFLAGS="-std=c++11 -g -O0"
And yet I get:
(gdb) s
std::basic_ofstream<char16_t, std::char_traits<char16_t> >::basic_ofstream (this=0xbefff84c, __s=0x80bfc "utest.txt", __mode=std::_S_out,
__in_chrg=<optimized out>, __vtt_parm=<optimized out>) at /usr/include/c++/4.9/fstream:645
645 : __ostream_type(), _M_filebuf()
Trying:
fstr> make CXXFLAGS="-std=c++11 -ggdb -Og"
Now, no breakpoint is caught!
Removed -Og
, and got back to the original situation.
More forward declarations may help, esp:
template<typename C, typename T> class basic_ios;
template<> class basic_ios<UChar, char_traits<UChar> >;
But then, one must provide the definition before including ostream.
So I did.
Now caught in the linker, for undefined symbols
for the non inlined members...
fstr.o: In function `std::basic_ios<char16_t, std::char_traits<char16_t> >::setstate(std::_Ios_Iostate)':
/home/marc/git/tests/fstr/fstr.cc:138: undefined reference to `std::basic_ios<char16_t, std::char_traits<char16_t> >::clear(std::_Ios_Iostate)'
fstr.o: In function `std::basic_ofstream<char16_t, std::char_traits<char16_t> >::basic_ofstream(char const*, std::_Ios_Openmode)':
/usr/include/c++/4.9/fstream:647: undefined reference to `std::basic_ios<char16_t, std::char_traits<char16_t> >::init(std::basic_streambuf<char16_t, std::char_traits<char16_t> >*)'
fstr.o: In function `std::basic_ostream<char16_t, std::char_traits<char16_t> >::basic_ostream()':
/usr/include/c++/4.9/ostream:385: undefined reference to `std::basic_ios<char16_t, std::char_traits<char16_t> >::init(std::basic_streambuf<char16_t, std::char_traits<char16_t> >*)'
fstr.o: In function `std::basic_ofstream<char16_t, std::char_traits<char16_t> >::open(char const*, std::_Ios_Openmode)':
/usr/include/c++/4.9/fstream:724: undefined reference to `std::basic_ios<char16_t, std::char_traits<char16_t> >::clear(std::_Ios_Iostate)'
OK: I managed to provide inline implementations or these functions.
Now, maybe reimplementing them in terms of ICU?
(gdb) s
std::basic_ostream<char16_t, std::char_traits<char16_t> >::basic_ostream (this=0xbefff84c,
__vtt_parm=0x80e14 <VTT for std::basic_ofstream<char16_t, std::char_traits<char16_t> >+4>, __in_chrg=<optimized out>)
at /usr/include/c++/4.9/ostream:385
385 { this->init(0); }
...
std::has_facet<std::codecvt<char16_t, char, __mbstate_t> > (__loc=...) at /usr/include/c++/4.9/bits/locale_classes.tcc:106
106 const size_t __i = _Facet::id._M_id();
(gdb) n
107 const locale::facet** __facets = __loc._M_impl->_M_facets;
(gdb) p __i
$1 = 31
(gdb) s
110 && dynamic_cast<const _Facet*>(__facets[__i]));
(gdb) p __facets[__i]
$2 = (const std::locale::facet *) 0xb6e1c784 <_nl_C_locobj>
(gdb) s
114 }
(gdb)
std::basic_filebuf<char16_t, std::char_traits<char16_t> >::basic_filebuf (this=0xbefff850) at /usr/include/c++/4.9/bits/fstream.tcc:89
89 }
(gdb) p _M_codecvt
$3 = (const std::basic_filebuf<char16_t, std::char_traits<char16_t> >::__codecvt_type *) 0x0
Still nothing added to the utest.txt file.
It looks like the point where it fails is:
std::basic_filebuf<char16_t, std::char_traits<char16_t> >::xsputn (this=0xbefff850, __s=0x80bcc u"Il y a de la joie\n",
__n=18) at /usr/include/c++/4.9/bits/fstream.tcc:640
640 streamsize __ret = 0;
(gdb) n
644 const bool __testout = (_M_mode & ios_base::out
(gdb)
645 || _M_mode & ios_base::app);
(gdb)
646 if (__check_facet(_M_codecvt).always_noconv()
(gdb) p __testout
$15 = true
(gdb) n
0x000168cc in std::__ostream_insert<char16_t, std::char_traits<char16_t> > (__out=...,
__s=0x80bcc u"Il y a de la joie\n", __n=18) at /usr/include/c++/4.9/bits/ostream_insert.h:109
109 __catch(...)
(gdb) info stack
#0 0x000168cc in std::__ostream_insert<char16_t, std::char_traits<char16_t> > (__out=...,
__s=0x80bcc u"Il y a de la joie\n", __n=18) at /usr/include/c++/4.9/bits/ostream_insert.h:109
#1 0x000154e8 in std::operator<< <char16_t, std::char_traits<char16_t> > (__out=..., __s=0x80bcc u"Il y a de la joie\n")
at /usr/include/c++/4.9/ostream:518
#2 0x000135cc in main () at fstr.cc:632
Back, deeper inside:
std::basic_filebuf<char16_t, std::char_traits<char16_t> >::xsputn (this=0xbefff850, __s=0x80bcc u"Il y a de la joie\n",
__n=18) at /usr/include/c++/4.9/bits/fstream.tcc:640
640 streamsize __ret = 0;
(gdb) n
644 const bool __testout = (_M_mode & ios_base::out
(gdb)
645 || _M_mode & ios_base::app);
(gdb)
646 if (__check_facet(_M_codecvt).always_noconv()
(gdb) s
std::__check_facet<std::codecvt<char16_t, char, __mbstate_t> > (__f=0x0) at /usr/include/c++/4.9/bits/basic_ios.h:48
48 if (!__f)
(gdb) p __f
$2 = (const std::codecvt<char16_t, char, __mbstate_t> *) 0x0
(gdb) info stack
#0 std::__check_facet<std::codecvt<char16_t, char, __mbstate_t> > (__f=0x0) at /usr/include/c++/4.9/bits/basic_ios.h:48
#1 0x00019ba0 in std::basic_filebuf<char16_t, std::char_traits<char16_t> >::xsputn (this=0xbefff850,
__s=0x80bcc u"Il y a de la joie\n", __n=18) at /usr/include/c++/4.9/bits/fstream.tcc:646
#2 0x0001521c in std::basic_streambuf<char16_t, std::char_traits<char16_t> >::sputn (this=0xbefff850,
__s=0x80bcc u"Il y a de la joie\n", __n=18) at /usr/include/c++/4.9/streambuf:451
#3 0x00014404 in std::__ostream_write<char16_t, std::char_traits<char16_t> > (out=..., s=0x80bcc u"Il y a de la joie\n",
n=18) at fstr.cc:620
#4 0x0001687c in std::__ostream_insert<char16_t, std::char_traits<char16_t> > (__out=...,
__s=0x80bcc u"Il y a de la joie\n", __n=18) at /usr/include/c++/4.9/bits/ostream_insert.h:101
#5 0x000154e8 in std::operator<< <char16_t, std::char_traits<char16_t> > (__out=..., __s=0x80bcc u"Il y a de la joie\n")
at /usr/include/c++/4.9/ostream:518
#6 0x000135cc in main () at fstr.cc:632
This used __check_facet
instead of my inline chkfac
,
and it returned 0, even if of the expected type.
It is not that it returns 0, it has a 0 _M_codecvt
:
646 if (__check_facet(_M_codecvt).always_noconv()
(gdb) p _M_codecvt
$1 = (const std::basic_filebuf<char16_t, std::char_traits<char16_t> >::__codecvt_type *) 0x0
This is a member of basic_filebuf
It is initalized in the constructor (fstream.tcc:88):
_M_codecvt = &use_facet<__codecvt_type>(this->_M_buf_locale);
Except that for this to happen,
has_facet
must have returned true, and...
(gdb) s
std::has_facet<std::codecvt<char16_t, char, __mbstate_t> > (__loc=...) at /usr/include/c++/4.9/bits/locale_classes.tcc:106
106 const size_t __i = _Facet::id._M_id();
(gdb) s
107 const locale::facet** __facets = __loc._M_impl->_M_facets;
(gdb) s
110 && dynamic_cast<const _Facet*>(__facets[__i]));
(gdb) s
114 }
(gdb) s
std::basic_filebuf<char16_t, std::char_traits<char16_t> >::basic_filebuf (this=0xbefff850)
at /usr/include/c++/4.9/bits/fstream.tcc:89
89 }
Next: the generic behaviour is not OK, in fact, it is not inline, not virtual,
195 ios_base::_M_init();
codecvt
This example compiles with GCC 8.1 (C++2a)...
One considered
deprecating
codecvt...
Looking for ios_base::_M_init
implementation:
libstdc++-v3> git checkout gcc-4_9_4-release
...
HEAD is now at d319148... Mark as release
Found the code I was looking for in ios_locale.cc:
// Called only by basic_ios<>::init.
void
ios_base::_M_init() throw()
{
// NB: May be called more than once
_M_precision = 6;
_M_width = 0;
_M_flags = skipws | dec;
_M_ios_locale = locale();
}
which means that it is not what I was interested in.
I leave the gcc fork in the 4.9 release state for now.
But now,
I missed the point where _Facet::id._M_id()
was created:
std::has_facet<std::codecvt<char16_t, char, __mbstate_t> > (__loc=...)
at /usr/include/c++/4.9/bits/locale_classes.tcc:106
106 const size_t __i = _Facet::id._M_id();
which must be in my own code, since it is 31, instead of previously 28.
_Facet
here is
codecvt<UChar, char, uencodstate>
It is assigned to in locale::_Impl::_M_install_facet
but this one just installs what it gets as argument,
and that's when constructing the locale
object,
in ~/git/mgirod/gcc/libstdc++-v3/src/c++98/codecvt.cc.
Added a definition for the id
object.
_M_buf_locale
is a member
of basic_streambuf
, which is a base of
basic_filebuf
.
Added the explicit specialization,
if only to be able to put a breakpoint.
Decided to clone my own fork into a 4.9 tree,
in order to free gcc for updates:
mgirod> git clone gcc gcc-4.9
mgirod> cd gcc
gcc> git checkout master
fstr> make CXXFLAGS="-std=c++11 -g -O0"
make: Nothing to be done for 'all'.
fstr> gdb fstr
(gdb) b 294
(gdb) run
Breakpoint 1, std::basic_streambuf<char16_t, std::char_traits<char16_t> >::basic_streambuf (this=0xbefff850) at fstr.cc:296
296 _M_out_cur(0), _M_out_end(0), _M_buf_locale(locale()) { }
(gdb) info stack
#0 std::basic_streambuf<char16_t, std::char_traits<char16_t> >::basic_streambuf (this=0xbefff850) at fstr.cc:296
#1 0x0001643c in std::basic_filebuf<char16_t, std::char_traits<char16_t> >::basic_filebuf (this=0xbefff850) at /usr/include/c++/4.9/bits/fstream.tcc:85
#2 0x0001532c in std::basic_ofstream<char16_t, std::char_traits<char16_t> >::basic_ofstream (this=0xbefff84c, __s=0x80bd0 "utest.txt", __mode=std::_S_out,
__in_chrg=<optimized out>, __vtt_parm=<optimized out>)
at /usr/include/c++/4.9/fstream:645
#3 0x00013644 in main () at fstr.cc:636
(gdb) s
std::basic_filebuf<char16_t, std::char_traits<char16_t> >::basic_filebuf (
this=0xbefff850) at /usr/include/c++/4.9/bits/fstream.tcc:87
87 if (has_facet<__codecvt_type>(this->_M_buf_locale))
There, __codecvt_type
is
codecvt<UChar, char, CTUC::state_type>
CTUC::state_type
is std::mbstate_t
by default as _Char_types
if not overridden/specialized
for UChar
.
mbstate
is defined in cwchar as int[6]
.
For a couple of weeks, got into trouble:
could not reindex htdig, and failed to update berry.
I ended up upgrading to the next OS version: stretch
although probably,
a mere reboot would have solved the cause of the problems.
However,
one consequence was that the htdig.conf file had get corrupted,
and it is only today that I was at last able to login again over ssh,
and regenerate a config file.
With stretch, the version of gcc in now 6.3.0
So, that's the version I checkout in my reference repo.
Git repositories, icu,
bdb,
objects,
2017,
2019,
log
Marc Girod