Discussion:
About encoding detection and default encoding in SciTE
Sylvain Brunerie
2010-06-11 12:48:23 UTC
Permalink
Hi,

First, please excuse me if I’m didn’t get something (and please excuse my
English, which may not be perfect), but here are my thoughts concerning
encoding matters in SciTE.

– I like the idea of selecting the encoding by checking the presence of
“coding:utf-8” (or something like that) in the first two lines of the file,
though I'm wondering why we never use the Linux “file” command, for example,
or something else (native) based on the same principle, to determine the
encoding of the file without asking anything to the user.
– The only real issue concerning encoding in SciTE is about default
encoding. When I create a new file, I want it to be UTF-8 encoded, so that I
could write some special characters that ISO-8859 doesn't support, but the
default in SciTE seems to be ISO-8859 encoding. Therefore, I use the
code.page property: when I set it to 65001, the default encoding turns to
UTF-8, that's perfect. The problem now is that I cannot anymore open files
with the ISO-8859 encoding. Indeed, in the Encoding menu, the first item
represent the default encoding, and the other ones represent UTF-8 and
UTF-16 encodings. So the only way to be able to select ISO-8859 encoding in
this menu is to select it as the default encoding… Am I wrong?

Thank you for taking the time to read this message and maybe answer to it.
:)
And, by the way, thank you very much for making SciTE, IMHO the best
lightweight and functional text editor (especially when encodings will be
easier to handle !).

Regards,
— Sylvain Brunerie
http://innsbay.toile-libre.org
--
You received this message because you are subscribed to the Google Groups "scite-interest" group.
To post to this group, send email to scite-***@googlegroups.com.
To unsubscribe from this group, send email to scite-interest+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scite-interest?hl=en.
Neil Hodgson
2010-06-12 01:11:41 UTC
Permalink
– I like the idea of selecting the encoding by checking the presence of
“coding:utf-8” (or something like that) in the first two lines of the file,
though I'm wondering why we never use the Linux “file” command, for example,
or something else (native) based on the same principle, to determine the
encoding of the file without asking anything to the user.
The Linux file command 'guesses' an encoding statistically and may
do so incorrectly.
– The only real issue concerning encoding in SciTE is about default
encoding. When I create a new file, I want it to be UTF-8 encoded, so that I
could write some special characters that ISO-8859 doesn't support, but the
default in SciTE seems to be ISO-8859 encoding.
If you want to do this then choose a UTF-8 encoding from the File menu.

Neil
--
You received this message because you are subscribed to the Google Groups "scite-interest" group.
To post to this group, send email to scite-***@googlegroups.com.
To unsubscribe from this group, send email to scite-interest+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scite-interest?hl=en.
Nicolas Chachereau
2010-06-12 17:05:42 UTC
Permalink
  The Linux file command 'guesses' an encoding statistically and may
do so incorrectly.
What most applications do (e.g. gedit) is defaulting to UTF-8 whenever
possible, else choosing an ASCII-variant (say ISO-8859-1). If the file
contains only character which can be represented in 7-bit ASCII, it
doesn't make any difference whether we edit it as UTF-8 or ASCII - and
if you're writing code and commenting in English, your file might very
well only contain such characters.

In UTF-8, a character like "é" (used in French, which is my monther
tongue), is made of two bytes, the combination of which is rendered as
"é" in ASCII. While this is a valid byte sequence in ASCII, it is a
improbable one, so it wouldn't be a problem defaulting to UTF-8. On
the other hand, the "é" character in ASCII variants is invalid in
UTF-8, so we would fall back to the non-UTF-8 variant. (Same reasoning
would apply to any other character like "ä" (À) etc.)

I would be in favor for SciTE to adopt such a mechanism. I had looked
into it once, but it seemed difficult to achieve. There was no clear
function to detect characters invalid in UTF-8 (or there was one in
GTK+ but it wasn't included in the minimum version supported by SciTE
at that time), and chunks of the text were sent to Scintilla to be
rendered, but they were chunks of bytes, not of characters, so it was
possible that a UTF-8 multibyte character was split and wrongly
detected as invalid.

I wrote a Lua script to emulate this "default to utf-8" behaviour.
Whenever a file is opened in SciTE, it is converted (using something
like `iconv -f UTF-8 /home/nico/opened-file > /dev/null') from UTF-8.
If this doesn't throw an error, the script switches the encoding to
UTF-8 "Cookie" (i.e. without BOM), else it stays with the ISO-8859-1
encoding.

It'd be nice if this was made possible in standard SciTE (say through
a configuration option) instead of having to resort to such an ugly
hack. A more talented programmer can probably find a solution to the
problems I had.

Regards,
Nicolas
--
You received this message because you are subscribed to the Google Groups "scite-interest" group.
To post to this group, send email to scite-***@googlegroups.com.
To unsubscribe from this group, send email to scite-interest+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scite-interest?hl=en.
Neil Hodgson
2010-06-13 01:16:51 UTC
Permalink
Post by Nicolas Chachereau
What most applications do (e.g. gedit) is defaulting to UTF-8 whenever
possible, else choosing an ASCII-variant (say ISO-8859-1). If the file
contains only character which can be represented in 7-bit ASCII, it
doesn't make any difference whether we edit it as UTF-8 or ASCII - and
if you're writing code and commenting in English, your file might very
well only contain such characters.
If a file is opened with only 7-bit ASCII then an encoding still
has to be chosen in case the user enters a character not in 7-bit
ASCII.
Post by Nicolas Chachereau
In UTF-8, a character like "é" (used in French, which is my monther
tongue), is made of two bytes, the combination of which is rendered as
"é" in ASCII. While this is a valid byte sequence in ASCII, it is a
improbable one, so it wouldn't be a problem defaulting to UTF-8. On
the other hand, the "é" character in ASCII variants is invalid in
UTF-8, so we would fall back to the non-UTF-8 variant. (Same reasoning
would apply to any other character like "ä" (À) etc.)
While the 0xE9 byte ("é" in Windows-1252) is invalid by itself in a
UTF-8 file, it is valid as part of many other characters such as "鞔"
(0xE9 0x9E 0x94). Whether "鞔" is more likely in your files than "éž”"
(the same byte sequence in Windows-1252) will depend on your
particular circumstances.

You could look at the whole file and treat any invalid UTF-8
sequence as conclusive proof that the file is not UTF-8 and then
display using a different encoding but that ignores the intent of the
file. It is quite common to see files that are supposed to be UTF-8
(and which may have a BOM or cookie) contain invalid UTF-8, possibly
as well as valid multi-byte UTF-8 characters. The right thing to do is
to follow the intent of the file as expressed in a BOM, cookie or
setting rather than automatically switch encoding.

While there may be scope for a mode in which guesses are made, it
should be an option. I'd also like any contributor of such a patch to
commit to fixing any bugs found in the feature.

Neil
--
You received this message because you are subscribed to the Google Groups "scite-interest" group.
To post to this group, send email to scite-***@googlegroups.com.
To unsubscribe from this group, send email to scite-interest+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scite-interest?hl=en.
Nicolas Chachereau
2010-06-13 08:57:16 UTC
Permalink
  If a file is opened with only 7-bit ASCII then an encoding still
has to be chosen in case the user enters a character not in 7-bit
ASCII.
That's right. What I meant was: we can choose UTF-8, because it works
and is becoming/has become a de-facto standard (at least in most
recent Linux distributions). As long as the user doesn't enter a
character not in 7-bit ASCII, he won't even notice we chose UTF-8.
  While the 0xE9 byte ("é" in Windows-1252) is invalid by itself in a
UTF-8 file, it is valid as part of many other characters such as "鞔"
(0xE9 0x9E 0x94). Whether "鞔" is more likely in your files than "éž"
(the same byte sequence in Windows-1252) will depend on your
particular circumstances.
Well, yes, in theory. Can you give any real-world examples where the
sequence "éž" should be displayed in Windows-1252/ISO-8859-1?
  You could look at the whole file and treat any invalid UTF-8
sequence as conclusive proof that the file is not UTF-8 and then
display using a different encoding but that ignores the intent of the
file. It is quite common to see files that are supposed to be UTF-8
(and which may have a BOM or cookie) contain invalid UTF-8, possibly
as well as valid multi-byte UTF-8 characters. The right thing to do is
to follow the intent of the file as expressed in a BOM, cookie or
setting rather than automatically switch encoding.
Treating any invalid UTF-8 sequence as "conclusive proof" that the
file is not UTF-8, which is what I propose, is only meant for files
that don't express their "intent" in a BOM, cookie or setting. Right
now, a file which is supposed to be UTF-8 but doesn't use a BOM or
cookie will be opened by SciTE in ASCII-mode, whether it contains
invalid characters or not. With my proposal, the behaviour will be
exactly the same for a file which contains invalid UTF-8. If, however,
the file doesn't contain any invalid UTF-8 sequences, it won't be a
problem to open it in UTF-8. If that's not what the user wanted, he
can still switch the display encoding used, just like right now he has
to switch to UTF-8.
  While there may be scope for a mode in which guesses are made, it
should be an option. I'd also like any contributor of such a patch to
commit to fixing any bugs found in the feature.
I was proposing this as an option. However, I don't understand why you
keep speaking of "guessing"... it's not really guessing, it's just
displaying the file as UTF-8 if possible. Just like right now, the
file is displayed as ASCII* no matter what (I'm still speaking about
files without BOM nor cookie).

Regards,
Nicolas

* actually, displayed using the encoding set with the "code.page" option
--
You received this message because you are subscribed to the Google Groups "scite-interest" group.
To post to this group, send email to scite-***@googlegroups.com.
To unsubscribe from this group, send email to scite-interest+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scite-interest?hl=en.
Neil Hodgson
2010-06-14 02:03:25 UTC
Permalink
Post by Nicolas Chachereau
Well, yes, in theory. Can you give any real-world examples where the
sequence "éž" should be displayed in Windows-1252/ISO-8859-1?
Umm... Haven't you just changed sides on this ;-) You were
initially saying that an "é" (0xE9) made the file invalid as UTF-8 and
now the unlikelihood of "éž" makes it UTF-8?

The point I was trying to make is that if you want to implement
something based on UTF-8 validity then the implementation should be
accurate and ''' "é" character in ASCII variants is invalid in UTF-8
''' is just wrong. The byte value 0xe9 is not at all invalid in UTF-8.

Anyway, searching for "éž" shows the Czech word "též" (meaning
"also") seems to be fairly common with millions of search hits. Czech
would more likely be in Windows-1250 but 1250 uses the same values as
1252 for these characters. I'd expect it to be sometimes used with
punctuation like quotes or ellipses – "též…" where "éž…" has the same
bytes as "鞅" in UTF-8. Unfortunately its hard to search the net for
punctuation…
Post by Nicolas Chachereau
Treating any invalid UTF-8 sequence as "conclusive proof" that the
file is not UTF-8, which is what I propose, is only meant for files
that don't express their "intent" in a BOM, cookie or setting.
If the setting (code.page) takes precedence then how do you express
the encoding to be used when the file is not UTF-8?
Post by Nicolas Chachereau
I was proposing this as an option. However, I don't understand why you
keep speaking of "guessing"... it's not really guessing,
Much of your text talks about likelihoods such as whether "é" is
expected text or whether "éž”" will ever occur in real world examples
in a particular encoding. If you are going to base a feature on
likelihoods then sometimes the less likely case will turn up and the
implementation has guessed wrong. The 'file' command mentioned in the
original post uses statistical models to determine the probability of
various encodings and chooses the best match: it doesn't always choose
correctly so I'd call it a "guess".

Neil
--
You received this message because you are subscribed to the Google Groups "scite-interest" group.
To post to this group, send email to scite-***@googlegroups.com.
To unsubscribe from this group, send email to scite-interest+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scite-interest?hl=en.
Nicolas Chachereau
2010-06-14 14:21:50 UTC
Permalink
  Umm... Haven't you just changed sides on this ;-) You were
initially saying that an "é" (0xE9) made the file invalid as UTF-8 and
now the unlikelihood of "éž" makes it UTF-8?
  The point I was trying to make is that if you want to implement
something based on UTF-8 validity then the implementation should be
accurate and ''' "é" character in ASCII variants is invalid in UTF-8
''' is just wrong. The byte value 0xe9 is not at all invalid in UTF-8.
I haven't made myself clear I'm sorry... 0xE9 on its own, i.e. between
"h" and "o" (french "théorie") makes a file invalid as UTF-8. As part
of a multibyte character, it might be valid. So while your point is
valid, it isn't a problem for the system I'm trying to describe.
  Anyway, searching for "éž" shows the Czech word "též" (meaning
"also") seems to be fairly common with millions of search hits. Czech
would more likely be in Windows-1250 but 1250 uses the same values as
1252 for these characters. I'd expect it to be sometimes used with
punctuation like quotes or ellipses – "> Anyway, searching for "éž" shows the Czech word "též" (meaning
"also") seems to be fairly common with millions of search hits. Czech
would more likely be in Windows-1250 but 1250 uses the same values as
1252 for these characters. I'd expect it to be sometimes used with
punctuation like quotes or ellipses – "též…" where "éž…" has the same
bytes as "鞅" in UTF-8. Unfortunately its hard to search the net for
punctuation…
Okay, fair point.
Post by Nicolas Chachereau
Treating any invalid UTF-8 sequence as "conclusive proof" that the
file is not UTF-8, which is what I propose, is only meant for files
that don't express their "intent" in a BOM, cookie or setting.
  If the setting (code.page) takes precedence then how do you express
the encoding to be used when the file is not UTF-8?
I wasn't sure which setting you were speaking of. It it is code.page,
then I don't understand how the "intent" of the file would be
specified through this setting.
Post by Nicolas Chachereau
I was proposing this as an option. However, I don't understand why you
keep speaking of "guessing"... it's not really guessing,
  Much of your text talks about likelihoods such as whether "é" is
expected text or whether "éž”" will ever occur in real world examples
in a particular encoding. If you are going to base a feature on
likelihoods then sometimes the less likely case will turn up and the
implementation has guessed wrong. The 'file' command mentioned in the
original post uses statistical models to determine the probability of
various encodings and chooses the best match: it doesn't always choose
correctly so I'd call it a "guess".
I don't know enough about the internals of the file command, so I
won't say anything about that. It's true that I'm thinking about the
probability of particular sequences in encodings, which is only a way
to try to satisfy the user in the majority in cases, since there is no
reliable way to determine the encoding of a file without BOM nor
cookie. You are right, in some cases, it *will* be wrong. However, I
think those cases will be very rare – you Czech example shows I may be
too optimistic, but even in that case I think a file encoded in
Windows-1250 in Czech text would contain at least one sequence that is
invalid in UTF-8 (e.g. 0x9E (é) or 0xE9 (ž) on their own). Besides,
right now, while it is not "guessing", SciTE is *always* wrong on
UTF-8 files without BOM nor cookie – which are becoming increasingly
frequent in the Linux world (I have no statistics to back that, it's
my feeling).

Anyway, I'm willing to continue discussing this if you have other
questions/remarks or if I'm still forgetting something. My main
question at this point, however, is: would you accept such an
implementation, if it is made an option (say default.to.utf8 or
something like that)? If anyone is willing to give me a hand on this,
it'd be grateful – as I said, I had looked into it in the past and it
seemed difficult to achieve.

Regards,
Nicolas
--
You received this message because you are subscribed to the Google Groups "scite-interest" group.
To post to this group, send email to scite-***@googlegroups.com.
To unsubscribe from this group, send email to scite-interest+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scite-interest?hl=en.
Neil Hodgson
2010-06-15 00:28:48 UTC
Permalink
Post by Nicolas Chachereau
I wasn't sure which setting you were speaking of. It it is code.page,
then I don't understand how the "intent" of the file would be
specified through this setting.
You normally use code.page to set the encoding that you (or your
group) have decided to work with and refine this for any directory
trees or directories that use different encodings. It is fairly rare
to place files with different encodings in a single directory but this
can be handled for the UTF-8 + other encoding case by adding cookies
or BOMs. I suppose this could be extended to allow for file-pattern
specific code.page settings if this was a common need.
Post by Nicolas Chachereau
Besides,
right now, while it is not "guessing", SciTE is *always* wrong on
UTF-8 files without BOM nor cookie
Not true when code.page is set to UTF-8.
Post by Nicolas Chachereau
– which are becoming increasingly
frequent in the Linux world (I have no statistics to back that, it's
my feeling).
Is your problem that you are opening files from projects that have
no defined encoding? I'm trying to imagine what workflow you are
following to produce files in unexpected encodings. Possibly you are
viewing external projects and do not know what the project has defined
as its encoding?
Post by Nicolas Chachereau
Anyway, I'm willing to continue discussing this if you have other
questions/remarks or if I'm still forgetting something. My main
question at this point, however, is: would you accept such an
implementation, if it is made an option (say default.to.utf8 or
something like that)?
I'm willing to accept something that is completely defined and has
a reasonable implementation. For a start what values are permitted for
default.to.utf8 and what do they mean?

Neil
--
You received this message because you are subscribed to the Google Groups "scite-interest" group.
To post to this group, send email to scite-***@googlegroups.com.
To unsubscribe from this group, send email to scite-interest+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scite-interest?hl=en.
Nicolas Chachereau
2010-06-16 09:46:41 UTC
Permalink
  You normally use code.page to set the encoding that you (or your
group) have decided to work with and refine this for any directory
trees or directories that use different encodings. [snip]
Okay. The thing is: I use SciTE to open any text file I want to read
or edit. I'm not using it only on "projects" which are in clearly
defined directory trees. I use Ubuntu, and my locale is "fr_CH.utf8".

Here are a few examples of the problem we are talking about:
- I redirect the output of a shell command into a file, which I later
open using SciTE. This file is encoded using UTF-8, but doesn't have a
cookie nor a BOM. However, since my system is in french, since many of
my files have french names (e.g. my directory ~/études), it is quite
common for those files to have characters outside the 7-bit range. So
I end up with Mojibake [1].

- I open one of my chat logs, which are text files produced by Pidgin.
They are using UTF-8, but don't have a BOM. They're in French. Same
problem occurs.

- I download a web page. It may be in French, or simply use
typographic characters in UTF-8 (like John Gruber's Markdown Syntax
page [2]). It doesn't use a BOM (which is understandable, since it can
cause some problems [3]). Starting to see a pattern?

So why don't I just set code.page=65001 globally? Well, there are
files which are still encoded using ISO-8859-1 (or WINDOWS-1252). For
example if I download a page from the Swiss Historical Dictionary, it
is encoded in ISO-8859-1 [4]. If I change code.page, SciTE can't
display this encoding correctly anymore. I also have some files lying
around from the time I was a Windows user. Some of them are in French
(or have french comments when it is code), so I have the same problem
with characters outside the 7-bit range.
For a start what values are permitted for
default.to.utf8 and what do they mean?
0 = keep current behaviour
1 = if the file doesn't have a cookie, nor a BOM, display it using
UTF-8 unless there are invalid character sequences, in which case
display it using default code.page. When creating a new file, use
UTF-8 Cookie.
When I say "display using UTF-8", I really mean that the display
encoding is switched inside SciTE: that is, if the user goes to File
-> Encoding, it is switched to UTF-8 Cookie, and he can switch back to
"Code Page Property" if he wants to. See how I implemented it in Lua
for my own purposes: http://paste2.org/p/879744

I hope you understand better why I'm somewhat unhappy with the current
behaviour (I know I'm not the only one... this comes up every once in
a while on the mailing list)

Regards,
Nicolas

[1] http://en.wikipedia.org/wiki/Mojibake
[2] http://daringfireball.net/projects/markdown/syntax
[3] http://www.w3.org/International/tutorials/tutorial-char-enc/#bomhow
[4] E.g. this page about the industrial revolution:
http://hls-dhs-dss.ch/textes/f/F13825.php
--
You received this message because you are subscribed to the Google Groups "scite-interest" group.
To post to this group, send email to scite-***@googlegroups.com.
To unsubscribe from this group, send email to scite-interest+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scite-interest?hl=en.
math mhb
2010-06-13 03:12:23 UTC
Permalink
hi, Sylvain,

You may want to try our modified and extended version, called All-In-One IDE
(Sc1IDE) or formerly known as SciTE LaTeX IDE, of SciTE from
http://code.google.com/p/scitelatexide/.

Our editor integrates many extensions and new features, one of which is to
automatically detect UTF-8 files and switch encoding upon opening files, no
matter whether there are stuffs like "coding:utf-8" in the header. With this
feature, you need not set UTF-8 as default encoding since there is no need
to manually switch to UTF-8 for unicode files. With this feature, I think
the optimal choice is to leave the default encoding as your windows locale
ANSI encoding, which is very convenient for opening any files (ANSI or UTF-8
whatever); and when you create new files and if you want to enter UTF-8
chars, you just need manually switch to UTF-8 encoding and save it once to
make the editor working in UTF-8 mode only for current file, which does not
prevent opening other ANSI files.

You can also try SciTE-Ru from http://code.google.com/p/scite-ru/since
UTF-8 autodetection codes in Sc1IDE have also been borrowed into
SciTE-Ru, thus this magic feature is also available in SciTE-Ru. And our
Sc1IDE also borrowed many features from SciTE-Ru, which makes our editor
much more powerful as a quite general IDE for programming with any language
and text editing jobs like HTML coding, and so on.

For non-chinese users of Sc1IDE, please change or uncomment code.page
setting in Sc1IDE\config\codepage.properties. The default setting of Sc1IDE
is to use Simplified Chinese interface for convenience of Chinese users.

Hope our editor can resolve your problem perfectly and bring you extra
experience:-)
Post by Sylvain Brunerie
Hi,
First, please excuse me if I’m didn’t get something (and please excuse my
English, which may not be perfect), but here are my thoughts concerning
encoding matters in SciTE.
– I like the idea of selecting the encoding by checking the presence of
“coding:utf-8” (or something like that) in the first two lines of the file,
though I'm wondering why we never use the Linux “file” command, for example,
or something else (native) based on the same principle, to determine the
encoding of the file without asking anything to the user.
– The only real issue concerning encoding in SciTE is about default
encoding. When I create a new file, I want it to be UTF-8 encoded, so that I
could write some special characters that ISO-8859 doesn't support, but the
default in SciTE seems to be ISO-8859 encoding. Therefore, I use the
code.page property: when I set it to 65001, the default encoding turns to
UTF-8, that's perfect. The problem now is that I cannot anymore open files
with the ISO-8859 encoding. Indeed, in the Encoding menu, the first item
represent the default encoding, and the other ones represent UTF-8 and
UTF-16 encodings. So the only way to be able to select ISO-8859 encoding in
this menu is to select it as the default encoding… Am I wrong?
Thank you for taking the time to read this message and maybe answer to it.
:)
And, by the way, thank you very much for making SciTE, IMHO the best
lightweight and functional text editor (especially when encodings will be
easier to handle !).
Regards,
— Sylvain Brunerie
http://innsbay.toile-libre.org
--
You received this message because you are subscribed to the Google Groups
"scite-interest" group.
To unsubscribe from this group, send email to
.
For more options, visit this group at
http://groups.google.com/group/scite-interest?hl=en.
--
Best regards,

Hongbin Ma
--
You received this message because you are subscribed to the Google Groups "scite-interest" group.
To post to this group, send email to scite-***@googlegroups.com.
To unsubscribe from this group, send email to scite-interest+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scite-interest?hl=en.
mozers
2010-07-02 21:19:50 UTC
Permalink
Post by math mhb
Our editor integrates many extensions and new features, one of
which is to automatically detect UTF-8 files and switch encoding
upon opening files, no matter whether there are stuffs like
"coding:utf-8" in the header.
You can also try SciTE-Ru from http://code.google.com/p/scite-ru/
since UTF-8 autodetection codes in Sc1IDE have also been borrowed
into SciTE-Ru, thus this magic feature is also available in
SciTE-Ru.
original post uses statistical models to determine the probability of
various encodings and chooses the best match: it doesn't always choose
correctly so I'd call it a "guess".
I agree with Neil. This is really "guess".
But your algorithm works wonderfully guessing at 99%.
It works fine in SciTE-Ru 2.12 (with other patchs).
I make a patch based on the original SciTE 2.12 (current CVS version).
I wanted give to Neil try it.
Alas, autodetection there is not works :(
Probably your patch uses specific SciTE-Ru.
--
mozers
<http://scite.net.ru>
--
You received this message because you are subscribed to the Google Groups "scite-interest" group.
To post to this group, send email to scite-***@googlegroups.com.
To unsubscribe from this group, send email to scite-interest+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scite-interest?hl=en.
Neil Hodgson
2010-07-03 00:52:11 UTC
Permalink
Post by mozers
But your algorithm works wonderfully guessing at 99%.
This code has some problems. It is longer than it needs to be since
it recognizes 5 and 6 byte UTF-8 sequences which have been invalid
since RFC 3629 in 2003. It will read beyond the end of the buffer when
a potential start byte is located near the end of the buffer.

Files are read in fixed length blocks. For UTF-8, this will often
mean that there will be a partial character at the end of the block
which will cause this code to reject the block as UTF-8.. To avoid
reading partial characters it should probably terminate several (4?)
bytes before the end of the buffer.

It does not implement Nicolas' preferred semantics which are to
treat as UTF-8 if there are no invalid characters. The patch requires
at least one valid UTF-8 multi-byte character to treat as UTF-8. I'm
not saying one or the other way is right: its just that it has to be
clearly defined.

Neil
--
You received this message because you are subscribed to the Google Groups "scite-interest" group.
To post to this group, send email to scite-***@googlegroups.com.
To unsubscribe from this group, send email to scite-interest+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scite-interest?hl=en.
Nicolas Chachereau
2010-07-25 21:27:33 UTC
Permalink
First, a question for everybody:

When you open a file and it could be shown correctly both in "UTF-8"
and in "Code Page Property" mode, which one would you want to
use?

This doesn't matter as long as you only type code and English, but
if you want to type "café", "Übermensch" or put a price with the € sign,
how would you want your file to be saved by default?
Using UTF-8 or Code Page Property?

When you create a new file, do you want it to use UTF-8 by default,
or the Code Page Property?

I'd like some feedback on this. Do you want to be able to choose the
behaviour?

So that was the question for everybody. The rest of this email gets
more technical ^^
  This code has some problems. It is longer than it needs to be since
it recognizes 5 and 6 byte UTF-8 sequences which have been invalid
since RFC 3629 in 2003. It will read beyond the end of the buffer
when a potential start byte is located near the end of the buffer.
It doesn't detect some other invalid sequences either: 0x80 to 0xC1
and 0xF5 to 0xFF can't start a sequence, 0xED must be followed by a
byte in range 0x80 to *0x9F* (not 0xBF), and so on.

I have tried to implement a better version: http://paste2.org/p/927504

I didn't like the use of binary shift operators in mozers' patch, as it
made the code hard to understand. It probably also leads to the
problems with the invalid sequences. So this function uses simple
comparisons.

It only iterates over one byte at a time, in order to avoid reading past
the end of the buffer when it is not null-terminated.
  Files are read in fixed length blocks. For UTF-8, this will often
mean that there will be a partial character at the end of the block
which will cause this code to reject the block as UTF-8.. To avoid
reading partial characters it should probably terminate several (4?)
bytes before the end of the buffer.
[snip]
To deal with this, the function returns the start of the multibyte
sequence that appears to be cut. I plan to prepend the next block
with this sequence before doing the detection. It works in a little
proof-of-concept program I made.

I wasn't sure how to implement this. Is it okay to return strings like
"" (which is false, i.e. there are invalid bytes), and "\1" (everything
okay)? It feels a bit hacky.

What would be the preferred way to join the two strings?
- use SString
- use gui_string
- use std::string
- use C functions on char*

Or should I just stop worrying about those "broken" sequences?

One last question: where should I put this function? In Utf8_16, like
SciTE-ru does? Or SciTEIO, like (say) SciTEBase::DiscoverLanguage ?

Regards,
Nicolas
--
You received this message because you are subscribed to the Google Groups "scite-interest" group.
To post to this group, send email to scite-***@googlegroups.com.
To unsubscribe from this group, send email to scite-interest+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scite-interest?hl=en.
Neil Hodgson
2010-07-27 01:47:00 UTC
Permalink
Post by Nicolas Chachereau
To deal with this, the function returns the start of the multibyte
sequence that appears to be cut. I plan to prepend the next block
with this sequence before doing the detection. It works in a little
proof-of-concept program I made.
Prepending blocks doesn't look that nice to me.
Post by Nicolas Chachereau
I wasn't sure how to implement this. Is it okay to return strings like
"" (which is false, i.e. there are invalid bytes), and "\1" (everything
okay)? It feels a bit hacky.
Yes. Since the blocks are being added to Scintilla before making
the final encoding decision it may as well wait until they are all in
Scintilla and then run the check over the whole file.

Alternatively change the detector to a state machine that can be
reentered with the result from the previous block. There appears to me
to be 12 states although it may be possible to collapse some (s2_1,
s3_2 and s4_3 may be the same (sMultiple_last)):

sStart Start (and character complete)
sFailure Failure

Partial states:
s2_1 Consumed 1 byte of 2 byte sequence
s3_1, s3_2 Consumed 1 or 2 byte of 3 byte sequence
s4_1, s4_2, s4_3 Consumed 1, 2 or 3 bytes of 4 byte sequence

Special partial states:

s3_1_E0, s3_1_ED Consumed 1 byte of 3 byte sequence and have
special case range
s4_1_F0, s4_1_F4 Consumed 1 byte of 4 byte sequence and have
special case range

For each byte in the block switch on the current state, check that
the byte is valid for this state (mostly a simple range check except
for sStart) and move on to the next state if it is or set state to
sFailure if it isn't.
Post by Nicolas Chachereau
One last question: where should I put this function? In Utf8_16, like
SciTE-ru does? Or SciTEIO, like (say) SciTEBase::DiscoverLanguage ?
Probably Utf8_16.

Neil
--
You received this message because you are subscribed to the Google Groups "scite-interest" group.
To post to this group, send email to scite-***@googlegroups.com.
To unsubscribe from this group, send email to scite-interest+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scite-interest?hl=en.
math mhb
2010-07-26 03:06:35 UTC
Permalink
Dear Mozers,

Thanks for your email dated Jul 3. I will try your patch if possible. I am
very occupied in these days.
Post by mozers
Post by math mhb
Our editor integrates many extensions and new features, one of
which is to automatically detect UTF-8 files and switch encoding
upon opening files, no matter whether there are stuffs like
"coding:utf-8" in the header.
You can also try SciTE-Ru from http://code.google.com/p/scite-ru/
since UTF-8 autodetection codes in Sc1IDE have also been borrowed
into SciTE-Ru, thus this magic feature is also available in
SciTE-Ru.
original post uses statistical models to determine the probability of
various encodings and chooses the best match: it doesn't always choose
correctly so I'd call it a "guess".
I agree with Neil. This is really "guess".
But your algorithm works wonderfully guessing at 99%.
It works fine in SciTE-Ru 2.12 (with other patchs).
I make a patch based on the original SciTE 2.12 (current CVS version).
I wanted give to Neil try it.
Alas, autodetection there is not works :(
Probably your patch uses specific SciTE-Ru.
--
mozers
<http://scite.net.ru>
--
You received this message because you are subscribed to the Google Groups
"scite-interest" group.
To unsubscribe from this group, send email to
.
For more options, visit this group at
http://groups.google.com/group/scite-interest?hl=en.
--
Best regards,

Hongbin Ma
--
You received this message because you are subscribed to the Google Groups "scite-interest" group.
To post to this group, send email to scite-***@googlegroups.com.
To unsubscribe from this group, send email to scite-interest+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scite-interest?hl=en.
Loading...