Unicode source code?

5 messages Options
Embed this post
Permalink
Paolo Redaelli

Unicode source code?

Reply Threaded More More options
Print post
Permalink
I'm still working - in the few spare time that real life leaves me - on
eiffel-gcc-xml.

I'm updating it to work with latest SmartEiffel snapshots and to
produce low-level wrappers that uses the plug-in mechanism.

I noticed that XML_COMPOSITE_NODE "took the place" of XML_NODE and that
attributes names and values are now Unicode strings.

So my question is: can Eiffel source code be encoded in Unicode?

If - as I suspect - it is not the case what's the preferred encoding?
Since we are handling the XML representation of a C/C++
program/library I think that something as naive as

print((U" arbitrary Unicode string from GCC-XML output").as_utf8)

could be fine since it would end up producing ASCII with almost any
conceivable output of gccxml. I already checked that UTF8 in comments
is easily digested by compiler.

Since I'm writing about encoding of SmartEiffel source file I would
like to open a little discussion about a question that has been puzzling
me for a while.

I remember that in OOSC2 Meyer told that symbols like Döppleganger
should not be allowed for several reasons that were agreeable and
reasonable.

Now I'm wondering if it is instead useful to allow Unicode source-code
to allow usage of mathematical operators and Greek letters that looks
"normal" to many people, for example to distinguish between scalar and
dot products in matrices.
It would allow to write math-dealing code is a more natural way,
writing features like:

infix "∋", has (an_element: like item): BOOLEAN  is
infix "∌", doesnt_have (an_element: like item): BOOLEAN is
infix "≤", infix "<=" is
infix "≥", infix ">="
infix "⨯", cross_product
and so on.

Also symbols like Greek letters (α,β,γ,δ,η,μ,φ) are widespread in
many scientific expressions and if allowed would make code more
readable than alpha,beta,gamma,delta, because expressions will be much
shorter without sacrificing understandability and readability of the
indended readers. I think that you will agree with me that SmartEiffel
source code is meant to be read by people and only after that compiled
by a computer....

Thanks in advance for your attention,
        Paolo

PS: See Unicode from U+2200 U+22FF for math symbols.
I found http://live.gnome.org/Gucharmap useful to browse Unicode
symbols.
Cyril ADRIAN

Re: Unicode source code?

Reply Threaded More More options
Print post
Permalink
Hi Paolo,

On Tue, Nov 4, 2008 at 10:30 AM, Paolo Redaelli <[hidden email]> wrote:
Also symbols like Greek letters (α,β,γ,δ,η,μ,φ) are widespread in
many scientific expressions and if allowed would make code more
readable than alpha,beta,gamma,delta, because expressions will be much
shorter without sacrificing understandability and readability of the
indended readers. I think that you will agree with me that SmartEiffel
source code is meant to be read by people and only after that compiled
by a computer....

Java has allowed it since the beginning. But I never saw anything written otherwise than with standard ASCII7 alphanum, except in comments. The reason, I guess, is twofold: (but then, maybe I'm biased)
1- our keyboards do not allow to easily input greek or in general non-ascii characters, so entering something that's not ascii is usually cumbersome.
2- anyway most people tend to write code in English; I do, because doing otherwise disrupts the thought process with complex inter-language translations.

In countries with other alphabets than latin maybe inputting something in their own language may be more straightforward... But is Unicode widespread? Or other different "code pages"?
At the office I currently work with Chinese people overseas, and even though I never saw their keyboards they seem to be able to enter ASCII characters quite easily. We share screens and work on code together, and the only latency is the network, not the typing speed.

Well, I don't know if it is worth it... But I have no strong opinion either way.

Just my 2 cents...

Best regards,
--
Cyril ADRIAN - http://www.cadrian.net/~cyril
Hendrik Boom-2

Re: Unicode source code?

Reply Threaded More More options
Print post
Permalink
On Tue, Nov 04, 2008 at 09:44:53PM +0100, Cyril ADRIAN wrote:

> Hi Paolo,
>
> On Tue, Nov 4, 2008 at 10:30 AM, Paolo Redaelli <[hidden email]>wrote:
>
> > Also symbols like Greek letters (α,β,γ,δ,η,μ,φ) are widespread in
> > many scientific expressions and if allowed would make code more
> > readable than alpha,beta,gamma,delta, because expressions will be much
> > shorter without sacrificing understandability and readability of the
> > indended readers. I think that you will agree with me that SmartEiffel
> > source code is meant to be read by people and only after that compiled
> > by a computer....

Yes, there are characters that would be useful.  But there are also
ambiguities.  For one thing, Unicode has "combining" code-points, so
that several Unicode code-points combine to make one character.  Some of
characters so build also have code points of their own.  Should it
matter whether you type e-grave as an e and a grave or as one symbol?
Should the two versions be recognised as separate identifiers or as the
same identifier?  It would be confusing either way.

Also, most languages in the world don't have a distinction between upper
and lower case.  All the rules Smarteiffel seems to want to enforce
about which case of letters to use for which things would have to be
abandoned.

Finally, if all you want is to be able to used unicode in character
strings for printing, UTF-8 works fine, with no (or almost no) changes
to the smarteiffel imp0lementation.  The UTF-8 encoding was designed to
make this easy.  Just if you want to play games with individual
characters, you'd have use a little effort to tase them out.

Not to mention the Unicode tools in the Smarteiffel libraries. of
course.

-- hendrik


>
>
> Java has allowed it since the beginning. But I never saw anything written
> otherwise than with standard ASCII7 alphanum, except in comments. The
> reason, I guess, is twofold: (but then, maybe I'm biased)
> 1- our keyboards do not allow to easily input greek or in general non-ascii
> characters, so entering something that's not ascii is usually cumbersome.
> 2- anyway most people tend to write code in English; I do, because doing
> otherwise disrupts the thought process with complex inter-language
> translations.
>
> In countries with other alphabets than latin maybe inputting something in
> their own language may be more straightforward... But is Unicode widespread?
> Or other different "code pages"?
> At the office I currently work with Chinese people overseas, and even though
> I never saw their keyboards they seem to be able to enter ASCII characters
> quite easily. We share screens and work on code together, and the only
> latency is the network, not the typing speed.

I don't know about Chinese specifically, but I do know that for Japanese
there are so-called "input methids" that allow one to type a romaji
versin of the text (which follows specific conventions for Roman
alphabet versions of thousands of Japanese characters) and it makes
reasonable guesses as to the proper characters.  Now Japanese has
homonyms (what language doesn't) and so there are frequently
alternatives to be chosen from.  THe input methid makes a guess, and the
typist can press the space bar to have it switch to alternatives.

This actually seems to work.

>
> Well, I don't know if it is worth it... But I have no strong opinion either
> way.
>
> Just my 2 cents...
>
> Best regards,
> --
> Cyril ADRIAN - http://www.cadrian.net/~cyril
Hendrik Boom-2

Re: Unicode source code?

Reply Threaded More More options
Print post
Permalink
On Fri, Jun 19, 2009 at 03:50:48PM +0100, Colin Paul Adams wrote:

> >>>>> "Hendrik" == hendrik  <[hidden email]> writes:
>
>     Hendrik> Yes, there are characters that would be useful.  But
>     Hendrik> there are also ambiguities.  For one thing, Unicode has
>     Hendrik> "combining" code-points, so that several Unicode
>     Hendrik> code-points combine to make one character.  Some of
>     Hendrik> characters so build also have code points of their own.
>     Hendrik> Should it matter whether you type e-grave as an e and a
>     Hendrik> grave or as one symbol?  Should the two versions be
>     Hendrik> recognised as separate identifiers or as the same
>     Hendrik> identifier?  It would be confusing either way.
>
> You normalize to the appropriate form if you want to erase such
> distinctions.

Yes, but don't the normalization rules change slightly form version to
version of Unicode?

>
>     Hendrik> Also, most languages in the world don't have a
>     Hendrik> distinction between upper and lower case.  All the rules
>     Hendrik> Smarteiffel seems to want to enforce about which case of
>     Hendrik> letters to use for which things would have to be
>     Hendrik> abandoned.
>
> Yes - it's a ghetto approach to language design.

So even if smrteiffel wants to maintian case distinction and predefine
words (such as THEN) to have particular cases, it will have to give up
on enforcing case distinction for user-defined words if we adopt Unicoed
identifiers.

-- hendrik
Colin Paul Adams

Re: Unicode source code?

Reply Threaded More More options
Print post
Permalink
>>>>> "Hendrik" == hendrik  <[hidden email]> writes:

    Hendrik> On Fri, Jun 19, 2009 at 03:50:48PM +0100, Colin Paul Adams wrote:
    >> >>>>> "Hendrik" == hendrik <[hidden email]> writes:
    >>
    Hendrik> Yes, there are characters that would be useful.  But
    Hendrik> there are also ambiguities.  For one thing, Unicode has
    Hendrik> "combining" code-points, so that several Unicode
    Hendrik> code-points combine to make one character.  Some of
    Hendrik> characters so build also have code points of their own.
    Hendrik> Should it matter whether you type e-grave as an e and a
    Hendrik> grave or as one symbol?  Should the two versions be
    Hendrik> recognised as separate identifiers or as the same
    Hendrik> identifier?  It would be confusing either way.
    >>
    >> You normalize to the appropriate form if you want to erase such
    >> distinctions.

    Hendrik> Yes, but don't the normalization rules change slightly
    Hendrik> form version to version of Unicode?

No - they are stable.
    >>
    Hendrik> Also, most languages in the world don't have a
    Hendrik> distinction between upper and lower case.  All the rules
    Hendrik> Smarteiffel seems to want to enforce about which case of
    Hendrik> letters to use for which things would have to be
    Hendrik> abandoned.
    >>
    >> Yes - it's a ghetto approach to language design.

    Hendrik> So even if smrteiffel wants to maintian case distinction
    Hendrik> and predefine words (such as THEN) to have particular
    Hendrik> cases, it will have to give up on enforcing case
    Hendrik> distinction for user-defined words if we adopt Unicoed
    Hendrik> identifiers.

That would be the case.
--
Colin Adams
Preston Lancashire