From: James Kass Date: २००८ फरवरी ३ ००:२१ Subject: [indic] Re: New Malayalam codepoints To: Indic Discussion List Cibu C J wrote, > Of course, Unicode and IETF have specs to specify locale or script based > exceptions. The fact that chillus has to make use of all those is a pretty > good indication that it is a excellent idea to encode them and remove all > those requirement for exceptions. Any time settings are referred to as "default", it means that they are subject to change. Often the ability to change settings extends all the way down the line to the end-user. Making a setting "default" is a good indication that the engineer(s) expect people to want the ability to change that setting. Chillu forms should not require custom settings in most instances. As sequences, they would not cause exceptional behaviour in any application which did not strip ZWJs from the data. In those cases where applications strip ZWJs from the data, users appear to consider the process either beneficial or benign. In cases where ZWJ-stripping is regarded as malign, then it may be an excellent idea to change the default settings so that the application becomes workable. > Those exceptions and character properties like 'default-ignorable' are there > for a reason. It is there to choose between a coarse or fine tuned > implementation based on the resources the implementor has. It is a great > thing for the language that, the script can remain intact in a coarse > implementaiton as well. For Malayalam, that will be more or less true after > chillu encoding. So it will be better supported in resource constrained > platforms or implementations. The source data itself should remain intact regardless of an application's tuning. In this way the author's intent is preserved and the script remains intact as a matter of course. I'm not understanding how atomic chillu encoding makes Malayalam better supported in resource constrained environments. Would you elaborate? To me, a resource constrained environment suggests a system with limited memory, where duplicate encodings would add to the strain. (I do understand that character properties exist for good reason.) Let's suppose you are involved with a hypothetical search engine company called "DataQuest" and, for one reason or another, decide to research the frequency of web pages which offer text in the fictional Klingon language/script using the ConScript Unicode Registry's Private Use Area Unicode encoding. You might proceed by entering some common words into the search box of your engine. If your engine restricts P.U.A. characters, or maps them all to zero, your search results would be nothing. However, since you pulled those common words from a web page in the first place, you know that such pages exist. Would this be a good indication that it is time to change the settings? Or would it be better to encode Klingon in TUS? (smile) If that hypothetical situation is too far-fetched, suppose you are working for a real company and Malayalee users were complaining about search results being too fuzzy because your collation interface was stripping certain characters for comparison purposes. Best regards, James Kass
ശേഖരത്തിലേയ്ക്കുള്ള കണ്ണി
From: Rajeev J Sebastian Date: २००८ फरवरी ३ ०७:१५ Subject: [indic] Re: New Malayalam codepoints To: Cibu C J Cc: James Kass, Indic Discussion List Cibu, Good that you finally understand what "tailoring" means. Now for the problems of your "theory": In the Unicode ecosystem, we can consider the following levels of systems: 1) Higher-level applications (including advanced rendering, spell-check, grammar-check, etc) 2) Low-level applications (including rendering, input, sorting, etc) 3) the Unicode encoding itself Your theory seeks to disambiguate chillus from vowelless consonants (for some odd reason even though even you cannot state without contradicting yourself, that they are the same), i.e., you want atomic chillus at level 3 in order to support one application at level 2, rendering application. If the atomic chillus come into force, then all applications other than rendering requires a "tailoring" which equates <atomic chillu> == <consonant> + chandrakkala ... for e.g., in sorting, atomic-chillu-NA == NA + chandrakkala in IDN, atomic-chillu-NA == NA + chandrakkala Implementation of input method "Inscript" will also require the "tailoring" atomic-chillu-NA <= kNA + kChandrakkala + kNUK (aka ZWJ) where kX means "key for character X on keyboard", and <= means "produced by key sequence". Implementation of input method "Typewriter" will have to disambiguate when the user means atomic chillu from when the user means vowelless-consonant (Similiarly for other input methods). People like Santhosh Thottingal, will probably agree that atomic-chillu vs vowelless-consonant is a meaningless difference as regards spell-check, because the system can never really rely on the user to actually type atomic-chillu when he means vowelless-consonant and vice-versa. In other words, your theory seeks atomic-chillus, but for every application other than rendering, we need to add tailorings for Malayalam to state that this disastrous atomic-chillu == vowelless-consonant. The case of rendering is a little more complicated. But it was resolved in a pratical manner by the workshop help at Kerala University. Please see the docs for more info. But we can say one thing regarding rendering - it too requires two mappings for the gChilluNa glyph viz., atomic-chillu-na CMAP and (gNa+gChandrakkala GSUB or gNa + gChandrakkala + gZWJ GSUB depending on the shaping engine involved). So then, whats the real use of your atomic chillus, other than further problems, and tailoring all other applications to require a map from atomic-chillu back to consonant+chandrkkala ? To Everyone Else: The UTC meeting is on the 4th i.e., tomorrow; have you submitted your opposition to atomic-chillus ? Regards Rajeev J Sebastian
ശേഖരത്തിലേയ്ക്കുള്ള കണ്ണി |