- Jun 04, 2015
-
-
Jungshik Shin authored
1. Add a one-way (encoding-only/fromUnicode) mapping for U+2212 to Shift_JIS, EUC-JP and ISO-2022-JP. The last just uses Shift_JIS. See https://www.w3.org/Bugs/Public/show_bug.cgi?id=28661 2. Make GBK aliases list compliant to the encoding spec. 3. Add "xA3xA0 => U+3000" to the GBK (windows-936) and gb18030. This makes it possible to remove the corresponding override in Blink 4. Modify the following to GBK (windows-936). See [1] - Add U+01F9 <=> \xA8\xBF - Drop U+E7C8 <=> \xA8\xBF 5. The following change is put on hold (NOT included in the CL) until the resolution of [1] - Add U+1E3F <=> \xA8\xBC - Drop U+E7C7 <=> \xA8\xBC The corresponding Blink CL is https://codereview.chromium.org/1167523003/ [1] https://www.w3.org/Bugs/Public/show_bug.cgi?id=28740#c3 BUG=425417,493824 TEST=Once ICU is rolled to this CL, Blink layout test fast/encoding/*. R=jsbell@chromium.org Review URL: https://codereview.chromium.org/1162723008
-
- Mar 19, 2015
-
-
Jungshik Shin (jungshik at google) authored
1. Update ucmlocal.mk and convertrs.txt to refer to euc-kr-html.ucm instead of windows-949.ucm 2. Tighten up the valid code range for the following converters: EUC-KR, Shift_JIS, Big5 This is to add back an ASCII range byte to the stream per the encoding spec when they're either illegal as a 'trail byte' or there's no assigned code point for a "lead + trail" sequence. For instance, with this change, '0xF3 0x41' in EUC-KR is converted to 'U+FFFD U+0041' instead of 'U+FFFD'. This change requires adding 2 ~ 8 new states to the conversion table of each converter mentioned above leading to 6.5kB net increase in the final data size. 3. Tighten the trail byte range for 2-byte sequences starting with 0x8E from [A1,E2] to [A1,DF] in EUC-JP and update the corresponding generating script. 4. Change the substitution characters for EUC-JP and Shift_JIS to match other converters. i.e. make them produce U+FFFD when encountering an invalid input. Before this chaange, they emitted U+001A. 5. Enable 'U_CHARSET_IS_UTF8' configuration flag. Chromium/Blink does not rely on ICU for the code conversion between the 'system native encoding' (if it's one of legacy encodings) and Unicode. With this configuration, we can cut down the code size a bit. 6. Update the icudtl.dat (all platforms) and assembly files (mac,linux) and the icudata dll (windows) See https://codereview.chromium.org/1026453002 for a new blink test added ( fast/encoding/char-decoding-invalid-trail.html ) BUG=450312,430823 TEST=Blink: fast/encoding/char-decoding-{truncated,invalid-trail}.html TEST=base_unittests --gtest_filter=*Conv*, browser_tests --gtest_filter=*ncoding* R=jsbell@chromium.org, mark@chromium.org Review URL: https://codereview.chromium.org/984233002
-
- Jan 21, 2015
-
-
Jungshik Shin (jungshik at google) authored
A. Converter update per HTML encoding spec along with changes in the encoding name alias table. B. Remove all the codes for converters Blink and Chromium do not need (SCSU, Lotus, ISO-2022-xx other than JP, BOCU, UTF-7, etc). This is reapplying the following CLs (that we used for ICU 52.1) to ICU 54.1 : https://codereview.chromium.org/598383002 https://codereview.chromium.org/654153002 We have two upstream bugs filed for A and B above: http://www.icu-project.org/trac/ticket/11296 http://www.icu-project.org/trac/ticket/10303 In addiition to A and B, we unified Big5 and Big5-HKSCS per the encoding spec (bug 277868). That also includes properly supporting the four 2-character sequences ( see http://crbug.com/277868#c3 ). big5_gen.sh deviates from the current spec to work around a bug in the spec. (see https://www.w3.org/Bugs/Public/show_bug.cgi?id=27878) Moreover, ucmlocal.mk is added to list only encodings we want to support. Also, tighten the state table for windows-946-2000.ucm that we use for EUC-KR for now. And, drop 'base' map for windows-{936,949}-2000.ucm. Finally, add euc-kr-html.ucm along with scripts/euckr_gen.sh, but it is not yet used pending the resolution of bug 450312. Data size checkpoint: 20,566,864 bytes (the original ICU 54=25,343,024) BUG=277868, 428145, 450312 TEST=net_unittests --gtest_filter="*ilenameUtil*" TEST=base_unittests --gtest_filter="*Conv*" TEST=browser_tests --gtest_filter="*ncoding*" TEST=Blink: fast/encoding/* R=jsbell@chromium.org, mark@chromium.org Review URL: https://codereview.chromium.org/839713003
-
- Apr 29, 2014
-
-
jshin@chromium.org authored
- Add missing half-width kana entries (omitted by mistake) - Drop 'extra' decoding only mapping. See https://www.w3.org/Bugs/Public/show_bug.cgi?id=25266 - Regenerate icu data files (*dat and assembly source files) for Linux, Mac, Windows and Android. (they'll not be shown at codereview.chromium.org because they're too large). BUG=132145,78847 TEST=When ICU is rolled in, base_unittests --gtest_filter=*ICU* and layout tests R=jsbell@chromium.org Review URL: https://codereview.chromium.org/251203003 git-svn-id: http://src.chromium.org/svn/trunk/deps/third_party/icu52@266919 4ff67af0-8c30-449e-8e8b-ad334ec8d88c
-
- Apr 07, 2014
-
-
jshin@chromium.org authored
1. Remove all the obsolete patches. There are lots of them because most of local patches to ICU 4.6.1 have either been accepted or become obsolete. The largest local patch removed is our patches for CJ word breaker because they were upstreamed. Android didn't apply the CJK word breaker patch to ICU 4.6 to reduce the data size. In a follow-up CL, we'll have an Android-specific change for this issue. Besides, we don't include patches for files we locally add because the patches for new files are redundant. Instead, they're mentioned in README.chromium. 2. We don't need platform-specific headers any more (pmac, plinux, pwin, etc). They're combined into a single file and all platforms we care about are well-supported except for one issue on Android/QNX. putil.patch takes care of it. 3. Breakiterator patches for a few remaining issues. We also use a much smaller Khmer dictionary (upstream fix pending). 4. Converter - Introduced two WHATWG-encoding-standard-compliant mapping tables are added (derived directly from the spec with a script) for EUC-JP and CP866 - Disabled various non-HTML5-encodings such as SCSU,BOCU, UTF-7, CESU-8 saving ~30kB in the code size. Even though we link statically, they're still pulled in as a part of uconv. - Disabled ISO-2022-JP-[1-4] in ucnv2022.c - Removed a number of encoding alias entries in the alias table leading to ~40kB data size reduction. 5. Locale data : Haven't yet updated. We need to trim them substantially. 6. Unihan collation removal is now done with a script (scripts/remove_unihan.sh) 7. Updated timezone data to the latest (2014b) as of today. 8. Customized transliterator for Greek uppercasing 9. Updated data build related patches. The windows data build patch has yet to be updated. 10. The updated ICU data file/assembly source files are not included in this CL. They'll be updated in a separate CL. With all the size reduction changes applied, the data size went down from > 23MB to 12.4MB. However, it's still 2.5MB larger than ICU 4.6.1 data. The locale data trimming will bring it down further. 11. Update README.chromium accordingly. The only exceptions are item #5 and the android entry in item #3 (breakiterator. see #1 above) BUG=259715,76328 TEST=Following the procedure outlined in README.chromium, one can build the icu data file. R=jsbell@chromium.org, mark@chromium.org Review URL: https://codereview.chromium.org/224943002 git-svn-id: http://src.chromium.org/svn/trunk/deps/third_party/icu52@262192 4ff67af0-8c30-449e-8e8b-ad334ec8d88c
-