Commits · 98218d1e92b919412ac4b27e5af8e37138d7e347 · fuchsia-mirror / fuchsia.googlesource.com-third_party-icu

Jun 04, 2015

Add U+2212 to Japanese converters and update GBK/gb18030 aliases and tables. · 9939a5d5

Jungshik Shin authored 9 years ago

1. Add a one-way (encoding-only/fromUnicode) mapping for U+2212 to
   Shift_JIS, EUC-JP and ISO-2022-JP. The last just uses Shift_JIS.

   See https://www.w3.org/Bugs/Public/show_bug.cgi?id=28661

2. Make GBK aliases list compliant to the encoding spec.

3. Add "xA3xA0 => U+3000" to the GBK (windows-936)  and gb18030. This makes it possible to remove the corresponding override in Blink

4. Modify the following to GBK (windows-936). See [1]
  - Add U+01F9 <=> \xA8\xBF
  - Drop U+E7C8 <=> \xA8\xBF

5. The following change is put on hold (NOT included in the CL)  until the resolution of [1]

  - Add U+1E3F <=> \xA8\xBC
  - Drop U+E7C7 <=> \xA8\xBC

The corresponding Blink CL is https://codereview.chromium.org/1167523003/

[1] https://www.w3.org/Bugs/Public/show_bug.cgi?id=28740#c3

BUG=425417,493824
TEST=Once ICU is rolled to this CL, Blink layout test fast/encoding/*.
R=jsbell@chromium.org

Review URL: https://codereview.chromium.org/1162723008

9939a5d5

Mar 19, 2015

Update CJK converters and their generating scripts · dafa8443

Jungshik Shin (jungshik at google) authored 10 years ago

1. Update ucmlocal.mk and convertrs.txt to refer to euc-kr-html.ucm
instead of windows-949.ucm

2. Tighten up the valid code range for the following converters:

   EUC-KR, Shift_JIS, Big5

This is to add back an ASCII range byte to the stream per
the encoding spec when they're either illegal as a 'trail byte' or
there's no assigned code point for a "lead + trail" sequence.
For instance, with this change, '0xF3 0x41' in EUC-KR is converted to
'U+FFFD U+0041' instead of 'U+FFFD'.

This change requires adding 2 ~ 8 new states to the conversion
table of each converter mentioned above leading to 6.5kB net increase
in the final data size.

3. Tighten the trail byte range for 2-byte sequences starting with 0x8E
from [A1,E2] to [A1,DF] in EUC-JP and update the corresponding generating
script.

4. Change the substitution characters for EUC-JP and Shift_JIS to
match other converters. i.e. make them produce U+FFFD when encountering
an invalid input. Before this chaange, they emitted U+001A.

5. Enable 'U_CHARSET_IS_UTF8' configuration flag.
Chromium/Blink does not rely on ICU for the code conversion between
the 'system native encoding' (if it's one of legacy encodings)
and Unicode. With this configuration, we can cut down the code size
a bit.

6. Update the icudtl.dat (all platforms) and assembly files (mac,linux)
   and the icudata dll (windows)

See https://codereview.chromium.org/1026453002 for a new blink test
added ( fast/encoding/char-decoding-invalid-trail.html )

BUG=450312,430823
TEST=Blink: fast/encoding/char-decoding-{truncated,invalid-trail}.html
TEST=base_unittests --gtest_filter=*Conv*, browser_tests --gtest_filter=*ncoding*
R=jsbell@chromium.org, mark@chromium.org

Review URL: https://codereview.chromium.org/984233002

dafa8443

Jan 21, 2015

ICU update to 54 step 3 · afd723ba

Jungshik Shin (jungshik at google) authored 10 years ago

A. Converter update per HTML encoding spec along with changes in
  the encoding name alias table.
B. Remove all the codes for converters Blink and Chromium do not need
(SCSU, Lotus, ISO-2022-xx other than JP, BOCU, UTF-7, etc).

This is reapplying the following CLs (that we used for ICU 52.1) to ICU 54.1 :

https://codereview.chromium.org/598383002
https://codereview.chromium.org/654153002

We have two upstream bugs filed for A and B above:
  http://www.icu-project.org/trac/ticket/11296
  http://www.icu-project.org/trac/ticket/10303

In addiition to A and B, we unified Big5 and Big5-HKSCS per
the encoding spec (bug 277868). That also includes properly supporting
the four 2-character sequences ( see http://crbug.com/277868#c3 ).
big5_gen.sh deviates from the current spec to work around a bug
in the spec. (see https://www.w3.org/Bugs/Public/show_bug.cgi?id=27878)

Moreover, ucmlocal.mk is added to list only encodings we want to support.

Also, tighten the state table for windows-946-2000.ucm that we use
for EUC-KR for now. And, drop 'base' map for windows-{936,949}-2000.ucm.

Finally, add euc-kr-html.ucm along with scripts/euckr_gen.sh, but
it is not yet used pending the resolution of bug 450312.

Data size checkpoint: 20,566,864 bytes (the original ICU 54=25,343,024)

BUG=277868, 428145, 450312
TEST=net_unittests --gtest_filter="*ilenameUtil*"
TEST=base_unittests --gtest_filter="*Conv*"
TEST=browser_tests --gtest_filter="*ncoding*"
TEST=Blink: fast/encoding/*
R=jsbell@chromium.org, mark@chromium.org

Review URL: https://codereview.chromium.org/839713003

afd723ba

Apr 29, 2014

Update EUC-JP per WHATWG encoding spec · b76b3106

jshin@chromium.org authored 10 years ago

- Add missing half-width kana entries (omitted by mistake)
- Drop 'extra' decoding only mapping. See
  https://www.w3.org/Bugs/Public/show_bug.cgi?id=25266
- Regenerate icu data files (*dat and assembly source files) for Linux,
  Mac, Windows and Android.  (they'll not be shown at
  codereview.chromium.org because they're too large).

BUG=132145,78847
TEST=When ICU is rolled in, base_unittests --gtest_filter=*ICU*
and layout tests

R=jsbell@chromium.org

Review URL: https://codereview.chromium.org/251203003

git-svn-id: http://src.chromium.org/svn/trunk/deps/third_party/icu52@266919 4ff67af0-8c30-449e-8e8b-ad334ec8d88c

b76b3106

Apr 07, 2014

ICU 52 local changes part1 · 4dfa619c

jshin@chromium.org authored 10 years ago

1. Remove all the obsolete patches. There are lots of them because most of
local patches to ICU 4.6.1 have either been accepted or become obsolete.
The largest local patch removed is our patches for CJ word breaker because
they were upstreamed.

Android didn't apply the CJK word breaker patch to ICU 4.6 to reduce the
data size. In a follow-up CL, we'll have an Android-specific change for this issue.

Besides, we don't include patches for files we locally add because the
patches for new files are redundant. Instead, they're mentioned in
README.chromium.

2. We don't need platform-specific headers any more (pmac, plinux, pwin, etc).
They're combined into a single file and all platforms we care about are
well-supported except for one issue on Android/QNX. putil.patch takes care
of it.

3. Breakiterator patches for a few remaining issues. We also use
a much smaller Khmer dictionary (upstream fix pending).

4. Converter
- Introduced two WHATWG-encoding-standard-compliant mapping tables
are added (derived directly from the spec with a script) for EUC-JP
and CP866
- Disabled various non-HTML5-encodings such as SCSU,BOCU, UTF-7, CESU-8
saving ~30kB in the code size. Even though we link statically, they're
still pulled in as a part of uconv.
- Disabled ISO-2022-JP-[1-4] in ucnv2022.c
- Removed a number of encoding alias entries in the alias table
leading to ~40kB data size reduction.

5. Locale data : Haven't yet updated. We need to trim them substantially.

6. Unihan collation removal is now done with a script (scripts/remove_unihan.sh)

7. Updated timezone data to the latest (2014b) as of today.

8. Customized transliterator for Greek uppercasing

9. Updated data build related patches. The windows data build patch has yet
to be updated.

10. The updated ICU data file/assembly source files are not included in this
CL. They'll be updated in a separate CL.
With all the size reduction changes applied, the data size went down
from > 23MB to 12.4MB. However, it's still 2.5MB larger than ICU 4.6.1
data. The locale data trimming will bring it down further.

11. Update README.chromium accordingly. The only exceptions are
item #5 and the android entry in item #3 (breakiterator. see #1 above)

BUG=259715,76328
TEST=Following the procedure outlined in README.chromium, one can build
the icu data file.

R=jsbell@chromium.org, mark@chromium.org

Review URL: https://codereview.chromium.org/224943002

git-svn-id: http://src.chromium.org/svn/trunk/deps/third_party/icu52@262192 4ff67af0-8c30-449e-8e8b-ad334ec8d88c

4dfa619c