Kata pengantar terjemahan
Ini adalah terjemahan dari bagian penjelasan proposal Intl.Segmenter, yang kemungkinan akan ditambahkan ke spesifikasi ECMAScript berikutnya.
Proposal sudah diimplementasikan di V8 dan tanpa flag bisa digunakan di versi 8.7 (lebih tepatnya, di dalam 8.7.38dan di atasnya), sehingga bisa diuji di Google Chrome Canary (mulai dari versi 87.0.4252.0) atau di Node.js V8 Canary (mulai dari versi v15.0.0-v8-canary202009025a2ca762b8; binari tersedia untuk Windows v15.0.0-v8-canary202009173b56586162).
Jika Anda menguji di versi sebelumnya dengan flag --harmony-intl-segmenter, hati-hatilah karena spesifikasinya telah berubah dan implementasi di bawah flag mungkin sudah ketinggalan zaman. Periksa dengan keluaran dalam contoh kode.
Setelah terjemahan, tautan disediakan untuk materi tentang alasan masalah yang diselesaikan oleh proposal ini.
Intl.Segmenter: Segmentasi unicode di JavaScript
Proposal berada di Tahap 3 dengan dukungan dari Richard Gibson.
Motivasi
(code point) «» . , (, ). , . , .
, CLDR (Common Locale Data Repository, ) (, locales). , , , .
, UAX 29. , JavaScript .
Chrome API Intl.v8BreakIterator. API . API, API JavaScript — , ES2015.
, segment(), Intl.Segmenter, Iterable.
// .
let segmenter = new Intl.Segmenter("fr", {granularity: "word"});
// .
let input = "Moi? N'est-ce pas.";
let segments = segmenter.segment(input);
// !
for (let {segment, index, isWordLike} of segments) {
console.log("segment at code units [%d, %d): «%s»%s",
index, index + segment.length,
segment,
isWordLike ? " (word-like)" : ""
);
}
// console.log:
// segment at code units [0, 3): «Moi» (word-like)
// segment at code units [3, 4): «?»
// segment at code units [4, 6): « »
// segment at code units [6, 11): «N'est» (word-like)
// segment at code units [11, 12): «-»
// segment at code units [12, 14): «ce» (word-like)
// segment at code units [14, 15): « »
// segment at code units [15, 18): «pas» (word-like)
// segment at code units [18, 19): «.»
, API .
// ┃0 1 2 3 4 5┃6┃7┃8┃9
// ┃A l l o n s┃-┃y┃!┃
let input = "Allons-y!";
let segmenter = new Intl.Segmenter("fr", {granularity: "word"});
let segments = segmenter.segment(input);
let current = undefined;
current = segments.containing(0)
// → { index: 0, segment: "Allons", isWordLike: true }
current = segments.containing(5)
// → { index: 0, segment: "Allons", isWordLike: true }
current = segments.containing(6)
// → { index: 6, segment: "-", isWordLike: false }
current = segments.containing(current.index + current.segment.length)
// → { index: 7, segment: "y", isWordLike: true }
current = segments.containing(current.index + current.segment.length)
// → { index: 8, segment: "!", isWordLike: false }
current = segments.containing(current.index + current.segment.length)
// → undefined
API
new Intl.Segmenter(locale, options)
.
options , granularity, ("grapheme" ( ), "word" ( ) "sentence" ( ); — "grapheme").
Intl.Segmenter.prototype.segment(string)
%Segments% Iterable .
:
segment— .index— (code unit index) , .input— .isWordLike—true,"word"( ) ( /// ..);false,"word"( // ..);undefined,"word".
%Segments%.prototype:
%Segments%.prototype.containing(index)
, , (code unit) , undefined, .
%Segments%.prototype[Symbol.iterator]
%SegmentIterator%, "" (lazy, ) , .
%SegmentIterator%.prototype:
%SegmentIterator%.prototype.next()
next() Iterator, IteratorResult, value , .
FAQ
? ?
— , . . . CLDR. , CLDR/ICU , .
API ?
, 3- , . TC39 . ; , , .
?
API, , API : , API (, ). API CSS Houdini.
?
API:
- .
- .
- , (.. Web API (Web Platform), ECMAScript).
- , . CLDR ICU . CSS, . . , , , ; .
?
%SegmentIterator%.prototype, (, seek([inclusiveStartIndex = thisIterator.index + 1]) seekBefore([exclusiveLastIndex = thisIterator.index]), . ECMA-262 ( ). , , .
API Intl, String?
, . segment() SegmentIterator. , API Intl, ECMA-402. , . String, , .
?
n (code unit), . , "Hello, world\u{1F499}" ( , - — ), 0, 5, 6, 7 12. : ┃Hello┃,┃ ┃world┃\u{1F499}┃, (code units), (code point). , .
?
, next().
, ?
, - QA ;)
Number: null 0, — 0 1, , , Symbol BigInt, undefined NaN *. , ( , ).
* . "fail". Chrome Canary, Symbol BigInt TypeError, undefined NaN , 0.
JavaScript.
- Joel Spolsky. The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
- Dmitri Pavlutin. What every JavaScript developer should know about Unicode
- Dr. Axel Rauschmayer. JavaScript for impatient programmers: 17. Unicode – a brief introduction
- Dr. Axel Rauschmayer. JavaScript for impatient programmers: 18.6. Atoms of text: Unicode characters, JavaScript characters, grapheme clusters
- Jonathan New. "\u{1F4A9}".length === 2
- Nicolás Bevacqua. ES6 Strings (and Unicode, ) in Depth
- Mathias Bynens. JavaScript has a Unicode problem
- Mathias Bynens. Unicode-aware regular expressions in ECMAScript 6
- Mathias Bynens. Unicode property escapes in JavaScript regular expressions
- Mathias Bynens. Unicode sequence property escapes
- Awesome Unicode: a curated list of delightful Unicode tidbits, packages and resources