2016-03-02 00:39:01 +00:00
|
|
|
|
Test for unicode regular expression processing
|
|
|
|
|
|
|
|
|
|
On success, you will see a series of "PASS" messages, followed by "TEST COMPLETE".
|
|
|
|
|
|
|
|
|
|
|
2016-03-04 01:24:28 +00:00
|
|
|
|
PASS "a".match(/a/u)[0].length is 1
|
|
|
|
|
PASS "a".match(/A/ui)[0].length is 1
|
2016-03-02 00:39:01 +00:00
|
|
|
|
PASS "a".match(/a/u)[0].length is 1
|
|
|
|
|
PASS "a".match(/A/iu)[0].length is 1
|
|
|
|
|
PASS "Ȓ".match(/Ȓ/u)[0].length is 1
|
2016-03-04 01:24:28 +00:00
|
|
|
|
PASS "Ȓ".match(/Ȓ/u)[0].length is 1
|
2016-03-02 00:39:01 +00:00
|
|
|
|
PASS "ሴ".match(/ሴ/u)[0].length is 1
|
2016-03-04 01:24:28 +00:00
|
|
|
|
PASS "ሴ".match(/ሴ/u)[0].length is 1
|
|
|
|
|
PASS "⪼".match(/⪼/u)[0].length is 1
|
2016-03-02 00:39:01 +00:00
|
|
|
|
PASS "㿭".match(/㿭/u)[0].length is 1
|
|
|
|
|
PASS "𒍅".match(/𒍅/u)[0].length is 2
|
|
|
|
|
PASS "𒍅".match(/𒍅/u)[0].length is 2
|
2016-03-04 01:24:28 +00:00
|
|
|
|
PASS "𝌆".match(/𝌆/u)[0].length is 2
|
2016-03-02 00:39:01 +00:00
|
|
|
|
PASS /𐑏/u.test("𐑏") is true
|
|
|
|
|
PASS /𐑏/u.test("𐑏") is true
|
|
|
|
|
PASS "𝌆".match(/𝌆/u)[0].length is 2
|
|
|
|
|
PASS /(𐀀|𐐀|𐐩)/u.test("𐐀") is true
|
|
|
|
|
PASS "𐄣".match(/a|𐄣|b/u)[0].length is 2
|
|
|
|
|
PASS "b".match(/a|𐄣|b/u)[0].length is 1
|
|
|
|
|
PASS /(?:a|𐄣|b)x/u.test("𐄣") is false
|
|
|
|
|
PASS /(?:a|𐄣|b)x/u.test("𐄣x") is true
|
|
|
|
|
PASS /(?:a|𐄣|b)x/u.test("b") is false
|
|
|
|
|
PASS /(?:a|𐄣|b)x/u.test("bx") is true
|
|
|
|
|
PASS "a𐄣x".match(/a𐄣b|a𐄣x/u)[0].length is 4
|
|
|
|
|
PASS /(𐀀|𐐀|𐐩)x/ui.test("𐐀x") is true
|
|
|
|
|
PASS /(𐀀|𐐀|𐐩)x/ui.test("𐐩x") is true
|
|
|
|
|
PASS /(𐀀|𐐀|𐐩)x/ui.test("𐐁x") is true
|
|
|
|
|
PASS /(𐀀|𐐀|𐐩)x/ui.test("𐐨x") is true
|
|
|
|
|
PASS "𐐩".match(/a|𐐁|b/iu)[0].length is 2
|
|
|
|
|
PASS "B".match(/a|𐄣|b/iu)[0].length is 1
|
|
|
|
|
PASS /(?:A|𐄣|b)x/iu.test("𐄣") is false
|
|
|
|
|
PASS /(?:A|𐄣|b)x/iu.test("𐄣x") is true
|
|
|
|
|
PASS /(?:A|𐄣|b)x/iu.test("b") is false
|
|
|
|
|
PASS /(?:A|𐄣|b)x/iu.test("bx") is true
|
|
|
|
|
PASS "a𐄣X".match(/a𐄣b|a𐄣x/iu)[0].length is 4
|
|
|
|
|
PASS "Ťx".match(/ťx/iu)[0].length is 2
|
2016-04-14 00:47:40 +00:00
|
|
|
|
PASS /\w/iu.test("ſ") is true
|
|
|
|
|
PASS /\w/iu.test("K") is true
|
ES6 Change: Unify handling of RegExp CharacterClassEscapes \w and \W and Word Asserts \b and \B
https://bugs.webkit.org/show_bug.cgi?id=158505
Reviewed by Geoffrey Garen.
Source/JavaScriptCore:
This change makes it so that the CharacterClassEscape \w matches the inverse of
\W and vice versa for unicode, ignore case RegExp's.
Before this change, both /\w/ui and /\W/ui RegExp's would match the characters
k, K, s, S, \u017f (Latin Small Letter Long S) and \u212a (Kelvin Sign).
This was due to how the ES6 standard defined matching of character classes
specifically that the abstract operation "Canonicalize()" is called for the
character to be matched AND for the characters in the character class we are
matching against. This change is to make \W always be the inverse of \w.
It is still the case that the characters that match against \w changes
depending on a regular expression's flags.
The only real changes occur for regular expressions with both the unicode and
ignore case flags set. Updated the character class generator to make
nonwordUnicodeIgnoreCaseChar not include k, K, s, S, \u017f and \u212a.
Changed BytecodePattern.wordcharCharacterClass to use the correct
word character class for the flags. Simplfied character class set up in
in the pattern to use m_pattern.wordUnicodeIgnoreCaseCharCharacterClass and
invert as appropriate when unicode and ignore case are both set.
* create_regex_tables:
* yarr/YarrInterpreter.h:
(JSC::Yarr::BytecodePattern::BytecodePattern):
* yarr/YarrPattern.cpp:
(JSC::Yarr::YarrPatternConstructor::atomBuiltInCharacterClass):
LayoutTests:
Updated and added test cases.
* js/regexp-unicode-expected.txt:
* js/script-tests/regexp-unicode.js:
Canonical link: https://commits.webkit.org/177243@main
git-svn-id: https://svn.webkit.org/repository/webkit/trunk@202490 268f45cc-cd09-0410-ab3c-d52691b4dbfc
2016-06-27 17:38:55 +00:00
|
|
|
|
PASS /\W/iu.test("ſ") is false
|
|
|
|
|
PASS /\W/iu.test("K") is false
|
2016-04-14 00:47:40 +00:00
|
|
|
|
PASS /[\w\d]/iu.test("ſ") is true
|
|
|
|
|
PASS /[\w\d]/iu.test("K") is true
|
|
|
|
|
PASS /[^\w\d]/iu.test("ſ") is false
|
|
|
|
|
PASS /[^\w\d]/iu.test("K") is false
|
ES6 Change: Unify handling of RegExp CharacterClassEscapes \w and \W and Word Asserts \b and \B
https://bugs.webkit.org/show_bug.cgi?id=158505
Reviewed by Geoffrey Garen.
Source/JavaScriptCore:
This change makes it so that the CharacterClassEscape \w matches the inverse of
\W and vice versa for unicode, ignore case RegExp's.
Before this change, both /\w/ui and /\W/ui RegExp's would match the characters
k, K, s, S, \u017f (Latin Small Letter Long S) and \u212a (Kelvin Sign).
This was due to how the ES6 standard defined matching of character classes
specifically that the abstract operation "Canonicalize()" is called for the
character to be matched AND for the characters in the character class we are
matching against. This change is to make \W always be the inverse of \w.
It is still the case that the characters that match against \w changes
depending on a regular expression's flags.
The only real changes occur for regular expressions with both the unicode and
ignore case flags set. Updated the character class generator to make
nonwordUnicodeIgnoreCaseChar not include k, K, s, S, \u017f and \u212a.
Changed BytecodePattern.wordcharCharacterClass to use the correct
word character class for the flags. Simplfied character class set up in
in the pattern to use m_pattern.wordUnicodeIgnoreCaseCharCharacterClass and
invert as appropriate when unicode and ignore case are both set.
* create_regex_tables:
* yarr/YarrInterpreter.h:
(JSC::Yarr::BytecodePattern::BytecodePattern):
* yarr/YarrPattern.cpp:
(JSC::Yarr::YarrPatternConstructor::atomBuiltInCharacterClass):
LayoutTests:
Updated and added test cases.
* js/regexp-unicode-expected.txt:
* js/script-tests/regexp-unicode.js:
Canonical link: https://commits.webkit.org/177243@main
git-svn-id: https://svn.webkit.org/repository/webkit/trunk@202490 268f45cc-cd09-0410-ab3c-d52691b4dbfc
2016-06-27 17:38:55 +00:00
|
|
|
|
PASS /[\W\d]/iu.test("ſ") is false
|
|
|
|
|
PASS /[\W\d]/iu.test("K") is false
|
|
|
|
|
PASS /[^\W\d]/iu.test("ſ") is true
|
|
|
|
|
PASS /[^\W\d]/iu.test("K") is true
|
2016-04-14 00:47:40 +00:00
|
|
|
|
PASS /\w/iu.test("S") is true
|
|
|
|
|
PASS /\w/iu.test("K") is true
|
ES6 Change: Unify handling of RegExp CharacterClassEscapes \w and \W and Word Asserts \b and \B
https://bugs.webkit.org/show_bug.cgi?id=158505
Reviewed by Geoffrey Garen.
Source/JavaScriptCore:
This change makes it so that the CharacterClassEscape \w matches the inverse of
\W and vice versa for unicode, ignore case RegExp's.
Before this change, both /\w/ui and /\W/ui RegExp's would match the characters
k, K, s, S, \u017f (Latin Small Letter Long S) and \u212a (Kelvin Sign).
This was due to how the ES6 standard defined matching of character classes
specifically that the abstract operation "Canonicalize()" is called for the
character to be matched AND for the characters in the character class we are
matching against. This change is to make \W always be the inverse of \w.
It is still the case that the characters that match against \w changes
depending on a regular expression's flags.
The only real changes occur for regular expressions with both the unicode and
ignore case flags set. Updated the character class generator to make
nonwordUnicodeIgnoreCaseChar not include k, K, s, S, \u017f and \u212a.
Changed BytecodePattern.wordcharCharacterClass to use the correct
word character class for the flags. Simplfied character class set up in
in the pattern to use m_pattern.wordUnicodeIgnoreCaseCharCharacterClass and
invert as appropriate when unicode and ignore case are both set.
* create_regex_tables:
* yarr/YarrInterpreter.h:
(JSC::Yarr::BytecodePattern::BytecodePattern):
* yarr/YarrPattern.cpp:
(JSC::Yarr::YarrPatternConstructor::atomBuiltInCharacterClass):
LayoutTests:
Updated and added test cases.
* js/regexp-unicode-expected.txt:
* js/script-tests/regexp-unicode.js:
Canonical link: https://commits.webkit.org/177243@main
git-svn-id: https://svn.webkit.org/repository/webkit/trunk@202490 268f45cc-cd09-0410-ab3c-d52691b4dbfc
2016-06-27 17:38:55 +00:00
|
|
|
|
PASS /\W/iu.test("S") is false
|
|
|
|
|
PASS /\W/iu.test("K") is false
|
2016-04-14 00:47:40 +00:00
|
|
|
|
PASS /[\w\d]/iu.test("S") is true
|
|
|
|
|
PASS /[\w\d]/iu.test("K") is true
|
|
|
|
|
PASS /[^\w\d]/iu.test("S") is false
|
|
|
|
|
PASS /[^\w\d]/iu.test("K") is false
|
ES6 Change: Unify handling of RegExp CharacterClassEscapes \w and \W and Word Asserts \b and \B
https://bugs.webkit.org/show_bug.cgi?id=158505
Reviewed by Geoffrey Garen.
Source/JavaScriptCore:
This change makes it so that the CharacterClassEscape \w matches the inverse of
\W and vice versa for unicode, ignore case RegExp's.
Before this change, both /\w/ui and /\W/ui RegExp's would match the characters
k, K, s, S, \u017f (Latin Small Letter Long S) and \u212a (Kelvin Sign).
This was due to how the ES6 standard defined matching of character classes
specifically that the abstract operation "Canonicalize()" is called for the
character to be matched AND for the characters in the character class we are
matching against. This change is to make \W always be the inverse of \w.
It is still the case that the characters that match against \w changes
depending on a regular expression's flags.
The only real changes occur for regular expressions with both the unicode and
ignore case flags set. Updated the character class generator to make
nonwordUnicodeIgnoreCaseChar not include k, K, s, S, \u017f and \u212a.
Changed BytecodePattern.wordcharCharacterClass to use the correct
word character class for the flags. Simplfied character class set up in
in the pattern to use m_pattern.wordUnicodeIgnoreCaseCharCharacterClass and
invert as appropriate when unicode and ignore case are both set.
* create_regex_tables:
* yarr/YarrInterpreter.h:
(JSC::Yarr::BytecodePattern::BytecodePattern):
* yarr/YarrPattern.cpp:
(JSC::Yarr::YarrPatternConstructor::atomBuiltInCharacterClass):
LayoutTests:
Updated and added test cases.
* js/regexp-unicode-expected.txt:
* js/script-tests/regexp-unicode.js:
Canonical link: https://commits.webkit.org/177243@main
git-svn-id: https://svn.webkit.org/repository/webkit/trunk@202490 268f45cc-cd09-0410-ab3c-d52691b4dbfc
2016-06-27 17:38:55 +00:00
|
|
|
|
PASS /[\W\d]/iu.test("S") is false
|
|
|
|
|
PASS /[\W\d]/iu.test("K") is false
|
|
|
|
|
PASS /[^\W\d]/iu.test("S") is true
|
|
|
|
|
PASS /[^\W\d]/iu.test("K") is true
|
|
|
|
|
PASS "Grasſoden is old German for grass".match(/.*?\Bs\u017foden/iu)[0] is "Grasſoden"
|
|
|
|
|
PASS "Grasſoden is old German for grass".match(/.*?\B\u017foden/iu)[0] is "Grasſoden"
|
|
|
|
|
PASS "Grasſoden is old German for grass".match(/.*?\Boden/iu)[0] is "Grasſoden"
|
|
|
|
|
PASS "Grasſoden is old German for grass".match(/.*?\Bden/iu)[0] is "Grasſoden"
|
|
|
|
|
PASS "Water freezes at 273K which is 0C.".split(/\b\s/iu) is ["Water","freezes","at","273K","which","is","0C."]
|
2016-03-02 00:39:01 +00:00
|
|
|
|
PASS "𝌆".match(/^.$/u)[0].length is 2
|
|
|
|
|
PASS "It is 78°".match(/.*/u)[0].length is 9
|
2016-03-04 01:24:28 +00:00
|
|
|
|
PASS stringWithDanglingFirstSurrogate.match(/.*/u)[0].length is 3
|
|
|
|
|
PASS stringWithDanglingSecondSurrogate.match(/.*/u)[0].length is 3
|
2016-03-02 00:39:01 +00:00
|
|
|
|
PASS "𝌆".match(/[𝌆a]/)[0].length is 1
|
|
|
|
|
PASS "𝌆".match(/[a𝌆]/u)[0].length is 2
|
|
|
|
|
PASS "𝌆".match(/[𝌆a]/u)[0].length is 2
|
|
|
|
|
PASS "𝌆".match(/[a-𝌆]/)[0].length is 1
|
|
|
|
|
PASS "𝌆".match(/[a-𝌆]/u)[0].length is 2
|
|
|
|
|
PASS "X".match(/[ -𐑏]/u)[0].length is 1
|
|
|
|
|
PASS "က".match(/[ -𐑏]/u)[0].length is 1
|
|
|
|
|
PASS "𐐧".match(/[ -𐑏]/u)[0].length is 2
|
|
|
|
|
PASS re1.test("Z") is false
|
|
|
|
|
PASS re1.test("က") is false
|
|
|
|
|
PASS re1.test("𐐀") is false
|
|
|
|
|
PASS re2.test("A") is true
|
|
|
|
|
PASS re2.test("") is false
|
|
|
|
|
PASS re2.test("𒍅") is true
|
2020-02-06 21:36:48 +00:00
|
|
|
|
PASS /[𐰁<>#<23>]/u.exec("𐰁").toString() is "𐰁"
|
|
|
|
|
PASS /[<5B>𐰁<EFBFBD>]/u.exec("𐰁").toString() is "𐰁"
|
|
|
|
|
PASS /[<5B>#<23>𐰁]/u.exec("𐰁").toString() is "𐰁"
|
|
|
|
|
PASS /[<5B>𐰁<EFBFBD>]/u.exec("𐰁").toString() is "𐰁"
|
|
|
|
|
PASS /[𐰁<>#<23>]{2}/u.exec("𐰁") is null
|
|
|
|
|
PASS /[<5B>𐰁<EFBFBD>]{2}/u.exec("𐰁") is null
|
|
|
|
|
PASS /[<5B>#<23>𐰁]{2}/u.exec("𐰁") is null
|
|
|
|
|
PASS /[<5B>𐰁<EFBFBD>]{2}/u.exec("𐰁") is null
|
|
|
|
|
PASS /<2F>|<7C>|𐰁/u.exec("𐰁").toString() is "𐰁"
|
|
|
|
|
PASS /<2F>|𐰁|<7C>/u.exec("𐰁").toString() is "𐰁"
|
|
|
|
|
PASS /<2F>|<7C>|𐰁/u.exec("<22>").toString() is "<22>"
|
|
|
|
|
PASS /<2F>|𐰁|<7C>/u.exec("<22>").toString() is "<22>"
|
|
|
|
|
PASS /<2F>𐰁/u.exec("𐰁") is null
|
|
|
|
|
PASS /<2F>𐰁/u.exec("<22>") is null
|
|
|
|
|
PASS "<22>𐰁".match(/<2F>𐰁/u)[0].length is 3
|
2016-03-31 00:38:20 +00:00
|
|
|
|
PASS /𝌆{2}/u.test("𝌆𝌆") is true
|
|
|
|
|
PASS /𝌆{2}/u.test("𝌆𝌆") is true
|
Implement Unicode RegExp support in the YARR JIT
https://bugs.webkit.org/show_bug.cgi?id=174646
Reviewed by Filip Pizlo.
Source/JavaScriptCore:
This support is only implemented for 64 bit platforms. It wouldn't be too hard to add support
for 32 bit platforms with a reasonable number of spare registers. This code slightly refactors
register usage to reduce the number of callee save registers used for non-Unicode expressions.
For Unicode expressions, there are several more registers used to store constants values for
processing surrogate pairs as well as discerning whether a character belongs to the Basic
Multilingual Plane (BMP) or one of the Supplemental Planes.
This implements JIT support for Unicode expressions very similar to how the interpreter works.
Just like in the interpreter, backtracking code uses more space on the stack to save positions.
Moved the BackTrackInfo* structs to YarrPattern as separate functions. Added xxxIndex()
functions to each of these to simplify how the JIT code reads and writes the structure fields.
Given that reading surrogate pairs and transforming them into a single code point takes a
little processing, the code that implements reading a Unicode character is implemented as a
leaf function added to the end of the JIT'ed code. The calling convention for
"tryReadUnicodeCharacterHelper()" is non-standard given that the rest of the code assumes
that argument values stay in argument registers for most of the generated code.
That helper takes the starting character address in one register, regUnicodeInputAndTrail,
and uses another dedicated temporary register, regUnicodeTemp. The result is typically
returned in regT0. If another return register is requested, we'll create an inline copy of
that function.
Added a new flag to CharacterClass to signify if a class has non-BMP characters. This flag
is used in optimizeAlternative() where we swap the order of a fixed character class term with
a fixed character term that immediately follows it. Since the non-BMP character class may
increment "index" when matching, that must be done first before trying to match a fixed
character term later in the string.
Given the usefulness of the LEA instruction on X86 to create a single pointer value from a
base with index and offset, which the YARR JIT uses heavily, I added a new macroAssembler
function, getEffectiveAddress64(), with an ARM64 implementation. It just calls x86Lea64()
on X86-64. Also added an ImplicitAddress version of load16Unaligned().
(JSC::MacroAssemblerARM64::load16Unaligned):
(JSC::MacroAssemblerARM64::getEffectiveAddress64):
* assembler/MacroAssemblerX86Common.h:
(JSC::MacroAssemblerX86Common::load16Unaligned):
(JSC::MacroAssemblerX86Common::load16):
* assembler/MacroAssemblerX86_64.h:
(JSC::MacroAssemblerX86_64::getEffectiveAddress64):
* create_regex_tables:
* runtime/RegExp.cpp:
(JSC::RegExp::compile):
* yarr/YarrInterpreter.cpp:
* yarr/YarrJIT.cpp:
(JSC::Yarr::YarrGenerator::optimizeAlternative):
(JSC::Yarr::YarrGenerator::matchCharacterClass):
(JSC::Yarr::YarrGenerator::tryReadUnicodeCharImpl):
(JSC::Yarr::YarrGenerator::tryReadUnicodeChar):
(JSC::Yarr::YarrGenerator::readCharacter):
(JSC::Yarr::YarrGenerator::jumpIfCharNotEquals):
(JSC::Yarr::YarrGenerator::matchAssertionWordchar):
(JSC::Yarr::YarrGenerator::generateAssertionWordBoundary):
(JSC::Yarr::YarrGenerator::generatePatternCharacterOnce):
(JSC::Yarr::YarrGenerator::generatePatternCharacterFixed):
(JSC::Yarr::YarrGenerator::generatePatternCharacterGreedy):
(JSC::Yarr::YarrGenerator::backtrackPatternCharacterGreedy):
(JSC::Yarr::YarrGenerator::generatePatternCharacterNonGreedy):
(JSC::Yarr::YarrGenerator::backtrackPatternCharacterNonGreedy):
(JSC::Yarr::YarrGenerator::generateCharacterClassOnce):
(JSC::Yarr::YarrGenerator::backtrackCharacterClassOnce):
(JSC::Yarr::YarrGenerator::generateCharacterClassFixed):
(JSC::Yarr::YarrGenerator::generateCharacterClassGreedy):
(JSC::Yarr::YarrGenerator::backtrackCharacterClassGreedy):
(JSC::Yarr::YarrGenerator::generateCharacterClassNonGreedy):
(JSC::Yarr::YarrGenerator::backtrackCharacterClassNonGreedy):
(JSC::Yarr::YarrGenerator::generate):
(JSC::Yarr::YarrGenerator::backtrack):
(JSC::Yarr::YarrGenerator::generateTryReadUnicodeCharacterHelper):
(JSC::Yarr::YarrGenerator::generateEnter):
(JSC::Yarr::YarrGenerator::generateReturn):
(JSC::Yarr::YarrGenerator::YarrGenerator):
(JSC::Yarr::YarrGenerator::compile):
* yarr/YarrJIT.h:
* yarr/YarrPattern.cpp:
(JSC::Yarr::CharacterClassConstructor::CharacterClassConstructor):
(JSC::Yarr::CharacterClassConstructor::reset):
(JSC::Yarr::CharacterClassConstructor::charClass):
(JSC::Yarr::CharacterClassConstructor::addSorted):
(JSC::Yarr::CharacterClassConstructor::addSortedRange):
(JSC::Yarr::CharacterClassConstructor::hasNonBMPCharacters):
(JSC::Yarr::YarrPatternConstructor::setupAlternativeOffsets):
* yarr/YarrPattern.h:
(JSC::Yarr::CharacterClass::CharacterClass):
(JSC::Yarr::BackTrackInfoPatternCharacter::beginIndex):
(JSC::Yarr::BackTrackInfoPatternCharacter::matchAmountIndex):
(JSC::Yarr::BackTrackInfoCharacterClass::beginIndex):
(JSC::Yarr::BackTrackInfoCharacterClass::matchAmountIndex):
(JSC::Yarr::BackTrackInfoBackReference::beginIndex):
(JSC::Yarr::BackTrackInfoBackReference::matchAmountIndex):
(JSC::Yarr::BackTrackInfoAlternative::offsetIndex):
(JSC::Yarr::BackTrackInfoParentheticalAssertion::beginIndex):
(JSC::Yarr::BackTrackInfoParenthesesOnce::beginIndex):
(JSC::Yarr::BackTrackInfoParenthesesTerminal::beginIndex):
LayoutTests:
Updated tests.
* js/regexp-unicode-expected.txt:
* js/script-tests/regexp-unicode.js:
Canonical link: https://commits.webkit.org/192507@main
git-svn-id: https://svn.webkit.org/repository/webkit/trunk@221052 268f45cc-cd09-0410-ab3c-d52691b4dbfc
2017-08-22 22:43:08 +00:00
|
|
|
|
PASS "𐐅𐐅𐐅𐐅".match(/𐐅{3}/u)[0] is "𐐅𐐅𐐅"
|
|
|
|
|
PASS "𐐂𐐅𐐅𐐅".match(/𐐅{3}/u)[0] is "𐐅𐐅𐐅"
|
2016-03-31 00:38:20 +00:00
|
|
|
|
PASS "𐐁𐐁𐐀".match(/𐐁{1,3}/u)[0] is "𐐁𐐁"
|
|
|
|
|
PASS "𐐁𐐩".match(/𐐁{1,3}/iu)[0] is "𐐁𐐩"
|
|
|
|
|
PASS "𐐁𐐩𐐪𐐩".match(/𐐁{1,}/iu)[0] is "𐐁𐐩"
|
2016-03-02 00:39:01 +00:00
|
|
|
|
PASS "𐌑𐌑𐌑".match(/𐌑*a|𐌑*./u)[0] is "𐌑𐌑𐌑"
|
|
|
|
|
PASS "a𐌑𐌑".match(/a𐌑*?$/u)[0] is "a𐌑𐌑"
|
|
|
|
|
PASS "a𐌑𐌑𐌑c".match(/a𐌑*cd|a𐌑*c/u)[0] is "a𐌑𐌑𐌑c"
|
|
|
|
|
PASS "a𐌑𐌑𐌑c".match(/a𐌑+cd|a𐌑+c/u)[0] is "a𐌑𐌑𐌑c"
|
|
|
|
|
PASS "𐌑𐌑𐌑".match(/𐌑+?a|𐌑+?./u)[0] is "𐌑𐌑"
|
|
|
|
|
PASS "𐌑𐌑𐌑".match(/𐌑+?a|𐌑+?$/u)[0] is "𐌑𐌑𐌑"
|
|
|
|
|
PASS "a𐌑𐌑𐌑c".match(/a𐌑*?cd|a𐌑*?c/u)[0] is "a𐌑𐌑𐌑c"
|
|
|
|
|
PASS "a𐌑𐌑𐌑c".match(/a𐌑+?cd|a𐌑+?c/u)[0] is "a𐌑𐌑𐌑c"
|
|
|
|
|
PASS "𐌑𐌑𐌑".match(/𐌑+?a|𐌑+?./iu)[0] is "𐌑𐌑"
|
|
|
|
|
PASS "𐐪𐐪𐌑".match(/𐐂*|𐐂*𐌑/iu)[0] is "𐐪𐐪𐌑"
|
|
|
|
|
PASS "𐐪𐐪𐌑".match(/𐐂+|𐐂+𐌑/iu)[0] is "𐐪𐐪𐌑"
|
|
|
|
|
PASS "𐐪𐐪𐌑".match(/𐐂*?|𐐂*?𐌑/iu)[0] is "𐐪𐐪𐌑"
|
|
|
|
|
PASS "𐐪𐐪𐌑".match(/𐐂+?|𐐂+?𐌑/iu)[0] is "𐐪𐐪𐌑"
|
|
|
|
|
PASS "ab𐌑c𐨁".match(/abc|ab𐌑cd|ab𐌑c𐨁d|ab𐌑c𐨁/u)[0] is "ab𐌑c𐨁"
|
|
|
|
|
PASS "ab𐐨c𐨁".match(/abc|ab𐐀cd|ab𐐀c𐨁d|ab𐐀c𐨁/iu)[0] is "ab𐐨c𐨁"
|
|
|
|
|
PASS /abc|ab𐐀cd|ab𐐀c𐨁d|ab𐐀c𐨁/iu.test("qwerty123") is false
|
|
|
|
|
PASS "a𐐨𐐨𐐨c".match(/ac|a𐐀*cd|a𐐀+cd|a𐐀+c/iu)[0] is "a𐐨𐐨𐐨c"
|
|
|
|
|
PASS "ab𐐨𐐨𐐨c𐨁".match(/abc|ab𐐀*cd|ab𐐀+c𐨁d|ab𐐀+c𐨁/iu)[0] is "ab𐐨𐐨𐐨c𐨁"
|
|
|
|
|
PASS "ab𐐨𐐨𐐨".match(/abc|ab𐐨*./u)[0] is "ab𐐨𐐨𐐨"
|
|
|
|
|
PASS "ab𐐨𐐨𐐨".match(/abc|ab𐐀*./iu)[0] is "ab𐐨𐐨𐐨"
|
2016-03-24 14:19:37 +00:00
|
|
|
|
PASS "𐐀".match(/a*/u)[0].length is 0
|
|
|
|
|
PASS "𐐀".match(/a*/ui)[0].length is 0
|
|
|
|
|
PASS "𐐀".match(/\d*/u)[0].length is 0
|
|
|
|
|
PASS "123𐐀".match(/\d*/u)[0] is "123"
|
|
|
|
|
PASS "12X3𐐀4".match(/\d{0,1}/ug) is ["1", "2", "", "3", "", "4", ""]
|
Implement Unicode RegExp support in the YARR JIT
https://bugs.webkit.org/show_bug.cgi?id=174646
Reviewed by Filip Pizlo.
Source/JavaScriptCore:
This support is only implemented for 64 bit platforms. It wouldn't be too hard to add support
for 32 bit platforms with a reasonable number of spare registers. This code slightly refactors
register usage to reduce the number of callee save registers used for non-Unicode expressions.
For Unicode expressions, there are several more registers used to store constants values for
processing surrogate pairs as well as discerning whether a character belongs to the Basic
Multilingual Plane (BMP) or one of the Supplemental Planes.
This implements JIT support for Unicode expressions very similar to how the interpreter works.
Just like in the interpreter, backtracking code uses more space on the stack to save positions.
Moved the BackTrackInfo* structs to YarrPattern as separate functions. Added xxxIndex()
functions to each of these to simplify how the JIT code reads and writes the structure fields.
Given that reading surrogate pairs and transforming them into a single code point takes a
little processing, the code that implements reading a Unicode character is implemented as a
leaf function added to the end of the JIT'ed code. The calling convention for
"tryReadUnicodeCharacterHelper()" is non-standard given that the rest of the code assumes
that argument values stay in argument registers for most of the generated code.
That helper takes the starting character address in one register, regUnicodeInputAndTrail,
and uses another dedicated temporary register, regUnicodeTemp. The result is typically
returned in regT0. If another return register is requested, we'll create an inline copy of
that function.
Added a new flag to CharacterClass to signify if a class has non-BMP characters. This flag
is used in optimizeAlternative() where we swap the order of a fixed character class term with
a fixed character term that immediately follows it. Since the non-BMP character class may
increment "index" when matching, that must be done first before trying to match a fixed
character term later in the string.
Given the usefulness of the LEA instruction on X86 to create a single pointer value from a
base with index and offset, which the YARR JIT uses heavily, I added a new macroAssembler
function, getEffectiveAddress64(), with an ARM64 implementation. It just calls x86Lea64()
on X86-64. Also added an ImplicitAddress version of load16Unaligned().
(JSC::MacroAssemblerARM64::load16Unaligned):
(JSC::MacroAssemblerARM64::getEffectiveAddress64):
* assembler/MacroAssemblerX86Common.h:
(JSC::MacroAssemblerX86Common::load16Unaligned):
(JSC::MacroAssemblerX86Common::load16):
* assembler/MacroAssemblerX86_64.h:
(JSC::MacroAssemblerX86_64::getEffectiveAddress64):
* create_regex_tables:
* runtime/RegExp.cpp:
(JSC::RegExp::compile):
* yarr/YarrInterpreter.cpp:
* yarr/YarrJIT.cpp:
(JSC::Yarr::YarrGenerator::optimizeAlternative):
(JSC::Yarr::YarrGenerator::matchCharacterClass):
(JSC::Yarr::YarrGenerator::tryReadUnicodeCharImpl):
(JSC::Yarr::YarrGenerator::tryReadUnicodeChar):
(JSC::Yarr::YarrGenerator::readCharacter):
(JSC::Yarr::YarrGenerator::jumpIfCharNotEquals):
(JSC::Yarr::YarrGenerator::matchAssertionWordchar):
(JSC::Yarr::YarrGenerator::generateAssertionWordBoundary):
(JSC::Yarr::YarrGenerator::generatePatternCharacterOnce):
(JSC::Yarr::YarrGenerator::generatePatternCharacterFixed):
(JSC::Yarr::YarrGenerator::generatePatternCharacterGreedy):
(JSC::Yarr::YarrGenerator::backtrackPatternCharacterGreedy):
(JSC::Yarr::YarrGenerator::generatePatternCharacterNonGreedy):
(JSC::Yarr::YarrGenerator::backtrackPatternCharacterNonGreedy):
(JSC::Yarr::YarrGenerator::generateCharacterClassOnce):
(JSC::Yarr::YarrGenerator::backtrackCharacterClassOnce):
(JSC::Yarr::YarrGenerator::generateCharacterClassFixed):
(JSC::Yarr::YarrGenerator::generateCharacterClassGreedy):
(JSC::Yarr::YarrGenerator::backtrackCharacterClassGreedy):
(JSC::Yarr::YarrGenerator::generateCharacterClassNonGreedy):
(JSC::Yarr::YarrGenerator::backtrackCharacterClassNonGreedy):
(JSC::Yarr::YarrGenerator::generate):
(JSC::Yarr::YarrGenerator::backtrack):
(JSC::Yarr::YarrGenerator::generateTryReadUnicodeCharacterHelper):
(JSC::Yarr::YarrGenerator::generateEnter):
(JSC::Yarr::YarrGenerator::generateReturn):
(JSC::Yarr::YarrGenerator::YarrGenerator):
(JSC::Yarr::YarrGenerator::compile):
* yarr/YarrJIT.h:
* yarr/YarrPattern.cpp:
(JSC::Yarr::CharacterClassConstructor::CharacterClassConstructor):
(JSC::Yarr::CharacterClassConstructor::reset):
(JSC::Yarr::CharacterClassConstructor::charClass):
(JSC::Yarr::CharacterClassConstructor::addSorted):
(JSC::Yarr::CharacterClassConstructor::addSortedRange):
(JSC::Yarr::CharacterClassConstructor::hasNonBMPCharacters):
(JSC::Yarr::YarrPatternConstructor::setupAlternativeOffsets):
* yarr/YarrPattern.h:
(JSC::Yarr::CharacterClass::CharacterClass):
(JSC::Yarr::BackTrackInfoPatternCharacter::beginIndex):
(JSC::Yarr::BackTrackInfoPatternCharacter::matchAmountIndex):
(JSC::Yarr::BackTrackInfoCharacterClass::beginIndex):
(JSC::Yarr::BackTrackInfoCharacterClass::matchAmountIndex):
(JSC::Yarr::BackTrackInfoBackReference::beginIndex):
(JSC::Yarr::BackTrackInfoBackReference::matchAmountIndex):
(JSC::Yarr::BackTrackInfoAlternative::offsetIndex):
(JSC::Yarr::BackTrackInfoParentheticalAssertion::beginIndex):
(JSC::Yarr::BackTrackInfoParenthesesOnce::beginIndex):
(JSC::Yarr::BackTrackInfoParenthesesTerminal::beginIndex):
LayoutTests:
Updated tests.
* js/regexp-unicode-expected.txt:
* js/script-tests/regexp-unicode.js:
Canonical link: https://commits.webkit.org/192507@main
git-svn-id: https://svn.webkit.org/repository/webkit/trunk@221052 268f45cc-cd09-0410-ab3c-d52691b4dbfc
2017-08-22 22:43:08 +00:00
|
|
|
|
PASS "𐐂𐐅𐐅𐐂𐐅𐐅𐐅".match(/𐐅{3}/u)[0] is "𐐅𐐅𐐅"
|
2017-08-23 22:24:30 +00:00
|
|
|
|
PASS "a𐐐𐐐b".match(/a(𐐐*?)bc|a(𐐐*?)b/ui)[0] is "a𐐐𐐐b"
|
2016-03-02 00:39:01 +00:00
|
|
|
|
PASS match3[0] is "a𐐐𐐐b"
|
|
|
|
|
PASS match3[1] is undefined.
|
|
|
|
|
PASS match3[2] is "a𐐐𐐐b"
|
|
|
|
|
PASS match4[0] is "a𐐸𐐸b"
|
|
|
|
|
PASS match4[1] is undefined.
|
|
|
|
|
PASS match4[2] is "𐐸𐐸"
|
|
|
|
|
PASS match5[0] is "a𐐒𐐒b𐐒𐐒"
|
|
|
|
|
PASS match5[1] is undefined.
|
|
|
|
|
PASS match5[2] is "𐐒𐐒"
|
|
|
|
|
PASS match6[0] is "a𐐒𐐒b𐐺𐐒"
|
|
|
|
|
PASS match6[1] is undefined.
|
|
|
|
|
PASS match6[2] is "𐐒𐐒"
|
2016-03-08 18:35:58 +00:00
|
|
|
|
PASS /ſtop/ui.test("stop") is true
|
|
|
|
|
PASS /stop/ui.test("ſtop") is true
|
|
|
|
|
PASS /Kelvin/ui.test("kelvin") is true
|
|
|
|
|
PASS /KELVIN/ui.test("Kelvin") is true
|
2016-03-04 01:24:28 +00:00
|
|
|
|
PASS /\u{1}/.test("u") is true
|
|
|
|
|
PASS /\u{4}/.test("u") is false
|
|
|
|
|
PASS /\u{4}/.test("uuuu") is true
|
|
|
|
|
PASS "800-555-1212".match(/[0-9\-]*/u)[0].length is 12
|
Implement Unicode RegExp support in the YARR JIT
https://bugs.webkit.org/show_bug.cgi?id=174646
Reviewed by Filip Pizlo.
Source/JavaScriptCore:
This support is only implemented for 64 bit platforms. It wouldn't be too hard to add support
for 32 bit platforms with a reasonable number of spare registers. This code slightly refactors
register usage to reduce the number of callee save registers used for non-Unicode expressions.
For Unicode expressions, there are several more registers used to store constants values for
processing surrogate pairs as well as discerning whether a character belongs to the Basic
Multilingual Plane (BMP) or one of the Supplemental Planes.
This implements JIT support for Unicode expressions very similar to how the interpreter works.
Just like in the interpreter, backtracking code uses more space on the stack to save positions.
Moved the BackTrackInfo* structs to YarrPattern as separate functions. Added xxxIndex()
functions to each of these to simplify how the JIT code reads and writes the structure fields.
Given that reading surrogate pairs and transforming them into a single code point takes a
little processing, the code that implements reading a Unicode character is implemented as a
leaf function added to the end of the JIT'ed code. The calling convention for
"tryReadUnicodeCharacterHelper()" is non-standard given that the rest of the code assumes
that argument values stay in argument registers for most of the generated code.
That helper takes the starting character address in one register, regUnicodeInputAndTrail,
and uses another dedicated temporary register, regUnicodeTemp. The result is typically
returned in regT0. If another return register is requested, we'll create an inline copy of
that function.
Added a new flag to CharacterClass to signify if a class has non-BMP characters. This flag
is used in optimizeAlternative() where we swap the order of a fixed character class term with
a fixed character term that immediately follows it. Since the non-BMP character class may
increment "index" when matching, that must be done first before trying to match a fixed
character term later in the string.
Given the usefulness of the LEA instruction on X86 to create a single pointer value from a
base with index and offset, which the YARR JIT uses heavily, I added a new macroAssembler
function, getEffectiveAddress64(), with an ARM64 implementation. It just calls x86Lea64()
on X86-64. Also added an ImplicitAddress version of load16Unaligned().
(JSC::MacroAssemblerARM64::load16Unaligned):
(JSC::MacroAssemblerARM64::getEffectiveAddress64):
* assembler/MacroAssemblerX86Common.h:
(JSC::MacroAssemblerX86Common::load16Unaligned):
(JSC::MacroAssemblerX86Common::load16):
* assembler/MacroAssemblerX86_64.h:
(JSC::MacroAssemblerX86_64::getEffectiveAddress64):
* create_regex_tables:
* runtime/RegExp.cpp:
(JSC::RegExp::compile):
* yarr/YarrInterpreter.cpp:
* yarr/YarrJIT.cpp:
(JSC::Yarr::YarrGenerator::optimizeAlternative):
(JSC::Yarr::YarrGenerator::matchCharacterClass):
(JSC::Yarr::YarrGenerator::tryReadUnicodeCharImpl):
(JSC::Yarr::YarrGenerator::tryReadUnicodeChar):
(JSC::Yarr::YarrGenerator::readCharacter):
(JSC::Yarr::YarrGenerator::jumpIfCharNotEquals):
(JSC::Yarr::YarrGenerator::matchAssertionWordchar):
(JSC::Yarr::YarrGenerator::generateAssertionWordBoundary):
(JSC::Yarr::YarrGenerator::generatePatternCharacterOnce):
(JSC::Yarr::YarrGenerator::generatePatternCharacterFixed):
(JSC::Yarr::YarrGenerator::generatePatternCharacterGreedy):
(JSC::Yarr::YarrGenerator::backtrackPatternCharacterGreedy):
(JSC::Yarr::YarrGenerator::generatePatternCharacterNonGreedy):
(JSC::Yarr::YarrGenerator::backtrackPatternCharacterNonGreedy):
(JSC::Yarr::YarrGenerator::generateCharacterClassOnce):
(JSC::Yarr::YarrGenerator::backtrackCharacterClassOnce):
(JSC::Yarr::YarrGenerator::generateCharacterClassFixed):
(JSC::Yarr::YarrGenerator::generateCharacterClassGreedy):
(JSC::Yarr::YarrGenerator::backtrackCharacterClassGreedy):
(JSC::Yarr::YarrGenerator::generateCharacterClassNonGreedy):
(JSC::Yarr::YarrGenerator::backtrackCharacterClassNonGreedy):
(JSC::Yarr::YarrGenerator::generate):
(JSC::Yarr::YarrGenerator::backtrack):
(JSC::Yarr::YarrGenerator::generateTryReadUnicodeCharacterHelper):
(JSC::Yarr::YarrGenerator::generateEnter):
(JSC::Yarr::YarrGenerator::generateReturn):
(JSC::Yarr::YarrGenerator::YarrGenerator):
(JSC::Yarr::YarrGenerator::compile):
* yarr/YarrJIT.h:
* yarr/YarrPattern.cpp:
(JSC::Yarr::CharacterClassConstructor::CharacterClassConstructor):
(JSC::Yarr::CharacterClassConstructor::reset):
(JSC::Yarr::CharacterClassConstructor::charClass):
(JSC::Yarr::CharacterClassConstructor::addSorted):
(JSC::Yarr::CharacterClassConstructor::addSortedRange):
(JSC::Yarr::CharacterClassConstructor::hasNonBMPCharacters):
(JSC::Yarr::YarrPatternConstructor::setupAlternativeOffsets):
* yarr/YarrPattern.h:
(JSC::Yarr::CharacterClass::CharacterClass):
(JSC::Yarr::BackTrackInfoPatternCharacter::beginIndex):
(JSC::Yarr::BackTrackInfoPatternCharacter::matchAmountIndex):
(JSC::Yarr::BackTrackInfoCharacterClass::beginIndex):
(JSC::Yarr::BackTrackInfoCharacterClass::matchAmountIndex):
(JSC::Yarr::BackTrackInfoBackReference::beginIndex):
(JSC::Yarr::BackTrackInfoBackReference::matchAmountIndex):
(JSC::Yarr::BackTrackInfoAlternative::offsetIndex):
(JSC::Yarr::BackTrackInfoParentheticalAssertion::beginIndex):
(JSC::Yarr::BackTrackInfoParenthesesOnce::beginIndex):
(JSC::Yarr::BackTrackInfoParenthesesTerminal::beginIndex):
LayoutTests:
Updated tests.
* js/regexp-unicode-expected.txt:
* js/script-tests/regexp-unicode.js:
Canonical link: https://commits.webkit.org/192507@main
git-svn-id: https://svn.webkit.org/repository/webkit/trunk@221052 268f45cc-cd09-0410-ab3c-d52691b4dbfc
2017-08-22 22:43:08 +00:00
|
|
|
|
PASS "🂡🃑🂸🃉🃚".match(re7)[0] is "🂡🃑"
|
|
|
|
|
PASS "🂡🃑🂱🃉🃚".match(re7)[0] is "🂡🃑🂱"
|
|
|
|
|
PASS "🂡🃑🂱🃁🃚".match(re7)[0] is "🂡🃑🂱🃁"
|
|
|
|
|
PASS "🂣🃑🂱🃁🃚".match(re7)[0] is "🃑🂱🃁"
|
|
|
|
|
PASS "𐌑𐌐𐌑".match(/[𐌁𐌑]*a|[𐌐𐌑]*./iu)[0] is "𐌑𐌐𐌑"
|
|
|
|
|
PASS "𐌑𐌐𐌑".match(/[𐌁𐌑]*?a|[𐌐𐌑]*?./iu)[0] is "𐌑"
|
|
|
|
|
PASS "𐌑𐌐𐌑".match(/[𐌁𐌑]+a|[𐌐𐌑]+./iu)[0] is "𐌑𐌐𐌑"
|
|
|
|
|
PASS "𐌑𐌐𐌑".match(/[𐌁𐌑]+?a|[𐌐𐌑]+?./iu)[0] is "𐌑𐌐"
|
2019-04-03 23:51:12 +00:00
|
|
|
|
PASS "𐌑𐌐𐌑".match(/[𐌁𐌑]+?a$|[𐌐𐌑]+?.$/iu)[0] is "𐌑𐌐𐌑"
|
|
|
|
|
PASS "𐌑𐌐𐌑".match(/[𐌁𐌑x]+a|[𐌐𐌑x]+./iu)[0] is "𐌑𐌐𐌑"
|
|
|
|
|
PASS "𐌑𐌐𐌑".match(/[𐌁𐌑x]+?a|[𐌐𐌑x]+?./iu)[0] is "𐌑𐌐"
|
Implement Unicode RegExp support in the YARR JIT
https://bugs.webkit.org/show_bug.cgi?id=174646
Reviewed by Filip Pizlo.
Source/JavaScriptCore:
This support is only implemented for 64 bit platforms. It wouldn't be too hard to add support
for 32 bit platforms with a reasonable number of spare registers. This code slightly refactors
register usage to reduce the number of callee save registers used for non-Unicode expressions.
For Unicode expressions, there are several more registers used to store constants values for
processing surrogate pairs as well as discerning whether a character belongs to the Basic
Multilingual Plane (BMP) or one of the Supplemental Planes.
This implements JIT support for Unicode expressions very similar to how the interpreter works.
Just like in the interpreter, backtracking code uses more space on the stack to save positions.
Moved the BackTrackInfo* structs to YarrPattern as separate functions. Added xxxIndex()
functions to each of these to simplify how the JIT code reads and writes the structure fields.
Given that reading surrogate pairs and transforming them into a single code point takes a
little processing, the code that implements reading a Unicode character is implemented as a
leaf function added to the end of the JIT'ed code. The calling convention for
"tryReadUnicodeCharacterHelper()" is non-standard given that the rest of the code assumes
that argument values stay in argument registers for most of the generated code.
That helper takes the starting character address in one register, regUnicodeInputAndTrail,
and uses another dedicated temporary register, regUnicodeTemp. The result is typically
returned in regT0. If another return register is requested, we'll create an inline copy of
that function.
Added a new flag to CharacterClass to signify if a class has non-BMP characters. This flag
is used in optimizeAlternative() where we swap the order of a fixed character class term with
a fixed character term that immediately follows it. Since the non-BMP character class may
increment "index" when matching, that must be done first before trying to match a fixed
character term later in the string.
Given the usefulness of the LEA instruction on X86 to create a single pointer value from a
base with index and offset, which the YARR JIT uses heavily, I added a new macroAssembler
function, getEffectiveAddress64(), with an ARM64 implementation. It just calls x86Lea64()
on X86-64. Also added an ImplicitAddress version of load16Unaligned().
(JSC::MacroAssemblerARM64::load16Unaligned):
(JSC::MacroAssemblerARM64::getEffectiveAddress64):
* assembler/MacroAssemblerX86Common.h:
(JSC::MacroAssemblerX86Common::load16Unaligned):
(JSC::MacroAssemblerX86Common::load16):
* assembler/MacroAssemblerX86_64.h:
(JSC::MacroAssemblerX86_64::getEffectiveAddress64):
* create_regex_tables:
* runtime/RegExp.cpp:
(JSC::RegExp::compile):
* yarr/YarrInterpreter.cpp:
* yarr/YarrJIT.cpp:
(JSC::Yarr::YarrGenerator::optimizeAlternative):
(JSC::Yarr::YarrGenerator::matchCharacterClass):
(JSC::Yarr::YarrGenerator::tryReadUnicodeCharImpl):
(JSC::Yarr::YarrGenerator::tryReadUnicodeChar):
(JSC::Yarr::YarrGenerator::readCharacter):
(JSC::Yarr::YarrGenerator::jumpIfCharNotEquals):
(JSC::Yarr::YarrGenerator::matchAssertionWordchar):
(JSC::Yarr::YarrGenerator::generateAssertionWordBoundary):
(JSC::Yarr::YarrGenerator::generatePatternCharacterOnce):
(JSC::Yarr::YarrGenerator::generatePatternCharacterFixed):
(JSC::Yarr::YarrGenerator::generatePatternCharacterGreedy):
(JSC::Yarr::YarrGenerator::backtrackPatternCharacterGreedy):
(JSC::Yarr::YarrGenerator::generatePatternCharacterNonGreedy):
(JSC::Yarr::YarrGenerator::backtrackPatternCharacterNonGreedy):
(JSC::Yarr::YarrGenerator::generateCharacterClassOnce):
(JSC::Yarr::YarrGenerator::backtrackCharacterClassOnce):
(JSC::Yarr::YarrGenerator::generateCharacterClassFixed):
(JSC::Yarr::YarrGenerator::generateCharacterClassGreedy):
(JSC::Yarr::YarrGenerator::backtrackCharacterClassGreedy):
(JSC::Yarr::YarrGenerator::generateCharacterClassNonGreedy):
(JSC::Yarr::YarrGenerator::backtrackCharacterClassNonGreedy):
(JSC::Yarr::YarrGenerator::generate):
(JSC::Yarr::YarrGenerator::backtrack):
(JSC::Yarr::YarrGenerator::generateTryReadUnicodeCharacterHelper):
(JSC::Yarr::YarrGenerator::generateEnter):
(JSC::Yarr::YarrGenerator::generateReturn):
(JSC::Yarr::YarrGenerator::YarrGenerator):
(JSC::Yarr::YarrGenerator::compile):
* yarr/YarrJIT.h:
* yarr/YarrPattern.cpp:
(JSC::Yarr::CharacterClassConstructor::CharacterClassConstructor):
(JSC::Yarr::CharacterClassConstructor::reset):
(JSC::Yarr::CharacterClassConstructor::charClass):
(JSC::Yarr::CharacterClassConstructor::addSorted):
(JSC::Yarr::CharacterClassConstructor::addSortedRange):
(JSC::Yarr::CharacterClassConstructor::hasNonBMPCharacters):
(JSC::Yarr::YarrPatternConstructor::setupAlternativeOffsets):
* yarr/YarrPattern.h:
(JSC::Yarr::CharacterClass::CharacterClass):
(JSC::Yarr::BackTrackInfoPatternCharacter::beginIndex):
(JSC::Yarr::BackTrackInfoPatternCharacter::matchAmountIndex):
(JSC::Yarr::BackTrackInfoCharacterClass::beginIndex):
(JSC::Yarr::BackTrackInfoCharacterClass::matchAmountIndex):
(JSC::Yarr::BackTrackInfoBackReference::beginIndex):
(JSC::Yarr::BackTrackInfoBackReference::matchAmountIndex):
(JSC::Yarr::BackTrackInfoAlternative::offsetIndex):
(JSC::Yarr::BackTrackInfoParentheticalAssertion::beginIndex):
(JSC::Yarr::BackTrackInfoParenthesesOnce::beginIndex):
(JSC::Yarr::BackTrackInfoParenthesesTerminal::beginIndex):
LayoutTests:
Updated tests.
* js/regexp-unicode-expected.txt:
* js/script-tests/regexp-unicode.js:
Canonical link: https://commits.webkit.org/192507@main
git-svn-id: https://svn.webkit.org/repository/webkit/trunk@221052 268f45cc-cd09-0410-ab3c-d52691b4dbfc
2017-08-22 22:43:08 +00:00
|
|
|
|
PASS "C83|НАЧАТЬ".match(re8)[0] is "C83|НАЧАТЬ"
|
|
|
|
|
PASS "This.Is.16.Chars|НАЧАТЬ".match(re8)[0] is "This.Is.16.Chars|НАЧАТЬ"
|
|
|
|
|
PASS "Testing\nሴ 1 2 3".match(/^[က-] 1 2 3/um)[0] is "ሴ 1 2 3"
|
|
|
|
|
PASS "Testing\n𐃰 1 2 3".match(/^[က-] 1 2 3/um)[0] is "𐃰 1 2 3"
|
|
|
|
|
PASS "g\nሴ 1 2 3".match(/g\n^[က-] 1 2 3/um)[0] is "g\nሴ 1 2 3"
|
|
|
|
|
PASS "g\n𐃰 1 2 3".match(/g\n^[က-] 1 2 3/um)[0] is "g\n𐃰 1 2 3"
|
|
|
|
|
PASS "Testing ሴ\n1 2 3".match(/Testing [က-]$/um)[0] is "Testing ሴ"
|
|
|
|
|
PASS "Testing 𐃰\n1 2 3".match(/Testing [က-]$/um)[0] is "Testing 𐃰"
|
|
|
|
|
PASS "Testing ሴ\n1 2 3".match(/g [က-]$\n1/um)[0] is "g ሴ\n1"
|
|
|
|
|
PASS "Testing 𐃰\n1 2 3".match(/g [က-]$\n1/um)[0] is "g 𐃰\n1"
|
2016-03-04 01:24:28 +00:00
|
|
|
|
PASS "this is ba test".match(/is b\cha test/u)[0].length is 11
|
|
|
|
|
PASS new RegExp("\\/", "u").source is "\\/"
|
2020-03-31 01:27:10 +00:00
|
|
|
|
PASS r = new RegExp("\\u{110000}", "u") threw exception SyntaxError: Invalid regular expression: invalid Unicode code point \u{} escape.
|
Implement Unicode RegExp support in the YARR JIT
https://bugs.webkit.org/show_bug.cgi?id=174646
Reviewed by Filip Pizlo.
Source/JavaScriptCore:
This support is only implemented for 64 bit platforms. It wouldn't be too hard to add support
for 32 bit platforms with a reasonable number of spare registers. This code slightly refactors
register usage to reduce the number of callee save registers used for non-Unicode expressions.
For Unicode expressions, there are several more registers used to store constants values for
processing surrogate pairs as well as discerning whether a character belongs to the Basic
Multilingual Plane (BMP) or one of the Supplemental Planes.
This implements JIT support for Unicode expressions very similar to how the interpreter works.
Just like in the interpreter, backtracking code uses more space on the stack to save positions.
Moved the BackTrackInfo* structs to YarrPattern as separate functions. Added xxxIndex()
functions to each of these to simplify how the JIT code reads and writes the structure fields.
Given that reading surrogate pairs and transforming them into a single code point takes a
little processing, the code that implements reading a Unicode character is implemented as a
leaf function added to the end of the JIT'ed code. The calling convention for
"tryReadUnicodeCharacterHelper()" is non-standard given that the rest of the code assumes
that argument values stay in argument registers for most of the generated code.
That helper takes the starting character address in one register, regUnicodeInputAndTrail,
and uses another dedicated temporary register, regUnicodeTemp. The result is typically
returned in regT0. If another return register is requested, we'll create an inline copy of
that function.
Added a new flag to CharacterClass to signify if a class has non-BMP characters. This flag
is used in optimizeAlternative() where we swap the order of a fixed character class term with
a fixed character term that immediately follows it. Since the non-BMP character class may
increment "index" when matching, that must be done first before trying to match a fixed
character term later in the string.
Given the usefulness of the LEA instruction on X86 to create a single pointer value from a
base with index and offset, which the YARR JIT uses heavily, I added a new macroAssembler
function, getEffectiveAddress64(), with an ARM64 implementation. It just calls x86Lea64()
on X86-64. Also added an ImplicitAddress version of load16Unaligned().
(JSC::MacroAssemblerARM64::load16Unaligned):
(JSC::MacroAssemblerARM64::getEffectiveAddress64):
* assembler/MacroAssemblerX86Common.h:
(JSC::MacroAssemblerX86Common::load16Unaligned):
(JSC::MacroAssemblerX86Common::load16):
* assembler/MacroAssemblerX86_64.h:
(JSC::MacroAssemblerX86_64::getEffectiveAddress64):
* create_regex_tables:
* runtime/RegExp.cpp:
(JSC::RegExp::compile):
* yarr/YarrInterpreter.cpp:
* yarr/YarrJIT.cpp:
(JSC::Yarr::YarrGenerator::optimizeAlternative):
(JSC::Yarr::YarrGenerator::matchCharacterClass):
(JSC::Yarr::YarrGenerator::tryReadUnicodeCharImpl):
(JSC::Yarr::YarrGenerator::tryReadUnicodeChar):
(JSC::Yarr::YarrGenerator::readCharacter):
(JSC::Yarr::YarrGenerator::jumpIfCharNotEquals):
(JSC::Yarr::YarrGenerator::matchAssertionWordchar):
(JSC::Yarr::YarrGenerator::generateAssertionWordBoundary):
(JSC::Yarr::YarrGenerator::generatePatternCharacterOnce):
(JSC::Yarr::YarrGenerator::generatePatternCharacterFixed):
(JSC::Yarr::YarrGenerator::generatePatternCharacterGreedy):
(JSC::Yarr::YarrGenerator::backtrackPatternCharacterGreedy):
(JSC::Yarr::YarrGenerator::generatePatternCharacterNonGreedy):
(JSC::Yarr::YarrGenerator::backtrackPatternCharacterNonGreedy):
(JSC::Yarr::YarrGenerator::generateCharacterClassOnce):
(JSC::Yarr::YarrGenerator::backtrackCharacterClassOnce):
(JSC::Yarr::YarrGenerator::generateCharacterClassFixed):
(JSC::Yarr::YarrGenerator::generateCharacterClassGreedy):
(JSC::Yarr::YarrGenerator::backtrackCharacterClassGreedy):
(JSC::Yarr::YarrGenerator::generateCharacterClassNonGreedy):
(JSC::Yarr::YarrGenerator::backtrackCharacterClassNonGreedy):
(JSC::Yarr::YarrGenerator::generate):
(JSC::Yarr::YarrGenerator::backtrack):
(JSC::Yarr::YarrGenerator::generateTryReadUnicodeCharacterHelper):
(JSC::Yarr::YarrGenerator::generateEnter):
(JSC::Yarr::YarrGenerator::generateReturn):
(JSC::Yarr::YarrGenerator::YarrGenerator):
(JSC::Yarr::YarrGenerator::compile):
* yarr/YarrJIT.h:
* yarr/YarrPattern.cpp:
(JSC::Yarr::CharacterClassConstructor::CharacterClassConstructor):
(JSC::Yarr::CharacterClassConstructor::reset):
(JSC::Yarr::CharacterClassConstructor::charClass):
(JSC::Yarr::CharacterClassConstructor::addSorted):
(JSC::Yarr::CharacterClassConstructor::addSortedRange):
(JSC::Yarr::CharacterClassConstructor::hasNonBMPCharacters):
(JSC::Yarr::YarrPatternConstructor::setupAlternativeOffsets):
* yarr/YarrPattern.h:
(JSC::Yarr::CharacterClass::CharacterClass):
(JSC::Yarr::BackTrackInfoPatternCharacter::beginIndex):
(JSC::Yarr::BackTrackInfoPatternCharacter::matchAmountIndex):
(JSC::Yarr::BackTrackInfoCharacterClass::beginIndex):
(JSC::Yarr::BackTrackInfoCharacterClass::matchAmountIndex):
(JSC::Yarr::BackTrackInfoBackReference::beginIndex):
(JSC::Yarr::BackTrackInfoBackReference::matchAmountIndex):
(JSC::Yarr::BackTrackInfoAlternative::offsetIndex):
(JSC::Yarr::BackTrackInfoParentheticalAssertion::beginIndex):
(JSC::Yarr::BackTrackInfoParenthesesOnce::beginIndex):
(JSC::Yarr::BackTrackInfoParenthesesTerminal::beginIndex):
LayoutTests:
Updated tests.
* js/regexp-unicode-expected.txt:
* js/script-tests/regexp-unicode.js:
Canonical link: https://commits.webkit.org/192507@main
git-svn-id: https://svn.webkit.org/repository/webkit/trunk@221052 268f45cc-cd09-0410-ab3c-d52691b4dbfc
2017-08-22 22:43:08 +00:00
|
|
|
|
PASS r = new RegExp("𐐅{2147483648}", "u") threw exception SyntaxError: Invalid regular expression: pattern exceeds string length limits.
|
2020-01-30 21:27:11 +00:00
|
|
|
|
PASS /{/u threw exception SyntaxError: Invalid regular expression: incomplete {} quantifier for Unicode pattern.
|
|
|
|
|
PASS /[a-\d]/u threw exception SyntaxError: Invalid regular expression: invalid range in character class for Unicode pattern.
|
2020-01-31 17:59:26 +00:00
|
|
|
|
PASS /]/u threw exception SyntaxError: Invalid regular expression: unmatched ] or } bracket for Unicode pattern.
|
2020-04-05 08:12:54 +00:00
|
|
|
|
PASS /\5/u threw exception SyntaxError: Invalid regular expression: invalid backreference for Unicode pattern.
|
|
|
|
|
PASS /\01/u threw exception SyntaxError: Invalid regular expression: invalid octal escape for Unicode pattern.
|
|
|
|
|
PASS /[\23]/u threw exception SyntaxError: Invalid regular expression: invalid octal escape for Unicode pattern.
|
2020-02-02 00:20:04 +00:00
|
|
|
|
PASS /\c9/u threw exception SyntaxError: Invalid regular expression: invalid \c escape for Unicode pattern.
|
2020-01-30 21:27:11 +00:00
|
|
|
|
PASS r = new RegExp("\\-", "u") threw exception SyntaxError: Invalid regular expression: invalid escaped character for Unicode pattern.
|
|
|
|
|
PASS r = new RegExp("\\a", "u") threw exception SyntaxError: Invalid regular expression: invalid escaped character for Unicode pattern.
|
|
|
|
|
PASS r = new RegExp("[\\a]", "u") threw exception SyntaxError: Invalid regular expression: invalid escaped character for Unicode pattern.
|
|
|
|
|
PASS r = new RegExp("[\\B]", "u") threw exception SyntaxError: Invalid regular expression: invalid escaped character for Unicode pattern.
|
|
|
|
|
PASS r = new RegExp("\\x", "u") threw exception SyntaxError: Invalid regular expression: invalid escaped character for Unicode pattern.
|
|
|
|
|
PASS r = new RegExp("[\\x]", "u") threw exception SyntaxError: Invalid regular expression: invalid escaped character for Unicode pattern.
|
2020-03-31 01:27:10 +00:00
|
|
|
|
PASS r = new RegExp("\\u", "u") threw exception SyntaxError: Invalid regular expression: invalid Unicode \u escape.
|
|
|
|
|
PASS r = new RegExp("[\\u]", "u") threw exception SyntaxError: Invalid regular expression: invalid Unicode \u escape.
|
|
|
|
|
PASS r = new RegExp("\\u{", "u") threw exception SyntaxError: Invalid regular expression: invalid Unicode code point \u{} escape.
|
|
|
|
|
PASS r = new RegExp("\\u{\udead", "u") threw exception SyntaxError: Invalid regular expression: invalid Unicode code point \u{} escape.
|
2020-01-30 21:27:11 +00:00
|
|
|
|
PASS /\1/u threw exception SyntaxError: Invalid regular expression: invalid backreference for Unicode pattern.
|
|
|
|
|
PASS /\2/u threw exception SyntaxError: Invalid regular expression: invalid backreference for Unicode pattern.
|
|
|
|
|
PASS /\3/u threw exception SyntaxError: Invalid regular expression: invalid backreference for Unicode pattern.
|
|
|
|
|
PASS /\4/u threw exception SyntaxError: Invalid regular expression: invalid backreference for Unicode pattern.
|
|
|
|
|
PASS /\5/u threw exception SyntaxError: Invalid regular expression: invalid backreference for Unicode pattern.
|
|
|
|
|
PASS /\6/u threw exception SyntaxError: Invalid regular expression: invalid backreference for Unicode pattern.
|
|
|
|
|
PASS /\7/u threw exception SyntaxError: Invalid regular expression: invalid backreference for Unicode pattern.
|
|
|
|
|
PASS /\8/u threw exception SyntaxError: Invalid regular expression: invalid backreference for Unicode pattern.
|
|
|
|
|
PASS /\9/u threw exception SyntaxError: Invalid regular expression: invalid backreference for Unicode pattern.
|
2017-04-13 02:51:18 +00:00
|
|
|
|
PASS /(.)\1/u did not throw exception.
|
|
|
|
|
PASS /(.)(.)\2/u did not throw exception.
|
2020-01-30 21:27:11 +00:00
|
|
|
|
PASS /(.)(.)\3/u threw exception SyntaxError: Invalid regular expression: invalid backreference for Unicode pattern.
|
2017-04-13 02:51:18 +00:00
|
|
|
|
PASS /\1/ did not throw exception.
|
|
|
|
|
PASS /\2/ did not throw exception.
|
|
|
|
|
PASS /\3/ did not throw exception.
|
|
|
|
|
PASS /\4/ did not throw exception.
|
|
|
|
|
PASS /\5/ did not throw exception.
|
|
|
|
|
PASS /\6/ did not throw exception.
|
|
|
|
|
PASS /\7/ did not throw exception.
|
|
|
|
|
PASS /\8/ did not throw exception.
|
|
|
|
|
PASS /\9/ did not throw exception.
|
2016-03-02 00:39:01 +00:00
|
|
|
|
PASS successfullyParsed is true
|
|
|
|
|
|
|
|
|
|
TEST COMPLETE
|
|
|
|
|
|