haikuwebkit/Source/WTF/wtf/SmallSet.h

306 lines
10 KiB
C
Raw Permalink Normal View History

Make AirAllocateRegistersByGraphColoring use less memory https://bugs.webkit.org/show_bug.cgi?id=225848 Reviewed by Filip Pizlo. Source/JavaScriptCore: We've had some jetsam problems caused by the main Air register allocator, which caused us to lower Options::maximumTmpsForGraphColoring. Hence this patch tries to improve the memory usage of the allocator. It includes several changes: - Change the datastructure used for representing the interference graph. Before it was effectively a HashSet<std::pair<uint16_t, uint16_t>. Now, it is either a Bitvector (for n < 400 for now, can be tweaked easily), or a Vector<LikelyDenseUnsignedIntegerSet<uint16_t>> otherwise. LikelyDenseUnsignedIntegerSet is a new datastructure introduced by this patch, it is either a HashSet if very sparse, or a BitVector + an amount to shift it by. This is by far the largest memory reduction in this patch, it reduces the maximum memory used for an interference graph in tsf-wasm in JetStream2 from 16MB to 700kB, and in mruby-wasm.aotoki.dev from 262MB to 20MB (the later only happen when we increase Options::maximumTmpsForGraphColoring.. this is the exact function which caused us to lower it). Its effect on smaller functions in JetStream2 is rarely as dramatic but always an improvement, and improvements between 2x and 5x are extremely common (10x to 30x are significantly rarer but do occur). - In order to easily test this change and any further change to this datastructure, the old approach was preserved as InterferenceHashSet, and a template to run two such datastructures in parallel, checking their equivalence was added: InstrumentedInterferenceGraph. Running with it and reportInterferenceGraphMemoryUse set to true was used to compute the numbers given above. - There was already some template parameter to change the size of the tmp indices from unsigned to uint16_t but the code failed to compile unless it was unsigned. I fixed this, made more consistent use of it, and switched to uint16_t in the very common case that we have less than 65k Tmps (we can have more despite the option because of spilling). This halved the memory usage of various other datastructures in the register allocator - unspillableTmps was a HashSet<unsigned>. Since it is often quite dense (often around 20% on JetStream2), I replaced it by a Bitvector instead - m_biases was a HashMap<IndexType, HashSet<IndexType>>. Since it is extremely rare that the sets have more than 8 elements (from looking at some instrumented run of JetStream2), I replaced it by HashMap<IndexType, SmallSet<IndexType>>. This not only significantly reduces memory, but nearly halves the time spent in assignColors(around 80ms -> 40ms in JetStream 2) - UseCounts was needlessly general: it is only used by the register allocator (all other references to UseCounts refer to the completely different B3::UseCounts), so there is no point in it computing, and then storing lots of irrelevant data. A float is also more than enough precision (especially since it is pretty much always 1, 10, 100, or 1000 in practice…). Also, since we only need it indexed by Tmps, we can use a Vector with AbsoluteTmpMapper instead of its HashMap. These changes are not just memory savings, they also make selectSpill way faster (570ms -> 250ms on my machine on JetStream2) - While I was at it, I did a couple of other tweaks to the logic of selectSpill. In particular, instead of having to check for isFastTmp every time, I just put the fast tmps directly in unspillableTmps, which prevents them from getting added to m_spillWorklist in the first place. This + a bit of clean-up (for example putting an early exit instead of setting score to infinity in the case of dead tmps) resulted in a further perf win (to roughly 200ms spent in selectSpill() on JetStream2) All together, this patch reduces the time spent in the register allocator by roughly 15 to 20% in JetStream2 (tested both with the Briggs and the IRC allocators on my MBP 2019). I do not yet have precise performance numbers for this exact patch, but benchmarking a previous version of it (with a less optimized interference graph) resulted in significant RAMification improvements (around 1%), and more surprisingly some JetStream2 improvements on weaker machines (e.g. an iPhone 7 gained > 1%). I believe these gains come either from less trashing of the caches, or less contention caused by the memory traffic. I will try to update the bugzilla with more up-to-date thorough results when I get them. This patch does not increase Options::maximumTmpsForGraphColoring, I intend to do that in a separate patch to make it easier to revert in case of a problem. * b3/B3ReduceLoopStrength.cpp: (JSC::B3::ReduceLoopStrength::reduceByteCopyLoopsToMemcpy): * b3/air/AirAllocateRegistersAndStackByLinearScan.cpp: * b3/air/AirAllocateRegistersByGraphColoring.cpp: (JSC::B3::Air::allocateRegistersByGraphColoring): * b3/air/AirCode.h: (JSC::B3::Air::Code::forEachFastTmp const): * b3/air/AirUseCounts.h: (JSC::B3::Air::UseCounts::UseCounts): (JSC::B3::Air::UseCounts::isConstDef const): (JSC::B3::Air::UseCounts::numWarmUsesAndDefs const): (JSC::B3::Air::UseCounts::dump const): * parser/Nodes.h: Source/WTF: Two changes: the addition of LikelyDenseUnsignedIntegerSet, and various improvements to Small(Ptr)Set. The latter include: - Renaming SmallPtrSet into SmallSet, as it now supports integers as well as pointers. - Reducing its size by sharing the same storage for m_buffer and for m_smallStorage. This is safe to do, because all operations branch on isSmall() which depends purely on m_capacity. - Adding trivial size(), isEmpty() and memoryUse() methods - Adding a comment at the top of the file explaining when to use, and (more importantly) not to use SmallSet. LikelyDenseUnsignedIntegerSet is an even more specialized data structure, that can represent sets of unsigned integers very compactly if they are clustered. Finally I added an outOfLineMemoryUse() method to BitVector, making it more convenient to compare the memory consumption of different data structures in the register allocator. * WTF.xcodeproj/project.pbxproj: * wtf/BitVector.h: * wtf/CMakeLists.txt: * wtf/LikelyDenseUnsignedIntegerSet.cpp: Copied from Source/WTF/wtf/SmallPtrSet.cpp. * wtf/LikelyDenseUnsignedIntegerSet.h: Added. (WTF::LikelyDenseUnsignedIntegerSet::LikelyDenseUnsignedIntegerSet): (WTF::LikelyDenseUnsignedIntegerSet::~LikelyDenseUnsignedIntegerSet): (WTF::LikelyDenseUnsignedIntegerSet::contains const): (WTF::LikelyDenseUnsignedIntegerSet::add): (WTF::LikelyDenseUnsignedIntegerSet::size const): (WTF::LikelyDenseUnsignedIntegerSet::iterator::iterator): (WTF::LikelyDenseUnsignedIntegerSet::iterator::m_shift): (WTF::LikelyDenseUnsignedIntegerSet::iterator::operator++): (WTF::LikelyDenseUnsignedIntegerSet::iterator::operator* const): (WTF::LikelyDenseUnsignedIntegerSet::iterator::operator== const): (WTF::LikelyDenseUnsignedIntegerSet::iterator::operator!= const): (WTF::LikelyDenseUnsignedIntegerSet::begin const): (WTF::LikelyDenseUnsignedIntegerSet::end const): (WTF::LikelyDenseUnsignedIntegerSet::memoryUse const): (WTF::LikelyDenseUnsignedIntegerSet::validate const): (WTF::LikelyDenseUnsignedIntegerSet::isBitVector const): (WTF::LikelyDenseUnsignedIntegerSet::isValidValue const): (WTF::LikelyDenseUnsignedIntegerSet::transitionToHashSet): (WTF::LikelyDenseUnsignedIntegerSet::transitionToBitVector): * wtf/SmallPtrSet.h: Removed. * wtf/SmallSet.cpp: Renamed from Source/WTF/wtf/SmallPtrSet.cpp. * wtf/SmallSet.h: Added. Tools: Simply added some tests for SmallSet. * TestWebKitAPI/TestWebKitAPI.xcodeproj/project.pbxproj: * TestWebKitAPI/Tests/WTF/SmallSet.cpp: Added. (TestWebKitAPI::testSmallSetOfUnsigned): (TestWebKitAPI::testSmallSetOfPointers): (TestWebKitAPI::testVectorsOfSmallSetsOfUnsigned): (TestWebKitAPI::TEST): Canonical link: https://commits.webkit.org/237893@main git-svn-id: https://svn.webkit.org/repository/webkit/trunk@277714 268f45cc-cd09-0410-ab3c-d52691b4dbfc
2021-05-19 03:24:14 +00:00
/*
* Copyright (C) 2016-2021 Apple Inc. All Rights Reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
*
* THIS SOFTWARE IS PROVIDED BY APPLE INC. ``AS IS'' AND ANY
* EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
* PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL APPLE INC. OR
* CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
* EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
* PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
* OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#pragma once
#include <wtf/Assertions.h>
#include <wtf/FastMalloc.h>
#include <wtf/HashFunctions.h>
#include <wtf/Noncopyable.h>
namespace WTF {
DECLARE_ALLOCATOR_WITH_HEAP_IDENTIFIER(SmallSet);
// Functionally, this class is very similar to Variant<Vector<T, SmallArraySize>, HashSet<T>>
// It is optimized primarily for space, but is also quite fast
// Its main limitation is that it has no way to remove elements once they have been added to it
// Also, instead of being fully parameterized by a HashTrait parameter, it always uses -1 (all ones) as its empty value
// Relatedly, it can only store objects of up to 64 bit size (but that particular limitation should be fairly easy to lift if needed)
// Use it whenever you need to store an unbounded but probably small number of unsigned integers or pointers.
template<typename T, typename Hash = PtrHashBase<T, false /* isSmartPtr */>, unsigned SmallArraySize = 8>
class SmallSet {
WTF_MAKE_FAST_ALLOCATED;
WTF_MAKE_NONCOPYABLE(SmallSet);
static_assert(std::is_trivially_destructible<T>::value, "We currently don't support non-trivially destructible types.");
static_assert(!(SmallArraySize & (SmallArraySize - 1)), "Inline size must be a power of two.");
static_assert(sizeof(T*) <= SmallArraySize * sizeof(T), "This class has not been tested for m_inline.buffer larger than m_inline.smallStorage");
public:
SmallSet()
: m_inline()
{
initialize();
}
// We take care to have SmallSet have partial move semantics allowable through
// memcpy. It's partial move semantics because our destructor should not be called
// on the SmallPtrObject in the old memory we were moved from (otherwise, we might free m_buffer twice)
// unless that old memory is reset to be isSmall(). See move constructor below.
// To maintain these semantics, we determine if we're small by checking our size
// and not our m_buffer pointer. And when we're small, we don't do operations on
// m_buffer, instead, we perform operations on m_smallStorage directly. The reason we want
// these semantics is that it's beneficial to have a Vector that contains SmallSet
// (or an object with SmallSet as a field) be allowed to use memcpy for its move operation.
SmallSet(SmallSet&& other)
{
memcpy(static_cast<void*>(this), static_cast<void*>(&other), sizeof(SmallSet));
Make AirAllocateRegistersByGraphColoring use less memory https://bugs.webkit.org/show_bug.cgi?id=225848 Reviewed by Filip Pizlo. Source/JavaScriptCore: We've had some jetsam problems caused by the main Air register allocator, which caused us to lower Options::maximumTmpsForGraphColoring. Hence this patch tries to improve the memory usage of the allocator. It includes several changes: - Change the datastructure used for representing the interference graph. Before it was effectively a HashSet<std::pair<uint16_t, uint16_t>. Now, it is either a Bitvector (for n < 400 for now, can be tweaked easily), or a Vector<LikelyDenseUnsignedIntegerSet<uint16_t>> otherwise. LikelyDenseUnsignedIntegerSet is a new datastructure introduced by this patch, it is either a HashSet if very sparse, or a BitVector + an amount to shift it by. This is by far the largest memory reduction in this patch, it reduces the maximum memory used for an interference graph in tsf-wasm in JetStream2 from 16MB to 700kB, and in mruby-wasm.aotoki.dev from 262MB to 20MB (the later only happen when we increase Options::maximumTmpsForGraphColoring.. this is the exact function which caused us to lower it). Its effect on smaller functions in JetStream2 is rarely as dramatic but always an improvement, and improvements between 2x and 5x are extremely common (10x to 30x are significantly rarer but do occur). - In order to easily test this change and any further change to this datastructure, the old approach was preserved as InterferenceHashSet, and a template to run two such datastructures in parallel, checking their equivalence was added: InstrumentedInterferenceGraph. Running with it and reportInterferenceGraphMemoryUse set to true was used to compute the numbers given above. - There was already some template parameter to change the size of the tmp indices from unsigned to uint16_t but the code failed to compile unless it was unsigned. I fixed this, made more consistent use of it, and switched to uint16_t in the very common case that we have less than 65k Tmps (we can have more despite the option because of spilling). This halved the memory usage of various other datastructures in the register allocator - unspillableTmps was a HashSet<unsigned>. Since it is often quite dense (often around 20% on JetStream2), I replaced it by a Bitvector instead - m_biases was a HashMap<IndexType, HashSet<IndexType>>. Since it is extremely rare that the sets have more than 8 elements (from looking at some instrumented run of JetStream2), I replaced it by HashMap<IndexType, SmallSet<IndexType>>. This not only significantly reduces memory, but nearly halves the time spent in assignColors(around 80ms -> 40ms in JetStream 2) - UseCounts was needlessly general: it is only used by the register allocator (all other references to UseCounts refer to the completely different B3::UseCounts), so there is no point in it computing, and then storing lots of irrelevant data. A float is also more than enough precision (especially since it is pretty much always 1, 10, 100, or 1000 in practice…). Also, since we only need it indexed by Tmps, we can use a Vector with AbsoluteTmpMapper instead of its HashMap. These changes are not just memory savings, they also make selectSpill way faster (570ms -> 250ms on my machine on JetStream2) - While I was at it, I did a couple of other tweaks to the logic of selectSpill. In particular, instead of having to check for isFastTmp every time, I just put the fast tmps directly in unspillableTmps, which prevents them from getting added to m_spillWorklist in the first place. This + a bit of clean-up (for example putting an early exit instead of setting score to infinity in the case of dead tmps) resulted in a further perf win (to roughly 200ms spent in selectSpill() on JetStream2) All together, this patch reduces the time spent in the register allocator by roughly 15 to 20% in JetStream2 (tested both with the Briggs and the IRC allocators on my MBP 2019). I do not yet have precise performance numbers for this exact patch, but benchmarking a previous version of it (with a less optimized interference graph) resulted in significant RAMification improvements (around 1%), and more surprisingly some JetStream2 improvements on weaker machines (e.g. an iPhone 7 gained > 1%). I believe these gains come either from less trashing of the caches, or less contention caused by the memory traffic. I will try to update the bugzilla with more up-to-date thorough results when I get them. This patch does not increase Options::maximumTmpsForGraphColoring, I intend to do that in a separate patch to make it easier to revert in case of a problem. * b3/B3ReduceLoopStrength.cpp: (JSC::B3::ReduceLoopStrength::reduceByteCopyLoopsToMemcpy): * b3/air/AirAllocateRegistersAndStackByLinearScan.cpp: * b3/air/AirAllocateRegistersByGraphColoring.cpp: (JSC::B3::Air::allocateRegistersByGraphColoring): * b3/air/AirCode.h: (JSC::B3::Air::Code::forEachFastTmp const): * b3/air/AirUseCounts.h: (JSC::B3::Air::UseCounts::UseCounts): (JSC::B3::Air::UseCounts::isConstDef const): (JSC::B3::Air::UseCounts::numWarmUsesAndDefs const): (JSC::B3::Air::UseCounts::dump const): * parser/Nodes.h: Source/WTF: Two changes: the addition of LikelyDenseUnsignedIntegerSet, and various improvements to Small(Ptr)Set. The latter include: - Renaming SmallPtrSet into SmallSet, as it now supports integers as well as pointers. - Reducing its size by sharing the same storage for m_buffer and for m_smallStorage. This is safe to do, because all operations branch on isSmall() which depends purely on m_capacity. - Adding trivial size(), isEmpty() and memoryUse() methods - Adding a comment at the top of the file explaining when to use, and (more importantly) not to use SmallSet. LikelyDenseUnsignedIntegerSet is an even more specialized data structure, that can represent sets of unsigned integers very compactly if they are clustered. Finally I added an outOfLineMemoryUse() method to BitVector, making it more convenient to compare the memory consumption of different data structures in the register allocator. * WTF.xcodeproj/project.pbxproj: * wtf/BitVector.h: * wtf/CMakeLists.txt: * wtf/LikelyDenseUnsignedIntegerSet.cpp: Copied from Source/WTF/wtf/SmallPtrSet.cpp. * wtf/LikelyDenseUnsignedIntegerSet.h: Added. (WTF::LikelyDenseUnsignedIntegerSet::LikelyDenseUnsignedIntegerSet): (WTF::LikelyDenseUnsignedIntegerSet::~LikelyDenseUnsignedIntegerSet): (WTF::LikelyDenseUnsignedIntegerSet::contains const): (WTF::LikelyDenseUnsignedIntegerSet::add): (WTF::LikelyDenseUnsignedIntegerSet::size const): (WTF::LikelyDenseUnsignedIntegerSet::iterator::iterator): (WTF::LikelyDenseUnsignedIntegerSet::iterator::m_shift): (WTF::LikelyDenseUnsignedIntegerSet::iterator::operator++): (WTF::LikelyDenseUnsignedIntegerSet::iterator::operator* const): (WTF::LikelyDenseUnsignedIntegerSet::iterator::operator== const): (WTF::LikelyDenseUnsignedIntegerSet::iterator::operator!= const): (WTF::LikelyDenseUnsignedIntegerSet::begin const): (WTF::LikelyDenseUnsignedIntegerSet::end const): (WTF::LikelyDenseUnsignedIntegerSet::memoryUse const): (WTF::LikelyDenseUnsignedIntegerSet::validate const): (WTF::LikelyDenseUnsignedIntegerSet::isBitVector const): (WTF::LikelyDenseUnsignedIntegerSet::isValidValue const): (WTF::LikelyDenseUnsignedIntegerSet::transitionToHashSet): (WTF::LikelyDenseUnsignedIntegerSet::transitionToBitVector): * wtf/SmallPtrSet.h: Removed. * wtf/SmallSet.cpp: Renamed from Source/WTF/wtf/SmallPtrSet.cpp. * wtf/SmallSet.h: Added. Tools: Simply added some tests for SmallSet. * TestWebKitAPI/TestWebKitAPI.xcodeproj/project.pbxproj: * TestWebKitAPI/Tests/WTF/SmallSet.cpp: Added. (TestWebKitAPI::testSmallSetOfUnsigned): (TestWebKitAPI::testSmallSetOfPointers): (TestWebKitAPI::testVectorsOfSmallSetsOfUnsigned): (TestWebKitAPI::TEST): Canonical link: https://commits.webkit.org/237893@main git-svn-id: https://svn.webkit.org/repository/webkit/trunk@277714 268f45cc-cd09-0410-ab3c-d52691b4dbfc
2021-05-19 03:24:14 +00:00
other.initialize();
}
SmallSet& operator=(SmallSet&& other)
{
this->~SmallSet();
new (this) SmallSet(WTFMove(other));
return *this;
}
~SmallSet()
{
if (!isSmall())
SmallSetMalloc::free(m_inline.buffer);
}
// We could easily include an iterator in this to fully match the HashSet interface, but currently none of our clients require it.
struct AddResult {
bool isNewEntry;
};
inline AddResult add(T value)
{
ASSERT(isValidEntry(value));
if (isSmall()) {
for (unsigned i = 0; i < m_size; i++) {
if (m_inline.smallStorage[i] == value)
return { false };
}
if (m_size < SmallArraySize) {
m_inline.smallStorage[m_size] = value;
++m_size;
return { true };
}
grow(std::max(64u, SmallArraySize * 2));
// Fall through. We're no longer small :(
}
// If we're more than 3/4ths full we grow.
if (UNLIKELY(m_size * 4 >= m_capacity * 3)) {
grow(m_capacity * 2);
ASSERT(!(m_capacity & (m_capacity - 1)));
}
T* bucket = this->bucket(value);
if (*bucket != value) {
*bucket = value;
++m_size;
return { true };
}
return { false };
}
inline bool contains(T value) const
{
ASSERT(isValidEntry(value));
if (isSmall()) {
// We only need to search up to m_size because we store things linearly inside m_smallStorage.
for (unsigned i = 0; i < m_size; i++) {
if (m_inline.smallStorage[i] == value)
return true;
}
return false;
}
T* bucket = this->bucket(value);
return *bucket == value;
}
class iterator {
WTF_MAKE_FAST_ALLOCATED;
public:
iterator& operator++()
{
m_index++;
ASSERT(m_index <= m_capacity);
while (m_index < m_capacity && m_buffer[m_index] == emptyValue())
m_index++;
return *this;
}
T operator*() const { ASSERT(m_index < m_capacity); return static_cast<T>(m_buffer[m_index]); }
bool operator==(const iterator& other) const { ASSERT(m_buffer == other.m_buffer); return m_index == other.m_index; }
bool operator!=(const iterator& other) const { ASSERT(m_buffer == other.m_buffer); return !(*this == other); }
private:
template<typename U, typename H, unsigned S> friend class WTF::SmallSet;
unsigned m_index;
unsigned m_capacity;
T* m_buffer;
};
iterator begin() const
{
iterator it;
it.m_index = std::numeric_limits<unsigned>::max();
it.m_capacity = m_capacity;
if (isSmall())
it.m_buffer = const_cast<T*>(m_inline.smallStorage);
else
it.m_buffer = m_inline.buffer;
++it;
return it;
}
iterator end() const
{
iterator it;
it.m_index = m_capacity;
it.m_capacity = m_capacity;
if (isSmall())
it.m_buffer = const_cast<T*>(m_inline.smallStorage);
else
it.m_buffer = m_inline.buffer;
return it;
}
inline unsigned size() const { return m_size; }
inline bool isEmpty() const { return !size(); }
unsigned memoryUse() const
{
unsigned memory = sizeof(SmallSet);
if (!isSmall())
memory += m_capacity * sizeof(T);
return memory;
}
private:
constexpr static T emptyValue()
{
if constexpr (std::is_pointer<T>::value)
return static_cast<T>(bitwise_cast<void*>(std::numeric_limits<uintptr_t>::max()));
return std::numeric_limits<T>::max();
}
bool isValidEntry(const T value) const
{
return value != emptyValue();
}
inline bool isSmall() const
{
return m_capacity == SmallArraySize;
}
inline void initialize()
{
m_size = 0;
m_capacity = SmallArraySize;
memset(static_cast<void*>(m_inline.smallStorage), -1, sizeof(T) * SmallArraySize);
Make AirAllocateRegistersByGraphColoring use less memory https://bugs.webkit.org/show_bug.cgi?id=225848 Reviewed by Filip Pizlo. Source/JavaScriptCore: We've had some jetsam problems caused by the main Air register allocator, which caused us to lower Options::maximumTmpsForGraphColoring. Hence this patch tries to improve the memory usage of the allocator. It includes several changes: - Change the datastructure used for representing the interference graph. Before it was effectively a HashSet<std::pair<uint16_t, uint16_t>. Now, it is either a Bitvector (for n < 400 for now, can be tweaked easily), or a Vector<LikelyDenseUnsignedIntegerSet<uint16_t>> otherwise. LikelyDenseUnsignedIntegerSet is a new datastructure introduced by this patch, it is either a HashSet if very sparse, or a BitVector + an amount to shift it by. This is by far the largest memory reduction in this patch, it reduces the maximum memory used for an interference graph in tsf-wasm in JetStream2 from 16MB to 700kB, and in mruby-wasm.aotoki.dev from 262MB to 20MB (the later only happen when we increase Options::maximumTmpsForGraphColoring.. this is the exact function which caused us to lower it). Its effect on smaller functions in JetStream2 is rarely as dramatic but always an improvement, and improvements between 2x and 5x are extremely common (10x to 30x are significantly rarer but do occur). - In order to easily test this change and any further change to this datastructure, the old approach was preserved as InterferenceHashSet, and a template to run two such datastructures in parallel, checking their equivalence was added: InstrumentedInterferenceGraph. Running with it and reportInterferenceGraphMemoryUse set to true was used to compute the numbers given above. - There was already some template parameter to change the size of the tmp indices from unsigned to uint16_t but the code failed to compile unless it was unsigned. I fixed this, made more consistent use of it, and switched to uint16_t in the very common case that we have less than 65k Tmps (we can have more despite the option because of spilling). This halved the memory usage of various other datastructures in the register allocator - unspillableTmps was a HashSet<unsigned>. Since it is often quite dense (often around 20% on JetStream2), I replaced it by a Bitvector instead - m_biases was a HashMap<IndexType, HashSet<IndexType>>. Since it is extremely rare that the sets have more than 8 elements (from looking at some instrumented run of JetStream2), I replaced it by HashMap<IndexType, SmallSet<IndexType>>. This not only significantly reduces memory, but nearly halves the time spent in assignColors(around 80ms -> 40ms in JetStream 2) - UseCounts was needlessly general: it is only used by the register allocator (all other references to UseCounts refer to the completely different B3::UseCounts), so there is no point in it computing, and then storing lots of irrelevant data. A float is also more than enough precision (especially since it is pretty much always 1, 10, 100, or 1000 in practice…). Also, since we only need it indexed by Tmps, we can use a Vector with AbsoluteTmpMapper instead of its HashMap. These changes are not just memory savings, they also make selectSpill way faster (570ms -> 250ms on my machine on JetStream2) - While I was at it, I did a couple of other tweaks to the logic of selectSpill. In particular, instead of having to check for isFastTmp every time, I just put the fast tmps directly in unspillableTmps, which prevents them from getting added to m_spillWorklist in the first place. This + a bit of clean-up (for example putting an early exit instead of setting score to infinity in the case of dead tmps) resulted in a further perf win (to roughly 200ms spent in selectSpill() on JetStream2) All together, this patch reduces the time spent in the register allocator by roughly 15 to 20% in JetStream2 (tested both with the Briggs and the IRC allocators on my MBP 2019). I do not yet have precise performance numbers for this exact patch, but benchmarking a previous version of it (with a less optimized interference graph) resulted in significant RAMification improvements (around 1%), and more surprisingly some JetStream2 improvements on weaker machines (e.g. an iPhone 7 gained > 1%). I believe these gains come either from less trashing of the caches, or less contention caused by the memory traffic. I will try to update the bugzilla with more up-to-date thorough results when I get them. This patch does not increase Options::maximumTmpsForGraphColoring, I intend to do that in a separate patch to make it easier to revert in case of a problem. * b3/B3ReduceLoopStrength.cpp: (JSC::B3::ReduceLoopStrength::reduceByteCopyLoopsToMemcpy): * b3/air/AirAllocateRegistersAndStackByLinearScan.cpp: * b3/air/AirAllocateRegistersByGraphColoring.cpp: (JSC::B3::Air::allocateRegistersByGraphColoring): * b3/air/AirCode.h: (JSC::B3::Air::Code::forEachFastTmp const): * b3/air/AirUseCounts.h: (JSC::B3::Air::UseCounts::UseCounts): (JSC::B3::Air::UseCounts::isConstDef const): (JSC::B3::Air::UseCounts::numWarmUsesAndDefs const): (JSC::B3::Air::UseCounts::dump const): * parser/Nodes.h: Source/WTF: Two changes: the addition of LikelyDenseUnsignedIntegerSet, and various improvements to Small(Ptr)Set. The latter include: - Renaming SmallPtrSet into SmallSet, as it now supports integers as well as pointers. - Reducing its size by sharing the same storage for m_buffer and for m_smallStorage. This is safe to do, because all operations branch on isSmall() which depends purely on m_capacity. - Adding trivial size(), isEmpty() and memoryUse() methods - Adding a comment at the top of the file explaining when to use, and (more importantly) not to use SmallSet. LikelyDenseUnsignedIntegerSet is an even more specialized data structure, that can represent sets of unsigned integers very compactly if they are clustered. Finally I added an outOfLineMemoryUse() method to BitVector, making it more convenient to compare the memory consumption of different data structures in the register allocator. * WTF.xcodeproj/project.pbxproj: * wtf/BitVector.h: * wtf/CMakeLists.txt: * wtf/LikelyDenseUnsignedIntegerSet.cpp: Copied from Source/WTF/wtf/SmallPtrSet.cpp. * wtf/LikelyDenseUnsignedIntegerSet.h: Added. (WTF::LikelyDenseUnsignedIntegerSet::LikelyDenseUnsignedIntegerSet): (WTF::LikelyDenseUnsignedIntegerSet::~LikelyDenseUnsignedIntegerSet): (WTF::LikelyDenseUnsignedIntegerSet::contains const): (WTF::LikelyDenseUnsignedIntegerSet::add): (WTF::LikelyDenseUnsignedIntegerSet::size const): (WTF::LikelyDenseUnsignedIntegerSet::iterator::iterator): (WTF::LikelyDenseUnsignedIntegerSet::iterator::m_shift): (WTF::LikelyDenseUnsignedIntegerSet::iterator::operator++): (WTF::LikelyDenseUnsignedIntegerSet::iterator::operator* const): (WTF::LikelyDenseUnsignedIntegerSet::iterator::operator== const): (WTF::LikelyDenseUnsignedIntegerSet::iterator::operator!= const): (WTF::LikelyDenseUnsignedIntegerSet::begin const): (WTF::LikelyDenseUnsignedIntegerSet::end const): (WTF::LikelyDenseUnsignedIntegerSet::memoryUse const): (WTF::LikelyDenseUnsignedIntegerSet::validate const): (WTF::LikelyDenseUnsignedIntegerSet::isBitVector const): (WTF::LikelyDenseUnsignedIntegerSet::isValidValue const): (WTF::LikelyDenseUnsignedIntegerSet::transitionToHashSet): (WTF::LikelyDenseUnsignedIntegerSet::transitionToBitVector): * wtf/SmallPtrSet.h: Removed. * wtf/SmallSet.cpp: Renamed from Source/WTF/wtf/SmallPtrSet.cpp. * wtf/SmallSet.h: Added. Tools: Simply added some tests for SmallSet. * TestWebKitAPI/TestWebKitAPI.xcodeproj/project.pbxproj: * TestWebKitAPI/Tests/WTF/SmallSet.cpp: Added. (TestWebKitAPI::testSmallSetOfUnsigned): (TestWebKitAPI::testSmallSetOfPointers): (TestWebKitAPI::testVectorsOfSmallSetsOfUnsigned): (TestWebKitAPI::TEST): Canonical link: https://commits.webkit.org/237893@main git-svn-id: https://svn.webkit.org/repository/webkit/trunk@277714 268f45cc-cd09-0410-ab3c-d52691b4dbfc
2021-05-19 03:24:14 +00:00
ASSERT(isSmall());
}
inline void grow(unsigned size)
{
// We memset the new buffer with -1, so for consistency emptyValue() must return something which is all 1s.
#if !defined(NDEBUG)
if constexpr (std::is_pointer<T>::value)
ASSERT(bitwise_cast<intptr_t>(emptyValue()) == -1ll);
else if constexpr (sizeof(T) == 8)
ASSERT(bitwise_cast<int64_t>(emptyValue()) == -1ll);
else if constexpr (sizeof(T) == 4)
ASSERT(bitwise_cast<int32_t>(emptyValue()) == -1);
else if constexpr (sizeof(T) == 2)
ASSERT(bitwise_cast<int16_t>(emptyValue()) == -1);
else if constexpr (sizeof(T) == 1)
ASSERT(bitwise_cast<int8_t>(emptyValue()) == -1);
else
RELEASE_ASSERT_NOT_REACHED();
#endif
size_t allocationSize = sizeof(T) * size;
bool wasSmall = isSmall();
T* oldBuffer = wasSmall ? m_inline.smallStorage : m_inline.buffer;
unsigned oldCapacity = m_capacity;
T* newBuffer = static_cast<T*>(SmallSetMalloc::malloc(allocationSize));
memset(static_cast<void*>(newBuffer), -1, allocationSize);
Make AirAllocateRegistersByGraphColoring use less memory https://bugs.webkit.org/show_bug.cgi?id=225848 Reviewed by Filip Pizlo. Source/JavaScriptCore: We've had some jetsam problems caused by the main Air register allocator, which caused us to lower Options::maximumTmpsForGraphColoring. Hence this patch tries to improve the memory usage of the allocator. It includes several changes: - Change the datastructure used for representing the interference graph. Before it was effectively a HashSet<std::pair<uint16_t, uint16_t>. Now, it is either a Bitvector (for n < 400 for now, can be tweaked easily), or a Vector<LikelyDenseUnsignedIntegerSet<uint16_t>> otherwise. LikelyDenseUnsignedIntegerSet is a new datastructure introduced by this patch, it is either a HashSet if very sparse, or a BitVector + an amount to shift it by. This is by far the largest memory reduction in this patch, it reduces the maximum memory used for an interference graph in tsf-wasm in JetStream2 from 16MB to 700kB, and in mruby-wasm.aotoki.dev from 262MB to 20MB (the later only happen when we increase Options::maximumTmpsForGraphColoring.. this is the exact function which caused us to lower it). Its effect on smaller functions in JetStream2 is rarely as dramatic but always an improvement, and improvements between 2x and 5x are extremely common (10x to 30x are significantly rarer but do occur). - In order to easily test this change and any further change to this datastructure, the old approach was preserved as InterferenceHashSet, and a template to run two such datastructures in parallel, checking their equivalence was added: InstrumentedInterferenceGraph. Running with it and reportInterferenceGraphMemoryUse set to true was used to compute the numbers given above. - There was already some template parameter to change the size of the tmp indices from unsigned to uint16_t but the code failed to compile unless it was unsigned. I fixed this, made more consistent use of it, and switched to uint16_t in the very common case that we have less than 65k Tmps (we can have more despite the option because of spilling). This halved the memory usage of various other datastructures in the register allocator - unspillableTmps was a HashSet<unsigned>. Since it is often quite dense (often around 20% on JetStream2), I replaced it by a Bitvector instead - m_biases was a HashMap<IndexType, HashSet<IndexType>>. Since it is extremely rare that the sets have more than 8 elements (from looking at some instrumented run of JetStream2), I replaced it by HashMap<IndexType, SmallSet<IndexType>>. This not only significantly reduces memory, but nearly halves the time spent in assignColors(around 80ms -> 40ms in JetStream 2) - UseCounts was needlessly general: it is only used by the register allocator (all other references to UseCounts refer to the completely different B3::UseCounts), so there is no point in it computing, and then storing lots of irrelevant data. A float is also more than enough precision (especially since it is pretty much always 1, 10, 100, or 1000 in practice…). Also, since we only need it indexed by Tmps, we can use a Vector with AbsoluteTmpMapper instead of its HashMap. These changes are not just memory savings, they also make selectSpill way faster (570ms -> 250ms on my machine on JetStream2) - While I was at it, I did a couple of other tweaks to the logic of selectSpill. In particular, instead of having to check for isFastTmp every time, I just put the fast tmps directly in unspillableTmps, which prevents them from getting added to m_spillWorklist in the first place. This + a bit of clean-up (for example putting an early exit instead of setting score to infinity in the case of dead tmps) resulted in a further perf win (to roughly 200ms spent in selectSpill() on JetStream2) All together, this patch reduces the time spent in the register allocator by roughly 15 to 20% in JetStream2 (tested both with the Briggs and the IRC allocators on my MBP 2019). I do not yet have precise performance numbers for this exact patch, but benchmarking a previous version of it (with a less optimized interference graph) resulted in significant RAMification improvements (around 1%), and more surprisingly some JetStream2 improvements on weaker machines (e.g. an iPhone 7 gained > 1%). I believe these gains come either from less trashing of the caches, or less contention caused by the memory traffic. I will try to update the bugzilla with more up-to-date thorough results when I get them. This patch does not increase Options::maximumTmpsForGraphColoring, I intend to do that in a separate patch to make it easier to revert in case of a problem. * b3/B3ReduceLoopStrength.cpp: (JSC::B3::ReduceLoopStrength::reduceByteCopyLoopsToMemcpy): * b3/air/AirAllocateRegistersAndStackByLinearScan.cpp: * b3/air/AirAllocateRegistersByGraphColoring.cpp: (JSC::B3::Air::allocateRegistersByGraphColoring): * b3/air/AirCode.h: (JSC::B3::Air::Code::forEachFastTmp const): * b3/air/AirUseCounts.h: (JSC::B3::Air::UseCounts::UseCounts): (JSC::B3::Air::UseCounts::isConstDef const): (JSC::B3::Air::UseCounts::numWarmUsesAndDefs const): (JSC::B3::Air::UseCounts::dump const): * parser/Nodes.h: Source/WTF: Two changes: the addition of LikelyDenseUnsignedIntegerSet, and various improvements to Small(Ptr)Set. The latter include: - Renaming SmallPtrSet into SmallSet, as it now supports integers as well as pointers. - Reducing its size by sharing the same storage for m_buffer and for m_smallStorage. This is safe to do, because all operations branch on isSmall() which depends purely on m_capacity. - Adding trivial size(), isEmpty() and memoryUse() methods - Adding a comment at the top of the file explaining when to use, and (more importantly) not to use SmallSet. LikelyDenseUnsignedIntegerSet is an even more specialized data structure, that can represent sets of unsigned integers very compactly if they are clustered. Finally I added an outOfLineMemoryUse() method to BitVector, making it more convenient to compare the memory consumption of different data structures in the register allocator. * WTF.xcodeproj/project.pbxproj: * wtf/BitVector.h: * wtf/CMakeLists.txt: * wtf/LikelyDenseUnsignedIntegerSet.cpp: Copied from Source/WTF/wtf/SmallPtrSet.cpp. * wtf/LikelyDenseUnsignedIntegerSet.h: Added. (WTF::LikelyDenseUnsignedIntegerSet::LikelyDenseUnsignedIntegerSet): (WTF::LikelyDenseUnsignedIntegerSet::~LikelyDenseUnsignedIntegerSet): (WTF::LikelyDenseUnsignedIntegerSet::contains const): (WTF::LikelyDenseUnsignedIntegerSet::add): (WTF::LikelyDenseUnsignedIntegerSet::size const): (WTF::LikelyDenseUnsignedIntegerSet::iterator::iterator): (WTF::LikelyDenseUnsignedIntegerSet::iterator::m_shift): (WTF::LikelyDenseUnsignedIntegerSet::iterator::operator++): (WTF::LikelyDenseUnsignedIntegerSet::iterator::operator* const): (WTF::LikelyDenseUnsignedIntegerSet::iterator::operator== const): (WTF::LikelyDenseUnsignedIntegerSet::iterator::operator!= const): (WTF::LikelyDenseUnsignedIntegerSet::begin const): (WTF::LikelyDenseUnsignedIntegerSet::end const): (WTF::LikelyDenseUnsignedIntegerSet::memoryUse const): (WTF::LikelyDenseUnsignedIntegerSet::validate const): (WTF::LikelyDenseUnsignedIntegerSet::isBitVector const): (WTF::LikelyDenseUnsignedIntegerSet::isValidValue const): (WTF::LikelyDenseUnsignedIntegerSet::transitionToHashSet): (WTF::LikelyDenseUnsignedIntegerSet::transitionToBitVector): * wtf/SmallPtrSet.h: Removed. * wtf/SmallSet.cpp: Renamed from Source/WTF/wtf/SmallPtrSet.cpp. * wtf/SmallSet.h: Added. Tools: Simply added some tests for SmallSet. * TestWebKitAPI/TestWebKitAPI.xcodeproj/project.pbxproj: * TestWebKitAPI/Tests/WTF/SmallSet.cpp: Added. (TestWebKitAPI::testSmallSetOfUnsigned): (TestWebKitAPI::testSmallSetOfPointers): (TestWebKitAPI::testVectorsOfSmallSetsOfUnsigned): (TestWebKitAPI::TEST): Canonical link: https://commits.webkit.org/237893@main git-svn-id: https://svn.webkit.org/repository/webkit/trunk@277714 268f45cc-cd09-0410-ab3c-d52691b4dbfc
2021-05-19 03:24:14 +00:00
m_capacity = size;
for (unsigned i = 0; i < oldCapacity; i++) {
if (oldBuffer[i] != emptyValue()) {
T* ptr = bucketInBuffer(newBuffer, static_cast<T>(oldBuffer[i]));
*ptr = oldBuffer[i];
}
}
if (!wasSmall)
SmallSetMalloc::free(oldBuffer);
m_inline.buffer = newBuffer;
}
inline T* bucket(T target) const
{
ASSERT(!isSmall());
return bucketInBuffer(m_inline.buffer, target);
}
inline T* bucketInBuffer(T* buffer, T target) const
{
ASSERT(!(m_capacity & (m_capacity - 1)));
unsigned bucket = Hash::hash(target) & (m_capacity - 1);
unsigned index = 0;
while (true) {
T* ptr = buffer + bucket;
if (*ptr == emptyValue())
return ptr;
if (*ptr == target)
return ptr;
index++;
bucket = (bucket + index) & (m_capacity - 1);
}
}
unsigned m_size;
unsigned m_capacity;
union U {
T* buffer;
T smallStorage[SmallArraySize];
U() { };
} m_inline;
};
} // namespace WTF
using WTF::SmallSet;