First

2024-07-02 18:10:39 +02:00
commit 48ab06b1d9
733 changed files with 321088 additions and 0 deletions
@@ -0,0 +1,416 @@
+## Introduction to tuple_vector
+
+`tuple_vector` is a data container that is designed to abstract and simplify
+the handling of a "structure of arrays" layout of data in memory. In
+particular, it mimics the interface of `vector`, including functionality to do
+inserts, erases, push_backs, and random-access. It also provides a
+`RandomAccessIterator` and corresponding functionality, making it compatible
+with most STL (and STL-esque) algorithms such as ranged-for loops, `find_if`,
+`remove_if`, or `sort`.
+
+When used or applied properly, this container can improve performance of
+some algorithms through cache-coherent data accesses or allowing for
+sensible SIMD programming, while keeping the structure of a single
+container, to permit a developer to continue to use existing algorithms in
+STL and the like.
+
+## Review of "Structure of arrays" data layouts
+
+When trying to improve the performance of some code, it can sometimes be
+desirable to transform how some data is stored in memory to be laid out not as
+an "array of structures", but as a "structure of arrays". That is, instead of
+storing a series of objects as a single contiguous chunk of memory, one or 
+more data members are instead stored as separate chunks of memory that are 
+handled and accessed in parallel to each other.
+
+This can be beneficial in two primary respects:
+
+1) To improve the cache coherency of the data accesses, e.g. by utilizing more
+data that is loaded per cache line loaded from memory, and thereby reducing 
+the amount of time waiting on memory accesses from off-CPU memory.
+This presentation from Mike Acton touches on this, among other things:
+https://www.youtube.com/watch?v=rX0ItVEVjHc
+
+2) To allow the data to be more easily loaded and utilized by SIMD kernels, 
+by being able to load memory directly into a SIMD register.
+This is touched on in this presentation from Andreas Fredriksson for writing
+code with SIMD intrinsics:
+http://www.gdcvault.com/play/1022249/SIMD-at-Insomniac-Games-How
+...and as well in this guide for writing performant ISPC kernels:
+https://ispc.github.io/perfguide.html
+
+## How TupleVecImpl works
+
+`tuple_vector` inherits from `TupleVecImpl`, which
+provides the bulk of the functionality for those data containers. It manages
+the memory allocated, marshals data members to each array of memory, generates
+the necessary iterators, and so on. 
+
+When a `tuple_vector` is declared, it is alongside a list of types, or "tuple
+elements", indicating what data to store in the container, similar to how `tuple`
+operates. `TupleVecImpl` uses this list of tuple elements to then inherit from a series of
+`TupleVecLeaf` structures, which each have their own pointer to an array of their
+corresponding type in memory. When dereferencing the container, either to fetch a
+tuple of references or just fetching pointers to the memory, it is these pointers
+that are utilized or fetched. 
+
+While each `TupleVecLeaf` contains a pointer to its own block of memory, they
+are not individual memory allocations. When `TupleVecImpl` needs to grow its
+capacity, it calculates the total size needed for a single allocation, taking
+into account the number of objects for the container, the size of each tuple
+element's type, and the alignment requirements for each type. Pointers into the
+allocation for each tuple element are also determined at the same time, which
+are passed to each `TupleVecLeaf`. From there, many of the interactions with
+`TupleVecImpl`, to modify or access members of the container, then reference
+each `TupleVecLeaf`'s data pointer in series, using parameter packs to repeat
+each operation for each parent `TupleVecLeaf`.
+
+## How tuple_vector's iterator works
+
+`TupleVecImpl` provides a definition to an iterator type, `TupleVecIter`.
+As mentioned above, `TupleVecIter` provides all of the functionality to operate
+as a `RandomAccessIterator`. When it is dereferenced, it provides a tuple of
+references, similar to `at()` or `operator[]` on `TupleVecImpl`, as opposed to
+a reference of some other type. As well, a customization of `move_iterator` for
+`TupleVecIter` is provided, which will return a tuple of rvalue-references.
+
+The way that `TupleVecIter` operates internally is to track an index into the
+container, as well as a copy of all of the `TupleVecImpl`'s `TupleVecLeaf`
+pointers at the time of the iterator's construction. As a result, modifying the
+iterator involves just changing the index, and dereferencing the iterator into
+the tuple of references involves dereferencing each pointer with an offset
+specified by that index.
+
+Of the various ways of handling the multitude of references, this tended to
+provide the best code-generation. For example, having a tuple of pointers that
+are collectively modified with each iterator modification resulted in the compiler
+not being able to accurately determine which pointers were relevant to the final
+output of some function, creating many redundant operations. Similarly, having
+the iterator refer to the source `TupleVecImpl` for the series of pointers
+often resulted in extra, unnecessary, data hops to the `TupleVecImpl` to repeatedly
+fetch data that was not practically mutable, but theoretically mutable. While this
+solution is the heaviest in terms of storage, the resulted assembly tends to be
+competitive with traditional structure-of-arrays setups.
+
+## How to work with tuple_vector, and where to use it
+	
+Put simply, `tuple_vector` can be used as a replacement for `vector`. For example,
+instead of declaring a structure and vector as:
+
+```
+struct Entity
+{
+	bool active;
+	float lifetime;
+	Vec3 position;
+}
+vector<Entity> entityVec;
+```
+
+...the `tuple_vector` equivalent of this can be defined as:
+
+```
+tuple_vector<bool, float, Vec3> entityVec;
+```
+
+In terms of how `tuple_vector` is modified and accessed, it has a similar
+featureset as `vector`, except where `vector` would accept or return a single
+value, it instead accepts or returns a tuple of values or unstructured series
+of equivalent arguments.
+
+For example, the following functions can be used to access the data, either by
+fetching a tuple of references to a series of specific values, or the data
+pointers to the tuple elements:
+
+```
+tuple<bool&, float&, Vec3&> operator[](size_type)
+tuple<bool&, float&, Vec3&> at(size_type)
+tuple<bool&, float&, Vec3&> iterator::operator*()
+tuple<bool&&, float&&, Vec3&&> move_iterator::operator*()
+tuple<bool*, float*, Vec3*> data()
+
+// extract the Ith tuple element pointer from the tuple_vector
+template<size_type I>
+T* get<I>()
+// e.g. bool* get<0>(), float* get<1>(), and Vec3* get<2>()
+
+// extract the tuple element pointer of type T from the tuple_vector
+// note that this function can only be used if there is one instance
+// of type T in the tuple_vector's elements
+template<typename T>
+T* get<T>()
+// e.g. bool* get<bool>(), float* get<float>(), and Vec3* get<Vec3>()
+```
+
+And `push_back(...)` has the following overloads, accepting either values or tuples as needed.
+
+```
+tuple<bool&, float&, Vec3&> push_back()
+push_back(const bool&, const float&, const Vec3&)
+push_back(tuple<const bool&, const float&,const  Vec3&>)
+push_back(bool&&, float&&, Vec3&&)
+push_back(tuple<bool&&, float&&, Vec3&&>)		
+```
+...and so on, and so forth, for others like the constructor, `insert(...)`,
+`emplace(...)`, `emplace_back(...)`, `assign(...)`, and `resize(...)`.
+
+As well, note that the tuple types that are accepted or returned for 
+`tuple_vector<Ts...>` have typedefs available in the case of not wanting to use
+automatic type deduction:
+```
+typedef eastl::tuple<Ts...> value_tuple;
+typedef eastl::tuple<Ts&...> reference_tuple;
+typedef eastl::tuple<const Ts&...> const_reference_tuple;
+typedef eastl::tuple<Ts*...> ptr_tuple;
+typedef eastl::tuple<const Ts*...> const_ptr_tuple;
+typedef eastl::tuple<Ts&&...> rvalue_tuple;
+```
+With this, and the fact that the iterator type satisfies
+the `RandomAccessIterator` requirements, it is possible to use `tuple_vector` in
+most ways and manners that `vector` was previously used, with few structural
+differences.
+
+However, even if not using it strictly as a replacement for `vector`, it is
+still useful as a tool for simplifying management of a traditional structure of
+arrays. That is, it is possible to use `tuple_vector` to just perform a single
+large memory allocation instead of a series of smaller memory allocations,
+by sizing the `tuple_vector` as needed, fetching the necessary pointers with
+`data()` or `get<...>()`, and carrying on normally.
+
+One example where this can be utilized is with ISPC integration. Given the
+following ISPC function definition:
+
+	export void simple(uniform float vin[], uniform float vfactors[], uniform float vout[], uniform int size);
+
+...which generates the following function prototype for C/C++ usage:
+
+	extern void simple(float* vin, float* vfactors, float* vout, int32_t size);
+		
+...this can be utilized with some raw float arrays:
+```
+float* vin = new float[NumElements];
+float* vfactors = new float[NumElements];
+float* vout = new float[NumElements];
+
+// Initialize input buffer
+for (int i = 0; i < NumElements; ++i)
+{
+	vin[i] = (float)i;
+	vfactors[i] = (float)i / 2.0f;
+}
+
+// Call simple() function from simple.ispc file
+simple(vin, vfactors, vout, NumElements);
+
+delete vin;
+delete vfactors;
+delete vout;
+```
+or, with `tuple_vector`:
+
+```
+tuple_vector<float, float, float> simpleData(NumElements);
+float* vin = simpleData.get<0>();
+float* vfactors = simpleData.get<1>();
+float* vout = simpleData.get<2>();
+
+// Initialize input buffer
+for (int i = 0; i < NumElements; ++i)
+{
+	vin[i] = (float)i;
+	vfactors[i] = (float)i / 2.0f;
+}
+
+// Call simple() function from simple.ispc file
+simple(vin, vfactors, vout, NumElements);
+```
+		
+`simpleData` here only has a single memory allocation during its construction,
+instead of the three in the first example, and also automatically releases the
+memory when it falls out of scope.
+
+It is possible to also skip a memory allocation entirely, in some circumstances.
+EASTL provides "fixed" counterparts of many data containers which allows for a 
+data container to have an inlined buffer of memory. For example, 
+`eastl::vector<typename T>` has the following counterpart:
+
+	eastl::fixed_vector<typename T, size_type nodeCount, bool enableOverflow = true>
+
+This buffer allows for enough space to hold a `nodeCount` number of `T` objects,
+skipping any memory allocation at all, until the requested size becomes
+greater than `nodeCount` - assuming `enableOverflow` is True.
+
+There is a similar counterpart to `eastl::tuple_vector<typename... Ts>` available as well:
+
+	eastl::fixed_tuple_vector<size_type nodeCount, bool enableOverflow, typename... Ts>
+
+This does the similar legwork in creating an inlined buffer, and all of the
+functionality of `tuple_vector` otherwise is supported. Note the slight
+difference in declaration, though: `nodeCount` and `enableOverflow` are defined
+first, and `enableOverflow` is not a default parameter. This change arises out
+of restrictions surrounding variadic templates, in that they must be declared
+last, and cannot be mixed with default template parameters.
+
+Lastly, `eastl::vector` and other EASTL data containers support custom Memory Allocator
+types, through their template parameters. For example, `eastl::vector`'s full declaration
+is actually:
+
+	eastl::vector<typename T, typename AllocatorType = EASTLAllocatorType>
+
+However, because such a default template parameter cannot be used with
+variadic templates, a separate type for `tuple_vector` is required for such a
+definition:
+
+	eastl::tuple_vector_alloc<typename AllocatorType, typename... Ts>
+
+Note that `tuple_vector` uses EASTLAllocatorType as the allocator.
+
+## Performance comparisons/discussion
+
+A small benchmark suite for `tuple_vector` is included when running the
+EASTLBenchmarks project. It provides the following output on a Core i7 3770k
+(Skylake) at 3.5GHz, with DDR3-1600 memory.
+
+The `tuple_vector` benchmark cases compare total execution time of similar
+algorithms run against `eastl::tuple_vector` and `std::vector`, such as
+erasing or inserting elements, iterating through the array to find a specific
+element, sum all of the elements together via operator[] access, or just
+running `eastl::sort` on the data containers. More information about the
+EASTLBenchmarks suite can be found in EASTL/doc/EASTL Benchmarks.html
+	
+Benchmark | STD execution time | EASTL execution time | Ratio
+--------- | -------- | ---------- | -----
+`tuple_vector<AutoRefCount>/erase       ` |   1.7 ms |   1.7 ms | 1.00 
+`tuple_vector<MovableType>/erase        ` | 104.6 ms | 106.3 ms | 0.98 
+`tuple_vector<MovableType>/reallocate   ` |   1.3 ms |   1.7 ms | 0.77  -
+ |  |  | 
+`tuple_vector<uint64>/erase             ` |   3.4 ms |   3.5 ms | 0.98 
+`tuple_vector<uint64>/insert            ` |   3.4 ms |   3.4 ms | 0.99 
+`tuple_vector<uint64>/iteration         ` |  56.3 us |  81.4 us | 0.69  -
+`tuple_vector<uint64>/operator[]        ` |  67.4 us |  61.8 us | 1.09 
+`tuple_vector<uint64>/push_back         ` |   1.3 ms | 818.3 us | 1.53  +
+`tuple_vector<uint64>/sort              ` |   5.8 ms |   7.3 ms | 0.80 
+ |  |  | 
+`tuple_vector<uint64,Padding>/erase     ` |  34.7 ms |  32.9 ms | 1.05 
+`tuple_vector<uint64,Padding>/insert    ` |  41.0 ms |  32.6 ms | 1.26 
+`tuple_vector<uint64,Padding>/iteration ` | 247.1 us |  80.5 us | 3.07  +
+`tuple_vector<uint64,Padding>/operator[]` | 695.7 us |  81.1 us | 8.58  +
+`tuple_vector<uint64,Padding>/push_back ` |  10.0 ms |   6.0 ms | 1.67  +
+`tuple_vector<uint64,Padding>/sort      ` |   8.2 ms |  10.1 ms | 0.81 
+ |  |  | 
+`vector<AutoRefCount>/erase             ` |   1.3 ms |   1.2 ms | 1.05 
+`vector<MovableType>/erase              ` | 104.4 ms | 109.4 ms | 0.95 
+`vector<MovableType>/reallocate         ` |   1.5 ms |   1.5 ms | 0.95 
+ |  |  | 
+`vector<uint64>/erase                   ` |   4.3 ms |   3.6 ms | 1.20 
+`vector<uint64>/insert                  ` |   4.8 ms |   4.8 ms | 1.01 
+`vector<uint64>/iteration               ` |  71.5 us |  77.3 us | 0.92 
+`vector<uint64>/operator[]              ` |  90.7 us |  87.2 us | 1.04 
+`vector<uint64>/push_back               ` |   1.6 ms |   1.2 ms | 1.38  +
+`vector<uint64>/sort                    ` |   7.7 ms |   8.2 ms | 0.93 
+		
+First off, `tuple_vector<uint64>`'s performance versus `std::vector<uint64>` is
+comparable, as expected, as the `tuple_vector`'s management for one type
+becomes very similar to just a regular vector. The major notable exception is
+the iteration case, which runs `eastl::find_if`. This
+performance differences is a consequence of the iterator design, and how
+it works with indices, not a direct pointer, so the code generation suffers slightly
+in this compute-bound scenario. This is worth noting as a demonstration of a
+case where falling back to pointer-based iteration by fetching the `begin` and
+`end` pointers of that tuple element may be preferable, instead of using the
+iterator constructs.
+
+The set of `tuple_vector<uint64,Padding>` tests are more interesting. 
+This is a comparison between a single `std::vector` with a
+structure containing a `uint64` and 56 bytes of padding, and a `tuple_vector` with
+two elements: one for `uint64` and one for 56 bytes of padding. The erase,
+insert, push_back, and sort cases all perform at a similar relative rate as
+they did in the `tuple_vector<uint64>` tests - demonstrating that operations
+that have to touch all of elements do not have a significant change in
+performance.
+
+However, iteration and operator[] are very different, because
+those only access the `uint64` member of both `vector` and `tuple_vector` to run
+some operation. The iteration test now runs 3x faster whereas before it ran
+0.7x as fast, and operator[] runs 8.5x faster, instead of 1.1x. This
+demonstrates some of the utility of `tuple_vector`, in that these algorithms end
+up being limited by the CPU's compute capabilities, as opposed to being
+limited by how fast they can load memory in from DRAM. 
+		
+In a series of other tests, generally speaking, `tuple_vector` tends to perform
+on par with manual management of multiple arrays in many algorithms and
+operations, often even generating the same code. It should be noted that
+significant degrees of inlining and optimization are required to get the most out
+of `tuple_vector`. Compared to accessing a series of arrays or vectors,
+`tuple_vector` does perform a multitude of extra trivial function calls internally
+in order to manage the various elements, or interact with `eastl::tuple` through
+its interface, so running in debug configurations can run significantly slower
+in some cases, e.g. sometimes running at 0.2x the speed compared to vector.
+		
+## The problem of referencing tuple elements
+
+This will be experienced shortly after using `tuple_vector` in most capacities,
+but it should be noted that the most significant drawback is that there is no
+way to **symbolically** reference each tuple element of the `tuple_vector` - much
+in the same way as `tuple`. For example, if translating a struct such as...
+
+```
+struct Entity
+{
+	float x, y, z;
+	float lifetime;
+};
+```
+...to `tuple_vector`, it will exist as:
+
+```
+tuple_vector<float, float, float, float> entityVec;
+```
+
+...and can only be accessed in a manner like `entityVec.get<3>()` to refer to
+the `lifetime` member. With existing tools, the only good alternatives are to
+encapsulate each float as a separate struct to give it unique typenames...
+
+```
+struct entityX { float val; };
+struct entityY { float val; };
+struct entityZ { float val; };
+struct entityLifetime { float val; };
+
+tuple_vector<entityX, entityY, entityZ, entityLifetime> entityVec;
+```
+...and then access each tuple element by typename like
+`entityVec.get<entityLifetime>()`; or, creating an enumerated value to replace
+the indices...
+
+```
+enum EntityTypeEnum
+{
+	entityX = 0,
+	entityY = 1,
+	entityZ = 2,
+	entityLifetime = 3
+};
+
+tuple_vector<float, float, float, float> entityVec;
+```
+
+...and then access each tuple element by the enumerated value:
+`entityVec.get<entityLifetime>()`.
+
+Either way, there is a fairly significant maintenance and readability issue
+around this. This is arguably more severe than with `tuple` on its own
+because that is generally not intended for structures with long lifetime.
+
+Ideally, if the language could be mutated to accommodate such a thing, it would
+be good to have some combination of typenames and symbolic names in the
+declaration, e.g. something like
+
+```
+tuple_vector<float x, float y, float z, float lifetime> entityVec;
+```
+and be able to reference the tuple elements not just by typename or index, but
+through their corresponding symbol, like `entityVec.get<lifetime>()`. Or, it may
+be interesting if the necessary `get` functions could be even automatically
+generated through a reflection system, e.g. `entityVec.get_lifetime()`.
+All of this remains a pipe dream for now.