EASTL/doc/Bonus/tuple_vector_readme.md

## Introduction to tuple_vector

`tuple_vector` is a data container that is designed to abstract and simplify
the handling of a "structure of arrays" layout of data in memory. In
particular, it mimics the interface of `vector`, including functionality to do
inserts, erases, push_backs, and random-access. It also provides a
`RandomAccessIterator` and corresponding functionality, making it compatible
with most STL (and STL-esque) algorithms such as ranged-for loops, `find_if`,
`remove_if`, or `sort`.

When used or applied properly, this container can improve performance of
some algorithms through cache-coherent data accesses or allowing for
sensible SIMD programming, while keeping the structure of a single
container, to permit a developer to continue to use existing algorithms in
STL and the like.

## Review of "Structure of arrays" data layouts

When trying to improve the performance of some code, it can sometimes be
desirable to transform how some data is stored in memory to be laid out not as
an "array of structures", but as a "structure of arrays". That is, instead of
storing a series of objects as a single contiguous chunk of memory, one or
more data members are instead stored as separate chunks of memory that are
handled and accessed in parallel to each other.

This can be beneficial in two primary respects:

1) To improve the cache coherency of the data accesses, e.g. by utilizing more
data that is loaded per cache line loaded from memory, and thereby reducing
the amount of time waiting on memory accesses from off-CPU memory.
This presentation from Mike Acton touches on this, among other things:
https://www.youtube.com/watch?v=rX0ItVEVjHc

2) To allow the data to be more easily loaded and utilized by SIMD kernels,
by being able to load memory directly into a SIMD register.
This is touched on in this presentation from Andreas Fredriksson for writing
code with SIMD intrinsics:
http://www.gdcvault.com/play/1022249/SIMD-at-Insomniac-Games-How
...and as well in this guide for writing performant ISPC kernels:
https://ispc.github.io/perfguide.html

## How TupleVecImpl works

`tuple_vector` inherits from `TupleVecImpl`, which
provides the bulk of the functionality for those data containers. It manages
the memory allocated, marshals data members to each array of memory, generates
the necessary iterators, and so on.

When a `tuple_vector` is declared, it is alongside a list of types, or "tuple
elements", indicating what data to store in the container, similar to how `tuple`
operates. `TupleVecImpl` uses this list of tuple elements to then inherit from a series of
`TupleVecLeaf` structures, which each have their own pointer to an array of their
corresponding type in memory. When dereferencing the container, either to fetch a
tuple of references or just fetching pointers to the memory, it is these pointers
that are utilized or fetched.

While each `TupleVecLeaf` contains a pointer to its own block of memory, they
are not individual memory allocations. When `TupleVecImpl` needs to grow its
capacity, it calculates the total size needed for a single allocation, taking
into account the number of objects for the container, the size of each tuple
element's type, and the alignment requirements for each type. Pointers into the
allocation for each tuple element are also determined at the same time, which
are passed to each `TupleVecLeaf`. From there, many of the interactions with
`TupleVecImpl`, to modify or access members of the container, then reference
each `TupleVecLeaf`'s data pointer in series, using parameter packs to repeat
each operation for each parent `TupleVecLeaf`.

## How tuple_vector's iterator works

`TupleVecImpl` provides a definition to an iterator type, `TupleVecIter`.
As mentioned above, `TupleVecIter` provides all of the functionality to operate
as a `RandomAccessIterator`. When it is dereferenced, it provides a tuple of
references, similar to `at()` or `operator[]` on `TupleVecImpl`, as opposed to
a reference of some other type. As well, a customization of `move_iterator` for
`TupleVecIter` is provided, which will return a tuple of rvalue-references.

The way that `TupleVecIter` operates internally is to track an index into the
container, as well as a copy of all of the `TupleVecImpl`'s `TupleVecLeaf`
pointers at the time of the iterator's construction. As a result, modifying the
iterator involves just changing the index, and dereferencing the iterator into
the tuple of references involves dereferencing each pointer with an offset
specified by that index.

Of the various ways of handling the multitude of references, this tended to
provide the best code-generation. For example, having a tuple of pointers that
are collectively modified with each iterator modification resulted in the compiler
not being able to accurately determine which pointers were relevant to the final
output of some function, creating many redundant operations. Similarly, having
the iterator refer to the source `TupleVecImpl` for the series of pointers
often resulted in extra, unnecessary, data hops to the `TupleVecImpl` to repeatedly
fetch data that was not practically mutable, but theoretically mutable. While this
solution is the heaviest in terms of storage, the resulted assembly tends to be
competitive with traditional structure-of-arrays setups.

## How to work with tuple_vector, and where to use it

Put simply, `tuple_vector` can be used as a replacement for `vector`. For example,
instead of declaring a structure and vector as:

```
struct Entity
{
	bool active;
	float lifetime;
	Vec3 position;
}
vector<Entity> entityVec;
```

...the `tuple_vector` equivalent of this can be defined as:

```
tuple_vector<bool, float, Vec3> entityVec;
```

In terms of how `tuple_vector` is modified and accessed, it has a similar
featureset as `vector`, except where `vector` would accept or return a single
value, it instead accepts or returns a tuple of values or unstructured series
of equivalent arguments.

For example, the following functions can be used to access the data, either by
fetching a tuple of references to a series of specific values, or the data
pointers to the tuple elements:

```
tuple<bool&, float&, Vec3&> operator[](size_type)
tuple<bool&, float&, Vec3&> at(size_type)
tuple<bool&, float&, Vec3&> iterator::operator*()
tuple<bool&&, float&&, Vec3&&> move_iterator::operator*()
tuple<bool*, float*, Vec3*> data()

// extract the Ith tuple element pointer from the tuple_vector
template<size_type I>
T* get<I>()
// e.g. bool* get<0>(), float* get<1>(), and Vec3* get<2>()

// extract the tuple element pointer of type T from the tuple_vector
// note that this function can only be used if there is one instance
// of type T in the tuple_vector's elements
template<typename T>
T* get<T>()
// e.g. bool* get<bool>(), float* get<float>(), and Vec3* get<Vec3>()
```

And `push_back(...)` has the following overloads, accepting either values or tuples as needed.

```
tuple<bool&, float&, Vec3&> push_back()
push_back(const bool&, const float&, const Vec3&)
push_back(tuple<const bool&, const float&,const  Vec3&>)
push_back(bool&&, float&&, Vec3&&)
push_back(tuple<bool&&, float&&, Vec3&&>)
```
...and so on, and so forth, for others like the constructor, `insert(...)`,
`emplace(...)`, `emplace_back(...)`, `assign(...)`, and `resize(...)`.

As well, note that the tuple types that are accepted or returned for
`tuple_vector<Ts...>` have typedefs available in the case of not wanting to use
automatic type deduction:
```
typedef eastl::tuple<Ts...> value_tuple;
typedef eastl::tuple<Ts&...> reference_tuple;
typedef eastl::tuple<const Ts&...> const_reference_tuple;
typedef eastl::tuple<Ts*...> ptr_tuple;
typedef eastl::tuple<const Ts*...> const_ptr_tuple;
typedef eastl::tuple<Ts&&...> rvalue_tuple;
```
With this, and the fact that the iterator type satisfies
the `RandomAccessIterator` requirements, it is possible to use `tuple_vector` in
most ways and manners that `vector` was previously used, with few structural
differences.

However, even if not using it strictly as a replacement for `vector`, it is
still useful as a tool for simplifying management of a traditional structure of
arrays. That is, it is possible to use `tuple_vector` to just perform a single
large memory allocation instead of a series of smaller memory allocations,
by sizing the `tuple_vector` as needed, fetching the necessary pointers with
`data()` or `get<...>()`, and carrying on normally.

One example where this can be utilized is with ISPC integration. Given the
following ISPC function definition:

	export void simple(uniform float vin[], uniform float vfactors[], uniform float vout[], uniform int size);

...which generates the following function prototype for C/C++ usage:

	extern void simple(float* vin, float* vfactors, float* vout, int32_t size);

...this can be utilized with some raw float arrays:
```
float* vin = new float[NumElements];
float* vfactors = new float[NumElements];
float* vout = new float[NumElements];

// Initialize input buffer
for (int i = 0; i < NumElements; ++i)
{
	vin[i] = (float)i;
	vfactors[i] = (float)i / 2.0f;
}

// Call simple() function from simple.ispc file
simple(vin, vfactors, vout, NumElements);

delete vin;
delete vfactors;
delete vout;
```
or, with `tuple_vector`:

```
tuple_vector<float, float, float> simpleData(NumElements);
float* vin = simpleData.get<0>();
float* vfactors = simpleData.get<1>();
float* vout = simpleData.get<2>();

// Initialize input buffer
for (int i = 0; i < NumElements; ++i)
{
	vin[i] = (float)i;
	vfactors[i] = (float)i / 2.0f;
}

// Call simple() function from simple.ispc file
simple(vin, vfactors, vout, NumElements);
```

`simpleData` here only has a single memory allocation during its construction,
instead of the three in the first example, and also automatically releases the
memory when it falls out of scope.

It is possible to also skip a memory allocation entirely, in some circumstances.
EASTL provides "fixed" counterparts of many data containers which allows for a
data container to have an inlined buffer of memory. For example,
`eastl::vector<typename T>` has the following counterpart:

	eastl::fixed_vector<typename T, size_type nodeCount, bool enableOverflow = true>

This buffer allows for enough space to hold a `nodeCount` number of `T` objects,
skipping any memory allocation at all, until the requested size becomes
greater than `nodeCount` - assuming `enableOverflow` is True.

There is a similar counterpart to `eastl::tuple_vector<typename... Ts>` available as well:

	eastl::fixed_tuple_vector<size_type nodeCount, bool enableOverflow, typename... Ts>

This does the similar legwork in creating an inlined buffer, and all of the
functionality of `tuple_vector` otherwise is supported. Note the slight
difference in declaration, though: `nodeCount` and `enableOverflow` are defined
first, and `enableOverflow` is not a default parameter. This change arises out
of restrictions surrounding variadic templates, in that they must be declared
last, and cannot be mixed with default template parameters.

Lastly, `eastl::vector` and other EASTL data containers support custom Memory Allocator
types, through their template parameters. For example, `eastl::vector`'s full declaration
is actually:

	eastl::vector<typename T, typename AllocatorType = EASTLAllocatorType>

However, because such a default template parameter cannot be used with
variadic templates, a separate type for `tuple_vector` is required for such a
definition:

	eastl::tuple_vector_alloc<typename AllocatorType, typename... Ts>

Note that `tuple_vector` uses EASTLAllocatorType as the allocator.

## Performance comparisons/discussion

A small benchmark suite for `tuple_vector` is included when running the
EASTLBenchmarks project. It provides the following output on a Core i7 3770k
(Skylake) at 3.5GHz, with DDR3-1600 memory.

The `tuple_vector` benchmark cases compare total execution time of similar
algorithms run against `eastl::tuple_vector` and `std::vector`, such as
erasing or inserting elements, iterating through the array to find a specific
element, sum all of the elements together via operator[] access, or just
running `eastl::sort` on the data containers. More information about the
EASTLBenchmarks suite can be found in EASTL/doc/EASTL Benchmarks.html

Benchmark | STD execution time | EASTL execution time | Ratio
--------- | -------- | ---------- | -----
`tuple_vector<AutoRefCount>/erase       ` |   1.7 ms |   1.7 ms | 1.00
`tuple_vector<MovableType>/erase        ` | 104.6 ms | 106.3 ms | 0.98
`tuple_vector<MovableType>/reallocate   ` |   1.3 ms |   1.7 ms | 0.77  -
 |  |  |
`tuple_vector<uint64>/erase             ` |   3.4 ms |   3.5 ms | 0.98
`tuple_vector<uint64>/insert            ` |   3.4 ms |   3.4 ms | 0.99
`tuple_vector<uint64>/iteration         ` |  56.3 us |  81.4 us | 0.69  -
`tuple_vector<uint64>/operator[]        ` |  67.4 us |  61.8 us | 1.09
`tuple_vector<uint64>/push_back         ` |   1.3 ms | 818.3 us | 1.53  +
`tuple_vector<uint64>/sort              ` |   5.8 ms |   7.3 ms | 0.80
 |  |  |
`tuple_vector<uint64,Padding>/erase     ` |  34.7 ms |  32.9 ms | 1.05
`tuple_vector<uint64,Padding>/insert    ` |  41.0 ms |  32.6 ms | 1.26
`tuple_vector<uint64,Padding>/iteration ` | 247.1 us |  80.5 us | 3.07  +
`tuple_vector<uint64,Padding>/operator[]` | 695.7 us |  81.1 us | 8.58  +
`tuple_vector<uint64,Padding>/push_back ` |  10.0 ms |   6.0 ms | 1.67  +
`tuple_vector<uint64,Padding>/sort      ` |   8.2 ms |  10.1 ms | 0.81
 |  |  |
`vector<AutoRefCount>/erase             ` |   1.3 ms |   1.2 ms | 1.05
`vector<MovableType>/erase              ` | 104.4 ms | 109.4 ms | 0.95
`vector<MovableType>/reallocate         ` |   1.5 ms |   1.5 ms | 0.95
 |  |  |
`vector<uint64>/erase                   ` |   4.3 ms |   3.6 ms | 1.20
`vector<uint64>/insert                  ` |   4.8 ms |   4.8 ms | 1.01
`vector<uint64>/iteration               ` |  71.5 us |  77.3 us | 0.92
`vector<uint64>/operator[]              ` |  90.7 us |  87.2 us | 1.04
`vector<uint64>/push_back               ` |   1.6 ms |   1.2 ms | 1.38  +
`vector<uint64>/sort                    ` |   7.7 ms |   8.2 ms | 0.93

First off, `tuple_vector<uint64>`'s performance versus `std::vector<uint64>` is
comparable, as expected, as the `tuple_vector`'s management for one type
becomes very similar to just a regular vector. The major notable exception is
the iteration case, which runs `eastl::find_if`. This
performance differences is a consequence of the iterator design, and how
it works with indices, not a direct pointer, so the code generation suffers slightly
in this compute-bound scenario. This is worth noting as a demonstration of a
case where falling back to pointer-based iteration by fetching the `begin` and
`end` pointers of that tuple element may be preferable, instead of using the
iterator constructs.

The set of `tuple_vector<uint64,Padding>` tests are more interesting.
This is a comparison between a single `std::vector` with a
structure containing a `uint64` and 56 bytes of padding, and a `tuple_vector` with
two elements: one for `uint64` and one for 56 bytes of padding. The erase,
insert, push_back, and sort cases all perform at a similar relative rate as
they did in the `tuple_vector<uint64>` tests - demonstrating that operations
that have to touch all of elements do not have a significant change in
performance.

However, iteration and operator[] are very different, because
those only access the `uint64` member of both `vector` and `tuple_vector` to run
some operation. The iteration test now runs 3x faster whereas before it ran
0.7x as fast, and operator[] runs 8.5x faster, instead of 1.1x. This
demonstrates some of the utility of `tuple_vector`, in that these algorithms end
up being limited by the CPU's compute capabilities, as opposed to being
limited by how fast they can load memory in from DRAM.

In a series of other tests, generally speaking, `tuple_vector` tends to perform
on par with manual management of multiple arrays in many algorithms and
operations, often even generating the same code. It should be noted that
significant degrees of inlining and optimization are required to get the most out
of `tuple_vector`. Compared to accessing a series of arrays or vectors,
`tuple_vector` does perform a multitude of extra trivial function calls internally
in order to manage the various elements, or interact with `eastl::tuple` through
its interface, so running in debug configurations can run significantly slower
in some cases, e.g. sometimes running at 0.2x the speed compared to vector.

## The problem of referencing tuple elements

This will be experienced shortly after using `tuple_vector` in most capacities,
but it should be noted that the most significant drawback is that there is no
way to **symbolically** reference each tuple element of the `tuple_vector` - much
in the same way as `tuple`. For example, if translating a struct such as...

```
struct Entity
{
	float x, y, z;
	float lifetime;
};
```
...to `tuple_vector`, it will exist as:

```
tuple_vector<float, float, float, float> entityVec;
```

...and can only be accessed in a manner like `entityVec.get<3>()` to refer to
the `lifetime` member. With existing tools, the only good alternatives are to
encapsulate each float as a separate struct to give it unique typenames...

```
struct entityX { float val; };
struct entityY { float val; };
struct entityZ { float val; };
struct entityLifetime { float val; };

tuple_vector<entityX, entityY, entityZ, entityLifetime> entityVec;
```
...and then access each tuple element by typename like
`entityVec.get<entityLifetime>()`; or, creating an enumerated value to replace
the indices...

```
enum EntityTypeEnum
{
	entityX = 0,
	entityY = 1,
	entityZ = 2,
	entityLifetime = 3
};

tuple_vector<float, float, float, float> entityVec;
```

...and then access each tuple element by the enumerated value:
`entityVec.get<entityLifetime>()`.

Either way, there is a fairly significant maintenance and readability issue
around this. This is arguably more severe than with `tuple` on its own
because that is generally not intended for structures with long lifetime.

Ideally, if the language could be mutated to accommodate such a thing, it would
be good to have some combination of typenames and symbolic names in the
declaration, e.g. something like

```
tuple_vector<float x, float y, float z, float lifetime> entityVec;
```
and be able to reference the tuple elements not just by typename or index, but
through their corresponding symbol, like `entityVec.get<lifetime>()`. Or, it may
be interesting if the necessary `get` functions could be even automatically
generated through a reflection system, e.g. `entityVec.get_lifetime()`.
All of this remains a pipe dream for now.