Skip to content

Latest commit

 

History

History
177 lines (130 loc) · 15 KB

25.processing_and_analysis.md

File metadata and controls

177 lines (130 loc) · 15 KB

Processing and Analysis of Data

yt provides several interfaces for accessing the data available in a given dataset. As described in @sec:data_objects, the primary means of accessing data is through "data objects" that apply selections to the dataset. These objects present dictionary-like interfaces that return data; below, we describe what options are available for the data that is returned (@sec:field_system), as well as high-level interfaces for applying aggregations and reductions (@sec:arraylike_operations).

Field System {#sec:field_system}

In yt, there are three types of "fields" that define values at a given spatial location. The first of these is an "on-disk" field, representing the raw, unmodified (except potentially up-cast to 64 bit precision) values read from the data storage that defines the dataset, such as files or bucket storage; while yt does provide routines for reading these fields, they are passed largely unmodified and so we do not discuss them in depth. The second type of field is a "derived field," which is a functional definition of how to process or combine one or more fields that exist in the dataset. Finally, providing the closure necessary for these derived fields to be accessed independently of their naming convention are "alias fields" that provide mappings between platform- or format-specific names for fields and those used internally in yt.

Fields are also defined by their "sampling type" to distinguish between those fields defined in a volume-filling fashion (i.e., cell-based fields) and those that are defined by discrete samples that may or may not require closure or convolution functions to be applied. Fields that are defined as a collection of discrete samples can be combined or filtered differently than those that are defined in a volume-filling manner, as described in @sec:particle_filters and @sec:particle_unions.

Field Aliases {#sec:field_aliases}

Small differences in naming fields can prove disproportionately challenging for writing platform-neutral analysis code. For instance, if one platform names the "density" field dens and another refers to it as Density (or, as we have seen in one platform, even the unicode character for $\rho$) then any platform-independent derived field that utilizes density must be defined multiple times to refer to this fundamentally identical quantity. (An important note here is that in many cases, the reverse problem is true -- some codes may refer to things with the same name but with different underlying definitions, which provides an additional challenge to the analysis process by requiring disambiguation.)

To address this issue, yt defines a set of fundamental fields, along with a naming convention for extensibility, that are provided as "aliases" for the dataset-specific field names. This enables a consistent ontology to be defined for fields in yt, upon which the remainder of derived fields can rely. Typically these are defined by the authors of a given dataset format frontend, wherein a translation or lookup table is provided to match the on-disk fields to those expected by yt.

In some cases, it is through a combination of derived and aliased fields that the full set of data is made available to the researcher; for instance, some datasets do not store velocity as a quantity on disk, but instead store momentum. In this case, momentum is aliased from the on-disk field to the yt field, and then a derived field is generated to seamlessly provide access to the velocity field wherever it is needed.

Derived Fields {#sec:derived_fields}

In addition to the fields that are defined in the dataset, yt recognizes that there exist essentially infinite fields in potentia that can be defined. For instance, commonly in astrophysics datasets the "density" of different elemental abundances are stored (which provides a natural conservation scheme with the density) in the dataset. A simple derived field might be defined to provide the "fraction" of a given field: $$ f_{X} \equiv \frac{\rho_{X}}{\rho} $$ yt provides the ability to define this as a derived field in a functional form. For instance, if the density of helium is stored as the field-tuple ("gas", "helium_density") we can define the function as:

def _helium_fraction(field, data):
    return data["gas", "helium_density"] / data["gas", "density"]

Note that here, the argument field is a field definition object and data is a data object which we are using for our selection. This is the form that derived fields in yt take; these can be supplied to the function add_field (or they can use derived_field as a decorator) and they will become available for all data objects.

These fields can accept parameters (associated with the base data object used for selection) and can require that spatial information is made available to the derived field; this can enable the calculation of finite-difference stencils for operations such as averaging and operators such as the gradient.

Derived fields are an extremely integral component of yt and are the gateway to enabling low-memory overhead calculations and sharing of analysis code. In addition, yt includes a large number of fields available, many of which are dynamically constructed according to metadata available in the dataset, to jump-start analysis. Researchers using yt can load a dataset and immediately compute, for instance, the velocity divergence and yt will construct the appropriate finite different stencil, fill in any missing zones at the edge of individual boundaries, and return an array that can be accessed, visualized or processed.

Particle Filters {#sec:particle_filters}

Many of the data formats that yt accepts define particles as mixtures of a single set of attributes (such as position, velocity, etc) and then a "type" -- for instance, intermingling dark matter particles with "star" particles. Where simulations are concerned, this can produce much more efficient code; since particles are typically evolved in the same fashion, storing them adjacent in memory can speedup operations such as time evolution steps. However, when reading the data in, they often need to be handled in fundamentally different ways. The analysis of dark matter particles in a galaxy, for instance, needs to be conducted differently than the analysis of collisional particles, or particles that arise from other phenomena (such as gas). yt provides a method for creating new "particle types" on the fly and applying existing derived fields to them. By adding a new "filter" method, particles that meet this criteria ("high-mass Black Holes," for instance, or "star clusters more than 1 billion years old") are accessible in a new field tuple. This enables all existing memory-conservative operations to act on them.

This filter, for example, checks and returns only those particles whose field particle_type is set to a value of 2.

@yt.particle_filter(requires=["particle_type"], filtered_type="all")
def stars(pfilter, data):
    filter = data[(pfilter.filtered_type, "particle_type")] == 2
    return filter

In this case, yt also infers the name of the newly filtered type from the name of the function, and they become stars. Now all existing operations will work on field-tuples beginning with "stars" as their field type.

Particle Unions {#sec:particle_unions}

The opposite operation to that in @sec:particle_filters is also accessible, by which multiple particle types can be combined and viewed as a single logical type. For instance, if "star particles" and "black hole" particles are distinct in a simulation of galaxy formation, they can be combined into a logical union:

u = ParticleUnion("massive_objects", ["bh", "stars"])
ds.add_particle_union(u)

Since unions are restricted to combinations in full of different types, their creation requires only specification of the particle types to combine. The set of available fields is the intersection of the fields available for all the combined types. If both particle types share fields A and B but only one shares C, the union will only have fields A and B accessible to it.

Field Detection {#sec:field_detection}

yt determines at dataset instantiation time the fields that are available to be computed. This provides the ability for researchers to query what fields are available, and additionally as a side-effect it provides information to the yt IO routines which fields need to be computed for a given derived field. By utilizing this information, yt can "resolve" all required fields when a derived field is requested. As such, it is able to identify that ("gas", "velocity_divergence") relies on the velocity fields along each axis. If these are the fields that exist in the dataset, the resolution process concludes here. If, however, they need to be computed from the momentum and density fields, those become the fields that are read from the dataset.

This resolution of field dependencies enables yt to read only the fields that are necessary and to do so in a single pass over a file, reducing the initialization and seeking time within a file. Particularly in environments where metadata operations (required for an open system call) or seek operations (where dataset chunks may need to be looked up within a file as indexed by a header) are expensive, this can have significant impact on the overall performance, and by operating on a chunk-by-chunk basis, it further reduces the need to store multiple fields in memory simultaneously.

This computation does, however, come with an overhead. Detecting the fields that are required (and thus determining which fields are available) can be expensive, as many small sympy objects are created in the unit handling subsystem and many redundant calculations performed in the yt-specific field resolution code. This is an area of great interest for future optimizations, as the current situation benefits the access of large derived fields over iteration over many small datasets. In particular, an enormous amount of time in the unit testing framework is spent detecting fields for datasets that are only used once and then discarded.

Array-like Operations {#sec:arraylike_operations}

In yt, a newly-constructed data selector contains no data -- this enables data selectors for large regions, in extremely large datasets, to be lightweight and cheap to construct. By ensuring that these objects don't immediately consume resources, they can be manipulated and operated on in a high-level fashion, without taxing the computational power. While these data objects can return the full set of data they include, yt also provides array-like operations that do not require immediate access to the full set of numerical values, and which align with the mental-model for data processing that yt exposes. As an example, consider the following two operations:

dd = ds.all_data()
dd["gas", "density"].max()

and

dd = ds.all_data()
dd.max(("gas", "density"))

Both are available in yt. As a side-effect of Python's object model, the first will access the ("gas", "density") item in the object dd, itself a concatenated numpy array, and then execute the max method on it. The second will call the max method on the data object, supplying to it the name of the field. This allows yt to decide how to decompose, parallelize and process the data in a memory-efficient way, and spread across multiple processors. Additionally, by emphasizing that the "maximum" is being taken on the data object, rather than the numerical data, other operations can be exposed that build on the underlying data organization. For instance, taking the maximum along a given (spatial) axis:

sp = ds.sphere("center", (10.0, "m"))
sp.max(("gas", "temperature"), axis="z")

This translates our meaning -- find the maximum value along the z-axis -- into a dimensionality reduction operation that uses yt's built-in "projection" method. These operations, on data objects (rather than the underlying arrays of values that are accessible through them) provide dataframe-like methods for querying very large, spatially registered data.

The array-like operations utilized in yt attempt to map to conceptually similar operations in numpy. Unlike numpy, however, these utilize yt's dataset-aware "chunking" operations, in a manner philosophically similar to the chunking operations used in the parallel computation library dask. Below, we outline the three classes of operations that are available, based on the type of their return value.

Reduction to Scalars {#sec:arrayops-scalar}

Traditional array operations that map from an array to a scalar are accessible utilizing familiar syntax. These include:

  • min(field_specification), max(field_specification), and ptp(field_specification)
  • argmin(field_specification, axis), and argmax(field_specification, axis)
  • mean(field_specification, weight), std(field_specification, weight), and sum(field_specification)

In addition to the advantages of allowing the parallelism and memory management be handled by yt, these operations are also able to accept multiple fields. This allows multiple fields to be queried in a single pass over the data, rather than multiple passes. Additionally, the min and max operations will automatically cache the results during a single pass, which means that calling max immediately after min (and vice versa) on the same data object and field will not require a recomputation.

In the case of argmin and argmax, the default returned "axis" will be the spatial coordinates of the minimum or maximum field value (respectively).
However, by specifying an axis or set of axes that correspond to fields, the field values will be queried at these minimum or maximum points. This allows, for instance, to query the value of "density" at the minimum "temperature." The operations mean and sum are available here in a non-spatial form, where they simply compute the scalar reduction independent of the spatial registration of the dataset.

Reduction to Vectors {#sec:arrayops-vector}

  • profile(axes, fields, profile_specification)

The profile operation provides weighted or unweighted histogramming in one or two dimensions. This function accepts the axes along which to compute the histogram as well as the fields to compute, and information about whether the binning should be an accumulation, an average, or a weighted average. These operations are described in more detail in reference profile section.

Remapping Operations {#sec:arrayops-remap}

  • mean(field_specification, weight, axis)
  • sum(field_specification, axis)
  • integrate(field_specification, weight, axis)

These functions map directly to different methods used by the projection data object. Both mean and sum, when supplied a spatial axis, will compute a dimensionally-reduced projection, remapped into a pixel coordinate plane. Importantly, if the dataset is a finite-volume dataset (grid, octree, etc), the results of these operations will be a variable-resolution mesh, rather than a fixed resolution image buffer.