Skip to content

Commit 93542ae

Browse files
authored
Merge pull request #653 from scipp/sqw-docs
Finish SQW docs
2 parents de5f036 + 8e35c50 commit 93542ae

File tree

3 files changed

+166
-25
lines changed

3 files changed

+166
-25
lines changed

docs/developer/file-formats/sqw.md

Lines changed: 143 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -7,14 +7,16 @@ ScippNeutron can write and, to a limited degree, read SQW files using {mod}`scip
77

88
```{note}
99
There is no official specification for the format of SQW files.
10-
The contents of this document are based the little bits of documentation that are available, from reading the source code of Horace, and from experimenting with Horace and SQW files.
11-
It should *not* be regarded as authoritative.
10+
The contents of this document are based on what documentation is available, on reading the source code of Horace, and on experiments with Horace and SQW files.
11+
This document should *not* be regarded as authoritative.
1212
```
1313

14+
See, e.g., [07-sqw_redesign.md](https://github.com/pace-neutrons/Horace/blob/3c00958d578c625a7f2274ef7b34557578abb934/documentation/add/07-sqw_redesign.md) for some documentation provided by the developers of Horace.
15+
1416
## Overview
1517

1618
SQW files are binary files that encode neutron counts both in a flexible binned format called 'pixels' and a multidimensional histogram called 'DND' (or D1D through D4D for concrete cases).
17-
In addition to the data, SQW files store metadata that describe how the experiment was set up and how the data were binned.
19+
In addition to the data, SQW files store metadata that describes how the experiment was set up and how the data was binned.
1820

1921
### Data Blocks
2022

@@ -43,7 +45,7 @@ Block names are split into a 'name' and a 'level2_name' which in ScippNeutron ar
4345
- Histogram data.
4446
* - experiment_info
4547
- instruments
46-
- Information about the instrument where the data were measured. One {class}`SqwIXNullInstrument <scippneutron.io.sqw.SqwIXNullInstrument>` per [run](#Runs). (Horace defines several instrument classes but so far, we have only use the null instrument.)
48+
- Information about the instrument where the data was measured. One {class}`SqwIXNullInstrument <scippneutron.io.sqw.SqwIXNullInstrument>` per [run](#Runs). (Horace defines several instrument classes. But so far, the null instrument has been sufficient.)
4749
* - experiment_info
4850
- samples
4951
- Information on the sample that was measured. One {class}`SqwIXSample <scippneutron.io.sqw.SqwIXSample>` per [run](#Runs).
@@ -55,7 +57,7 @@ Block names are split into a 'name' and a 'level2_name' which in ScippNeutron ar
5557
- Information about the pixel data. Largely hidden and automated in ScippNeutron.
5658
* - pix
5759
- data_wrap
58-
- 'Pixel' data, i.e., binned observations as a function of momentum and energy transfer. Stored as a $9 \times N$ array with [rows](https://github.com/pace-neutrons/Horace/blob/master/documentation/add/05_file_formats.md):
60+
- 'Pixel' data, i.e., binned observations as a function of momentum and energy transfer. Stored as a $9 \times N$ array with these [rows](https://github.com/pace-neutrons/Horace/blob/master/documentation/add/05_file_formats.md) in order:
5961
- `u1`, `u2`, `u3`, `u4`: usually momentum transfers $Q_x$, $Q_y$, $Q_z$ and energy transfer $\Delta E$.
6062
- `irun`: [Run](#Runs) index.
6163
- `idet`: Detector (pixel) index.
@@ -66,11 +68,17 @@ Block names are split into a 'name' and a 'level2_name' which in ScippNeutron ar
6668

6769
### Runs
6870

69-
Data can be collected in multiple runs or 'settings' where each run has a fixed set of sample parameters.
71+
Data can be collected in multiple runs or 'settings' where each run has a fixed set of sample and instrument parameters.
7072
In SQW, runs are encoded by assigning a run index `irun` to each observation and providing an appropriate instrument, sample, and experiment object.
7173
The instrument and sample objects are often the same for all runs.
7274
This can be encoded by way of 'unique object containers' which store only one copy of each unique object and indices to associate these objects with runs.
7375

76+
### FORTRAN Layout
77+
78+
Horace, being a MATLAB program, uses FORTRAN order for arrays and 1-based indexing.
79+
{mod}`scippneutron.io.sqw` converts between these conventions and the Python conventions automatically in most cases.
80+
But some data, most notably the pixel array, must be provided with 1-based indices.
81+
7482
### Byte Order
7583

7684
SQW files do not have a dedicated mechanism for encoding or detecting byte order.
@@ -103,14 +111,14 @@ For Horace version 4.0, this header is
103111

104112
### Block Allocation Table
105113

106-
The block allocation table (BAT) encodes which data blocks exist and where they are located in the file.
114+
The block allocation table (BAT) encodes which [data blocks](#data-blocks) exist and where they are located in the file.
107115
The BAT is a sequence of data block descriptors.
108116
The structure of the BAT is
109117
```text
110-
| size | n_blocks | descriptor | ...
111-
| :u32 | :u32 | :descriptor | ...
118+
| size | n_blocks | descriptor_0 | ... | descriptor_n
119+
| :u32 | :u32 | :descriptor | ... | :descript
112120
```
113-
where `descriptor` is
121+
where the `:descriptor` type is
114122
```text
115123
| block_type | name | level2_name | position | size | locked |
116124
| :char array | :char array | :char array | :u64 | u32 | u32 |
@@ -144,9 +152,11 @@ Blocks with type `"dnd_data_block"` are stored as a
144152
| :u32 | :u32 | :u32 | ... | :f64 array | :f64 array | :u64 array |
145153
```
146154
- `ndim` and `length_i` encode a [shape](#shape) but uses a `u32` for `ndim` instead of the usual `u8`.
147-
- `values` currently unknown but looks like a spectroscopy pattern.
148-
- `errors` currently unknown, uncertainties of `values`?
149-
- `counts` currently unknown but looks like a spectroscopy pattern.
155+
- `values` is the average intensity per bin (`pix.bins.mean()` or, equivalently, `pix.bins.sum() / pix.bins.size()`).
156+
- `errors` is the standard deviation per bin (`sc.stddevs(pix.bins.mean())`).
157+
- `counts` is the number of observations in each bin (`pix.bins.size()`).
158+
159+
Note the example code at the end of the next section.
150160

151161
#### Pixel Data Blocks
152162

@@ -159,9 +169,49 @@ The column-major layout results in a file of the form
159169
| ...
160170
```
161171

162-
Columns must be sorted in ascending order according to the `u` rows where `u1` is sorted, for each constant `u1`, `u2` is sorted, etc.
172+
Pixels are sorted in bins following the same order as in the [DND data block](#dnd-data-blocks).
173+
Given the [FORTRAN layout](#fortran-layout), the pixels are sorted according to
174+
```python
175+
sorted_bins = binned_observations.transpose(['u4', 'u3', 'u2', 'u1'])
176+
```
177+
Within each bin, the data is unordered.
178+
179+
The number of bins is encoded by the `n_bins_all_dims` attribute of the DND metadata block.
180+
The number of pixels in each bin is determined by the ``counts`` array in the [DND data block](#dnd-data-blocks).
181+
And the ranges of the ``u``s are determined by the ``img_range`` attribute of the DND metadata block.
182+
```python
183+
import itertools
184+
import numpy as np
185+
import scipp as sc
186+
from scippneutron.io import sqw
187+
188+
with sqw.Sqw.open("file.sqw") as f:
189+
metadata = f.read_data_block("data", "metadata")
190+
dnd = f.read_data_block("data", "nd_data")
191+
pix = f.read_data_block("pix", "data_wrap")
192+
193+
n_pix = dnd[2].astype(int)
194+
195+
# Compute the bin edges:
196+
img_range = metadata.axes.img_range
197+
n_bins = metadata.axes.n_bins_all_dims.values
198+
u_edges = [
199+
sc.linspace(f'u{i}', img_range[i][0], img_range[i][1], n_bins[i] + 1)
200+
for i in range(len(n_bins))
201+
]
202+
203+
# Access pixels bin-by-bin:
204+
bin_offsets = np.r_[0, np.cumsum(n_pix.flat)]
205+
for bin_index, (l, k, j, i) in enumerate( # noqa: E741
206+
itertools.product(*(range(nb) for nb in n_bins[::-1]))
207+
):
208+
pix_slice = slice(bin_offsets[bin_index], bin_offsets[bin_index + 1])
209+
observations_in_bin = pix[pix_slice]
210+
```
211+
212+
## Data Types
163213

164-
## Objects
214+
This section describes the data types used in SQW files.
165215

166216
### Logical
167217

@@ -174,19 +224,87 @@ All numbers are encoded as their byte representation in the target byte order.
174224

175225
### Character Array
176226

177-
A sequence of ASCII characters with *known* length.
178-
Encoded as
227+
A sequence of ASCII characters including the length.
228+
Encoded as characters ``c0`` through ``cn``:
179229
```text
180230
| length | c0 | c1 | ... |
181231
| :uint32 | :uint8 | :uint8 | ... |
182232
```
183233

184234
### Fixed Character Array
185235

186-
A sequence of ASCII characters with *unknown* length.
187236
Like [Character Array](#character-array) but the length is not stored in the file.
237+
So the length must be known from context.
238+
239+
### Struct
188240

189-
### Shape
241+
A collection of named fields with different types.
242+
243+
The field names are encoded as [Fixed Character Arrays](#fixed-character-array) with the lengths up front.
244+
The field values are encoded as a [Cell Arrays](#cell-array).
245+
```text
246+
| n_fields | name_len_0 | ... | name_len_n |
247+
| :uint32 | :uint32 | ... | :uint32 |
248+
249+
| name_0 | ... | name_n |
250+
| :fixed_char_array | ... | :fixed_char_array |
251+
252+
| value_0 | ... | value_n |
253+
| :cell_array | ... | :cell_array |
254+
```
255+
256+
When used in an [Object Array](#object-array) with more than one element, the fields are stored in a 'struct of arrays' fashion.
257+
This means that if the surrounding object array has shape `S`, so does each [cell array](#cell-array).
258+
For example, given structs
259+
```python
260+
class MyStruct:
261+
i: int
262+
decimal: float
263+
264+
data = [MyStruct(i=10, decimal=0.1), MyStruct(i=20, decimal=0.2)]
265+
```
266+
the data will be encoded as
267+
```text
268+
| 2 | 1 | 7 | "i" | "decimal" | 10 | 20 | 0.1 | 0.2 |
269+
```
270+
271+
### Cell Array
272+
273+
Cell arrays are MATLAB’s way of encoding heterogeneous arrays where the elements can have different types.
274+
Each element is encoded as an [object array](#object-array).
275+
276+
In SQW, cell arrays can occur either standalone or within an object array (by way of a struct).
277+
Standalone cell arrays encode their shape in the same way as an object array.
278+
Otherwise, they do *not* store their shape; it needs to be provided by the [object array](#object-array) that the cell array is part of.
279+
280+
### Object Array
281+
282+
Object arrays are homogeneous, multidimensional arrays of any object type.
283+
They encode their element type using a type tag:
284+
```text
285+
| type_tag | shape | data |
286+
| :uint8 | :Shape | :Any |
287+
```
288+
See ``TypeTag`` in the source code for a list of supported types and their tags.
289+
290+
Object arrays have some non-trivial interactions with different types.
291+
292+
- For **simple types** like numbers, logical, and strings, ``data`` is simply an array of the given type in FORTRAN layout.
293+
- For **structs**, the data is encoded in a 'struct of arrays' fashion as described in the [Struct](#struct) section.
294+
- For **cell arrays**, i.e., if the type tag indicated 'cell array', the entire object array is instead processed as a [cell array](#cell-array). This means that ``data`` is an array of object arrays.
295+
- For **self-serializing** types, there is no shape and ``data`` has to encode how to read it.
296+
297+
### Self Serializing
298+
299+
Some data is 'self-serializing', meaning that it does not rely on an external type tag and shape to describe the data.
300+
Instead, if used inside an object array, this data must contain its own type tag.
301+
In ScippNeutron, this case is treated like a nested object array.
302+
303+
### Auxiliary Types
304+
305+
These types are not represented by type tags but occur in fixed locations in the file.
306+
307+
#### Shape
190308

191309
The shape of a multidimensional array.
192310
Encoded as
@@ -196,10 +314,12 @@ Encoded as
196314
```
197315
where `ndim` is the number of dimensions and there are this many `length_i` entries.
198316

199-
### Struct
317+
#### Data Block Type
200318

201-
TODO
319+
Each data block is encoded as a struct with, among others, fields 'serial_name' and 'version' which together identify the block.
320+
This is used to determine how to deserialize the block into a Python model.
202321

203-
### Object Array
322+
#### Unique Object Container
204323

205-
### Self Serializing
324+
'Unique object container' and 'Unique reference container' are always used together and represent an array of objects where duplicates are only stored once.
325+
Each 'Unique reference container' contains a 'Unique object container' which in turn contains an array of indices that map from array index to a unique object.

src/scippneutron/io/sqw/_build.py

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -175,6 +175,28 @@ def add_empty_dnd_data(self, metadata: SqwDndMetadata) -> SqwBuilder:
175175
def add_dnd_data(
176176
self, metadata: SqwDndMetadata, *, data: sc.Variable, counts: sc.Variable
177177
) -> SqwBuilder:
178+
"""Add a DND data block (image) to the file.
179+
180+
Attention
181+
---------
182+
``data`` and ``counts`` must be given in row-major layout.
183+
They are converted to column-major (FORTRAN) layout internally.
184+
This means that the data must have shape ``['u1', 'u2', 'u3', 'u4']``.
185+
186+
Parameters
187+
----------
188+
metadata:
189+
Metadata describing the DND data.
190+
data:
191+
The values and variances of the image.
192+
counts:
193+
The number of observations ('pixels') per bin.
194+
195+
Returns
196+
-------
197+
:
198+
Self.
199+
"""
178200
builder = self._add_dnd_metadata(metadata)
179201
builder._dnd_data = _DndData(
180202
shape=tuple(map(int, metadata.axes.n_bins_all_dims)), # type: ignore[call-overload]

src/scippneutron/io/sqw/_read_write.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -51,8 +51,7 @@ def get(self, ty: ir.TypeTag, pos: int) -> _T:
5151
def read_object_array(sqw_io: LowLevelSqw) -> ir.ObjectArray | ir.CellArray:
5252
position = sqw_io.position
5353
ty = ir.TypeTag(sqw_io.read_u8())
54-
if ty == ir.TypeTag.serializable: # TODO
55-
# raise RuntimeError(f'!!!! {ty.value} {sqw_io.position-1}')
54+
if ty == ir.TypeTag.serializable:
5655
# This type object does not encode a shape, so just attempt
5756
# to read its contents.
5857
return read_object_array(sqw_io)

0 commit comments

Comments
 (0)