Skip to content

Latest commit

 

History

History
226 lines (188 loc) · 8.91 KB

README_soeq.md

File metadata and controls

226 lines (188 loc) · 8.91 KB

soeq - Sieve of Eratosthenes prime search (bit)

Table of content

Same prime number search as soep, but now the sieve is implemented using one bit per odd number. This allows the jobs to target the primes up to 1.E8, resulting in a 6.25 MByte sieve array size.

The basic algorithm is as described under soep. Only the special considerations for a 'one bit per sieve array element' implementation are covered in the Language and Compiler Notes.

Same input file format as soep.

This can be only be implemented efficiently in languages which directly support bit handling.

The bit handling is done by logical operations on data types like unsigned char in C. The most straight forward implementation uses bit masks for testing and clearing a bit in the sieve array. Mapping the numbers to bits from right (LSB) to left (MSB) gives for example an algorithm core in C like

    const unsigned char tstmask[] = {0x01,0x02,0x04,0x08,0x10,0x20,0x40,0x80};
    const unsigned char clrmask[] = {0xfe,0xfd,0xfb,0xf7,0xef,0xdf,0xbf,0x7f};
    ...
    for (n=3; n<=nmsqrt; n+=2) {
      i = n/2;
      if ((prime[i>>3] & tstmsk[i&0x7]) == 0) continue;
      for (i=(n*n)/2; i<=bimax ; i+=n) prime[i>>3] &= clrmsk[i&0x7];
    }

The bit masks can be implemented as short lookup tables, like tstmsk and clrmsk in the example above, or calculated on the fly like

tstmsk[ind]  -->   (1<<(ind))
clrmsk[int]  -->  ~(1<<(ind))

It depends on compiler and the hardware whether a memory lookup or a few integer instructions are faster.

The code supports both lookup and in-line style for the bit masks. This controlled by the LOOKUP preprocessor symbol. Default is using in-line.

Assembler - soeq_asm.asm

The code uses on-the-fly calculation of the bits masks, conceptually similar to the C implementation described above.

Here a right (MSB) to left (MSB) mapping is used, this reversal of bit order allows to generate the clrmsk by right shifting X'FF7F' in a register which is 32 bit wide. This leads to a very compact inner loop

    SIEVI    LR    R2,R3              i
             NR    R2,R10             i&0x7
             LR    R1,R11             0xff7f
             SRL   R1,0(R2)           0xff7f>>(i&0x7)
             LR    R2,R3              i
             SRL   R2,3               i>>3
             IC    R0,0(R2,R9)        prime[i>>3]
             NR    R0,R1              & 0xff7f>>(i&0x7)
             STC   R0,0(R2,R9)        prime[i>>3] &= 0xff7f>>(i&0x7)
             BXLE  R3,R6,SIEVI

See also Author's Note.

Pascal - soeq_pas.pas

Pascal offers handling of sets. Because the sieve algorithm works on the set of natural numbers it's tempting to use Pascal sets. The book 'Pascal User Manual and Report, 2nd Edition', published in 1975 by Springer, has indeed on page 53 a prime search algorithm only based on sets.

Early Pascal implementations often used a fixed size bit pattern to represent sets, with some natural word size. The CDC6600 compiler used 60 bits, the Stanford compiler for MVS a double word with 64 bits.

That makes it natural to implement the sieve as an array of sets. There is even an example of this variant in the 2nd Edition book.

Some benchmarking revealed that set operations are relatively expensive, despite their apparent simplicity with fixed underlying set size. Further benchmarking showed that for sets operators

  • <= is faster than in
  • * is faster than -

The actual Pascal implementation differs therefore from the sketch given in the 2nd Edition book, and is in fact in spirit quite similar to the C implementation.

PL/I offers, as only language available on tk4-, direct handling of bits. An array of BIT(1) is implemented as a packed bit array. Therefore the byte version soep_pli.pli and the bit version soeq_pli.pli just differ by the base type of the sieve array and minor adoptions due to this type change.

The PL/I compiler available with MVS3.8J restricts array bounds to 16 bit integer values. To avoid this 64k storage limitation the PRIME array is two-dimensional. In addition, the compiler restricts the size of global statict aggregates to 2 MByte, larger allocations result in a

  IEM1088I  THE SIZE OF AGGREGATE PRIME IS GREATER THAN 2,097,151 BYTES.
            STORAGE ALLOCATION WILL BE UNSUCCESSFUL.

Local aggregates have no size limit, the sieve algorithm is therefore implemented in a separate PROC where the PRIME array is a local object. The dimensions are chosen such that the index calculation can be efficiently done (MOD is inlined by the compiler):

    DCL PRIME(0:JMAX,0:8191)  BIT(1);
    ...
    DO I=IMIN TO IMAX BY N;
      PRIME(I/8192,MOD(I,8192)) = '0'B;
    END;

The access the BIT(1) array is implemented via run-time library function calls, inspection of the generated assembler code shows

      PRIME(...) = '1'B;    --> IHEIOAT
      PRIME(...) = '0'B;    --> IHEBSKA
      IF (PRIME(...))       --> IHEBSD0

The extra code for index splitting and the rather indirect way the BIT(1) is accessed cost a lot of CPU cycles. As result the seoq PL/I code is rather slow.

The jobs directory contains four types of jobs for soeq named

soeq_*_t.JES     --> print primes up to 100k (or implementation limit)
soeq_*_f_10.JES  --> print number of primes up to 10M
soeq_*_f.JES     --> print number of primes up to 100M
soeq_*_p.JES     --> print primes up to 10M (print speed test)

The _t and _p jobs serve the same purpose as described for soep.

The soeq_*_f.JES should in output the equivalent of

pi(        10):          4
pi(       100):         25
pi(      1000):        168
pi(     10000):       1229
pi(    100000):       9592
pi(   1000000):      78498
pi(  10000000):     664579
pi( 100000000):    5761455

An initial round of benchmark tests was done in July 2018

  • on an Intel(R) Xeon(R) CPU E5-1620 0 @ 3.60GHz (Quad-Core with HT)
  • using tk4- update 08
  • staring hercules with NUMCPU=2 MAXCPU=2 ./mvs
  • using CLASS=C jobs, thus only one test job running at a time

The key result is the GO-step time of the soeq_*_f type jobs, as packaged for a 100M search, for different compilers. The table is sorted from fastest to slowest results and shows in the last column the time normalized to the fastest case (asm):

Compiler ID 100M search */asm
asm 8.12 1.00
gcc 13.87 1.70
jcc 20.77 2.55
pas 42.00 5.17
pli 159.35 19.62

See also the benchmark summary for an overview table and a compiler ranking.

It's interesting to compare for the 10M searches the times with the simpler soep algorithm

Compiler ID 10M soeq 10M soep seoq/soep
asm 0.83 0.45 1.84
gcc 1.19 0.76 1.57
jcc 1.67 1.02 1.64
pas 4.19 2.12 1.98
pli 15.05 3.00 5.02

The assembler code soeq_asm.asm was much inspired by SYS2.JCLLIB(PRIMASM) in tk4-. This code is very compact and very elegant, and was for me the single most important help to get me back to writing System/370 assembler code after several decades. So again kudos to Juergen Winkelmann, the author of this code.

Nevertheless I wondered whether the speed of this highly tuned assembler implementation could be still improved. My suspicion was that the EX instruction is expensive, and that inline calculation of the bit masks is faster than a table look-up. To have a quantitative basis I wrote an instruction benchmark, which after a lot of work grew into s370_perf, which is now a GitHub project of its own under wfjm/s370-perf. With the instruction timings at hand it turned out that my suspicion was correct. So in the end my code likely beats Juergens code in speed, but it is by far not as elegant and compact.