view tagu.txt @ 516:4cd7d914b0e3

rebase to current hsx/hotspot-comp
author jrose
date Sun, 06 Oct 2013 23:32:13 -0700
line wrap: on
line source
Notes for tagged unions in the JVM
John Rose, 9/2011

 - near-complete coverage of jlong range, without gaps
 - complete coverage of jdouble range (via longBitsToDouble)
 - support object heap sizes 1000 .. 10^6 times larger than present
 - fast conversion to/from reference
 - fast conversion to/from jlong
 - fast conversion to/from jdouble (via longBitsToDouble, etc.)
 - allow code space (~ 33 bits) for primitive values other than longs and references
 - portable specification of jlong range
 - portable specification of jdouble range (except perhaps for NaN bits)
 - for non-float values, occupy unused jdouble code points (there are 2^53 - 3)
 - if data loss occurs (because of silent truncation), magnitudes must be preserved

Key architectural constant:  MAX_UNION_LONG_VALUE = 0x7ff0_000000000000.
This value is chosen so that there are enough code points to represent all doubles.
All values in MAX_UNION_LONG_VALUE+1 to Long.MAX_VALUE are reserved for references.
This is true regardless of the internal representation of the tagged union.
The internal representation will in general have unused code-points.

The relevant user-visible ranges of primitives are thus:
  0x8000_000000000000 .. 0x7ff0_000000000000 : partial jlong, full jdouble
  0x7ff0_000000000001 .. 0x7fff_fffffffffffe : illegal values (never encoded or decoded)
  0x7fff_ffffffffffff                        : sentinel value (never encoded)

The sentinel value (Long.MAX_VALUE) may be returned from the non-signaling
decode operations, unionToLongSafe and unionToDoubleSafe.

Key implementation choice:  Which type is favored by being stored natively?
If longs, we say "long-native".  This also happens to be "double-native",
since all points above MAX_UNION_LONG_VALUE encode the single NaN value.
Decoding a long-native representation to a reference costs a test and add.

If addresses are stored natively (with padding), we say "address-native".
If addresses occupy one 50-bit range, decoding a long costs a test and add.
Address-native is the only representation that works well with Hotspot's "oops_do" APIs.

If the padding is all zero, we say "positive-address-native".
In this case, the representation is not "double-native", since native references
when loaded as doubles represent denormalized numbers or zero.

The bias for decoding must be such that for a pointer p, p+encBias is somewhere
in the illegal jlong value range.  Of course decBias = uint64_t(-encBias).

Preferred "positive-address-native" representation as padded address (jlong/intptr_t):
  0x8000_000000000000 .. 0xfff1_000000000000 : biased jlong/jdouble
  0xfff1_000000000001 .. 0xfffe_ffffffffffff : unused code points (never stored)
  0xffff_000000000000 .. 0xffff_ffffffffffff : natural negative address (2^48-1 code points)
  0x0000_000000000000                        : natural null address
  0x0000_000000000001 .. 0x0000_ffffffffffff : natural positive address (2^48-2 code points)
  0x0001_000000000000 .. 0x7fff_ffffffffffff : biased jlong/jdouble

The encoding and decoding biases are 0x8001_000000000000 and 0x7fff_000000000000.
  0x8000_000000000000 .. 0x7ff0_000000000000 : partial jlong (decoded)
  0x0001_000000000000 :: 0xfff1_000000000000 : encoded (wrapped range)
  0xfff1_000000000001 .. 0x0000_ffffffffffff : non-jlong code points (range complement)

If pointer signs are negative, the representation is "negative-address-native".
In this case, the pointer padding is not zeroes but ones, and the bias differs.
The null address (all zero bits) must be adjoined to the negative range after -1.

(The "negative-address-native" representation can also be "double-native",
since such values are NaNs.  They include small-magnitude negative jlongs
like jlong(-1), and so are are key values of the jlong range.
Unless long and double bit patterns are decoupled, treating such values
as native doubles does not pay off.  Solaris is notable for allocating
negative addresses to managed heaps.  Most OSs allocate positive addresses.
Also, all 32-bit addresses can be treated as positive regardless of bit 31.)

JVM structure:
  BasicType: T_UNION = 3
  signature sytax: 'U'
  occupies two stack slots, like long and double
  garbage collector can find embedded references, on stack/locals/registers/heap

API in package sun.misc:
class Union {
  public static final long MAX_UNION_LONG_VALUE = 0x7ff0_000000000000;

  public static final long NO_UNION_LONG_VALUE  = 0x7fff_ffffffffffff;
  // NO_UNION_LONG_VALUE is similar in magnitude to MAX_UNION_LONG_VALUE
  // it is also a NaN, when viewed as a double

  public static boolean isReference(U<?> x) { ... }

  public static U<?> longToUnion(long x) throws IllegalArgumentException { encodes the long }
  public static U<?> doubleToUnion(double x) { encodes the double (no error) }
  public static <T> U<T> referenceToUnion(T x) { encodes the reference (no error) }

  public static long unionToLong(U<?> x) throws IllegalUnionException { returns long value }
  public static double unionToDouble(U<?> x) throws IllegalUnionException { returns double value }
  public static <T> T unionToReference(U<T> x) throws IllegalUnionException { returns reference }

  public static long unionToLongSafe(U<?> x) { returns long value else NO_UNION_LONG_VALUE }
  public static double unionToDoubleSafe(U<?> x) { returns double value else NaN }
  public static <T> T unionToReferenceSafe(U<T> x) { returns reference else null }
class IllegalUnionException extends RuntimeException { ... }

Details of representation (assuming decoding and encoding biases above):

private final long __ENCODING_BIAS = 0x8001_000000000000;
private final long __DECODING_BIAS = 0x7fff_000000000000;
assert((long)(__ENCODING_BIAS + __DECODING_BIAS) == 0);

private final long __ADDRESS_LIMIT = 0x0001_000000000000;
private final bool __NEGATIVE_ADDRESSES = true;  // can set false in ILP32, Windows, etc.

int64_t union_is_object(int64_t ux) {
  bool r0 = (ux + __DECODING_BIAS) > MAX_UNION_LONG_VALUE;
  bool r1 = ((uint64_t)ux >> 48 == 0x0000
             || (__NEGATIVE_ADDRESSES && (uint64_t)ux >> 48 == 0xffff));
  bool r2 = ((uint64_t)ux < (uint64_t)__ADDRESS_LIMIT
             || (__NEGATIVE_ADDRESSES && (uint64_t)ux > -__ADDRESS_LIMIT));
  bool r3 = ((uint64_t)ux + (__NEGATIVE_ADDRESSES ? __ADDRESS_LIMIT : 0)
             < (uint64_t)__ADDRESS_LIMIT);
  assert(r0 == r1 && r1 == r2 && r2 == r3);
  return r1;

jlong union_to_jlong(int64_t ux, bool safe = false) {
  jlong x = ux + __DECODING_BIAS;
  if (x <= MAX_UNION_LONG_VALUE) {
    return x;
  if (!safe)  throw new IllegalUnionException("unexpected reference");

int64_t jlong_to_union(jlong x, bool safe = false) {
    return x + __ENCODING_BIAS;
  if (!safe)  throw new IllegalUnionException("unexpected reference");
  // convert to null reference
  assert(union_to_object(0) == (object)null);
  return 0;

jdouble union_to_jdouble(int64_t ux, bool safe = false) {
  return jdouble_cast(union_to_jlong(ux, safe));

int64_t jdouble_to_union(jdouble dx) {
  bool safe = true;  // always safe to yield a NaN
  return jlong_to_union(jlong_cast(dx, safe));

object union_to_object(int64_t ux, bool safe = false) {
  if (union_is_object(ux))
    return object_cast((uintptr_t)(uint64_t) ux);
  if (!safe)  throw new IllegalUnionException("unexpected non-reference");
  return null;

int64_t object_to_union(object rx) {
  // this is an error-free operation; encode all references, including null
  return object_uncast(rx);

object object_cast(uintptr_t x) {
  if (x == 0) {
    assert((object)x == (object)null);
  } else {
    x *= OBJECT_HEAP_SCALE;  // might be a no-op
    x += OBJECT_HEAP_BASE;   // might be a no-op
  return (object)x;

uintptr_t object_uncast(object rx) {
  uintptr_t x = ((uintptr_t)rx);
  if (rx == (object)null) {
    assert(x == 0);
  } else {
    x -= OBJECT_HEAP_BASE;   // might be a no-op
    x /= OBJECT_HEAP_SCALE;  // might be a no-op
    assert(x > 0 && x < __ENCODING_BIAS);
  return x;

typedef void oops_do_function(object* x);
void union_oops_do(int64_t *field, oops_do_function* fn) {
  // if primitive or null, return immediately
  if (!union_is_object(*field) || (*field) == 0)  return;
  object* objAddr = (object*) field;
  if (sizeof(object) != sizeof(int64_t) && __BIG_ENDIAN)
    objAddr = (object*) (field+1) - 1;

FTR, relevant ranges which are native to the double format:
  0x0000_000000000000 .. 0x7fef_ffffffffffff : positive rationals
  0x7ff0_000000000000                        : positive infinity
  0x7ff0_000000000001 .. 0x7ff7_ffffffffffff : nonstandard NaNs
  0x7ff8_000000000000                        : standard NaN (for Java)
  0x7ff8_000000000001 .. 0x7fff_ffffffffffff : nonstandard NaNs
  0x8000_000000000000                        : negative zero
  0x8000_000000000001 .. 0xffef_ffffffffffff : negative rationals
  0xfff0_000000000000                        : negative infinity
  0xfff0_000000000001 .. 0xffff_ffffffffffff : nonstandard NaNs

All NaN ranges are available as code points for non-double values.
Only one NaN code point needs to be reserved to represent all NaNs in common.

Non-preferred "long-native" representation using jlong (int64_t):
  0x8000_000000000000 .. 0x7ff0_000000000000 : native partial jlong; native full jdouble
  0x7ff0_000000000001 .. 0x7fff_ffffffffffff : biased reference (2^52 - 1 code points)
This is non-preferred because the biased references are hard for the Hotspot GC to process.

Non-preferred "unsigned-long-native" representation using non-standard type "julong" (u8/uint64_t)
  0x0000_000000000000 .. 0xfff0_000000000000 : partial unsigned long; full jdouble
  0xfff0_000000000001 .. 0xffff_ffffffffffff : biased reference (2^52 - 1 code points)
This is worse than "long-native" because it opens a hole in the middle of the jlong range.