Floating-Point Types
Overview
Rux provides a comprehensive set of IEEE 754-compliant floating-point primitive types, spanning from compact 8-bit representations to extended 512-bit precision. All float types are value types stored on the stack by default; heap allocation occurs only when boxed or placed inside a reference type.
Floating-point arithmetic in Rux follows IEEE 754-2019 semantics unless explicitly configured otherwise via compiler flags. This means results are deterministic, rounding modes are well-defined, and special values (Inf, -Inf, NaN) are natively supported across all float widths.
let x: float32 = 3.14;
let y: float64 = 2.718281828459045;
let z: float = 1.0; // defaults to float64Float Types
Rux defines eight distinct float types. Each type maps to a fixed-width IEEE 754 (or extended) format:
| Type | Size | Approximate range | Decimal precision |
|---|---|---|---|
float8 | 1 byte | ±1.0 × 10-2 to ±2.5 × 101 | ~1-2 digit |
float16 | 2 bytes | ±6.1 × 10-5 to ±6.5 × 104 | ~3-4 digits |
float32 | 4 bytes | ±1.2 × 10-38 to ±3.4 × 1038 | ~7 digits |
float64 | 8 bytes | ±2.2 × 10-308 to ±1.8 × 10308 | ~15–16 digits |
float80 | 10 bytes | ±3.4 × 10-308 to ±1.8 × 10308 | ~18–19 digits |
float128 | 16 bytes | ±3.4 × 10-4932 to ±1.2 × 104932 | ~33-34 digits |
float256 | 32 bytes | ±1.1 × 10-78913 to ±1.6 × 1078913 | ~71-72 digits |
float512 | 64 bytes | ±1.7 × 10-78913 to ±1.3 × 1078913 | ~148-150 digits |
float8
Minimal precision; intended for machine-learning weight storage, lookup tables, and other scenarios where memory density outweighs accuracy requirements. Not suitable for general arithmetic.
let w: float8 = 1.5;⚠️
float8arithmetic is subject to heavy quantization error. Avoid in numerical algorithms.
float16
Half-precision. Commonly used in GPU compute kernels, graphics pipelines, and ML inference. Arithmetic operations may be emulated in software on CPU targets that lack native FP16 instructions.
let half: float16 = 0.0625;float32
Single-precision. The standard type for graphics, game physics, signal processing, and any domain where ~7 significant digits are sufficient.
let pi: float32 = 3.1415927;float64
Double-precision. The default float type in Rux. Suitable for scientific computing, financial calculations, and most general-purpose work where precision matters.
let e: float64 = 2.718281828459045;float80
Extended-precision, matching the x87 FPU 80-bit format. Provides extra guard bits during intermediate calculations. Available only on targets that support it; will fall back to float128 on unsupported architectures.
let precise: float80 = 1.234567890123456789;float128
Quadruple-precision according to IEEE 754. Useful for compensated summation, high-accuracy numerical integration, and as a reference implementation when validating lower-precision algorithms. Software-emulated on most hardware.
let quad: float128 = 1.0f128; // suffix denotes float128 literalfloat256
Octuple-precision. Intended for symbolic mathematics, cryptographic applications, and extreme-precision scientific work. Always software-emulated; expect significant performance overhead.
let ultra: float256 = 1.0f256; // suffix denotes float256 literalfloat512
Maximum-precision float in Rux. Reserved for specialized research, arbitrary-precision scaffolding, and long-running simulations where error accumulation must be tightly bounded. Always software-emulated; not recommended for hot paths.
let ext: float512 = 1.0e512 // explicit type annotation requiredDefault Float Type
When a floating-point literal is written without an explicit type annotation, Rux infers float64:
let a = 1.0; // float64
let b = 3.14; // float64
let c: float = 0.5; // float is an alias for float64float is a built-in type alias:
type float = float64; // compiler intrinsic aliasTo select a different default, use an explicit annotation or a literal suffix:
let a: float32 = 1.0;
let b = 1.0f32; // float32 via suffix
let c = 1.0f16; // float16 via suffix
let d = 1.0f128; // float128 via suffixFloat Literals
Decimal Literals
let a = 1.0;
let b = -0.5;
let c = 3.141592653589793;Scientific Notation
let avogadro = 6.022e23; // 6.022 × 10²³
let planck = 6.626e-34; // 6.626 × 10⁻³⁴
let big = 1.5E+10;Literal Suffixes
| Suffix | Type |
|---|---|
f8 | float8 |
f16 | float16 |
f32 | float32 |
f64 | float64 |
f80 | float80 |
f128 | float128 |
f256 | float256 |
| (none) | float64 |
let a = 1.0f32; // float32
let b = 1.0f16; // float16
let c = 1.0f128; // float128Type Conversion
Implicit Widening
Rux allows implicit widening conversions where no precision is lost:
let a: float32 = 1.5;
let b: float64 = a; // OK — widening, implicit
let c: float128 = b; // OK — widening, implicitWidening order: float8 → float16 → float32 → float64 → float80 → float128 → float256 → float512
Explicit Narrowing
Narrowing conversions must be explicit to prevent silent precision loss:
let a: float64 = 3.141592653589793;
let b: float32 = a as float32; // explicit cast required
let c: float16 = a as float16; // explicit cast requiredCasting to a narrower type rounds to nearest (ties to even) by default.
Conversion Between Float and Integer Types
let f: float64 = 42.9;
let i: int32 = f as int32; // truncates toward zero → 42
let n: int64 = 100;
let g: float64 = n as float64; // widening int→float, implicit⚠️ Casting
NaNorInfto an integer type produces implementation-defined behavior and will trigger a runtime panic in debug builds.
Arithmetic Operations
All standard arithmetic operators are defined for float types. Mixed-type expressions require explicit casts.
Operators
| Operator | Description | Example |
|---|---|---|
+ | Addition | 1.0 + 2.0 |
- | Subtraction | 5.0 - 3.0 |
* | Multiplication | 2.0 * 3.0 |
/ | Division | 10.0 / 4.0 |
% | Floating remainder | 5.5 % 2.0 |
- | Unary negation | -x |
** | Exponentiation | 2.0 ** 10.0 |
let a: float64 = 10.0;
let b: float64 = 3.0;
let sum = a + b; // 13.0
let diff = a - b; // 7.0
let prod = a * b; // 30.0
let quot = a / b; // 3.333...
let rem = a % b; // 1.0
let pow = a ** b; // 1000.0Division Behavior
1.0 / 0.0 // → +Inf (positive infinity)
-1.0 / 0.0 // → -Inf (negative infinity)
0.0 / 0.0 // → NaNDivision by zero does not throw exception; it produces
InforNaNper IEEE 754.
Compound Assignment
var x: float64 = 10.0;
x += 5.0; // 15.0
x -= 3.0; // 12.0
x *= 2.0; // 24.0
x /= 4.0; // 6.0
x **= 2.0; // 36.0Comparison Operations
Standard Comparisons
let a: float64 = 1.0;
let b: float64 = 2.0;
a == b // false
a != b // true
a < b // true
a <= b // true
a > b // false
a >= b // falseNaN Comparison Rules
All comparisons involving NaN return false — including equality with itself:
let n = Float.NaN;
n == n // false ← NaN is never equal to itself
n != n // true
n < 1.0 // false
n > 1.0 // falseAlways use Float.IsNaN() to check for NaN:
if Float.IsNaN(value)
{
// Handle NaN case
}Best Practices, Common Pitfalls & Recommendations
✅ Choose the Right Precision
- Use
float32for graphics, physics simulations, and ML model weights where memory bandwidth matters. - Use
float64(the default) for general-purpose numeric code, scientific work, and financial totals. - Reserve
float128+ for reference implementations, symbolic math, or when validating lower-precision algorithms. - Avoid
float8andfloat16outside of hardware-accelerated contexts (GPU, neural accelerators).
✅ Never Use == on Computed Floats
// ❌ Unreliable
if (0.1 + 0.2 == 0.3) { ... }
// ✅ Reliable
if (Float.approx_eq(0.1 + 0.2, 0.3)) { ... }✅ Always Handle NaN Explicitly
// ❌ Silently wrong — NaN comparisons always return false
if (result < threshold) { ... }
// ✅ Explicit
if (Float.IsNaN(result)) {
HandleError()
} else if (result < threshold) {
...
}✅ Prefer Math.Fma() for Precision-Critical Multiply-Add
// ❌ Two rounding errors
let r = a * b + c;
// ✅ One rounding error (fused multiply-add)
let r = Math.Fma(a, b, c);✅ Be Aware of Catastrophic Cancellation
Subtracting two nearly equal numbers can cause massive relative error:
let a: float64 = 1.000000001;
let b: float64 = 1.000000000;
let diff = a - b; // only 1 significant bit remains accurateUse compensated summation (Kahan) or higher-precision intermediates when needed.
✅ Use Literal Suffixes for Clarity
// ❌ Ambiguous — requires reading the annotation
let x: float32 = 1.0;
// ✅ Self-documenting
let x = 1.0f32;✅ Avoid Mixed-Width Arithmetic Without Intent
Mixed expressions silently widen:
let a: float32 = 1.0f32;
let b: float64 = 2.0;
// a is widened to float64 before addition
let c = a + b; // float64Be explicit when mixing types to document intent.
⚠️ Infinity Propagates
let inf = Float.Inf;
inf + 1.0 // → Inf
inf * -1.0 // → -Inf
inf - inf // → NaN ← be careful
inf * 0.0 // → NaN ← be careful⚠️ Performance of Wide Floats
float128, float256, and float512 are software-emulated on virtually all current hardware. Expect 10–100× slowdowns compared to float64. Profile before committing to wide floats in hot paths.
Notes
IEEE 754 Compliance
float16throughfloat128are fully IEEE 754-2019 compliant.float80follows the Intel 80-bit extended format (not a standard IEEE 754 binary format).float8follows the E4M3 variant (4 exponent bits, 3 mantissa bits) common in ML frameworks; behavior at the boundary differs slightly from standard IEEE 754 subnormals.float256andfloat512are Rux extensions beyond the IEEE 754-2019 standard; they use the same structural rules (sign, biased exponent, significand) but are not standardized externally.
Rounding Mode
The default rounding mode is round-to-nearest, ties-to-even (IEEE 754 roundTiesToEven). Alternate rounding modes can be set per-scope:
import float.rounding
float.rounding.with_mode(.toward_zero) {
// all float ops in this block use truncation
let x = 3.9f32 as int32 // → 3
}