Chasing Panaitopol Primes

From Brute Force Pain to Elegant Math

the problem
the naïve beginning - two loops, too much pain
first algebraic simplification (still stuck with two variables)
the breakthrough - variable substitution
bounding the search space
the implementation evolution
the parallelization experiments
- lessons from over-concurrentization
- failed distributed computing attempts
micro-optimizations that mattered
the final performance journey
key lessons learned
the complete optimized solution
final thoughts
bonus: what's next?

I've been working through Project Euler problems, and Problem 291 ("Panaitopol primes") sent me down a fascinating rabbit hole. What started as a naïve double loop taking hours transformed into a 5-second solution through mathematical insight, careful optimization, and some surprising lessons about parallelization. Here's the complete journey.

The Problem

A Panaitopol prime is a prime number that can be expressed as:

p = (a⁴ - b⁴) / (a³ + b³)

where a > b > 0 are positive integers.

The challenge: Find how many Panaitopol primes exist below 5 × 10¹⁵. Yeah, that is 5 followed by 15 zeroes. 5_000_000_000_000_000. To put that in perspective, if light had to travel this number of kilometers, it would take 528.5 years. That's almost-almost the amount of time it takes for light from Betelgeuse to reach Earth.

The Naïve Beginning - Two Loops, Too Much Pain

Looking at the formula for Panaitopol primes:

p = (a⁴ - b⁴) / (a³ + b³)

My first instinct was the obvious one - just try all possible (a, b) pairs. But here's the catch: not all (a, b) pairs produce integer results! The division needs to be exact. So I implemented it with remainder checking:

def panaitopol(x, y):
    x3 = x ** 3
    x4 = x3 * x
    y3 = y ** 3
    y4 = y3 * y
    return (x4 - y4) // (x3 + y3), (x4 - y4) % (x3 + y3)

l = set()
for i in range(1, 5001):
    for j in range(1, i):
        pp, remainder = panaitopol(i, j)
        if remainder == 0 and isprime(pp):
            l.add(pp)

Even with bounds of just 5,000, this was painfully slow. That's $O(n^2)$ candidate pairs, and for the actual problem limit, this approach would take literally forever. Plus, I had to check divisibility for each pair - double the pain!

First Algebraic Simplification (Still Stuck with Two Variables)

I tried simplifying the expression algebraically. Let's start with the original formula and factor both the numerator and denominator:

p = (a⁴ - b⁴) / (a³ + b³)

For the numerator, we can use the difference of squares twice:

a⁴ - b⁴ = (a²)² - (b²)² = (a² - b²)(a² + b²)

And then factor a² - b² further:

a⁴ - b⁴ = (a - b)(a + b)(a² + b²)

For the denominator, we use the sum of cubes formula:

a³ + b³ = (a + b)(a² - ab + b²)

So our fraction becomes:

p = ((a - b)(a + b)(a² + b²)) / ((a + b)(a² - ab + b²))

We can cancel the common factor (a + b):

p = ((a - b)(a² + b²)) / (a² - ab + b²)

This was cleaner, but computationally? Still O(n²). I still needed to check every (a, b) pair, and the formula still had two independent variables. The key insight I was missing: I needed to find a relationship between a and b that would guarantee integer results and reduce the problem to one variable.

The Breakthrough - Variable Substitution

This was the key insight! What if instead of treating a and b as independent variables, I could couple them? I tried substituting:

a = n + k
b = n

This transforms our simplified formula into:

p = ((n + k - n)((n + k)² + n²)) / ((n + k)² - n(n + k) + n²)

Simplifying the numerator:

numerator = k × ((n + k)² + n²)

And the denominator becomes:

denominator = (n + k)² - n(n + k) + n²

Expanding the denominator:

= n² + 2nk + k² - n² - nk + n²
= n² + nk + k²

Now, for this to always produce integers, I needed the denominator to divide the numerator cleanly. Through experimentation (and some mathematical analysis), I discovered that when k = 1, something magical happens.

Let's verify with k = 1:

numerator = 1 × ((n + 1)² + n²) = 2n² + 2n + 1
denominator = n² + n + 1

And miraculously, it works! for a = n+1 and b = n, the original formula simplifies directly to:

p = n² + (n+1)² = 2n² + 2n + 1

This changed everything! From O(n²) to O(n) - a massive algorithmic improvement. Instead of checking all (a, b) pairs, I only needed to iterate through values of n.

def panaitopol_simplified(n):
    return n * n + (n + 1) * (n + 1)
    # Or equivalently: 2 * n * (n + 1) + 1

Bounding the Search Space

With the simplified formula, I could now bound the search space properly:

2n² + 2n + 1 < 5 × 10¹⁵

Solving for n:

n < √(5 × 10¹⁵ / 2) ≈ 5 × 10⁷

Only 50 million values to check instead of exploring a 2D space!

The Implementation Evolution

Python + SymPy (6 minutes)

from sympy import isprime
from math import isqrt
from tqdm import tqdm

LIMIT = 5 * (10**15)
count = 0
for n in tqdm(range(1, isqrt(LIMIT) + 1)):
    if isprime(2 * n * (n + 1) + 1):
        count += 1

First Go Implementation (43 seconds)

Using Go's big.Int.ProbablyPrime:

for n := uint64(1); n <= maxN; n++ {
    p := 2*n*(n+1) + 1
    b.SetUint64(p)
    if b.ProbablyPrime(1) {
        count++
    }
}

Optimized Go with Custom Miller-Rabin (5 seconds!)

The game-changer was implementing a specialized 64-bit Miller-Rabin test:

func deterministicMillerRabin64(n uint64) bool {
    // Deterministic for all n < 2^64 using bases {2, 3, 5, 7, 11}
    // ... specialized implementation
}

// With bit manipulation optimization
p := ((n * (n + 1)) << 1) + 1
if deterministicMillerRabin64(p) {
    count++
}

The Parallelization Experiments

Lessons from Over-Concurrentization

I experimented with varying the number of goroutines: (Here's the graph of the results of the experiment - the x axis is the number of goroutines, and the y axis is the time in seconds to run the program)

Key findings:

Sweet spot: 8 goroutines (matching my CPU's logical cores)
Diminishing returns: Beyond 8, context switching overhead dominates
Over-concurrentization hurts: Too many goroutines actually degraded performance

Failed Distributed Computing Attempts

I also tried:

Python multiprocessing: Helped, but Python overhead was still significant
PySpark: Framework overhead completely dominated the actual computation
Spark in Google Colab: Even worse due to JVM startup costs

Lesson: Don't reach for distributed computing until you've exhausted single-machine optimizations.

Micro-Optimizations That Mattered

Bit Shifting vs Multiplication

// Before
p := 2 * n * (n + 1) + 1

// After - ~1.67% improvement
p := ((n * (n + 1)) << 1) + 1

Over 50 million iterations, even 1-2% adds up!

Trial Division Pre-filtering

// Quick elimination of ~90% of composites
if !passesTrial(p) {
    continue
}

The LRU Cache Trap

I tried adding @lru_cache to the Python version:

@lru_cache(maxsize=None)
def panaitopol(x, y):
    # ... computation

Result: Performance got WORSE! Why? With 25 million unique calls and zero cache hits, I was just adding overhead. Lesson: caching only helps when you have repeated computations.

The Final Performance Journey

Approach	Time	Speedup	Notes
Python naive O(n²)	Hours	1x	Double nested loop
Python with safe division	Still hours	~1x	Fixed correctness, not speed
Python simplified O(n)	6 min	~100x	Mathematical breakthrough!
Go with big.Int	43 sec	~8x	Language change
Go with custom Miller-Rabin	5 sec	~8.6x	Algorithm optimization
Total improvement		~7,200x	Math + Engineering

Key Lessons Learned

Mathematical insight beats raw compute power: The O(n²) to O(n) reduction was worth more than all parallelization combined.
Premature parallelization is a root of evil: I tried Spark, considered Hadoop, experimented with channels - but the real wins came from algorithm and implementation improvements.
Know your tools deeply:
- Python's sympy.isprime: Convenient but slow
- Go's big.Int.ProbablyPrime: Has overhead for small numbers
- Custom Miller-Rabin for 64-bit: Blazing fast
Profile everything: My goroutine experiments revealed the exact saturation point of my hardware.
Small optimizations matter at scale: Even changing 2*n to n<<1 gave measurable improvements over 50 million iterations.
Caching isn't always the answer: My LRU cache experiment showed that blindly adding caching can hurt performance.

The Complete Optimized Solution

Here's the final algorithm in pseudocode that runs in under 5 seconds:

Algorithm: Count Panaitopol Primes
Input: LIMIT = 5 × 10^15

1. Calculate upper bound for n:
   max_n = floor(sqrt(LIMIT / 2))

2. Initialize count = 0

3. Split range [1, max_n] into chunks for parallel processing

4. For each chunk in parallel:
   For n from chunk_start to chunk_end:
     a. Calculate p = 2n(n+1) + 1
        (using bit shift: p = ((n × (n+1)) << 1) + 1)
     
     b. If p >= LIMIT, skip
     
     c. Quick trial division by small primes (2, 3, 5, ..., 61)
        If divisible by any, skip (unless p equals that prime)
     
     d. Apply deterministic Miller-Rabin test with bases 
        {2, 3, 5, 7, 11} => (deterministic for all n < 2^64)
     
     e. If prime, increment local count

5. Sum all local counts from parallel workers

6. Return total count

The key optimizations:

Mathematical reduction: From O(n²) double loop to O(n) single loop
Bit manipulation: Replace multiplication by 2 with left shift
Trial division: Eliminate ~90% of composites cheaply
Deterministic primality: No probabilistic false positives
Parallelization: Use all CPU cores with balanced chunks

Final Thoughts

What started as a brute force two-dimensional search became a clean, optimized solution through:

Mathematical simplification (the biggest win)
Algorithmic optimization (custom primality testing)
Language selection (Go's performance)
Careful parallelization (just enough, not too much)
Micro-optimizations (bit operations, trial division)

The journey from hours to 5 seconds taught me that before throwing compute power at a problem, always ask: "Can this be expressed more elegantly?" Often, the answer transforms the impossible into the trivial.

Final answer: ?????? Panaitopol primes exist below 5 × 10¹⁵. I hope you have fun figuring out the answer yourself! 😄

Want to try this yourself? The complete optimized Go implementation is available above. For the mathematically curious, I recommend starting with the algebraic derivation - it's a beautiful piece of number theory.

Bonus: What's Next?

If you enjoyed this optimization journey, you might also like:

Exploring SIMD instructions for even faster primality testing
Investigating why certain (a, b) pairs produce primes more often
Generalizing to other forms of specialized primes
Applying these optimization patterns to other Project Euler problems

What Project Euler problems have sent you down similar rabbit holes? I'd love to hear about your optimization journeys!

chasing panaitopol primes

Chasing Panaitopol Primes

contents

The Problem

The Naïve Beginning - Two Loops, Too Much Pain

First Algebraic Simplification (Still Stuck with Two Variables)

The Breakthrough - Variable Substitution

Bounding the Search Space

The Implementation Evolution

Python + SymPy (6 minutes)

First Go Implementation (43 seconds)

Optimized Go with Custom Miller-Rabin (5 seconds!)

The Parallelization Experiments

Lessons from Over-Concurrentization

Failed Distributed Computing Attempts

Micro-Optimizations That Mattered

Bit Shifting vs Multiplication

Trial Division Pre-filtering

The LRU Cache Trap

The Final Performance Journey

Key Lessons Learned

The Complete Optimized Solution

Final Thoughts

Bonus: What's Next?