Hash Collision Guide
Hash Collision Guide for Phone Hashing
A hash collision occurs when two different inputs produce the same hash output. For cryptographic hash functions, collisions should be computationally infeasible. However, MD5 and SHA-1 have been broken—practical collision attacks exist. This guide explains hash collision risks, their implications for phone number hashing, and how to choose algorithms that remain collision-resistant.
What is a Hash Collision?
By the pigeonhole principle, hash functions must have collisions: they map a large (or infinite) input space to a fixed output space. For a 256-bit hash, there are 2^256 possible outputs. The question is whether finding a collision is feasible. A cryptographically secure hash function makes it infeasible—requiring on the order of 2^(n/2) operations for an n-bit hash (birthday attack). Weaker functions allow faster collision finding.
MD5 Collision
MD5 produces a 128-bit output. In 2004, researchers demonstrated a practical MD5 collision—finding two different inputs that hash to the same value in hours on commodity hardware. Since then, MD5 collision attacks have been refined and automated.
Implications for phone hashing: For phone numbers, the risk is lower than for arbitrary inputs—attackers cannot freely choose phone numbers in many scenarios. However, MD5 collision means you cannot trust MD5 for integrity (e.g., detecting tampering) or in security-critical contexts. For legacy compatibility only, see our MD5 hash lookup guide.
SHA-1 Collision
SHA-1 (160-bit) was deprecated by NIST after the 2017 SHAttered attack, which produced a SHA-1 collision. While more expensive than MD5, SHA-1 collisions are now practical for well-resourced attackers.
Implications: Same as MD5—avoid SHA-1 for new systems. Use only when integrating with legacy systems that require it. See our SHA-1 hash lookup guide.
SHA-256 and Collision Resistance
No practical collision attacks exist for SHA-256. The birthday bound for SHA-256 is 2^128 operations—far beyond current computational capability. SHA-256 remains the recommended choice for cryptographic hashing, including phone number hashing.
Collision Probability for Phone Numbers
Phone numbers occupy a tiny fraction of the hash space. The probability of two random phone numbers colliding under SHA-256 is approximately 1 in 2^256—negligible. The concern is not random collision but adversarial collision: an attacker crafting two different inputs that produce the same hash. For MD5/SHA-1, that's feasible; for SHA-256, it is not.
When Collisions Matter
- Data integrity: If you use hashes to verify that data hasn't been tampered with, collisions allow substitution attacks. Use SHA-256.
- Deduplication: If you use hashes to identify duplicates, a collision could incorrectly merge two different numbers. With SHA-256, this is astronomically unlikely.
- Lookup: Hash lookup assumes one-to-one mapping for practical purposes. Collisions could cause false matches. With SHA-256, the risk is negligible.
Mitigation Strategies
- Use SHA-256 for new implementations. Avoid MD5 and SHA-1 for security-sensitive or integrity-critical use.
- Document algorithm choice: If you must use MD5/SHA-1 for legacy reasons, document the risk and plan migration.
- Consider HMAC: Keyed hashing adds a layer of protection; even with a weak underlying hash, the key complicates collision exploitation.
Collision Detection
In theory, you could detect collisions by checking for duplicate hashes in your dataset. If two different phone numbers hashed to the same value, you'd have a collision. For SHA-256, the probability is negligible for any realistic dataset size. For MD5, it's theoretically possible but still rare for random phone numbers. If you ever observe a collision in production, investigate—it could indicate a bug (e.g., duplicate input, wrong algorithm) rather than a true cryptographic collision. Log and alert on duplicate hashes for debugging. In a deduplication pipeline, duplicate hashes are expected (same number, same hash). But if you're hashing unique numbers and see duplicates, investigate. Possible causes: duplicate input data, bug in normalization (different formats producing same hash—unlikely but possible with MD5), or a true collision (extremely rare for SHA-256). Set up monitoring to track duplicate hash rate; a sudden increase may indicate a pipeline bug. For SHA-256, observing a collision would be a significant event—document it, preserve evidence, and consider whether it indicates a deeper issue (e.g., bug in hash implementation). True SHA-256 collisions are not known to have occurred in the wild; if you see one, verify your hash implementation before concluding it's a cryptographic breakthrough. Use a known test vector: hash a standard string (e.g., "abc") and compare to the published SHA-256 output. If your implementation produces the correct value for test vectors but "collides" for phone numbers, the issue is likely in normalization or input handling, not the hash function itself. Document any suspected collisions and preserve the inputs for analysis—they may reveal a bug that affects more than one record.
Migration from Weak Algorithms
If you're migrating from MD5 or SHA-1 to SHA-256:
- Dual-hash period: Store both old and new hashes during transition; lookups can check both.
- Re-hash from source: You cannot derive SHA-256 from MD5; you need the original phone number. Re-hash from source data when possible.
- Deprecation timeline: Set a date to remove support for weak algorithms; communicate to integrators.
Summary
Collision Attacks in Practice
Researchers have demonstrated MD5 collisions in minutes on commodity hardware. Tools like HashClash and custom scripts automate the process. For SHA-1, the 2017 SHAttered attack cost an estimated $110,000 in cloud compute. As hardware improves, attack costs drop. Assume that MD5 and SHA-1 hashes can be collided by motivated attackers. For integrity-critical applications (e.g., digital signatures, certificate transparency), never use these algorithms.
Impact on Phone Hashing Specifically
For phone number hashing, the collision risk is somewhat different. Attackers cannot freely choose arbitrary phone numbers in many scenarios—they must work within valid number formats. However, if an attacker can craft two different phone-like strings that collide (e.g., one valid, one malformed), they could potentially substitute one for the other in systems that don't validate format. Defense in depth: use SHA-256 and validate input format before hashing.
Summary
Hash collision is a real risk for MD5 and SHA-1. For phone number hashing, use SHA-256 to avoid collision-based attacks. See our cryptography basics for more on hash function properties. To perform lookups with collision-resistant algorithms, visit /hashes and /reverse-lookup.
Explore Phone Hash Directory
- Browse All Hashes - Paginated list of phone number hashes
- Browse Phone Numbers - List of phone numbers with hash values
- Reverse Hash Lookup - Find phone numbers from hash values
- All Resources - More guides and articles