====== Unfucking BIND Lost Unused Keys ====== ===== Problem ===== BIND has changed their DNSSEC management mechanism from ''auto-dnssec maintain;'' to policy-based management. This by itself does cause many headaches, but not immediate bugs. Unless the policy chooses to generate keys for signature schemes that are not actually used, for example because one later changes the policy definition (i.e. initially RSA signatures were defined, but the registrar doesn't actually want them so you later change the policy to not generate them - but there are now some RRSIG in the dynamic signed zone. So we just adjust the policy and remove the unneeded keys, right? WRONG! [[https://narkive.com/5GuCbBeC|Do not stupidly delete ZSK files]]! This used to be mostly fine with ''maintain'', but it absolutely breaks keymgr in many fun ways. The most obvious is that any time incremental signing is to be done (so usually every few weeks), BIND runs into an infinite loop that attempts to clean up the (expired) signature, fails to find the key file to keep track of the key state, and retries a few milliseconds later. Forever. And each of those also leaves journal entries in the zone's ''.signed.jnl'' file, which depending on settings can make these infinitely large - although something else usually breaks first, leading to "non-minimal diff" log messages. ===== Diagnosis ===== This is easy: we get millions of log lines along the lines of zone_maintenance: zone security.fail/IN (signed): enter zone_resigninc: zone security.fail/IN (signed): enter dns_zone_findkeys: error reading Ksecurity.fail.+010+24081.private: file not found dns_zone_findkeys: error reading Ksecurity.fail.+014+65102.private: file not found dns_zone_findkeys: error reading Ksecurity.fail.+015+19645.private: file not found dns_zone_findkeys: error reading Ksecurity.fail.+015+17916.private: file not found ===== Recreating the missing keys ===== First, collect the problematic keys from log file. It's going to be large, so we use tail to save some cycles - we're in a loop anyway. tail -n 500000 named.log | perl -ne '/reading (.*).private: file/ && print $1 . "\n"' | sort | uniq > ~/missingkeys.txt Now we need to give BIND something to write its state in. But we don't have the key material anymore? Turns out we do (at least the public part), in the currently RRset! So we query ourself for every domain that reports an issue and let ''dnssec-import'' create a fake state file. We then immediately expire this key. #!/bin/bash test -d dist || mkdir dist while read k ; do test -f ./dist/$k.key && continue d=${k#K} d=${d%%+*} # grab the key material and import into new files dig @::1 DNSKEY $d | dnssec-importkey -f - $d > /dev/null if [ -f $k.key ] ; then # expire it dnssec-settime -D now $k.key # save for manual apply chown bind:bind $k.* echo Created: $k.* mv $k.* ./dist/ else echo Not in RRset: $k fi # cleanup work dir rm K$d+*.{key,private} done < ~/missingkeys.txt Now we can copy the files from ''dist'' to BINDs key directory and tell it to reload (oddly enough, just copying them in does not do anything to the stuck loop, even if it should start "finding" the files immediately...). For each affected zone, do this: d=security.fail ; rndc loadkeys $d && sleep 30 && rndc sign $d && sleep 20 && tail -n 50000 /var/log/named.log | grep $d Why manually instead of one big loop? Mostly because I'm a scaredy cat and want to see the results for each. It's fun to sit on a ''tail -f /var/log/named.log'' and see the messages getting sparser and sparser. Why ''loadkeys'' //and// ''sign''? The first only removes the signature and CDS binding, the second also the CDNSKEY records. ===== Cleanup ===== It's worth mentioning again: do NOT delete the key files, even after we've removed them from the zone. BIND will have taken the Deleted value we gave it, turned it into Inactive and added the usual delay to the new Deleted timer. Once that expires, the fake key files will be purged automatically.