top of page
Search

WiredTiger Surgery: What to Do When You’re Out of Options

If you’re reading this at 2 a.m., staring at a corrupt production MongoDB instance and wondering if there’s any way out... you’re not alone. We’ve been there. Smell the cookies. Blow out the candles. There is a way through this.


This is the story of a MongoDB instance that went down hard — no backups, no time, and no easy answers. Just one stubborn team, a few smart folks (shoutout to John and Noah), and one desperate idea: Surgery.


WiredTiger Surgery.

Scene One: Everything Breaks

The incident started like most nightmares do — a disk filled up, and MongoDB ran out of memory.


That seemingly simple event kicked off a full-blown meltdown. The data directory was corrupted. The server wouldn’t start. There were no working backups. Not even an old mongodump tucked away somewhere.


We tried starting up the node with both MongoDB 7 and MongoDB 8, but nothing worked. Every attempt crashed, ending with an opaque error about a “missing or invalid admin.system.keys” file.


With no clean startup path, we reached for our first lifeline: a repair attempt.


⚠️ Important: Before attempting any repair or use of the wt utility, always make a full backup of the dbPath. These operations can overwrite files, and there's no undo.

Being sure to back up the full contents of the data directory first, we ran:


mongod --dbpath /data/db --repair 

And for a moment, it looked promising. Logs scrolled, collections were being processed, and it felt like things might just work.


But right at the end — boom:


UnsupportedFormat: Unable to find metadata for table:index-31-<id> Index: name:_id_, ns: admin.system.keys} - version either too old or too new for this mongod

The repair crashed out.


No startup, no shell, no data. Just one stubborn error and a database that wouldn’t come back.

Scene Two: The WT Utility Enters the Room

This is where things get a little lower-level.

We started to explore the wt command-line utility, a standalone tool from the WiredTiger project that lets you inspect and manipulate .wt files directly. It’s not bundled with MongoDB, and it’s not for the faint of heart.


⚠️ Important: The wt utility must match your MongoDB version exactly. Using the wrong version can lead to misleading errors or prevent recovery altogether.


To find the right one, run mongod --version, then check out the corresponding mongodb-x.y branch from the WiredTiger GitHub repo. For example, MongoDB 8.0.x requires mongodb-8.0.


Detailed instructions for building WiredTiger are available in the official WiredTiger build guide.


Once compiled, the wt binary can be used to inspect and manipulate the .wt files directly. Proceed with caution. You’re now holding the scalpel.

Scene Three: What is WT Surgery?

WiredTiger is the default storage engine for MongoDB. It handles how data is stored, accessed, and managed, functioning much like a file system built into your database. It provides performance and concurrency advantages through features such as document-level locking and compression.


Under the hood, WiredTiger stores collections and indexes as individual .wt files, typically named using internal identifiers like collection-42-1234567890123456789.wt. But the real magic, and the real trouble when things go wrong, lives in the metadata.


That metadata is stored in a few special files:

  • mdbcatalog.wt: maps collections and indexes to their internal IDs

  • WiredTiger.wt: general metadata for the storage engine

  • WiredTigerHS.wt: the history store, used for multi-version concurrency control (MVCC)


These files act as the control plane for WiredTiger. When something breaks (like a corrupted index entry) the metadata may need to be inspected, altered, or even surgically rewritten to allow MongoDB to start again.

Scene Four: The Actual Surgery

This was the moment we held our breath.

We had collections we knew were salvageable, but they were being blocked by a single broken index. Dropping the entire collection felt like overkill, especially when the data was otherwise intact.


So we decided to go deeper.


What follows is a surgical approach to remove a corrupted index from the WiredTiger catalog while preserving the associated collection. This method only works for general (non-_id) indexes. If the _id index is damaged, the situation gets much trickier (see Scene Five).


🔎 Note: Because of how fragile and low-level these operations were, we created a Git repo to track changes to the metadata files (_mdb_catalog, WiredTiger.wt, etc). This made it much easier to roll back cleanly when a step didn’t go as planned.


Here’s how we did it, step by step.

  1. Dump the catalog in a readable text form so you can find the collection and index ID's:

sudo ./wt -h /data/db dump -x file:_mdb_catalog.wt | tail -n +7 | awk 'NR%2 == 0 { print }' | xxd -r -p | bsondump --quiet > catalog.json
  1. Search for the collection name in catalog.json and note the collection and index ID's.

{"md":{"ns":"customerData.people","options":{"uuid":{"$binary":{"base64":"Zi44WLokR12GGLZbBFXWUQ==","subType":"04"}}},"indexes":[{"spec":{"v":{"$numberInt":"2"},"key":{"email":{"$numberInt":"1"}},"name":"email_1"},"ready":true,"multikey":false,"multikeyPaths":{"email":{"$binary":{"base64":"AA==","subType":"00"}}},"head":{"$numberLong":"0"},"backgroundSecondary":false},{"spec":{"v":{"$numberInt":"2"},"key":{"id":{"$numberInt":"1"}},"name":"_id"},"ready":true,"multikey":false,"multikeyPaths":{"id":{"$binary":{"base64":"AA==","subType":"00"}}},"head":{"$numberLong":"0"},"backgroundSecondary":false}]},"idxIdent":{"email_1":"index-45-7749382120629297684","_id":"index-46-7749382120629297684"},"ns":"customerData.people","ident":"collection-44-7749382120629297684"}
  1. Make a backup of the collection and index files:

sudo ./wt -h /data/db dump table:collection-44-7749382120629297684 > people-out.txt
sudo ./wt -h /data/db dump table:index-45-7749382120629297684 > email-idx-out.txt
  1. Get a BSON version of the catalog:

sudo ./wt -h /data/db dump table:_mdb_catalog > catalog-out.txt
  1. Drop the corrupt table and index object:

sudo ./wt -h /data/db drop table:collection-44-7749382120629297684
sudo ./wt -h /data/db drop table:index-45-7749382120629297684
  1. Edit the catalog. Locate the line that contains the collection and index ID's. Delete that line and the prior line which should be an integer. It will look something like the text below; in our example the line we are removing is \97

\96
v\01\ ... \00collection-47-7749382120629297684\00\00
\97
v\01\00\00\03md\00\ec\00\00\00\02ns\00\16\00\00\00customerData.accounts\00\03options\00 \00\00\00\05uuid\00\10\00\00\00\04\84\eb\dd\b2iQK\ae\8f\f4\03^\0f\ad\0a\08\00\04indexes\00\97\00\00\00\030\00\8f\00\00\00\03spec\00.\00\00\00\10v\00\02\00\00\00\03key\00\0e\00\00\00\10_id\00\01\00\00\00\00\02name\00\05\00\00\00_id_\00\00\08ready\00\01\08multikey\00\00\03multikeyPaths\00\10\00\00\00\05_id\00\01\00\00\00\00\00\00\12head\00\00\00\00\00\00\00\00\00\08backgroundSecondary\00\00\00\00\00\03idxIdent\00,\00\00\00\02_id_\00\1d\00\00\00index-48-7749382120629297684\00\00\02ns\00\16\00\00\00customerData.accounts\00\02ident\00"\00\00\00collection-47-7749382120629297684\00\00
\98
v\01\ ... \00collection-47-7749382120629297684\00\00

  1. Load the updated catalog:

sudo ./wt -h /data/db load -f catalog-out.txt
  1. Run an initial repair. This step resets internal metadata and may rename your old mdbcatalog as .wt.corrupt, creating a fresh one:

sudo mongod --repair --wiredTigerEngineConfigString="salvage=true"

🔎 Note: At this point, you can also confirm that the catalog is empty (often only containing admin.system) using wt dump table:_mdb_catalog


  1. Delete any tables in the catalog-out.txt file that are present in the default mdbcatalog. If the same table is present in both, a duplicate entry will be created causing failure errors on startup. We found that admin.system.keys and a few orphaned collections were common sources of these conflicts


  2. Reload your edited catalog into the blank file:

sudo ./wt -h /data/db load -f catalog-out.txt
  1. Run --repair again to rebuild MongoDB’s internal metadata with your updated catalog:

sudo mongod --repair --wiredTigerEngineConfigString="salvage=true"
  1. Restore the file permissions:

sudo chown -R mongod:mongod /data
  1. Start mongod (hopefully):

mongod --dbpath /data/db

This process was meticulous and unforgiving, but it worked. We used this sequence repeatedly: removing one bad index at a time, running --repair, and gradually watching the system come back to life.

Scene Five: Not For the Faint of Heart

Fixing a broken index is one thing. But when the _id index is corrupted, things get a lot more complicated.


The _id index is the primary key for every collection in MongoDB. It’s required, and MongoDB relies on it to load the collection’s metadata. That made simply dropping it too risky. We had no idea how MongoDB would behave without it. Would it ignore the collection? Refuse to start? Crash outright?


While inspecting the index files, a couple of useful patterns emerged:

  • _id indexes often had empty data segments. As long as the structure was valid, MongoDB could rebuild the rest during repair.

  • The "source" field in the index metadata didn’t need to point to anything specific. Leaving it blank worked just fine.


To validate this, we spun up a clean MongoDB instance, dumped the structure of a working _id index, and loaded it into our recovery environment. When we ran --repair with salvage enabled, MongoDB filled in the missing pieces.


So when we hit a corrupted _id index in our actual environment, we tried something different — a kind of transplant.

  1. We started by dumping a valid _id index from a known-good collection; in our case, something from the config or admin databases.

wt -h . dump table:index-27-config.system.indexes-_id_ > good_index.txt
  1. We then carefully dropped the broken index and its associated WiredTiger table.

wt -h . drop file:index-31-app.collection-_id_
wt -h . drop table:index-31-app.collection-_id_
  1. With the bad data out of the way, we re-created the index table manually using wt create, and then applied the correct metadata using wt alter


🔎 Note: Yes, the parentheses really matter. There’s a check in MongoDB's wiredtiger_util.cpp that expects this exact format.

wt -h . create table:index-31-app.collection-_id_
wt -h . alter table:index-31-app.collection-_id_ "app_metadata=(formatVersion=8)"
  1. At this point, we had a properly structured shell of an index table. We could now load the dumped structure back in using wt load:

wt -h . load -f good_index.txt
  1. And at last, we ran:

mongod --repair --wiredTigerEngineConfigString="salvage=true"

It worked. A win at last. But we still had more corrupt indexes to deal with.

Scene Six: The Signs of Life

We kept going. A few more index removals, a few more cautious repairs. Then we ran mongod one more time.


  • The repairincomplete file was gone

  • mongod started without crashing

  • We could connect using mongosh

  • The collections were back

  • Document counts actually matched what we expected


It wasn’t over, but the fog had lifted. For the first time, this felt like a database again.

Scene Seven: A Fresh Start

After the excitement of seeing mongod finally spin up without error, collections intact, indexes repaired, document counts matching expectations, there was still a lingering question: could we really trust this instance to stay stable?


Corruption can leave subtle artifacts behind, and we didn’t want to take chances. So we decided to give the data a clean home.

We ran a full mongodump and restored everything into a brand-new deployment using mongorestore.


In our case, all of this work was done on a standalone node. This process was not tested on a replica set. If we had been working with a replica set, our plan would have been to perform the WiredTiger surgery on a single node, then use mongodump to extract the recovered data and restore it into a brand-new replica set and allowing MongoDB to handle syncing the other members.


This gave us peace of mind that our recovered data wasn’t sitting on top of lingering internal issues. It was a fresh start.

Lessons Learned (The Hard-Earned Kind)

This was a gnarly situation, and we learned a lot as we navigated it together. A few key takeaways that might help the next team facing something similar:


  1. Breathe first. Even in the most hopeless-looking situations, taking a moment to pause, investigate, and think things through can reveal a path forward. There's usually something worth trying that can start to move things in the right direction.

  2. Have a backup strategy in place. Even a simple mongodump on a cron job can change the outcome of a disaster.

  3. Always take a full copy of your data directory before trying any repairs. Especially before invoking --repair or using the wt tool. Remember, you can’t undo.

  4. Try --repair first. In many cases, MongoDB will rebuild a corrupt cluster if you give it the chance.

  5. Version control your metadata files. Before making edits, consider using Git to track changes to the core .wt files. It’s a lifesaver when troubleshooting step-by-step repairs or rolling back from a mistake.

  6. Use the exact version of the WT utility that matches your MongoDB deployment. Mismatched versions can cause false errors or block recovery.

  7. The WT utility lets you remove corrupt indexes without losing the entire collection. You just need to surgically target what’s broken.

  8. Having a collaborative, persistent team is everything. Huge thanks again to John and Noah. This was a true joint recovery effort.

For the Desperate Reader

If you're reading this because you're where we were — out of options, alone, and about to perform surgery on your production database — take a deep breath.

We made it through. You can too.


If you’re stuck and need someone to sanity-check your next move or just say, “yep, that’s the same error I had,” feel free to reach out:



You’re not alone. Even when things seem completely broken, your database is probably more resilient than you think.


Acknowledgments 

This article was the result of a joint recovery effort between Clarity Business Solutions and a dedicated customer team, who brought not only technical creativity to a tough situation, but also helped shape this write-up so others could benefit.

Contributors:  Goldie Lesser, MongoDB Consulting Engineer, Clarity Business Solutions  Noah Sundberg, Data Scientist & Software Engineer, Visionist, Inc.  John Roberts, MongoDB Consulting Engineer, Clarity Business Solutions

Thank you all for your persistence, insight, and collaboration — both during the recovery itself and in helping bring this story to life.

 
 
 
bottom of page