Recovery on a salvaged PostgreSQL cluster with missing files

Cover art for this post, showing a distorted bright background, with a blooming PostgreSQL logo on the front

Hi! It's been a while. This post has been in my drafts since November of last year. I finally decided to finish it up, since the information presented here was quite hard to pinpoint.

As some of you may be aware of, my main server suffered an SSD failure on October 14th, 2023, with no backups available. This SSD had the data for many of the services I host, including my Akkoma instance, the mint.lgbt Matrix and XMPP servers, and even this blog! Thankfully, with the help of the donators now listed in the Coral Castle homepage, I sent it to a local, Chilean data recovery service called Nekiori, located in the Los Leones commercial neighborhood of the Providencia district. They were successful at the job, managing to recover around 300 GB of data, not to mention their excellent communication and pricing.

However, no data recovery job is perfect, and in many cases you'll still end up with a few files missing. This is what happened with the salvaged PostgreSQL cluster, and I'll explain how I was able to get a huge chunk of it back within about 2 hours.

Starting up the salvaged cluster

Before beginning, MAKE A COPY OF THE ORIGINAL DATA DIRECTORY. We'll be doing lots of destructive changes, so, if anything goes wrong, you'll want to have a checkpoint you can restart from. I know duplicating clusters is time-consuming, and may take up lots of space, but it's best to be safe than sorry (although, I'm expecting whoever has to go through this to already have learned this valuable lesson). Additionally, this guide only consists of getting past PostgreSQL errors, not regenerating any pages that were permanently lost.

Let's start by installing PostgreSQL on our work environment, making sure to match the version the recovered cluster was originally running on. On Alpine Linux, for instance, we can specify the cluster's version to be 15.x by running:

# apk add postgresql15

Once installed, we'll skip using the daemons provided by the package, and instead we'll run the server straight from the CLI, in order to have better control of each parameter. I would recommend using a multiplexer like Byobu, in order to view logs easily and be able to gracefully shut down the database when needed. Otherwise, if you're working on a full blown desktop, you can use multiple terminals. Let's break down the following command:

$ postgres -k /tmp -p 5433 -D postgresql/15/data/

After we've each parameter, let's run it!

...

All good? No? Don't worry. It's likely to fail its first launch, complaining about it not being able to find a directory in the data folder. If that's the case, create the missing directory (should be shown in the log) with:

$ mkdir [data]/[dir]

Then, try to execute the server again. Rinse, repeat, and eventually the server should begin running without a crash.

Locate a database

Now, since this cluster is in an unstable state, it shouldn't be used for production. Instead, we'll locate the most relevant databases we want to keep, so we can later dump them. Enter the psql shell with the following command:

$ psql -h /tmp -p 5433 postgres postgres 

Again, here's the breakdown: – psql is the command for the PostgreSQL shell – -h /tmp is the location of the Unix socket we defined on the last step. – -p 5433 is the port we also defined earlier. – First postgres is the database name. Most PostgreSQL clusters use this database as a sort of starting point. – Second postgres is the user we're authenticating as, which in most clusters is also set to be a superuser, granting us elevated privileges.

If all went well, you should now be staring at a screen like this:

psql (15.6)
Type "help" for help.

postgres=#

Now, to list our available databases, we need to run the following command:

postgres=# \l

After pressing enter, we should see a table of every database present in the cluster. Write down the names of the databases to preserve, as shown in the “Name” column. If you want to explore the contents of the database, you can execute a series of commands like these:

postgres=# \c shop # Connects to the database "shop"
shop=# \dt # Displays all tables in database
shop=# SELECT * IN orders; # Displays all elements in the table "orders"

Once we've kept note of all the databases we want to preserve, we'll be exiting psql by running:

postgres=# \q

Dump away!

In order to ease this process, I've created a script that will read the output off of the pg_dump command, and create blank pages in the likely case that some couldn't be found. We can download it from my GitHub Gist using the following command:

$ curl -LO https://gist.githubusercontent.com/otoayana/7623d05b0b60c71160c37771398bfcaf/raw/ada88e9ad936e317ba19787f8886e24e2c96b123/pg_taprec

Once downloaded, we'll give it permissions to execute using chmod, and afterwards we'll need to open it with a text editor. For the sake of simplicity, I'll be using nano as an example.

$ chmod +x pg_taprec
$ nano pg_taprec

Inside this file, we'll need to modify the following variables: – DATA: Route of the cluster's data directory – HOST: The location for the PostgreSQL server's Unix socket, as defined earlier – DB: The database we want to recover. Set it to the name for one of the databases we noted down earlier. – USER: A superuser within our cluster. By default, should stay as postgres.

Once we've edited the file, let's execute it, making sure to specify the final output as one of the arguments.

$ ./pg_taprec [file]

This process may take a while, as it needs to retry dumping whenever it encounters a missing page. In the meantime, go prepare whatever hot drink you like, and maybe read another article on this blog.

Once it's done, rinse and repeat, changing the DB variable to the next database we want to dump, and changing the first argument in the command to a different path, otherwise we'll be overwriting the previous dump.

Shut down, then import

Once we've finished dumping every picked database, let's close the terminal running the server (or terminate the server using Ctrl+C). We can now delete the duplicated cluster, and keep the original somewhere safe, just in the case we need it at a later time.

Let's copy the database dumps into our new server, and install a fresh PostgreSQL server on it, if not yet installed, as specified at the beginning of this post.

Once the new cluster is running, let's create a new database corresponding to a dump, with an owner which can interact with it:

# su - postgres
$ createuser [owner]
$ createdb [database] -O [owner]

Finally, let's import the database dump into it. We can do so using the following command, making sure to replace [dump] with the path where our dump is stored within the server:

$ psql [database] < [dump]

Rinse and repeat for each dumped database.

Conclusion

Congrats! You've managed to salvage what you could. However, it's always better to prepare for an emergency such as this one. Consider the following tips for preventing other catastrophes: