/ home / blog about

Rescuing a Broken EXT4 System with Ext4Magic and dd

2022/04/14

while i was setting up this machine, i made some mistakes along the way and observed the dreaded fsck messages on boot:

[kernel boot output]
...
/dev/sda1 contains a file system with errors, fsck required.
dropping into emergency shell
> 

so i ran fsck on the disk like the computer told me to:

# fsck /dev/sda1
[...]
Inode 5202 has an invalid extent
        (logical block 0, invalid physical block 2250752, len 2048)
Clear ('a' enables 'yes' to all) <y>? yes to all

and then i typed a ("yes to all") like a noob, expecting that any cornerstone tool of Linux in 2022 would act sanely. i mean, it didn't give me any other apparent option: i had to complete fsck before it would let me boot, and the only option fsck gives me on each error is yes or yes to all. so, obviously i'm supposed to let it fix everything itself.

Fsck fsck

so i let it whir. and it trashed everything. if i had properly read the output, i maybe would have understood that it was simply deleting ("clearing") everything on the disk that it didn't understand. but i'm an imperfect user, and i was in a hurry. so bye-bye /opt/pleroma. bye-bye /home. bye-bye arbitrary chunks of /var/lib/postgres/data/base/50616. i didn't have backups in place: i had naively decided to tackle that after i finished installation and had a clear sense of how this system would be structured long-term.

so do i just live with this, and redo two days of work?

hell no. i'd rather spend three days diving into EXT4 internals than redo anything. so i saved the fsck output, attached (but didn't mount) the drive to a clean system, and got busy.

What Did fsck Actually Do?

at the start of the fsck run was this:

Resize inode not valid.  Recreate<y>? yes

the remainder of the messages were mostly of two forms:

Entry 'admin' in /home (8196) has invalid inode #: 2234378.
Clear? yes

and

Inode 4574 has an invalid extent
        (logical block 0, invalid physical block 4031998, len 1)
Clear<y>? yes
Inode 4574, i_blocks is 8, should be 0.  Fix<y>? yes

in fact, i had done an earlier resize2fs to expand the 8 GiB FS to fit its 2 TiB partition. the docs say you can do this on a live filesystem, but... caveat emptor?

EXT4 defaults to a block size of 4096B (i.e., a traditional page of RAM). physical blocks are a direct reference to some offset into the underlying device. so "physical block 4031998" corresponds to the byte range on the device of 16515063808 - 16515067904. inodes are indexed by their physical block address as well, so inode # 2234378 corresponds to the block starting at byte index of 9152012288. notably, these indexes are both beyond the original 8 GiB fs size. this holds true of every inode and data block fsck complained about.

there's a good chance that all/most of the actual data/inode blocks still hold valid data on-disk, and the EXT4 drivers simply didn't understand their address.

we have two tasks here:

  1. recover the unlinked inodes (i.e. directory entries).
  2. recover the cleared extents (i.e. data blocks within a file).

there's a purpose-built tool for #1, and we can script our own thing for #2.

Recovering Inodes with Ext4Magic

ext4magic is a tool to manage data loss like this. one of its modes is ext4magic -I <inode> -R <device>, wherein you pass it an inode #, it parses the inode data structure off the disk, and then makes a best-effort attempt to recover everything in the fs tree at, or referenced by, that inode.

so for the missing /home/admin directory, i need only run:

# extmagic -I 2234378 -R -d recovered-inodes /dev/sda1

and out pops the entire directory tree for the admin directory. e.g.

$ tree recovered-inodes
└── <2234378>
    └── gitea
        ├── assets
        │   ├── emoji.json
        │   └── logo.svg
        ├── BSDmakefile
        ├── build
        │   ├── code-batch-process.go
        │   ├── codeformat
        │   │   ├── formatimports.go
        │   │   └── formatimports_test.go
        │   ├── generate-bindata.go
        │   ├── generate-emoji.go
...

everything under that <2234378> even has the correct group/owner/permission bits. you just need to rename <2234378> to admin, chown it to what it was before, and then link it back into /home in your fs (but don't do that yet: put this in some staging directory and link everything back in only after all the data has been recovered).

repeat this for all the "invalid inodes" referenced in the fsck output. then we'll recover the data blocks.

Recovering Data Blocks: EXT4 Data Structures

the 2nd class of message was:

Inode 4574 has an invalid extent
        (logical block 0, invalid physical block 4031998, len 1)
Clear<y>? yes
Inode 4574, i_blocks is 8, should be 0.  Fix<y>? yes

to understand that this even is recoverable, it helps to understand the ext4 inode structure. inodes are on-disk data structures, one for every directory entry on the system. an inode might represent a file or a directory. they look similar in both cases, but we only care about inodes which represent files here.

each inode is a fixed-size structure holding metadata, like file type/mtime and -- notably -- file size, and then they link to a dynamically-sized sequence of "extents"; roughly, pointers to where the file data lives on disk. the English translation is like "data bytes 0-32768 occupy the physical blocks starting at block 4156555; bytes 32768-36864 occupy the physical blocks starting at block 6285112". this is all represented in terms of blocks, so based on the file length the last block may only be partially filled with data.

EXT4 (and many file systems) largely keeps the file data entirely outside of the inode structure. fsck tells us that it cleared the extent entries, but not the actual data blocks. i_blocks here refers to the blocks allocated to the inode for storing its variably-sized data, i.e. the list of extents (for what seems to be legacy reasons, this is denoted in 512B disk sectors instead of FS blocks).

so, all the inode metadata is still here; the data blocks exist but are unlinked, and only the extents were lost. if you try reading the file, it'll still present its original length of data, but will show a block's worth of zeros for every logical block whose extent was cleared.

Recovering Data Blocks

so we just need to link the data blocks back into the extents structure. we could dive deeper into EXT4 data structures and twiddle those bits, but that would lead us into having to understand the inode and block allocators. instead, we can just dump the block-level data, and use fs-level APIs to put it back.

Ext4Magic -B <block> will dump the full 4096 bytes of some physical block. but because the physical block is a direct index into the device, we can also just use dd. for example, let's recover this cleared extent:

Inode 4574 has an invalid extent
        (logical block 0, invalid physical block 4031998, len 1)
Clear<y>? yes
Inode 4574, i_blocks is 8, should be 0.  Fix<y>? yes

first, we'll want to know which file this comes from:

$ mkdir preserved
# mount the drive READ-ONLY:
$ sudo mount -o ro /dev/sda1 preserved
$ find preserved/ -inum 4574
preserved/etc/passwd

that's, uh, an important file. does the data block still hold proper data?

$ dd if=/dev/sda1 of=/dev/stdout bs=4096 skip=4031998 count=1
root:x:0:0::/root:/bin/bash
bin:x:1:1::/:/usr/bin/nologin
daemon:x:2:2::/:/usr/bin/nologin
mail:x:8:12::/var/spool/mail:/usr/bin/nologin
ftp:x:14:11::/srv/ftp:/usr/bin/nologin
http:x:33:33::/srv/http:/usr/bin/nologin
nobody:x:65534:65534:Nobody:/:/usr/bin/nologin
[...]
<NUL><NUL><NUL>[...]

yes!

let's set up a scratch space. we can construct an overlay of our rootfs where we place all the recovered and patched entries, and then apply that to the original device once we're done recovering.

$ mkdir recovered

go ahead and manually link all the entries we recovered with ext4magic -I earlier into this recovered directory and fix up their group/owner/permissions.

now we can patch individual files by copying them from preserved/<path> to recovered/<path> and then dding specific byte ranges from /dev/sda1 into recovered/<path>. for example:

$ mkdir -p recovered/ext
$ sudo cp preserved/ext/passwd recovered/ext/passwd
$ sudo dd if=/dev/sda1 of=recovered/ext/passwd bs=4096 skip=4031998 count=1
$ ls -l preserved/ext/passwd
-rw-r--r-- 1 root root 3528 /etc/passwd
$ sudo truncate --size=3528 recovered/ext/passwd

because dd copies the whole block, we have that additional step of truncating the file to its original size.

Bringing It Together

we've successfully recovered (into the recovered directory):

  1. all unlinked directory entries.
  2. the cleared extent in /etc/passwd.

we still need to:

  1. recover all other cleared extents.
  2. link the recovered data back into the real file system.

step 2 is a simple rsync. step 1 is some nasty dd work. i demoed it for a file with only one cleared extent, but some files have many cleared extents, often non-contiguous.

assume the presence of a script patch_file.py (see Appendix) which takes:

then we can parse the fsck output and script the rest of step 1.

# fsck /dev/sda1 
[...]

Inode 34215 has an invalid extent
        (logical block 14, invalid physical block 2207227, len 1)
Clear? yes

Inode 58213 has an invalid extent
        (logical block 0, invalid physical block 3456000, len 1024)
Clear? yes

Inode 58213 has an invalid extent
        (logical block 1024, invalid physical block 3463168, len 623)
Clear? yes

Inode 58213, i_blocks is 13176, should be 0.  Fix? yes

Inode 58222 has an invalid extent
        (logical block 0, invalid physical block 2207151, len 1)
Clear? yes

Inode 58222, i_blocks is 8, should be 0.  Fix? yes

[...]

run find -i <inode> preserved/ on each of these inodes to find the file they correspond to, and then you can create this script from that snippet of fsck output:

./patch_file.py -i 34215 -f var/log/pacman.log 14,1,2207227
./patch_file.py -i 58213 -f usr/bin/yay 0,1024,3456000 1024,623,3463168
./patch_file.py -i 58222 -f etc/fstab 0,1,2207151

sometimes find won't find the inode that fsck updated. for example, if you booted the system after running fsck, Linux will notice that certain files have been corrupted and will update them with placeholders, destroying the original inode. these are usually the more important files, so you can dump the data block with that dd command and compare it to notable entries on a good file system to "guess" what it was originally. since we don't have the original inode, we lost the metadata like its length, so use the --auto-len flag to guess the length by trimming zero's off the original data block.

take this snippet of fsck output:

Inode 4997 has an invalid extent
        (logical block 0, invalid physical block 3831801, len 1)
Clear<y>? yes
Inode 4997, i_blocks is 8, should be 0.  Fix<y>? yes

try to find the file:

$ find -i 4997 preserved/
# (no output)

but we dump physical block 3831801 and notice that it looks a lot like /etc/shadow. so:

./patch_file.py -i 4997 --auto-len -f etc/shadow 0,1,3831801

once you've patched all the files, then bring the file system back online, writeable, and copy over your changes.

$ sudo umount preserved
$ mkdir sda1
$ sudo mount /dev/sda1 sda1
$ rsync -av --checksum recovered/ sda1/
$ sync && sudo umount sda1

if all went well, you can boot the disk now. cheers 🍻

Appendix

the patch_file.py script:

#!/usr/bin/env python3

'''
replaces zero-pages, or partial zero-pages within a single file
'''

import os
import subprocess
import sys

PAGE_LEN = 4096
IN_DIR = 'preserved'
OUT_DIR = 'recovered'

def patch_range(file_: str, logical_block: int, n_blocks: int, physical_block: int):
    '''
    patch a whole range of blocks within the file
    '''
    subprocess.check_output([
        'dd',
        'if=/dev/sda1',
        f'of={file_}',
        'bs=4096',
        f'seek={logical_block}',
        f'skip={physical_block}',
        f'count={n_blocks}',
    ])

def copy_for_patch(path: str) -> str:
    in_path = os.path.join(IN_DIR, '.', path)
    out_path = os.path.join(OUT_DIR, path)
    subprocess.check_output(['rsync', '-a', '--relative', in_path, OUT_DIR + '/'])
    return out_path

def estimate_length(path: str) -> int:
    '''
    return the length of the file were there to be no trailing bytes
    '''
    contents = open(path, 'rb').read()
    l = len(contents)
    while l and contents[l-1] == 0:
        l -= 1
    return l

def main(path: str, auto_len: bool, patches: list):
    path = copy_for_patch(path)
    old_size = os.stat(path).st_size
    for patch in patches:
        logical_block, n_blocks, physical_block = patch
        patch_range(path, logical_block, n_blocks, physical_block)

    if auto_len:
        os.truncate(path, estimate_length(path))
    else:
        os.truncate(path, old_size)
    
def parse_args(args: list):
    '''
    return:
      str: the relative file being operated on,
      bool: auto-estimate len,
      list: the ranges to patch
    '''
    i = 0
    inode = None
    file_ = None
    auto_len = False
    ranges = []
    while i < len(args):
        arg = args[i]
        if arg == '-i':
            inode = int(args[i+1])
            i += 2
        elif arg == '-f':
            file_ = args[i+1]
            i += 2
        elif arg == '--auto-len':
            auto_len = True
            i += 1
        else:
            logical_block, n_blocks, physical_block = map(int, arg.split(','))
            #vvv not actually required, but indicative of an error
            assert logical_block < physical_block
            ranges.append((logical_block, n_blocks, physical_block))
            i += 1
    # inode doesn't actually get used
    # it's useful just to keep the script invocations organized
    return file_, auto_len, ranges


if __name__ == '__main__':
    main(*parse_args(sys.argv[1:]))