Syncing a libferris filesystem with an XML file or database

Syncing It


With libferris, FUSE, and rsync, you can synchronize a filesystem with a dissimilar data source.

By Ben Martin

micjan, photocase.com

Admins use rsync to snchronize two filesystem trees. With a few tricks, you can use FUSE and libferris with rsync [1][2][3] to synchronize a filesystem with another data source such as an XML file or a PostgreSQL database. Libferris is a user address space Virtual FileSystem (VFS) that lets you mount almost any data source as a filesystem. Examples of data sources libferris can mount include XML files, Berkeley db4 files, rpm packages, relational databases, LDAP servers, web servers, and applications like XWindow, Emacs, xmms, Amarok, and Firefox.

Libferris also includes evolving support for mounting web services. For example, you can interface a libferris directory with a photo-sharing website like 23hq or Flickr. In this article, I will discuss some of the possibilities for using rsync to synchronize a libferris filesystem with an XML file or database.

The ferrisfs application lets you expose libferris filesystems through FUSE. In the most basic form, ferrisfs requires two arguments. First, you can pass the URL of a libferris filesystem using --url. The last argument is where you want the FUSE filesystem to appear in your Linux kernel filesystem tree. Normally, I create a fuse subdirectory in my home directory where all my FUSE mount points appear.

Metadata and Search

Apart from mounting miscellaneous data sources, the other two goals of libferris are metadata handling and filesystem search.

Libferris comes with support for automatic metadata extraction and lets you add explicit metadata to any file on any filesystem regardless of the user's write permission.

As an example of libferris' metadata capability, consider adding a handy tag to a file on an FTP server in libferris for later identification. Even if the user does not have write access to the FTP server, libferris will store the metadata in Resource Description Framework (RDF) to associate the tag with the file. On the other hand, for a file in a home directory, if you add a metadata tag, libferrris will store the metadata in a kernel extended attribute to give non-libferris applications access via the attr(1) interface.

Metadata extraction in libferris covers simple cases such as extracting the dimensions and Exif data of image files, as well as more advanced cases. For example, if you tag files in the F-Spot photo management tool, you can then access those tags using libferris.

Filesystem search support in libferris allows you to create multiple filesystem indexes.

Plugins are used to let you build indexes using PostgreSQL, Lucene, Xapian, and other tools. You can even link indexes together to create a federation.

Recent versions support using libferris through FUSE, giving unmodified applications direct access to anything libferris sees as a filesystem.

Steps

Listing 1 shows some of the steps for setting up an interaction with a libferris-backed FUSE filesystem. First a very basic XML file is created and mounted at ~/fuse/simple-xml.

Listing 1: FUSE Interaction on a Mounted XML File
01 $ cat simple-xml.xml
02 <simple-xml>
03   <something/>
04 </simple-xml>
05 $ mkdir simple-xml
06 $ ferrisfs --url ~/fuse/simple-xml.xml/simple-xml \
07     simple-xml
08 $ ll simple-xml
09 total 0
10 -rwx------ 0 ferristester ferristester 0 Jan  1  1970 something*
11 $ date >| simple-xml/something
12 $ cat simple-xml.xml
13 <?xml version="1.0" encoding="UTF-8" standalone="no" ?>
14 <simple-xml>
15   <something mtime="1179838137">Tue May 22 22:48:57 EST 2007
16 </something>
17 </simple-xml>

Notice that the --url parameter selects the first element in the XML file as the libferris filesystem (instead of the XML file itself).

XML files must have a single root element; by mounting that root element instead of the XML file, you avoid exposing this detail to the applications using the FUSE filesystem.

Normal filesystem metadata is mirrored in the XML file using XML attributes. By updating the contents of a file under the FUSE mount point, libferris both updates the contents of the XML element and records the modification time in an XML attribute.

Listing 2 shows rsync on a libferris-backed FUSE filesystem. First, the source-native-fs directory is created and populated with some simple test files. Other than the use of the - -temp-dir command-line option, the command looks like any other invocation of rsync.

Listing 2: Rsync to XML
01 $ mkdir source-native-fs
02 $ cd source-native-fs
03 $ date >datefile1.txt
04 $ date >datefile2.txt
05 $ touch emptyA
06 $ echo -n "hi there" > main
07 $ cd ~/fuse
08 $ mkdir ~/fuse/rsync-junk
09 $ rsync -avz -T ~/fuse/rsync-junk \
10    source-native-fs/ simple-xml/
11 $ cat simple-xml.xml
12 <?xml version="1.0" encoding="UTF-8" ...?>
13 <simple-xml atime="1179838274" mode="40775"...
14    mtime="1179838199">
15   <something mtime="1179838137"
16   >Tue May 22 22:48:57 EST 2007
17 </something>
18   <datefile1.txt atime="1179838337" mode="100664"...
19      mtime="1179838179">Tue May 22 22:49:39 EST 2007
20 </datefile1.txt>
21 ...
22   <main atime="1179838338" mode="100664"...
23      mtime="1179838199">hi there</main>
24 </simple-xml>
25 $ rsync -avz  --delete-after \
26   -T ~/fuse/rsync-junk \
27    source-native-fs/ simple-xml/
28 building file list ... done
29 deleting something
30 sent 159 bytes  received 20 bytes  358.00 bytes/sec
31 total size is 66  speedup is 0.37
32 $ grep something simple-xml.xml
33 0
34 $ fusermount -u simple-xml

The final rsync invocation uses the - -delete-after option to remove the something file, which was originally part of the XML file but is not part of the source filesystem passed to rsync.

The grep command checks that something is no longer part of the XML file after the sync.

The previous section showed data being synced between a native kernel filesystem (ext3 in this case) and a subtree in an XML file.

Sync Across Filesystem Types

The libferris and FUSE combination allows you to convert between different data formats while you are performing the sync. By exposing part of an XML file through libferris and FUSE, you can keep various parts of an XML file in sync with other data - perhaps involving many different rsync invocations covering different parts of a single XML file.

The ability to rsync between different filesystems like this can be very convenient when both filesystems provide different features and you want a combination of these features. For example, many tools make editing XML simple, though accessing a single element (file) in XML is much slower than accessing a single file in a db4 file.

The commands shown in Listing 3 keep a db4 file in sync with the contents of an XML file. The simple-xml FUSE filesystem, which is based on the simple-xml.xml file in Listing 1, is reused here. If there are attributes in the XML file that are not the standard lstat(2) attributes, they are exposed by the libferris FUSE filesystem as extended attributes.

Listing 3: Rsyncing an XML File into a db4 File
01 $ fcreate `pwd` --create-type=db4 name=db4.db
02 $ mkdir db4
03 $ ferrisfs -u ~/fuse/db4.db db4
04 $ rsync -avz --delete-after -T ~/fuse/rsync-junk  simple-xml/ db4/
05 $ db_dump -p db4.db
06 VERSION=3
07 format=print
08 type=btree
09 db_pagesize=4096
10 HEADER=END
11  /atime
12  1179840317
13  /datefile1.txt/atime
14  1179840317
15  /datefile1.txt/mode
16  100664
17  /datefile1.txt/mtime
18  1179838179
19 ...
20  datefile1.txt
21  Tue May 22 22:49:39 EST 2007\0a

The rsync command has support for syncing extended attributes across filesystems using the -X (--xattrs) command-line option. In syncing extended attributes, libferris creates many virtual attributes to expose extra metadata about the filesystem.

To get around this extra metadata libferris wants to offer, the ferrisfs command has the option to limit what attributes are reported from the FUSE filesystem. For example, using --show-ea=user.dislikes will make the FUSE filesystem report only the user.dislikes extended attribute. The result is that rsync will only try to sync that one extended attribute instead of a lot of other metadata that libferris makes available.

Another complication of syncing extended attributes is that filesystems report attributes that can be user modified with the user. prefix, so the attribute dislikes will only be readable by getxattr(2) using the name user.dislikes. As many XML files are not likely to have the user. prefix in their XML attributes, there is the ferrisfs - -prepend-user-dot-prefix-to-ea-regex command-line option to explicitly add user. to any attributes that match the given regular expression.

Listing 4 shows a first attempt to sync XML attributes as well as file content with ferrisfs and rsync. The first db_dump execution shows that none of the XML attributes have been written to the Berkeley db4 file. Using the rsync -X (--xattrs) command-line option to try to correct this gives the error message about "as-xml" not being available through getxattr().

Listing 4: Using Rsync to Sync XML Attributes
01 $ fcreate `pwd` --create-type=db4 name=target.db
02 $ mkdir target
03 $ ferrisfs -u `pwd`/target.db target
04 $ cat attributes-in-xml.xml
05 <main>
06   <sub1 attr1="hello" second="world"/>
07   <gaw  another="value"/>
08 </main>
09 $ mkdir attributes-in-xml
10 $ ferrisfs -u `pwd`/attributes-in-xml.xml/main \
11     attributes-in-xml
12 $ rsync -avz --delete-after -T ~/fuse/rsync-junk \
13     attributes-in-xml/ target/
14 $ db_dump -p target.db
15 VERSION=3
16 ...
17 HEADER=END
18  gaw
19  sub1
20 DATA=END
21 $ rsync -X -avz --delete-after -T ~/fuse/rsync-junk \
22     attributes-in-xml/ target/
23 ...building file list ...
24 rsync: rsync_xal_get: lgetxattr(".","as-xml",37199)
25 failed: Input/output error (5)
26 ...
27 $ db_dump -p target.db
28 VERSION=3
29 ...
30 HEADER=END
31  gaw
32  sub1
33 DATA=END
34 $ fusermount -u attributes-in-xml
35 $ ferrisfs -u `pwd`/attributes-in-xml.xml/main   \
36    --show-ea-regex="(attr1|another|second)"      \
37    --prepend-user-dot-prefix-to-ea-regex=".*"    \
38     attributes-in-xml
39 $ rsync -X -avz --delete-after -T ~/fuse/rsync-junk \
40     attributes-in-xml/ target/
41 $ db_dump -p target.db
42 ...
43 HEADER=END
44  /gaw/user.another
45  value
46  /sub1/user.attr1
47  hello
48  /sub1/user.second
49  world
50  gaw
51  sub1
52 DATA=END

The trick is to use the ferrisfs - -show-ea-regex and - -prepend-user-dot-prefix-to-ea-regex options to only show the extended attributes you are interested in. If an attribute that matches show-ea-regex is available for a virtual libferris file, ferrisfs will export that attribute to FUSE as an extended attribute. As the final db_dump shows, the XML attributes are now available in the db4 file as well.

Listing 5 is a simple table in a PostgreSQL database. The table can be mounted by using the postgresql:// or pg:// URL in libferris, as the ferrisls command shows. Using a PostgreSQL table as the source for rsync presents no new issues with how to invoke ferrisfs, as shown in Listing 6. Each column in the table becomes an extended attribute in the target filesystem.

When the file contents of a tuple is read by libferris, it gives an XML serialized version of the data. As the extended attributes also give the same information in broken down format, you don't really care about the tuple's file content. Listing 6 solves this issue by reporting that all the tuples are zero-byte files.

Listing 5: Accessing a PostgreSQL Database
01 $ psql ferristester
02 ferristester=> \d foobar
03             Table "public.foobar"
04  Column  |          Type          | Modifiers
05 ---------+------------------------+-----------
06  fooid   | integer                | not null
07  fooname | character varying(100) |
08  e       | character varying(100) |
09 Indexes:
10     "foobar_pkey" PRIMARY KEY, btree (fooid)
11 ferristester=> select * from foobar;
12  fooid | fooname |           e
13 -------+---------+-----------------------
14     10 | William |
15     45 | Rick    | 15 credibility street
16   3002 | Satou   | Tokyo
17    101 | John    | Some data
18 (4 rows)
19 ferristester=> \q
20 $ ferrisls --xml pg://localhost/ferristester/foobar
21 <?xml version="1.0" encoding="UTF-8" ... ?>
22 <ferrisls>
23   <ferrisls e="" fooid="" fooname="" ...
24     name="foobar" primary-key="fooid" ...
25    url="pg:///localhost/ferristester/foobar">
26     <context e="" fooid="10"
27       fooname="William" name="10".../>
28     <context e="Tokyo" fooid="3002"
29       fooname="Satou" name="3002".../>
30 ...
31   </ferrisls>
32 </ferrisls>
Listing 6: Rsyncing Data Out of a Table
01 $ mkdir pg
02 $ ferrisfs --show-ea=user.fooid,user.fooname,user.e \
03   --prepend-user-dot-prefix-to-ea-regex=".*"  \
04   --force-empty-file-contents-regex=".*" \
05   -u pg://localhost/ferristester/foobar pg
06 $ ls -l pg
07 total 0
08 -rwx------ 0 ferristester ferristester 50 Jan  1  1970 10
09 -rwx------ 0 ferristester ferristester 57 Jan  1  1970 101
10 -rwx------ 0 ferristester ferristester 55 Jan  1  1970 3002
11 -rwx------ 0 ferristester ferristester 68 Jan  1  1970 45
12 $ cd pg
13 $ attr -l 101
14 Attribute "fooid" has a 3 byte value for 101
15 Attribute "fooname" has a 4 byte value for 101
16 Attribute "e" has a 9 byte value for 101
17 $ attr -g fooname 101
18 Attribute "fooname" had a 4 byte value for 101:
19 John
20 $ cd ..
21 $ mkdir target
22 $ rsync -Cavz -X -T ~/fuse/rsync-junk pg/ target/
23 building file list ... done
24 ./
25 10
26 101
27 3002
28 45
29 7
30 sent 762 bytes  received 136 bytes  1796.00 bytes/sec
31 total size is 0  speedup is 0.00
32 $ cd target
33 $ attr -l 3002
34 Attribute "e" has a 5 byte value for 3002
35 Attribute "fooid" has a 4 byte value for 3002
36 Attribute "fooname" has a 5 byte value for 3002
37 $ attr -g e 3002
38 Attribute "e" had a 5 byte value for 3002:
39 Tokyo

Synching into PostgreSQL

Synchronizing information into a PostgreSQL database with rsync presents extra issues because a database table does not behave exactly like a filesystem. For example, as shown in Listing 5, the primary key of the table is fooid. Without specifying at least the primary key of the tuple to create, you cannot make a new file in a mounted PostgreSQL table.

Also, when the file contents of a tuple is read by libferris, it gives an XML serialized version of the tuple itself. Updating both the XML serialized version of a tuple and each individual table column through the extended attributes would be twice the effort. The --throw-away-write-to-file-contents-regex command-line option to ferrisfs solves the latter problem by ignoring anything that is written to the file's contents for files that have a URL matching the given regular expression. Updates must happen via the extended attributes interface.

The --delay-commit-path ferrisfs command-line option was added to solve the primary key issue. The nominated path allows new files to be created and extended attributes written on those new files without immediately trying to update the database. Listing 7 shows how to rsync into a PostgreSQL table.

Listing 7: Rsyncing into a PostgreSQL Table
01 $ ferrisfs --show-ea=user.fooname,user.e,user.fooid \
02   --prepend-user-dot-prefix-to-ea-regex=".*"  \
03   --throw-away-write-to-file-contents-regex=".*" \
04   --delay-commit-path=pg:///localhost/ferristester/foobar \
05   --delay-commit-path-trigger-ea=user.fooname \
06   --throw-away-write-to-ea-regex=".*foobar" \
07   -u pg://localhost/ferristester/foobar pg
08 $ rsync -avz -X -T ~/fuse/rsync-junk target/ pg/
09 building file list ... done
10 10
11 101
12 3002
13 45
14 7
15 sent 756 bytes  received 130 bytes  590.67 bytes/sec
16 total size is 0  speedup is 0.00
17 $ cd target
18 $ ll
19 total 28K
20 -rwx------ 1 ferristester ferristester 50 Jan  1  1970 10*
21 -rwx------ 1 ferristester ferristester 68 Jan  1  1970 45*
22 -rwx------ 1 ferristester ferristester 57 Jan  1  1970 101*
23 -rwx------ 1 ferristester ferristester 55 Jan  1  1970 3002*
24 $ attr -g fooname 10
25 Attribute "fooname" had a 7 byte value for 10:
26 William
27 $ attr -s fooname -V "Willie" 10
28 Attribute "fooname" set to a 6 byte value for 10:
29 Willie
30 $ touch 7
31 $ attr -s fooid -V 7 7
32 Attribute "fooid" set to a 1 byte value for 7:
33 7
34 $ attr -s fooname -V new-item 7
35 Attribute "fooname" set to a 8 byte value for 7:
36 new-item
37 $ cd ..
38 $ rsync -avz -X -T ~/fuse/rsync-junk target/ pg/

The commands shown in Listing 8 create a second table and then populate it from foobar using rsync. If the commands from the mkdir command down are run again at a later time, then foo2 is updated using rsync with changes from the foobar table.

Listing 8: Keeping a Copy of a PostgreSQL Table
01 $ psql ferristester
02 ferristester=> create table foo2
03   ( fooid serial primary key,
04     fooname varchar(100),
05     e varchar(100));
06 ferristester=> \q
07 $ mkdir -p foo2
08 $ ferrisfs --show-ea=user.fooname,user.e,user.fooid \
09   --prepend-user-dot-prefix-to-ea-regex=".*"  \
10   --force-empty-file-contents-regex=".*" \
11   --force-empty-read-from-ea-regex=".*foobar" \
12   -u pg://localhost/ferristester/foobar pg
13 $ ferrisfs --show-ea=user.fooname,user.e,user.fooid \
14   --prepend-user-dot-prefix-to-ea-regex=".*"  \
15   --throw-away-write-to-file-contents-regex=".*" \
16   --delay-commit-path=pg:///localhost/ferristester/foo2 \
17   --delay-commit-path-trigger-ea=user.fooname \
18   --throw-away-write-to-ea-regex=".*foo2" \
19   -u pg://localhost/ferristester/foo2 foo2
20 $ rsync -avz -X -T ~/fuse/rsync-junk pg/ foo2/
21 $ fusermount -u pg
22 $ fusermount -u foo2

Future Directions

Support for rsync with PostgreSQL currently revolves around single tables. In the future, this support should expand to allow rsync to operate on an entire database at once.

Also, adding support for other syncing solutions like Unison [5] and Harmony [6] will be very interesting.

INFO
[1] libferris: http://witme.sourceforge.net/libferris.web/
[2] rsync: http://rsync.samba.org/
[3] Filesystem in Userspace: http://fuse.sourceforge.net/
[4] fuselagefs and delegatefs: http://sourceforge.net/project/showfiles.php?group_id=16036&package_id=225200
[5] Unison bidirectional sync: http://www.cis.upenn.edu/~bcpierce/unison/
[6] Harmony bidirectional sync: http://www.seas.upenn.edu/~harmony/
THE AUTHOR

Ben Martin has been working on filesystems for more than 10 years. He is currently working toward a PhD. His research focuses on combining semantic filesystems with formal concept analysis to improve human-filesystem interaction.