Postgresql 11: Domain Array Types

Arrays of Domain

One subtle new feature in PostgreSQL 11 is domain type arrays. I say subtle, because you may not have noticed that they were not supported before, or perhaps, as was my experience, your first attempt to make use of an array over a domain happened to be with version 11, and it “just worked.”

If you share the philosophy to have the database do as much as reasonably possible to ensure data consistency (and I think you should), then you know that domains are a great tool to that end, letting you express a set of constraints in one place, then reference them in as many tables or datatypes as needed with much less repetition or copy/paste errors.

For example, given this simplified subset of a schema making reference to a domain required_text over text ensuring that the value is nonempty, not null, as well as has no leading / trailing whitespace:

create domain required_text text
    check (trim(value) = value
        and length(value) > 0) not null;

create table email_address
(
    id serial primary key,
    address required_text
        -- Not a production quality email constraint, but...
        check (address like '%@%')
);

create table email(
    id serial primary key,
    sender_id int
        references email_address,
    subject text
    --  ...
);

create table email_recipient (
    email_id int not null
        references email,
    recipient_id int not null
        references email_address,
    primary key (email_id, recipient_id)
);

PostgreSQL implements domains using its flexible and user-expandable type system, where each type known to the database exists as a row in system catalog table pg_catalog.pg_type. The primary key for this table is an integer-ish field/type named oid. The oid value corresponding to a type’s pg_type row is usually (ah, sweet foreshadowing) how a PostgreSQL server describes a result set’s column types in the client/server protocol with the RowDescription message which indicates the PostgerSQL-side datatypes, informing the client’s driver as to what client-side types and/or parsing logic to use to decode the result set’s values.

Basic scalar types are described in the system catalog’s pg_type table by rows with typtype set to ‘b’ for “base type.” Domains are described with a ‘d’, for, well, domain. Arrays are also modeled as types in the system catalog, described with typcategory value A for “Array” and through having their oid referenced by their scalar’s row in column typarray. Convention dictates that the array type’s name will be the same as the scalar type’s, but with an underscore prepended.

All that’s a mouthful. Let’s just look:

scratch=# select
    oid, typname, typtype, typcategory, typarray, typbasetype
    from pg_type
    where typname in ('text', 'required_text', '_text', '_required_text');

   oid   |    typname     | typtype | typcategory | typarray | typbasetype
---------+----------------+---------+-------------+----------+-------------
      25 | text           | b       | S           |     1009 |           0
    1009 | _text          | b       | A           |        0 |           0
 2741346 | required_text  | d       | S           |  2741345 |          25
 2741345 | _required_text | b       | A           |        0 |           0
(4 rows)

Here we see a subset of the pg_type descriptions for types text and our domain required_text plus their corresponding array types. The required_text domain is described with typtype value ‘d,’ and references its base type’s oid in the typbasetype column. Being scalars supporting arrays, they name their array types by oid in the typarray column: 1009 and 2741345. The array types _text and _required_text are marked as arrays according to typcategory value ‘A’ (and through other pg_type columns we’re not interested in for the sake of this post). The oids for required_text and _required_test will be based on the state of your database cluster when you actually run the code, whereas the values for text and _text, being builtins, are well known and set up by initdb when you initialized your cluster.

(Why does the array type _required_text end up having a lower oid than its scalar type? Because the code for CREATE TYPE ... AS DOMAIN in /src/backend/commands/typecmds.c’s DefineDomain() creates the array type oid first since the scalar domain’s type will need to forward reference it.)

Array types don’t have a corresponding typarray value, because, well, PostgreSQL arrays are inherently multi-dimensional, and we don’t want to have infinite recursion in the system catalog.

Given the above schema, you may have an overview query projecting both the sender and all recipients by address along the lines of:

select
    em.id,
    em.subject,
    sa.address as sender,
    (select array_agg(ea.address)
            from email_recipient e_r
            join email_address ea
                on (e_r.recipient_id = ea.id)
            where e_r.email_id = em.id)
        as recipients
    from email em
        join email_address sa
            on em.sender_id = sa.id
;

whose results may be something like:

 id |      subject      |           sender           |                 recipients
----+-------------------+----------------------------+--------------------------------------------
  1 | Anyone Out there? | joe@foo.com                | {mary@jane.com,majortom@space.control.com}
  2 | Yup!              | majortom@space.control.com | {joe@foo.com,mary@jane.com}

Excellent! Prior versions of PostgreSQL would tell you:

    ERROR:  could not find array type for data type required_text

which you’d then probably solve through downcasting ea.address to text:

...
    (select array_agg(ea.address::text)
...

to end up with a vanilla array of text and just move on. But no more need in PG 11!

But when embedding this same query into a client application (here I will use python and the psycopg2 driver), you are then met with a novel surprise —- your application code does not recieve a beautiful array or list of strings like it would have if you had done that SQL-side downcast, but instead probably just a single string which includes both the array delimiting { and } characters as well as the interior commas:

import psycopg2

con = psycopg2.connect('dbname=scratch')
cur = con.cursor()

cur.execute('''
select
    em.id,
    em.subject,
    sa.address as sender,
    (select array_agg(ea.address)
            from email_recipient e_r
            join email_address ea
                on (e_r.recipient_id = ea.id)
            where e_r.email_id = em.id)
        as recipients
    from email em
        join email_address sa
            on em.sender_id = sa.id
''')

for row in cur:
    print(row)

cur.close()
con.close()

Results:

$ python /tmp/test.py
(1, 'Anyone Out there?', 'joe@foo.com', '{mary@jane.com,majortom@space.control.com}')
(2, 'Yup!', 'majortom@space.control.com', '{joe@foo.com,mary@jane.com}')

Boo! What’s with those big strings at the end instead of a list of individual addresses? And why did the individual required_text values projected in the sender column come across A-OK? What’s going on? This isn’t helpful at all!

Answer

Ends up what is happening is a difference in how PostgreSQL 11 currently describes results containing arrays of domain versus results containing scalar domain values. Scalar domain values are described in result sets using the underlying base type’s oid since time immemorial, whilst the array of domain is being described with its unique oid which your database driver and application most likely do not have a mapping to a native type (yet).

Rephrased, the server is performing type erasure in the client/server protocol for scalar domain projections, making their projected values be indistinguishable from those of the underlying base type. This is has been happening since PostgreSQL’s support for domains for the ease of client applications / and ‘simple’ drivers who need not concern themselves / differentiate between domain values and the underlying base types. This type erasure has popped up as a surprise and discussion point for the occassional user who wanted to use domains to trigger special client-side type mappings, but then find they cannot because registering the domain’s oid with their driver ends up never getting tickled. Their solution is to then just go ahead and make a real type, but leveraging their implicit base type’s input, output, operator class, etc. PostgreSQL functions.

Let’s peek into how these results are being presented by replacing the for loop over the results with one over the cursor’s description:

...
for column in cur.description:
    print('Name: %s / oid: %d' % (column.name, column.type_code))
...

and we see:

$ python /tmp/test.py
Name: id / oid: 23
Name: subject / oid: 25
Name: sender / oid: 25
Name: recipients / oid: 2741345

Interesting! The scalar domain column sender is not described with the oid we observed in the pg_types table (2741346), but instead is presented to the client with the oid of its base type, 25, corresponding to text. The equivalent conversion is not currently done for arrays of domains — they’re presented back to the client with their unadulterated “array of domain” oid, which, out of the box, your database client most likely won’t know how to handle. Psycopg2 defaults to presenting the value as a string when it doesn’t know otherwise, so a single string with curlybraces and commas is handed back to the application code. I’ve brought this arguable asymmetry in how PostgreSQL treats returning domains to client applications up on pgsql-hackers, and it seems it will stay this way at least for PostgreSQL 11.

Fortunately, psycopg2 lets us register new type adaptors, and has a convienence function to create a type adaptor for an array type based upon the the type adaptor for the corresponding scalar type. So, with some lookups in both the system catalog as well as within the psycopg2 module, we can craft a flexible registration function which introspects into our database to see what domains we have, and to register their array type’s oid to be handled equivalently to that of an array of the domain’s base type. We’ve only created a domain over text, but the following ought to handle domains over int or any other of the core PostgreSQL types.

import psycopg2, psycopg2.extras, psycopg2.extensions


def register_domain_array_types(con):

    cur = con.cursor(cursor_factory=psycopg2.extras.NamedTupleCursor)
    # Learn user-created domain names, their array type's oids, their
    # base type's oid
    cur.execute('''
        select
            pgt.typname,        -- Name of this domain type,
            pgt.typbasetype,    -- its base type's oid,
                                -- base type's name,
            pgt_base.typname as base_typname,
            pgt.typarray       -- and its array type's oid

        from pg_catalog.pg_type pgt
                -- Dig out the base type for the domain ...
                join pg_catalog.pg_type pgt_base
                    on (pgt.typbasetype = pgt_base.oid)
        where
            -- Domains only please!
            pgt.typtype = 'd'
        ''')

    for c in cur.fetchall():
        # Get the psycopg type adaptor for the domain's base type's oid.
        base_type_adaptor = psycopg2.extensions.string_types.get(c.typbasetype)

        if not base_type_adaptor:
            raise Exception('Unknown type adaptor for %s'
                            ' for domain array type %s'
                            % (c.base_typname, c.typname))

        # Create new psycopg-side array type(converter) ...
        array_type = psycopg2.extensions.new_array_type(
                        # oids to trigger use of this typecoverter.
                        # The array's oid only please.
                        (c.typarray,),

                        # Name for this type: let's just tack '[]'
                        # onto the domain's name
                        c.typname + '[]',

                        # Typeconverter for the scalar type. Let's use the one for the
                        # domain's base type!
                        base_type_adaptor)

        # ... and then register it. Done!
        psycopg2.extensions.register_type(array_type)

    cur.close()

So, after making a call to that routine, and expecting the recipients projection to now be a list of strings:

import psycopg2, psycopg2.extras, psycopg2.extensions

def register_domain_array_types(con):
    ...

con = psycopg2.connect('dbname=scratch')
register_domain_array_types(con)

cur = con.cursor(cursor_factory=psycopg2.extras.NamedTupleCursor)

cur.execute('''
select
    em.id,
    em.subject,
    sa.address as sender,
    (select array_agg(ea.address)
            from email_recipient e_r
            join email_address ea
                on (e_r.recipient_id = ea.id)
            where e_r.email_id = em.id)
        as recipients
    from email em
        join email_address sa
            on em.sender_id = sa.id
''')

for row in cur:
    print(row)
    for recip in row.recipients:
        print('    Recipient: %s' % (recip,))
    print('')

cur.close()
con.close()

Now gives us something we like:

$ python array_type_example.py
Record(id=1, subject='Anyone Out there?', sender='joe@foo.com', recipients=['mary@jane.com', 'majortom@space.control.com'])
    Recipient: mary@jane.com
    Recipient: majortom@space.control.com

Record(id=2, subject='Yup!', sender='majortom@space.control.com', recipients=['joe@foo.com', 'mary@jane.com'])
    Recipient: joe@foo.com
    Recipient: mary@jane.com

$

So, research your own driver’s custom type registration capability, and make use of a post-connection setup function to scrape the system catalog for domains and their array type’s oids to make full use of domain arrays at the client side!

Next time I’ll augment and generalize this setup function to educate psycopg2 about custom composite types and show how useful they can be when writing json microservices.

References: