Arrays of Domain
One subtle new feature in PostgreSQL 11 is domain type arrays. I say subtle, because you may not have noticed that they were not supported before, or perhaps, as was my experience, your first attempt to make use of an array over a domain happened to be with version 11, and it “just worked.”
If you share the philosophy to have the database do as much as reasonably possible to ensure data consistency (and I think you should), then you know that domains are a great tool to that end, letting you express a set of constraints in one place, then reference them in as many tables or datatypes as needed with much less repetition or copy/paste errors.
For example, given this simplified subset of a schema making reference to a domain required_text
over text
ensuring that the value is nonempty, not null, as well as has no leading / trailing whitespace:
create domain required_text text
check (trim(value) = value
and length(value) > 0) not null;
create table email_address
(
id serial primary key,
address required_text
-- Not a production quality email constraint, but...
check (address like '%@%')
);
create table email(
id serial primary key,
sender_id int
references email_address,
subject text
-- ...
);
create table email_recipient (
email_id int not null
references email,
recipient_id int not null
references email_address,
primary key (email_id, recipient_id)
);
PostgreSQL implements domains using its flexible and user-expandable type system, where each type
known to the database exists as a row in system catalog table
pg_catalog.pg_type
. The primary key for this table
is an integer-ish field/type named oid
. The oid value corresponding to a type’s pg_type
row is usually
(ah, sweet foreshadowing) how a PostgreSQL server describes a result set’s column types in the
client/server protocol with the RowDescription
message which indicates the PostgerSQL-side datatypes, informing
the client’s driver as to what client-side types and/or parsing logic to use to decode the result set’s values.
Basic scalar types are described in the system catalog’s pg_type
table by rows with typtype
set to ‘b’
for “base type.” Domains
are described with a ‘d’, for, well, domain. Arrays are also modeled as types in the system catalog,
described with typcategory
value A
for “Array” and through having their oid referenced by
their scalar’s row in column typarray
.
Convention dictates that the array type’s name will be the same as the scalar type’s, but with an underscore prepended.
All that’s a mouthful. Let’s just look:
scratch=# select
oid, typname, typtype, typcategory, typarray, typbasetype
from pg_type
where typname in ('text', 'required_text', '_text', '_required_text');
oid | typname | typtype | typcategory | typarray | typbasetype
---------+----------------+---------+-------------+----------+-------------
25 | text | b | S | 1009 | 0
1009 | _text | b | A | 0 | 0
2741346 | required_text | d | S | 2741345 | 25
2741345 | _required_text | b | A | 0 | 0
(4 rows)
Here we see a subset of the pg_type
descriptions for types text
and our domain
required_text
plus their corresponding array types. The required_text
domain
is described with typtype
value ‘d,’ and references its base type’s oid in the
typbasetype
column. Being scalars supporting arrays, they name their array
types by oid in the typarray
column: 1009 and 2741345. The array types _text
and _required_text
are marked as arrays according to typcategory
value ‘A’
(and through other pg_type
columns we’re not interested in for the sake of this
post). The oids for required_text
and _required_test
will
be based on the state of your database cluster when you actually run the code, whereas
the values for text
and _text
, being builtins, are well known and set up by initdb
when you
initialized your cluster.
(Why does the array type _required_text
end up having a lower oid than its scalar type? Because
the code for CREATE TYPE ... AS DOMAIN
in /src/backend/commands/typecmds.c’s DefineDomain()
creates the array type oid first since the scalar domain’s type will need to forward reference it.)
Array types don’t have a corresponding typarray
value, because, well, PostgreSQL
arrays are inherently multi-dimensional, and we don’t want to have infinite
recursion in the system catalog.
Given the above schema, you may have an overview query projecting both the sender and all recipients by address along the lines of:
select
em.id,
em.subject,
sa.address as sender,
(select array_agg(ea.address)
from email_recipient e_r
join email_address ea
on (e_r.recipient_id = ea.id)
where e_r.email_id = em.id)
as recipients
from email em
join email_address sa
on em.sender_id = sa.id
;
whose results may be something like:
id | subject | sender | recipients
----+-------------------+----------------------------+--------------------------------------------
1 | Anyone Out there? | joe@foo.com | {mary@jane.com,majortom@space.control.com}
2 | Yup! | majortom@space.control.com | {joe@foo.com,mary@jane.com}
Excellent! Prior versions of PostgreSQL would tell you:
ERROR: could not find array type for data type required_text
which you’d then probably solve through downcasting ea.address
to text:
...
(select array_agg(ea.address::text)
...
to end up with a vanilla array of text and just move on. But no more need in PG 11!
But when embedding this same query into a client application (here I will use python and the psycopg2 driver), you
are then met with a novel surprise —- your application code does not recieve a beautiful array or list of strings
like it would have if you had done that SQL-side downcast, but instead probably just a single string which includes
both the array delimiting {
and }
characters as well as the interior commas:
import psycopg2
con = psycopg2.connect('dbname=scratch')
cur = con.cursor()
cur.execute('''
select
em.id,
em.subject,
sa.address as sender,
(select array_agg(ea.address)
from email_recipient e_r
join email_address ea
on (e_r.recipient_id = ea.id)
where e_r.email_id = em.id)
as recipients
from email em
join email_address sa
on em.sender_id = sa.id
''')
for row in cur:
print(row)
cur.close()
con.close()
Results:
$ python /tmp/test.py
(1, 'Anyone Out there?', 'joe@foo.com', '{mary@jane.com,majortom@space.control.com}')
(2, 'Yup!', 'majortom@space.control.com', '{joe@foo.com,mary@jane.com}')
Boo! What’s with those big strings at the end instead of a list of individual addresses? And why did the individual
required_text
values projected in the sender
column come across A-OK?
What’s going on? This isn’t helpful at all!
Answer
Ends up what is happening is a difference in how PostgreSQL 11 currently describes results containing arrays of domain versus results containing scalar domain values. Scalar domain values are described in result sets using the underlying base type’s oid since time immemorial, whilst the array of domain is being described with its unique oid which your database driver and application most likely do not have a mapping to a native type (yet).
Rephrased, the server is performing type erasure in the client/server protocol for scalar domain projections, making their projected values be indistinguishable from those of the underlying base type. This is has been happening since PostgreSQL’s support for domains for the ease of client applications / and ‘simple’ drivers who need not concern themselves / differentiate between domain values and the underlying base types. This type erasure has popped up as a surprise and discussion point for the occassional user who wanted to use domains to trigger special client-side type mappings, but then find they cannot because registering the domain’s oid with their driver ends up never getting tickled. Their solution is to then just go ahead and make a real type, but leveraging their implicit base type’s input, output, operator class, etc. PostgreSQL functions.
Let’s peek into how these results are being presented by replacing the for loop over the results with one over the cursor’s description:
...
for column in cur.description:
print('Name: %s / oid: %d' % (column.name, column.type_code))
...
and we see:
$ python /tmp/test.py
Name: id / oid: 23
Name: subject / oid: 25
Name: sender / oid: 25
Name: recipients / oid: 2741345
Interesting! The scalar domain column sender
is not described with the oid we observed in the pg_types
table
(2741346), but instead is presented to the client with the oid of its base type, 25, corresponding to text
.
The equivalent conversion is not currently done for arrays of domains — they’re presented back to the
client with their unadulterated
“array of domain” oid, which, out of the box, your database client most likely won’t know how to handle. Psycopg2
defaults to presenting the value as a string when it doesn’t know otherwise, so a single string with curlybraces and
commas is handed back to the application code. I’ve brought this arguable asymmetry in how PostgreSQL treats
returning domains to client applications
up on pgsql-hackers, and it seems it will stay this way at least for PostgreSQL 11.
Fortunately, psycopg2 lets us register new type adaptors, and has a convienence function to create a type adaptor for an array type based upon the the type adaptor for the corresponding scalar type. So, with some lookups in both the system catalog as well as within the psycopg2 module, we can craft a flexible registration function which introspects into our database to see what domains we have, and to register their array type’s oid to be handled equivalently to that of an array of the domain’s base type. We’ve only created a domain over text, but the following ought to handle domains over int or any other of the core PostgreSQL types.
import psycopg2, psycopg2.extras, psycopg2.extensions
def register_domain_array_types(con):
cur = con.cursor(cursor_factory=psycopg2.extras.NamedTupleCursor)
# Learn user-created domain names, their array type's oids, their
# base type's oid
cur.execute('''
select
pgt.typname, -- Name of this domain type,
pgt.typbasetype, -- its base type's oid,
-- base type's name,
pgt_base.typname as base_typname,
pgt.typarray -- and its array type's oid
from pg_catalog.pg_type pgt
-- Dig out the base type for the domain ...
join pg_catalog.pg_type pgt_base
on (pgt.typbasetype = pgt_base.oid)
where
-- Domains only please!
pgt.typtype = 'd'
''')
for c in cur.fetchall():
# Get the psycopg type adaptor for the domain's base type's oid.
base_type_adaptor = psycopg2.extensions.string_types.get(c.typbasetype)
if not base_type_adaptor:
raise Exception('Unknown type adaptor for %s'
' for domain array type %s'
% (c.base_typname, c.typname))
# Create new psycopg-side array type(converter) ...
array_type = psycopg2.extensions.new_array_type(
# oids to trigger use of this typecoverter.
# The array's oid only please.
(c.typarray,),
# Name for this type: let's just tack '[]'
# onto the domain's name
c.typname + '[]',
# Typeconverter for the scalar type. Let's use the one for the
# domain's base type!
base_type_adaptor)
# ... and then register it. Done!
psycopg2.extensions.register_type(array_type)
cur.close()
So, after making a call to that routine, and expecting the recipients
projection to now be a list of strings:
import psycopg2, psycopg2.extras, psycopg2.extensions
def register_domain_array_types(con):
...
con = psycopg2.connect('dbname=scratch')
register_domain_array_types(con)
cur = con.cursor(cursor_factory=psycopg2.extras.NamedTupleCursor)
cur.execute('''
select
em.id,
em.subject,
sa.address as sender,
(select array_agg(ea.address)
from email_recipient e_r
join email_address ea
on (e_r.recipient_id = ea.id)
where e_r.email_id = em.id)
as recipients
from email em
join email_address sa
on em.sender_id = sa.id
''')
for row in cur:
print(row)
for recip in row.recipients:
print(' Recipient: %s' % (recip,))
print('')
cur.close()
con.close()
Now gives us something we like:
$ python array_type_example.py
Record(id=1, subject='Anyone Out there?', sender='joe@foo.com', recipients=['mary@jane.com', 'majortom@space.control.com'])
Recipient: mary@jane.com
Recipient: majortom@space.control.com
Record(id=2, subject='Yup!', sender='majortom@space.control.com', recipients=['joe@foo.com', 'mary@jane.com'])
Recipient: joe@foo.com
Recipient: mary@jane.com
$
So, research your own driver’s custom type registration capability, and make use of a post-connection setup function to scrape the system catalog for domains and their array type’s oids to make full use of domain arrays at the client side!
Next time I’ll augment and generalize this setup function to educate psycopg2 about custom composite types and show how useful they can be when writing json microservices.
References: