Today I Learned

hashrocket A Hashrocket project

230 posts about #sql surprise

Specify behavior for nulls in a unique index

Postgres 15 gave us the ability to specify how we want null values to be treated when dealing with unique indexes.

By default, nulls are considered unique values in Postgres:

create table users (name text, email text unique);
-- CREATE TABLE
insert into users values ('Joe', null), ('Jane', null);
-- INSERT 0 2

This default behavior can also be explicitly set using the nulls distinct clause:

create table users (name text, email text unique nulls distinct);
-- CREATE TABLE
insert into users values ('Joe', null), ('Jane', null);
-- INSERT 0 2

To change the default behavior and prevent nulls from being considered unique values, you can use the nulls not distinct clause:

create table users (name text, email text unique nulls not distinct);
-- CREATE TABLE
insert into users values ('Joe', null), ('Jane', null);
-- ERROR:  duplicate key value violates unique constraint "users_email_key"
-- DETAIL:  Key (email)=(null) already exists.

See this change in the Postgres 15 release notes

Return Types In PSQL Pattern Match

If you use the pattern match operators in PSQL, you'll want to mind the column types passed to these statements

If you use a string, you will get a boolean return -

select 'a' like '%b%';
?column?
----------
 f
(1 row)

select 'a' like '%a%';
?column?
----------
 t
(1 row)

But if you select a null in one of these statements, the return is null as well -

select null like '%a%';
?column?
----------
 ΓΈ
(1 row)

Moral of the story - if you're expecting a boolean, you can coalesce the column before the pattern match -

select coalesce(null, '') like '%a%';
?column?
----------
 f
(1 row)

https://www.postgresql.org/docs/current/functions-matching.html

Comparing Nullable values in PostgreSQL

PostgreSQL treats NULL values a bit different than most of the languages we work with. Said that, if you try this:

SELECT * FROM users WHERE has_confirmed_email <> TRUE

This would return users with has_confirmed_email = FALSE only, so NULL values that you have on your DB would be ignored. In this case if you want to get users with has_confirmed_email as FALSE or NULL then you can use IS DISTINCT FROM:

SELECT * FROM users WHERE has_confirmed_email IS DISTINCT FROM TRUE

How to search case sensitively in Postgres

There's nothing special about this, I'm just a dummy who only ever used ilike and never thought twice about it.

Today, I learned that ilike is just a case-insensitive version of like.

This would return any rows with a name of Peter

select * from users
where name ilike '%peter%'
;

This would not return any rows with a name of Peter

select * from users
where name like '%peter%'
;

Stay tuned for tomorrow's TIL where I tell you about how I learned the sky is blue! πŸ˜‚

Geocode an Address in PostgreSQL with PostGIS

I recently learned that you can use PostGIS alongside the Tiger Geocoder extension to geocode an address in Postgres. This is especially handy if you have a specific locale (US or state level) that you need to Geocode. In my case, I need lat/long coordinates for addresses in Florida and Illinois.

Another reason I like this is because it is free - no need to pay for an additional service.

Here's what the API looks like:

select
  result.rating,
  ST_X(result.geomout) as longitude,
  ST_Y(result.geomout) as latitude
from geocode('320 1st St N, Jacksonville Beach FL', 1) as result;

 rating |     longitude      |      latitude
--------+--------------------+--------------------
      1 | -81.39021163774713 | 30.291481272126084
(1 row)

https://postgis.net/docs/manual-3.3/postgis_installation.html#loading_extras_tiger_geocoder

h/t Mary Lee

On-disk size of a table/view in Postgres

In Postgres if you want to see how much disk space your relation (including data and any indexes) is taking, you can use pg_total_relation_size(<relation_name>)

SELECT pg_total_relation_size('<relation_name>') as size;

image

This can be used in conjunction with pg_size_pretty() to give a more readable output

SELECT pg_size_pretty(pg_total_relation_size('<relation_name>')) as size;

image

Use `is distinct from` to match `null` records

Let's say you want to find all purchases that don't match a specific coupon. You can use != to filter them:

select * from purchases where coupon != 'JULY4';

The problem with that is that it doesn't match records that have null values. One way to solve that is by doing a or:

select * from purchases where coupon != 'JULY4' or coupon is null;

Better than that is to use is distinct from:

select * from purchases where coupon is distinct from 'JULY4';

Remove Padding from Postgres Formatting Functions

Earlier today, I was trying to join a table on a formatted string row, but wasn't getting the results I expected. Turns out that my formatting string had blank padding and I discovered "fill mode".

When using postgres formatting functions, like to_char, some of the formatting options include padding in the result. For example, the day format string will be blank padded to 9 chars.

select to_char(current_date, 'day');

  to_char
-----------
 sunday

You can use the "fill mode" (FM) option to remove any leading zeros or blank padding characters by prepending it to your format option:

select to_char(current_date, 'FMday')

 to_char
---------
 sunday

https://www.postgresql.org/docs/current/functions-formatting.html

Understanding Query I/O in Postgres with BUFFERS

The EXPLAIN command in Postgres can help you understand the query plan for a given query. Furthermore, you can use EXPLAIN ANALYZE to see the estimated query plan and cost vs the actual time and rows.

To take it a step further, you can use EXPLAIN (ANALYZE, BUFFERS) to include a number that represents the I/O disk usage of certain parts of your query.

explain (analyze, buffers)
  select
    *
  from floor_plans
  order by created_at desc
;

                                                  QUERY PLAN
---------------------------------------------------------------------------------------------------------------
 Sort  (cost=2.56..2.60 rows=18 width=159) (actual time=0.062..0.065 rows=16 loops=1)
   Sort Key: created_at DESC
   Sort Method: quicksort  Memory: 28kB
   Buffers: shared hit=2
   ->  Seq Scan on floor_plans  (cost=0.00..2.18 rows=18 width=159) (actual time=0.018..0.032 rows=16 loops=1)
         Buffers: shared hit=2
 Planning Time: 0.111 ms
 Execution Time: 0.106 ms
(8 rows)

Make sure that if you run this with a query that writes, that you wrap it in a BEGIN...ROLLBACK statement.

https://www.postgresql.org/docs/current/using-explain.html

Add primary key to table

You can add a primary key to a table with alter an alter table statement:

Table with no primary key:

create table no_pks (id int generated by default as identity not null);
insert into no_pks select from generate_series(0,999);
[local] dillon@dillon=# \d no_pks
                           Table "public.no_pks"
 Column |  Type   | Collation | Nullable |             Default
--------+---------+-----------+----------+----------------------------------
 id     | integer |           | not null | generated by default as identity

You can add it later:

alter table no_pks add primary key (id);
[local] dillon@dillon=# \d no_pks
                           Table "public.no_pks"
 Column |  Type   | Collation | Nullable |             Default
--------+---------+-----------+----------+----------------------------------
 id     | integer |           | not null | generated by default as identity
Indexes:
    "no_pks_pkey" PRIMARY KEY, btree (id)

Print unknown exceptions in PL/pgSQL

When trying to figure out why a function raised an exception you can print the error code raised to lookup in the table Appendix A-1.

One method is to capture others and then raise the magic sqlstate variable (only available in exception handlers)

create or replace function do_it(name text)
  returns void
as $$
begin
  select 42 from nothing;
exception
  when others then
    raise '%: %', sqlstate, sqlerrm;
end;
$$
  security definer
  language plpgsql
;

Then you can view the error:

select do_it('hi');
ERROR:  42P01: relation "nothing" does not exist
CONTEXT:  PL/pgSQL function do_it(text) line 6 at RAISE

Ignore ~/.psqlrc when using psql

You can ignore your ~/.psqlrc when running psql commands by using the -X or --no-psqlrc flags.

So when you have all this in your rc file:

/* ~/.psqlrc */
\x auto
\timing
\set PROMPT1 '%[%033[1m%]%M %n@%/%R%[%033[0m%]%# '
\set PROMPT2 '[more] %R > '
\pset null '☒'
\setenv PSQL_PAGER pspg
\setenv PAGER pspg

This command becomes quite noisy:

psql -c 'select 1'
Expanded display is used automatically.
Timing is on.
Null display is "☒".
Time: 0.210 ms
 ?column?
----------
        1
(1 row)

Time: 0.297 ms

If you run without the config file:

psql -X -c 'select 1'
 ?column?
----------
        1
(1 row)

on-line manual pages:

-X,
--no-psqlrc
    Do not read the start-up file (neither the system-wide psqlrc file nor the
    user's ~/.psqlrc file).

Only fk constraints may be altered in PostgreSQL

Only foreign key constraints may be altered in PostgreSQL:

create extension citext;
create table users (id int generated by default as identity primary key);
create table user_emails (
  user_id int not null references users,
  email citext primary key
);

Now we can see the constraint:

[local] dillon@test=# \d user_emails
             Table "public.user_emails"
 Column  |  Type   | Collation | Nullable | Default
---------+---------+-----------+----------+---------
 user_id | integer |           | not null |
 email   | citext  |           | not null |
Indexes:
    "user_emails_pkey" PRIMARY KEY, btree (email)
Foreign-key constraints:
    "user_emails_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id)

But now we can change the foreign key to be deferrable:

alter table user_emails
  alter constraint user_emails_user_id_fkey deferrable initially immediate;

After:

[local] dillon@test=# \d user_emails
             Table "public.user_emails"
 Column  |  Type   | Collation | Nullable | Default
---------+---------+-----------+----------+---------
 user_id | integer |           | not null |
 email   | citext  |           | not null |
Indexes:
    "user_emails_pkey" PRIMARY KEY, btree (email)
Foreign-key constraints:
    "user_emails_user_id_fkey" FOREIGN KEY (user_id) REFERENCES users(id) DEFERRABLE

Check if string starts with character

In ruby I would use something like

"PostgreSQL".downcase.start_with?("p")
=> true

and the equivalent in a query would be:

select lower(left('PostgreSQL', 1)) = 'p';

 ?column?
----------
 t
(1 row)

and if you have the citext extension enabled you could do:

select left('PostgreSQL', 1)::citext = 'p';

 ?column?
----------
 t
(1 row)

Other things:

select *
from users
where left(display_name, 1)::citext = 'a'
;

Select first element from array_agg

It's as simple as:

select zip_code, (array_agg(company_name))[1]
from locations
group by zip_code
;

 zip_code |        array_agg
----------+--------------------------
 90210    | In-N-out
 46368    | Johnny's Round the Clock
(2 rows)

source:

create table locations (
  id bigint generated by default as identity primary key,
  zip_code text not null,
  company_name text not null
);

insert into locations (zip_code, company_name) values 
  ('46368', 'Johnny''s Round the Clock'),
  ('90210', 'In-N-out'),
  ('46368', 'Albanese Candy');

Export CSV of query without COPY

Clients task me to export data to a CSV from time to time. My favorite workflow for authoring queries is a split tmux session, vim (with the excellent coc-sql) on one side, and a psql session on the other. I modify my query and then re-execute via \i.

This is great, until I'm ready to export the CSV. historically I've edited the query, cumbersomely breaking my REPL flow by turning it into a COPY command. Now I do this:

\pset format csv
\o output-file.csv
\i my-query.sql

To reset my psql session output...

\pset format aligned
\o

And thus my-query.sql remains a working query.

PostgreSQL Indexes on Partition tables

When we CREATE INDEX in a partitioned table, PostgreSQL automatically "recursively" creates the same index on all its partitions.

As this operation could take a while we can specify the ONLY parameter to the main table index to avoid the index to be created on all partitions and later creating the same index on each partition individually.

CREATE INDEX index_users_on_country ON ONLY users USING btree (country);

CREATE INDEX users_shard_0_country_idx ON users_shard_0 USING btree (country);
CREATE INDEX users_shard_1_country_idx ON users_shard_0 USING btree (country);

Explicitly Use Index For EXPLAIN in PostgreSQL

Sometimes when checking the EXPLAIN of a query results in a sequential scan where you expected an index to be used. This is expected as an optimization of PostgreSQL to decide which is the best method to use.

                                    QUERY PLAN
-----------------------------------------------------------------------------------
 Seq Scan on widgets  (cost=0.00..70.92 rows=1038 width=402)
   Filter: ((deleted_at IS NULL) AND ((widget_type)::text = 'foo'::text))

But what if you WANT to see how the index will be used? Well you can force sequential scanning to be turned off.

SET enable_seqscan TO off;

Now if we view the query plan again we can see the index in use.

                                    QUERY PLAN
-----------------------------------------------------------------------------------------------
 Bitmap Heap Scan on widgets  (cost=32.87..100.76 rows=1038 width=402)
   Recheck Cond: ((widget_type)::text = 'foo'::text)
   Filter: (deleted_at IS NULL)
   ->  Bitmap Index Scan on index_widgets_on_widget_type  (cost=0.00..32.61 rows=1111 width=0)
         Index Cond: ((widget_type)::text = 'foo'::text)

Hooray!

Now just don't forget to re-enable the previous functionality

SET enable_seqscan TO on;

Postgres regex matching with squiggles

Postgres supports POSIX regex pattern matching using the squiggle (~) operator

-- Check for a match
select 'wibble' ~ 'ubb';
-- returns false
select 'wibble' ~ 'ibb';
-- returns true

-- Case insensitive match checking
select 'wibble' ~ 'IBB';
-- returns false
select 'wibble' ~* 'IBB';
-- returns true

-- Check for no match
select 'wibble' !~ 'ibb';
-- returns false
select 'wibble' !~ 'ubb';
-- returns true

Full postgres pattern matching documentation can be found here

Find the position of a substring in postgres

Postgres has a strpos function for finding the position of a substring in a given string:

select strpos('wibble', 'ibb');

The function returns an integer representing the location of the substring in the provided string, or a 0 if the substring cannot be found in the provided string (the locations are 1-based indexes, so you don't have to worry about collisions!).

Split text in postgres

You can split text in postgres

select email from users;
bob@example.com
mary@example.com
select split_part(email, '@', 1) from users;
bob
mary
select split_part(email, '@', 2) from users;
@example.com
@example.com

Which can be helpful when sanitizing data:

update users
  set email = split_part(email, '@', 1) || '@example.com';

How to show constraints in MySQL

Working in Postgres, I've gotten used to seeing the contraints on a table listed with \d. Working in MySQL I wasn't seeing similar with describe <table>;. Turns out you can see all the constraints by selecting from information_schema.referential_constraints:

SELECT * 
  FROM information_schema.referential_constraints 
  WHERE constraint_schema='db_name' 
  AND table_name='table_of_interest';

prevent execution when creating materialized views

When working with foreign data wrappers, one uses a materialized view to store the downloaded foreign table data. The process of downloading could be very expensive and managed by another process or program, but your program needs to define the parameters of materialzied view/foreign table. Maybe the data is loaded out of band by cron:

@daily /usr/bin/psql -d cool_db -c 'refresh materialized view big_view;'

So to create the materialized view without loading the data we use the WITH NO DATA clause:

create foreign table measurement_y2016m07
    partition of measurement for values from ('2016-07-01') to ('2016-08-01')
    server server_07;

create materialized view big_view
  as select *
  from measurement_y2016m07
  with no data;

This way we are able to execute the create foreign table and create materialized view statements in a very short amount of time. A different process can start the download with refresh materialized view

Postgres Identity Column

The Postgres wiki recommends not using the serial type, and instead added identity columns to replace them.

Old way:

create table todos (
  id bigserial primary key,
  todo text not null
);

The new way with identity columns:

create table todos (
  id bigint generated by default as identity primary key,
  todo text not null
);

Data:

insert into todos (todo) values
  ('write a til'),
  ('get some coffee');

select *
from todos;
 id |      todo
----+-----------------
  1 | write a til
  2 | get some coffee
(2 rows)

Source: PG wiki: Don't use serial

Supercharge Your Script with psql -c πŸ₯ž

Want to execute a PostgreSQL command from the command line? You can! The --command or -c flag takes a string argument that will be executed on your database of choice.

I've been using it as part of a script that creates a remote database backup, downloads the backup, drops and creates a local database, dumps the database backup into the local database, and then runs a select statement on the dataset. That final command looks like this (query has been simplified):

$ psql -d tilex_prod_backup -c "select count(*) from posts";

 count
-------
  2311
(1 row)

Heroku Psql Oneline 🐘

I cooked up this Heroku/subshell command today, and I like it:

$ psql $(heroku config:get DATABASE_URL)

This will connect me to the primary PostgreSQL database for my Heroku application in the psql client. Now I'm ready to query!

NOTE: why not $ heroku pg:psql? I'm not sure. I think Heroku was reporting status issues today, and I wanted to bypass any infrastructure I could.

Postgres `null` and `where <VALUE> not in`

Always watch out for null in Postgres. When null sneaks into a result set it may confuse the results of your query.

Without nulls a where in query could look like this:

psql> select 'found it' as c0 where 1 in (1);
    c0
----------
 found it
 (1 row)

For the where in clause a null does not change the results.

psql> select 'found it' as c0 where 1 in (null, 1);
    c0
----------
 found it
(1 row)

The where not in formulation however is sensitive to null. Without a null it looks like this:

psql> select 'found it' as c0 where 17 not in (1);
    c0
----------
 found it
(1 row)

Add in the null and the results can be counterintuitive:

psql> select 'found it' as c0 where 17 not in (1, null);
 c0
----
(0 rows)

Watch out for those nulls!!

Equality comparison and null in postgres

null is weird in postgres. Sure, it's a way of saying that there is no data. But if there is a null value Postgres doesn't want to be responsible for filtering the null value unless you explicitly tell it to.

psql> select 1 where null;
 ?column?
----------
(0 rows)

Comparing null to null with = returns null, not true.

psql> select 1 where null = null;
 ?column?
----------
(0 rows)

And comparing a value to null returns neither true nor false, but null.

psql> select 1 where 17 != null or 17 = null;
 ?column?
----------
(0 rows)

So when we apply a comparison to a nullable column over many rows, we have to be cognisant that null rows will not be included.

psql> select x.y from (values (null), (1), (2)) x(y) where x.y != 1;
 y
---
 2
(1 row)

To include the rows which have null values we have to explicitly ask for them with is null.

psql> select x.y from (values (null), (1), (2)) x(y) where x.y != 1 or x.y is null;
 y
---
 ΓΈ
 2
(2 rows)

What a cursor is in postgres

SQL has a structure called a CURSOR that, according to the docs:

Rather than executing a whole query at once, read the query result a few rows at a time

This is mainly for solving memory usage issues, but probably not very applicable to web applications. Here's an example syntax to periodically fetch a limited amount of rows:

begin;
declare posts_cursor cursor for select * from posts;
fetch 10 from posts_cursors;
fetch 10 from posts_cursors;
commit;

This is not advised due to leaving a transaction open, but a simple example. More powerful use cases would be iterating a query in a function for updating a small amount of records at a time.

`random()` in subquery is only executed once

I discovered this morning that random() when used in a subquery doesn't really do what you think it does.

Random generally looks like this:

> select random() from generate_series(1, 3)
      random
-------------------
 0.856217631604522
 0.427044434007257
 0.237484132871032
(3 rows)

But when you use random() in a subquery the function is only evaluated one time.

> select (select random()), random() from generate_series(1, 3);
      random       |      random
-------------------+-------------------
 0.611774671822786 | 0.212534857913852
 0.611774671822786 | 0.834582580719143
 0.611774671822786 | 0.415058249142021
(3 rows)

So do something like this:

insert into things (widget_id) 
select 
  (select id from widgets order by random() limit 1)
from generate_series(1, 1000);

Results in 1000 entries into things all with the same widget_id.

Where in with multiple values in postgres

Postgres has a record type that you can use with a comma seperated list of values inside of parenthesis like this:

> SELECT pg_typeof((1, 2));

 pg_typeof
-----------
 record
(1 row)

What is also interesting is that you can compare records:

> select (1, 2) = (1, 2);

 ?column?
----------
 t
(1 row)

And additionally, a select statement results in a record:

> select (1, 2) = (select 1, 2);

 ?column?
----------
 t
(1 row)

What this allows you to do is to create a where statement where the expression can check to see that 2 or more values are contained in the results of a subquery:

> select true where (1, 2) in (
  select x, y
  from
    generate_series(1, 2) x,
    generate_series(1, 2) y
);

 bool
------
 t
(1 row)

This is useful when you declare composite keys for your tables.

Change PostgreSQL psql prompt colors

Today I learned how to change psql prompt to add some color and manipulate which info to show:

$ psql postgres
postgres=# \set PROMPT1 '%[%033[1;32m%]@%/ => %[%033[0m%]%'
@postgres => \l
                                           List of databases
            Name            |   Owner    | Encoding |   Collate   |    Ctype    |   Access privileges
----------------------------+------------+----------+-------------+-------------+-----------------------
 postgres                   | postgres   | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 template0                  | postgres   | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
                            |            |          |             |             | postgres=CTc/postgres
 template1                  | postgres   | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
                            |            |          |             |             | postgres=CTc/postgres
(3 rows)

image

Check this documentation if you want to know more.

Are All Values True in Postgres

If you have values like this:

chriserin=# select * from (values (true), (false), (true)) x(x);
 x
---
 t
 f
 t

You might want to see if all of them are true. You can do that with bool_and:

chriserin=# select bool_and(x.x) from (values (true), (false), (true)) x(x);
bool_and 
---
 f

And when they are all true:

chriserin=# select bool_and(x.x) from (values (true), (true), (true)) x(x);
bool_and 
---
 t

Aggregate Arrays In Postgres

array_agg is a great aggregate function in Postgres but it gets weird when aggregating other arrays.

First let's look at what array_agg does on rows with integer columns:

select array_agg(x) from (values (1), (2), (3), (4)) x (x);
-- {1,2,3,4}

It puts each value into an array. What if are values are arrays?

select array_agg(x)
from (values (Array[1, 2]), (Array[3, 4])) x (x);
-- {{1,2},{3,4}}

Put this doesn't work when the arrays have different numbers of elements:

select array_agg(x)
from (values (Array[1, 2]), (Array[3, 4, 5])) x (x);
-- ERROR:  cannot accumulate arrays of different dimensionality

If you are trying to accumulate elements to process in your code, you can use jsonb_agg.

select jsonb_agg(x.x)
from (values (Array[1, 2]), (Array[3, 4, 5])) x (x);
-- [[1, 2], [3, 4, 5]]

The advantage of using Postgres arrays however is being able to unnest those arrays downstream:

select unnest(array_agg(x))
from (values (Array[1, 2]), (Array[3, 4])) x (x);
--      1
--      2
--      3
--      4

ALL CAPS SQL

A while ago I read The Mac Is Not a Typewriter by Robin Williams. In it, the author claims:

Many studies have shown that all caps are much harder to read. We recognize words not only by their letter groups, but also by their shapes, sometimes called the "coastline." --pg. 31, The Mac Is Not a Typewriter

I've found this to be true. When we teach SQL, students are often surprised that we don't capitalize PostgreSQL keywords, preferring this:

select * from posts limit 5;

To this, which you might see in an SQL textbook:

SELECT * FROM posts LIMIT 5;

My arguments against the latter syntax: it's practically redundant in PostgreSQL, it's harder to type, and it's unnecessary because any good text editor highlights the keywords. Now I have another: such writing has been, in typesetting, shown to be harder to read. WHERE and LIMIT look similar from a distance in all-caps, but they mean and do different things.

It's a style opinion each developer gets to refine for themselves. To quote Williams: "Be able to justify the choice."