Some time ago I was involved in a project aiming to systemize a pandemonium of databases. Especially hard were the toponym ones. The source data was a deep XML tree with regions and cities at the top and streets and houses at the bottom.
| id | parent_id | name |
| ---------- |------------|------------|
| 1 | 2 | Kremlin |
| 2 | | Moscow |
| 3 | 2 | Red Square |
| … (1M+ unsorted records) … |
I had to put this tree into PostgreSQL to allow queries. A typical query was, for example, to fetch all the streets of the cities within a specific region or to fetch the whole path from a particular street at the bottom up to the root node, etc.
Since there were quite a few queries to make, it was a must to process them as fast as possible. Therefore, I needed a hierarchical storage pattern to speed them up.
So I did the following:
- Imported the table;
- Chose the suitable hierarchical pattern;
- Calculated the extra reference data used by the chosen pattern.
At each of the three stages, it was the speed that mattered. I wanted to do everything in next to no time, because there are not many things worse than having your server not responding, and downtimes are unacceptable.
Importing the initial data
In PostgreSQL, a table is best populated by the COPY FROM
statement. It works significantly faster than multiple INSERT
statements. Ensure to prepare the input data for COPY FROM
. It accepts CSV or a binary file/string at the input. Binary is faster.