Finding release dates from database dump

heyoni · February 18, 2018, 8:23am

Just wondering, how would you go about SQL searching by release dates for individual releases or release groups? Specifically, I’m looking at release_country and release_unknown_country but those two just have separated year, month and date fields.

I hacked up a way to do it by creating a view that turns the year/month/day columns into an actual date value but when I do that, the queries take forever to run…is there a better way to do this?

dns_server · February 19, 2018, 7:14am

Can you paste the query / view definition you are trying to run?
It may be easier to use pastebin or a google doc with your query.

heyoni · February 20, 2018, 9:28pm

Hi! First of all, thanks for taking the time to even look into this. I just found out most of the devs are actually in IRC rather than here.

Anyways, my goal is to create a sort of super table that lists the artists name, with any release_groups and the earliest release_dates attached to whatever album/single it finds since I don’t want it to list re-releases. To achieve this, I’m joining together the following tables:

artist
artist_credit
artist_credit_name
release
release_group
release_event_dates

release_event_dates is basically just a UNION between release_country and release_unknown_country that turns the day, month and year columns into dates. I’ve included the code for the two views at the bottom of my post. This is the query I try to run just to see what’s new this week:

SELECT * FROM release_groups_single_date WHERE released >= '2018-01-16' AND released =< '2018-02-16'

I’m sure someone more experience can already see the problem in that it takes forever to execute. I can post the output from EXPLAIN if you’d like, but the EXPLAIN ANALYZE says it takes 15 seconds to execute. If I run a similar search on the website it returns results instantly.

Things I’ve tried

Adding indexes where required
Removing the COALESCE and MIN functions
Using Materialized views

The last one basically solved it for me, and I just run a cron job where it updates the view once a day. Still though, I’m very curious as to how the website achieves such good performance so I’ve been tinkering with the virtual machine but can’t seem to get the search-server docker to run for some reason.

Once I get the virtual machine search-server up and running, I intend to log all queries and dig through the website’s source to see what it’s doing and try to mimic. I’m a bit new at this, so everything takes forever and I would appreciate it if anyone could point me in the right direction.

Thanks for reading! Let me know if there’s any info you think I should add.

/edit After looking more closely at the search help page, it looks like it uses a search engine called Lucene, so I’m guessing that means it doesn’t query the database directly. There may not be an answer to my question then :x

CREATE VIEW release_event_dates AS
 SELECT q.release,
    to_date(concat(COALESCE((q.date_year)::integer, 1900), '/', COALESCE((q.date_month)::integer, 1), '/', COALESCE((q.date_day)::integer, 1)), 'YYYY/MM/DD'::text) AS date,
    q.country
   FROM ( SELECT release_country.release,
            release_country.date_year,
            release_country.date_month,
            release_country.date_day,
            release_country.country
           FROM release_country
        UNION ALL
         SELECT release_unknown_country.release,
            release_unknown_country.date_year,
            release_unknown_country.date_month,
            release_unknown_country.date_day,
            NULL::integer AS int4
           FROM release_unknown_country) q;

SELECT artist.name,
    release.name AS rname,
    release_group.id,
    min(release_event_dates.date) AS released
   FROM release_group
     JOIN release ON release_group.id = release.release_group
     JOIN release_event_dates ON release.id = release_event_dates.release
     JOIN artist_credit ON artist_credit.id = release.artist_credit
     JOIN artist_credit_name ON artist_credit.id = artist_credit_name.artist_credit
     JOIN artist ON artist.id = artist_credit_name.artist
  GROUP BY release_group.id, artist.name, release.name;

dns_server · February 21, 2018, 2:01pm

Your date calculation looks good but if you convert that to a function you may be able to use a function based index.
This allows you to create a function based index and this should hopefully be faster (needs testing)

create or replace function calc_date(date_year integer,date_month integer,date_day integer) returns date as $$
select to_date(concat(COALESCE((date_year)::integer, 1900), '/', COALESCE((date_month)::integer, 1), '/', COALESCE((date_day)::integer, 1)), 'YYYY/MM/DD'::text) $$
LANGUAGE SQL immutable;

create index release_unknown_country_over_date on release_unknown_country(calc_date(date_year,date_month,date_day));
create index release_country_over_date on release_country(calc_date(date_year,date_month,date_day));

When I look at the explain plan the difference is not that large.
As the releases table is huge most of the time is spent looking at that table.

alastairp · February 21, 2018, 8:04pm

Did you find that release_groups have a record of the first release date, which might be something that is useful for you (and should remove your need for your view)

github.com

metabrainz/musicbrainz-server/blob/a1c0c019be8135a3580b214086e57e7338da48ff/admin/sql/CreateTables.sql#L3036-L3044


      
          CREATE TABLE release_group_meta ( -- replicate
              id                  INTEGER NOT NULL, -- PK, references release_group.id CASCADE
              release_count       INTEGER NOT NULL DEFAULT 0,
              first_release_date_year   SMALLINT,
              first_release_date_month  SMALLINT,
              first_release_date_day    SMALLINT,
              rating              SMALLINT CHECK (rating >= 0 AND rating <= 100),
              rating_count        INTEGER
          );

heyoni · February 23, 2018, 8:36pm

I had no idea that such a table existed! I’m going to compare that to the min date and see if I get the same thing. Does this mean that the earliest release dates are duplicated in the database?

@dns_server

When I look at the explain plan the difference is not that large.

I’ve never used functions or function indexes so I might try it out just for fun, but are you saying that performance-wise it had no impact?

dns_server · February 24, 2018, 12:31am

@heyoni
You are joining a huge table with a small table so most of the time was reading the big table from disk.
The performance gains from calculating things one way or another way don’t make a huge difference in this case.