Once T299417: Normalize templatelinks table is done.
Description
Details
Event Timeline
Change 966213 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):
[operations/puppet@production] wikireplicas: Allow pagelinks.pl_target_id to be replicated to the cloud
Change 966213 merged by Ladsgroup:
[operations/puppet@production] wikireplicas: Allow pagelinks.pl_target_id to be replicated to the cloud
Hey @Ladsgroup , we're using that table in the image suggestions production data pipeline, see this query.
While I'll try to monitor this ticket and mailing lists announcements, it would be great if you could ping me in advance before the breaking change rolls out, so that we can take action.
You could also leave a comment in T350007: [M] Adapt image suggestions to comply with breaking database schema changes if you prefer, thanks!
Before dropping columns, please notify WP:VPT at enwp and similar pages on other projects. Dropping these columns will break a lot of regular and ad hoc queries using Quarry, {{database report}}, etc. Identifying and modifying them will be a major exercise.
I will but noting that this has been already announced multiple times and in multiple venues: https://lists.wikimedia.org/hyperkitty/list/cloud@lists.wikimedia.org/thread/KWWEWNFVXDXOHZBJO32NQLBKC5LUTEAB/
We deployed https://gerrit.wikimedia.org/r/c/operations/puppet/+/966213 today with a maintain-views run on all wikireplica servers.
Change 977218 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):
[operations/mediawiki-config@master] beta: Stop writing to the old columns of pagelinks in fawiki
Change 977218 merged by jenkins-bot:
[operations/mediawiki-config@master] beta: Stop writing to the old columns of pagelinks in fawiki
I changed PK of pagelinks in fawiki in beta cluster:
ALTER TABLE pagelinks DROP PRIMARY KEY, ADD PRIMARY KEY (`pl_from`, `pl_target_id`);
Then, stopped writing the old columns and then dropped them and indexes:
ALTER TABLE pagelinks DROP COLUMN pl_namespace, DROP COLUMN pl_title; DROP INDEX pl_namespace ON pagelinks; DROP INDEX pl_backlinks_namespace ON pagelinks;
Size of the table went from 9.1M to 448K but the catch here is that majority of size redaction came from running optimize when changing the PK, making it go to 996K, so the net gain here is cutting size of the table to around 45% which is pretty good regardless.
Running it on simplewiki gave 38% of the original size. On top, you'd get a clean up due to running optimize table which can in itself cut the size of the table from half to one fifth depending on the wiki.
Fully understand the reasoning behind this normalization. But (unexpectedly) this will result in some queries to run substantially slower. For instance:
SELECT page_title, pl_title, CASE WHEN c2.cl_to IS NULL THEN NULL ELSE '{{yes}}' END AS under_construction FROM page JOIN pagelinks ON page_id = pl_from LEFT JOIN categorylinks c2 ON page_id = c2.cl_from AND c2.cl_to = 'صفحههای_گسترده_در_دست_ساخت' WHERE page_namespace = 0 AND pl_namespace IN (2, 3) AND NOT EXISTS ( SELECT c1.cl_to FROM categorylinks c1 WHERE page_id = c1.cl_from AND c1.cl_to = 'مقالههای_نامزد_حذف_سریع' )
versus"
SELECT page_title, lt_title, CASE WHEN c2.cl_to IS NULL THEN NULL ELSE '{{yes}}' END AS under_construction FROM page JOIN pagelinks ON page_id = pl_from JOIN linktarget ON pl_target_id = lt_id LEFT JOIN categorylinks c2 ON page_id = c2.cl_from AND c2.cl_to = 'صفحههای_گسترده_در_دست_ساخت' WHERE page_namespace = 0 AND lt_namespace IN (2, 3) AND NOT EXISTS ( SELECT c1.cl_to FROM categorylinks c1 WHERE page_id = c1.cl_from AND c1.cl_to = 'مقالههای_نامزد_حذف_سریع' )
It would be nice if we had analytics on the queries run against pagelinks table (both in production and in replicas) and what the average or median query duration looks like over time.
In looking at the EXPLAIN results for those two queries, the first one runs more quickly not only because of denormalization, but also because the WHERE condition on pl_namespace uses the pl_namespace index on the pagelinks table. However, it seems like the secondary does *not* use the lt_namespace_title [[index]] on the linktargets table (where I expected it would be used as a partial index). Any ideas on how to improve my query or the indexes?
Maybe this is what you're looking for:
SELECT page_title, lt_title, CASE WHEN c2.cl_to IS NULL THEN NULL ELSE '{{yes}}' END AS under_construction FROM page JOIN pagelinks ON page_id = pl_from JOIN linktarget ON pl_target_id = lt_id LEFT JOIN categorylinks c2 ON page_id = c2.cl_from WHERE c2.cl_to = 'صفحههای_گسترده_در_دست_ساخت' AND page_namespace = 0 AND lt_namespace IN (2, 3) AND page_id NOT IN ( SELECT c1.cl_from FROM categorylinks c1 WHERE c1.cl_to = 'مقالههای_نامزد_حذف_سریع' )
(if I understood your query correctly, you want articles that are in daste sakht but not naamzade hazfe sari?)
No. The query finds all articles linking to a User or User_talk page, and distinguishes which has an "under construction" category. So the last part is optional, not mandatory. Your suggested query makes it mandatory.
But thanks for taking a look.
ah I see, unfortunately I don't think there is a way to make the query instantly fast as the knowledge that has the highest cardinality (pl from ns + pltarget ns) now is literally sitting in two different table and none of the fields have enough cardinality on their own to make query in the respective tables faster. I think it'll get a bit worse once the data is properly populated in linktarget (and more links to user and user talk ns start being added to linktarget) but OTOH, dropping the old columns makes all queries faster by increasing memory lookups and this took 48s for me:
SELECT page_title, lt_title, CASE WHEN page_id in (SELECT cl_from from categorylinks WHERE cl_to = 'صفحههای_گسترده_در_دست_ساخت') THEN '{{yes}}' ELSE NULL END AS under_construction FROM page JOIN pagelinks ON page_id = pl_from JOIN linktarget ON pl_target_id = lt_id WHERE page_namespace = 0 AND pl_from_namespace = 0 AND lt_namespace IN (2, 3) AND page_id NOT IN ( SELECT c1.cl_from FROM categorylinks c1 WHERE c1.cl_to = 'مقالههای_نامزد_حذف_سریع' )
(adding pl_from_namespace = 0 helps a bit)
@Huji Hi, the data has been fully populated for fawiki. If you want check and switch your tools and queries, it should be fine to do so now.
Change #1024450 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):
[mediawiki/core@master] Run pagelinks migration for third parties too
Change #1024450 merged by jenkins-bot:
[mediawiki/core@master] Run pagelinks migration for third parties too
Does "everywhere" include Manual:Pagelinks table, which has yet to be updated this year and doesn't indicate that any columns are deprecated? Nor document the new way?
I migrated all my code. In most cases, queries are running just a tad slower. In a few cases, they seem to run a tad faster. In all cases, my tests showed the results were identical.
Change #1028778 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):
[operations/mediawiki-config@master] Stop writing to old columns of pagelinks in most wikis
Change #1028778 merged by jenkins-bot:
[operations/mediawiki-config@master] Stop writing to old columns of pagelinks in most wikis
Mentioned in SAL (#wikimedia-operations) [2024-05-07T15:32:32Z] <ladsgroup@deploy1002> Started scap: Backport for [[gerrit:1028778|Stop writing to old columns of pagelinks in most wikis (T352010 T299947)]]
Mentioned in SAL (#wikimedia-operations) [2024-05-07T15:38:11Z] <ladsgroup@deploy1002> ladsgroup: Backport for [[gerrit:1028778|Stop writing to old columns of pagelinks in most wikis (T352010 T299947)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
Mentioned in SAL (#wikimedia-operations) [2024-05-07T16:05:02Z] <ladsgroup@deploy1002> Finished scap: Backport for [[gerrit:1028778|Stop writing to old columns of pagelinks in most wikis (T352010 T299947)]] (duration: 32m 29s)
Something along the lines of:
Two columns of pagelinks table (pl_namespace, pl_title) are being dropped soon. Users must use pl_target_id instead (foreign key to linktarget table). See T222224 for reasoning.
@Ladsgroup Added to https://meta.wikimedia.org/wiki/Tech/News/2024/20 -- Thanks for the draft wording, always appreciated!
I think I've gotten the <code> highlighting and non-translation snippets correct, but please glance-check.
It will be frozen for translations in ~20 hours, if any tweaks (or link additions/changes) are needed before then.
Can't use pagelinks without linktarget; can't use linktarget without pagelinks
Chicken and egg problem
The join columns were wrong: Here is the correct query: https://quarry.wmcloud.org/query/82698
Thanks. So Quiddity, the advice: "Users must use pl_target_id instead (a foreign key to the linktarget table)." wasn't sufficient to clearly explain to me what I needed to do. I think, to better explain for rusty SQL users like me:
Two columns of the pagelinks table (pl_namespace, pl_title) are being dropped soon. Users must use two columns of the new linktarget table instead (lt_namespace, lt_title).
In your existing SQL queries:
- Replace JOIN pagelinks with JOIN linktarget and pl_ with lt_ in the ON statement
- Below that add JOIN pagelinks ON lt_id = pl_target_id
Thanks for the detailed recommendation! Updated at https://meta.wikimedia.org/wiki/Tech/News/2024/20
Sidenote (Tech News insider stuff): I'm hesitant about including that much technical detail (as it's a burden for the translators, and confusing + bulky for those completely unfamiliar with the topic...), but we're trying to improve Tech News (cf. T363876) so we need to experiment with some aspects in order to get clear example-based feedback, which I think this will help with. :)
Change #1043756 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):
[mediawiki/core@master] schema: Drop old pagelinks columns
I am running MW 1_43-wmf9 and started running update.php yesterday night. It seems that the updater is very slow (running migrateLinksTable on pagelinks). I now started to run this migration in a separate process. Do you have any idea how long the migration might take? My first estimation is 12 days for our 12M pages. This seems very slow, does that make sense?
It depends on how connected those pages are (=size of pagelinks table). 12 days for 12M sounds about right. You could potentially increase the batch size if they get connected to one set of popular pages mostly.
Thank you. This is very helpful. I think one of the problems is that we are also writing to the wiki simultaneously.
MariaDB [(none)]> SELECT table_name AS `Table`, round(((data_length + index_length) / 1024 / 1024 / 1024), 2) `Size in GB` FROM information_schema.TABLES where table_name like '%links' order by (data_length + index_length) desc; +---------------+------------+ | Table | Size in GB | +---------------+------------+ | pagelinks | 22.43 | | externallinks | 9.92 | | templatelinks | 7.45 | | categorylinks | 0.09 | | iwlinks | 0.01 |
And it is a bit faster than originally expected. So maybe only 6 days. (Edit: It was actually much faster real 1510m58.783s)
Change #1043756 merged by jenkins-bot:
[mediawiki/core@master] schema: Drop old pagelinks columns
tacsipacsi opened https://gitlab.wikimedia.org/toolforge-repos/poty-stuff/-/merge_requests/2
Update pagelinks access
I didn’t want to spam here… It looks like @CodeReviewBot is smarter than I thought. Sorry!
legoktm merged https://gitlab.wikimedia.org/toolforge-repos/poty-stuff/-/merge_requests/2
Update pagelinks access