Page MenuHomePhabricator

daily-article-l@, education@ import to Mailman3 failed because of unicode characters in display name
Closed, ResolvedPublic

Description

The daily-article-l import failed while importing subscribers. More details to come, but for now I've deleted the list out of Mailman3, it should still be active on Mailman2.

Event Timeline

sqlalchemy.exc.InternalError: (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely) (pymysql.err.InternalError) (1366, "Incorrect string value: '\\xF0\\x9F\\x98\\x8Dku...' for column `mailman3`.`user`.`display_name` at row 1") [SQL: 'UPDATE user SET display_name=%(display_name)s WHERE user.id = %(user_id)s'] [parameters: {'display_name': 'Priya😍kushwaha', 'user_id': 85452}] (Background on this error at: http://sqlalche.me/e/2j85)

Yes, that's an emoji in the display name.

CREATE TABLE `user` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `display_name` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
  `password` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
  `_user_id` char(32) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `_created_on` datetime DEFAULT NULL,
  `is_server_owner` tinyint(1) DEFAULT NULL,
  `_preferred_address_id` int(11) DEFAULT NULL,
  `preferences_id` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `ix_user__user_id` (`_user_id`),
  KEY `ix_user_preferences_id` (`preferences_id`),
  KEY `_preferred_address` (`_preferred_address_id`),
  CONSTRAINT `_preferred_address` FOREIGN KEY (`_preferred_address_id`) REFERENCES `address` (`id`) ON DELETE SET NULL,
  CONSTRAINT `user_ibfk_1` FOREIGN KEY (`preferences_id`) REFERENCES `preferences` (`id`),
  CONSTRAINT `CONSTRAINT_1` CHECK (`is_server_owner` in (0,1))
) ENGINE=InnoDB AUTO_INCREMENT=85453 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

I really don't understand why this field is forced to utf8 and not uf8mb4.

@jcrespo and I discussed this on #wikimedia-databases. In summary,

This also affected the education@ list.

Legoktm renamed this task from daily-article-l import to Mailman3 failed to daily-article-l@, education@ import to Mailman3 failed because of unicode characters in display name.May 7 2021, 9:44 PM

I fixed education and re-migrated it. daily-article-l otoh, has lots of such cases, will clean that later.

The person that answered seems like someone in the know, and even if we don't get a fix, it provide us the understanding that the hardcoding of the charset (not the collation, _bin) was not on done purpose, which would give us more confidence to migrate to utf8mb4 on our side without collateral issues.

Offtopic, but I found in the past several projects, even big ones (OpenStack, Etherpad, Gerrit) where software contributors didn't realized the subtleties of each individual engine until we pointed them. And honestly, it is fair because having to maintain many of them and even different ones within each type (MariaDB, MySQL-pre 8.0, MySQL-post 8.0, Percona, ...) is not an easy task. I think we, as SREs/DBAs can contribute in this subject.

Charsets are something in which the Wikimedia community is a bit special because of the large number of languages we have to maintain (>300), easily discovering compatibility issues. And Mediawiki only avoided many problems by maintaining the charset operations at pure application side (using binary strings on persistence layer)- which is a lot of work but pays of in the long term. That is why I suggested to report it first upstream, as they will know way more about the reasoning behind the status quo, and the room to modify it ourselves. We can even test it and report back/send a contribute back to fix the "proper unicode" support.

Mentioned in SAL (#wikimedia-operations) [2021-05-19T06:18:29Z] <Amir1> upgrading daily-article-l to mailman3 (T282271 T280322)

Ladsgroup assigned this task to Legoktm.

Migrated like a charm