The daily-article-l import failed while importing subscribers. More details to come, but for now I've deleted the list out of Mailman3, it should still be active on Mailman2.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Legoktm | T282271 daily-article-l@, education@ import to Mailman3 failed because of unicode characters in display name | |||
Resolved | • Marostegui | T282621 Mailman3 schema change: change utf8 columns to utf8mb4 |
Event Timeline
sqlalchemy.exc.InternalError: (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely) (pymysql.err.InternalError) (1366, "Incorrect string value: '\\xF0\\x9F\\x98\\x8Dku...' for column `mailman3`.`user`.`display_name` at row 1") [SQL: 'UPDATE user SET display_name=%(display_name)s WHERE user.id = %(user_id)s'] [parameters: {'display_name': 'Priya😍kushwaha', 'user_id': 85452}] (Background on this error at: http://sqlalche.me/e/2j85)
Yes, that's an emoji in the display name.
CREATE TABLE `user` ( `id` int(11) NOT NULL AUTO_INCREMENT, `display_name` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL, `password` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL, `_user_id` char(32) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `_created_on` datetime DEFAULT NULL, `is_server_owner` tinyint(1) DEFAULT NULL, `_preferred_address_id` int(11) DEFAULT NULL, `preferences_id` int(11) DEFAULT NULL, PRIMARY KEY (`id`), KEY `ix_user__user_id` (`_user_id`), KEY `ix_user_preferences_id` (`preferences_id`), KEY `_preferred_address` (`_preferred_address_id`), CONSTRAINT `_preferred_address` FOREIGN KEY (`_preferred_address_id`) REFERENCES `address` (`id`) ON DELETE SET NULL, CONSTRAINT `user_ibfk_1` FOREIGN KEY (`preferences_id`) REFERENCES `preferences` (`id`), CONSTRAINT `CONSTRAINT_1` CHECK (`is_server_owner` in (0,1)) ) ENGINE=InnoDB AUTO_INCREMENT=85453 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
I really don't understand why this field is forced to utf8 and not uf8mb4.
@jcrespo and I discussed this on #wikimedia-databases. In summary,
- Using utf8 is wrong because it doesn't support emojis and other non-Western languages. utf8mb4 is really what we want.
- We default to utf8mb4 for everything except these fields which are explicitly forced to utf8 in mailman: https://gitlab.com/mailman/mailman/-/blob/master/src/mailman/database/types.py#L105
- For now just drop the emoji from the username and continue with the import. Given this is an announce-only list display name doesn't really matter.
- Work with upstream to switch these fields from using utf8_bin to utf8mb4_bin, which is also case sensitive (see https://mariadb.com/kb/en/supported-character-sets-and-collations/)
- Finally perform an alter/schema change on our database, switching to utf8mb4
I fixed education and re-migrated it. daily-article-l otoh, has lots of such cases, will clean that later.
The person that answered seems like someone in the know, and even if we don't get a fix, it provide us the understanding that the hardcoding of the charset (not the collation, _bin) was not on done purpose, which would give us more confidence to migrate to utf8mb4 on our side without collateral issues.
Offtopic, but I found in the past several projects, even big ones (OpenStack, Etherpad, Gerrit) where software contributors didn't realized the subtleties of each individual engine until we pointed them. And honestly, it is fair because having to maintain many of them and even different ones within each type (MariaDB, MySQL-pre 8.0, MySQL-post 8.0, Percona, ...) is not an easy task. I think we, as SREs/DBAs can contribute in this subject.
Charsets are something in which the Wikimedia community is a bit special because of the large number of languages we have to maintain (>300), easily discovering compatibility issues. And Mediawiki only avoided many problems by maintaining the charset operations at pure application side (using binary strings on persistence layer)- which is a lot of work but pays of in the long term. That is why I suggested to report it first upstream, as they will know way more about the reasoning behind the status quo, and the room to modify it ourselves. We can even test it and report back/send a contribute back to fix the "proper unicode" support.
Mentioned in SAL (#wikimedia-operations) [2021-05-19T06:18:29Z] <Amir1> upgrading daily-article-l to mailman3 (T282271 T280322)