daily-article-l@, education@ import to Mailman3 failed because of unicode characters in display name
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Legoktm
	May 7 2021, 6:58 PM

Description

The daily-article-l import failed while importing subscribers. More details to come, but for now I've deleted the list out of Mailman3, it should still be active on Mailman2.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Legoktm	T282271 daily-article-l@, education@ import to Mailman3 failed because of unicode characters in display name
		Resolved		• Marostegui	T282621 Mailman3 schema change: change utf8 columns to utf8mb4

Event Timeline

Legoktm created this task.May 7 2021, 6:58 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 7 2021, 6:58 PM

sqlalchemy.exc.InternalError: (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely) (pymysql.err.InternalError) (1366, "Incorrect string value: '\\xF0\\x9F\\x98\\x8Dku...' for column `mailman3`.`user`.`display_name` at row 1") [SQL: 'UPDATE user SET display_name=%(display_name)s WHERE user.id = %(user_id)s'] [parameters: {'display_name': 'Priya😍kushwaha', 'user_id': 85452}] (Background on this error at: http://sqlalche.me/e/2j85)

Yes, that's an emoji in the display name.

CREATE TABLE `user` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `display_name` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
  `password` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
  `_user_id` char(32) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `_created_on` datetime DEFAULT NULL,
  `is_server_owner` tinyint(1) DEFAULT NULL,
  `_preferred_address_id` int(11) DEFAULT NULL,
  `preferences_id` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `ix_user__user_id` (`_user_id`),
  KEY `ix_user_preferences_id` (`preferences_id`),
  KEY `_preferred_address` (`_preferred_address_id`),
  CONSTRAINT `_preferred_address` FOREIGN KEY (`_preferred_address_id`) REFERENCES `address` (`id`) ON DELETE SET NULL,
  CONSTRAINT `user_ibfk_1` FOREIGN KEY (`preferences_id`) REFERENCES `preferences` (`id`),
  CONSTRAINT `CONSTRAINT_1` CHECK (`is_server_owner` in (0,1))
) ENGINE=InnoDB AUTO_INCREMENT=85453 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

I really don't understand why this field is forced to utf8 and not uf8mb4.

Matthewrbowker subscribed.May 7 2021, 7:07 PM

@jcrespo and I discussed this on #wikimedia-databases. In summary,

Using utf8 is wrong because it doesn't support emojis and other non-Western languages. utf8mb4 is really what we want.
- We default to utf8mb4 for everything except these fields which are explicitly forced to utf8 in mailman: https://gitlab.com/mailman/mailman/-/blob/master/src/mailman/database/types.py#L105
For now just drop the emoji from the username and continue with the import. Given this is an announce-only list display name doesn't really matter.
Work with upstream to switch these fields from using utf8_bin to utf8mb4_bin, which is also case sensitive (see https://mariadb.com/kb/en/supported-character-sets-and-collations/)
Finally perform an alter/schema change on our database, switching to utf8mb4

Maintenance_bot added a project: SRE.May 7 2021, 7:45 PM

This also affected the education@ list.

Legoktm renamed this task from daily-article-l import to Mailman3 failed to daily-article-l@, education@ import to Mailman3 failed because of unicode characters in display name.May 7 2021, 9:44 PM

I fixed education and re-migrated it. daily-article-l otoh, has lots of such cases, will clean that later.

I filed https://gitlab.com/mailman/mailman/-/issues/891 upstream

Legoktm added a project: Upstream.May 8 2021, 1:27 AM

Peachey88 moved this task from Backlog to Reported Upstream on the Upstream board.May 8 2021, 3:08 AM

• Ladsgroup moved this task from Backlog to Upstream on the Wikimedia-Mailing-lists board.May 8 2021, 3:42 AM

Jony subscribed.May 8 2021, 1:13 PM

• Ladsgroup mentioned this in T282310: Old pending actions in migrated ML were not imported.May 8 2021, 4:53 PM

In T282271#7071118, @Legoktm wrote:

I filed https://gitlab.com/mailman/mailman/-/issues/891 upstream

The person that answered seems like someone in the know, and even if we don't get a fix, it provide us the understanding that the hardcoding of the charset (not the collation, _bin) was not on done purpose, which would give us more confidence to migrate to utf8mb4 on our side without collateral issues.

Offtopic, but I found in the past several projects, even big ones (OpenStack, Etherpad, Gerrit) where software contributors didn't realized the subtleties of each individual engine until we pointed them. And honestly, it is fair because having to maintain many of them and even different ones within each type (MariaDB, MySQL-pre 8.0, MySQL-post 8.0, Percona, ...) is not an easy task. I think we, as SREs/DBAs can contribute in this subject.

Charsets are something in which the Wikimedia community is a bit special because of the large number of languages we have to maintain (>300), easily discovering compatibility issues. And Mediawiki only avoided many problems by maintaining the charset operations at pure application side (using binary strings on persistence layer)- which is a lot of work but pays of in the long term. That is why I suggested to report it first upstream, as they will know way more about the reasoning behind the status quo, and the room to modify it ourselves. We can even test it and report back/send a contribute back to fix the "proper unicode" support.

Legoktm mentioned this in T282621: Mailman3 schema change: change utf8 columns to utf8mb4.May 11 2021, 9:56 PM

Legoktm closed subtask T282621: Mailman3 schema change: change utf8 columns to utf8mb4 as Resolved.May 19 2021, 6:07 AM

Mentioned in SAL (#wikimedia-operations) [2021-05-19T06:18:29Z] <Amir1> upgrading daily-article-l to mailman3 (T282271 T280322)

Stashbot mentioned this in T280322: Upgrade mailing lists from mailman2 to 3 in batches.May 19 2021, 6:18 AM

Migrated like a charm

jcrespo mentioned this in T281070: hyperkitty didn't import all wikitech-l messages.Apr 12 2022, 11:09 AM

daily-article-l@, education@ import to Mailman3 failed because of unicode characters in display nameClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

daily-article-l@, education@ import to Mailman3 failed because of unicode characters in display name
Closed, ResolvedPublic
Actions

Related Objects
Search...