Page MenuHomePhabricator

CommunityConfiguration: Switch to generating schemas from PHP classes rather than committing them as JSON files directly
Closed, ResolvedPublic3 Estimated Story Points

Description

Implementation of T351232: Community configuration 2.0: Consider generating JSONSchema from PHP classes rather than committing them directly:

Experience has also shown that is its beneficial to maintain the schema as constants in a PHP class, from which we can then generate a JSON Schema. This is nice because we can use constants for things like namespace IDs, and can make use of (e.g. nullable types like "?string", or use "list" and "map" rather than "array" and "object").

While I'm unsure about the implementation details, I strongly agree with the idea of using PHP things like constants and types. This is one of the main disadvantages of extension.json files.

Let's create a concept of a SchemaBuilder, which will be able to convert a PHP class into a schema (as of now, a JSONSchema specifically). This schema can then be passed to the validation library (in our case, JsonSchema\Validator), or used in places like the configuration editor or reader (to get the allowlist).

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 1007664 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[mediawiki/extensions/CommunityConfiguration@master] Switch to storing configuration schemas as PHP classes

https://gerrit.wikimedia.org/r/1007664

Implementation notes

Using PHP classes as the source of truth for the JSON schemas will help in several ways. We will be able to easily load the schemas as needed, without having to wonder in which file on the filesystem it is exactly stored (MediaWiki's autoloader for PHP classes would do that for us). We will also be able to restrict the JSONSchema capabilities to a strictly defined subset, decreasing the chances of client extensions formatting their schema in a way we didn't anticipate or work against.

I reviewed maintenance/generateConfigSchema.php and found it generates a schema using data gathered by the ReflectionSchemaSource class. The ReflectionSchemaSource class expects to operate on top of a schema-class, which has a set of PHP constants, each being an array describing a certain configuration option. In MediaWiki Core, an example of such a class is MainConfigSchema. Each constant supports the following options:

  • type for specifying its datatype
  • obsolete for declaring an option as obsolete
  • default and dynamicDefault for determining a default value of a configuration option (either statically, or dynamically using a PHP callable)

Optionally, PHPDoc can be read as a description into the JSON Schema. We will not need this capability for CommunityConfiguration, as the schema is computed and directly passed to the validator, so no human would be able to read the description key if present.

For CommunityConfiguration, I ended up introducing a JsonSchema interface, which introduces a list of constants to be referred within the schema class, as well as default values for special-constants such as SCHEMA_URI or VERSION (see below).

ReflectionSchemaSource reads the specified schema class via PHP reflection and returns a list of defined configuration options. This can be then used by CommunityConfiguration to build a valid JSON Schema. A basic schema might look like this:

[
	'$schema' => 'https://json-schema.org/draft-07/schema#',
	'$id' => str_replace( '\\', '/', $this->className ) . '/1.0.0',
	'type' => 'object',
	'properties' => ReflectionSchemaSource::load()['config-schema'] ?? [],
	'additionalProperties' => false,
];

Such a schema has several properties we need to build around:

  • Version in $id is hardcoded: This can be easily fixed by defining a special constant VERSION in the schema class, which could be properly set within the schema class to bump the version. Same approach was selected fro the $schema, although there it might make more sense to just hardcode it, as we do not expect client extensions to use arbitrary versions of the schema standard.
  • No support for required: There is no way to specify which options are considered required and which are optional. This could be reasonably easily added to ReflectionSchemaSource (or a dedicated version of such a class) if needed. It is not included in the first version (see https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CommunityConfiguration/+/1007664).
  • type is hardcoded: Personally, I consider this to not be an issue. For most (if not all) providers, it makes sense for the top-level type to be object anyway. When a single-setting would be desirable, it can be achieved by introducing a single top-level key easily.
  • additionalProperties is hardcoded: Personally, I cannot imagine a scenario in which additionalProperties would need to be allowed at the top level. However, if that was the case, it can be easily implemented later similar to the $id version described above.

Questions

  • Is it a good idea to use ReflectionSchemaSource directly? Or does Growth need to maintain its own version of the class?
  • Do we need to mark options as required (is there a safe fallback we can make use of instead)? If so, is it reasonable to add support for required to the ReflectionSchemaSource class (depends on the answer to the previous question)?
  • How to handle per-type validation requirements? For example, how would { "type": "string", "minLength": 2, "maxLength": 3} or { "type": "string", "pattern": "^(\\([0-9]{3}\\))?[0-9]{3}-[0-9]{4}$"} be represented?
Urbanecm_WMF set the point value for this task to 3.Mar 5 2024, 5:59 PM

Change 1007664 merged by jenkins-bot:

[mediawiki/extensions/CommunityConfiguration@master] Switch to storing configuration schemas as PHP classes

https://gerrit.wikimedia.org/r/1007664

  • Is it a good idea to use ReflectionSchemaSource directly? Or does Growth need to maintain its own version of the class?

I think re-using ReflectionSchemaSource (or some code extracted from it) is a good idea. We are likely to start using it in the REST API as well, to define scheas for request and response bodies. It would be annoying of we ended up having multiple slightly different ways to representschemas in PHP.

  • Do we need to mark options as required (is there a safe fallback we can make use of instead)? If so, is it reasonable to add support for required to the ReflectionSchemaSource class (depends on the answer to the previous question)?

I think that would be reasonable, yes. Though it should be noted that in JSON schema, "required" is a list of propertiy names that lives on the container object, it's not a flag on the definition of individual fields.

  • How to handle per-type validation requirements? For example, how would { "type": "string", "minLength": 2, "maxLength": 3} or { "type": "string", "pattern": "^(\\([0-9]{3}\\))?[0-9]{3}-[0-9]{4}$"} be represented?

I think the best approach would be to make use of the format mechansim defined by JSON Schema. Custom validators could be registered for each format. They could even be part of the schema class: static functions that have a name that matches "validateXyz" could be collected and returned by ReflectionSchemaSource. During validation, we could look at the "format" associated with a given string property, and call the validator function.

For validation against a configuration, e.g. checking for a group name or namespace name, a non-static validator would have to be registerred explicitly, perhaps using a hook.

  • Do we need to mark options as required (is there a safe fallback we can make use of instead)? If so, is it reasonable to add support for required to the ReflectionSchemaSource class (depends on the answer to the previous question)?

I think that would be reasonable, yes. Though it should be noted that in JSON schema, "required" is a list of propertiy names that lives on the container object, it's not a flag on the definition of individual fields.

I'm sorry, I probably didn't phrase the question very clearly. By the first subquestion ("should we do that"), I was referring to whether we even want support for required in the context of CC. I have two reasons why that might not be a good idea:

  • Whenever the configuration wikipage does not exist, CommunityConfiguration behaves exactly the same as if its content was {} (an empty JSON dictionary). Whenever a configuration key is not available in the wikipage, CommunityConfiguration determines a fallback value (right now, it comes from GlobalVarConfig, but we're considering shifting to the default key in the JSON schema) and uses that instead. If we support required, then the default behaviour of {} will not work (or to be more precise, it wouldn't pass the validator and the client extension would need to handle it as a "CommunityConfiguration is unavailable" error scenario).
  • If a client extension defines MediaWiki:Foo.json with a certain schema and then decides to add new (required) options to the schema, none of the existing configuration pages would pass the validator after deployment. If we combine additionalProperties: false and required, we might easily end up requiring on-wiki edits upon train deployment and train rollback, which is definitely not a good idea. Data Engineering solves this problem with analytics schemas by only allowing compatible changes to schemas (required is supported, but you're not allowed to add new fields into it as part of schema updates, as that is not a compatible change).

On the other hand, conceptually it makes some sense to allow client extensions to say "XYZ needs to be configured by the wiki admins in some way".

With the first subquestion ("do we want required support"), I meant to ask whether it is a good idea to outright say that required is not supported at all in our JSONSchema implementation (or in other words, is always forced to []), or whether there is a better solution for the required issues I mentioned above. The second subquestion was targetting the case in which we decide "okay, we want required" (which is the less important part of the question, admittedly).

Do you have any thoughts on this please, @daniel?

  • How to handle per-type validation requirements? For example, how would { "type": "string", "minLength": 2, "maxLength": 3} or { "type": "string", "pattern": "^(\\([0-9]{3}\\))?[0-9]{3}-[0-9]{4}$"} be represented?

I think the best approach would be to make use of the format mechansim defined by JSON Schema. Custom validators could be registered for each format. They could even be part of the schema class: static functions that have a name that matches "validateXyz" could be collected and returned by ReflectionSchemaSource. During validation, we could look at the "format" associated with a given string property, and call the validator function.

For validation against a configuration, e.g. checking for a group name or namespace name, a non-static validator would have to be registerred explicitly, perhaps using a hook.

Thanks for the advice. I'll look into this and come back to you if I have any questions.