https://wikimediafoundation.org/wiki/User:Sthottingal-WMF
Maintains and engineer for ContentTranslation UniversalLanguageSelector and general MediaWiki-Internationalization
https://wikimediafoundation.org/wiki/User:Sthottingal-WMF
Maintains and engineer for ContentTranslation UniversalLanguageSelector and general MediaWiki-Internationalization
CXServer removed the dependency on service-runner T357950: Remove servicerunner dependency for cxserver and deployed to production last week. Since service-runner and the associated service template had huge influence on how a nodejs servie is written, it was not an easy migration. This was also partly due to the fact that cxserver was written in 2015 and then grew to a complex system.
Looks good to me
It looks like the new lines in ECS Json data is treated as seperate log. And the log.level is interpreted as NOTICE when the value is "error". I cann't tell whether this is related to containerd migration. If new lines in messages are not allowed in ECS log messages, we can fix it in cxserver side(stacktraces if present in log message, will have new line). But it would be nice if there is a way to validate without trying to deploy a change and see what happens.
Since we prioritize user experience, and sending a large chunk will have proportional delay in response time, cxserver accepting larger chunk will not help users. The clients need to send smaller chunks of content in sequential batches.
status: cxserver is deployed in production. A few metrics reporting related issues were noticed(some metrics missing) and currently being fixed. Other than that, all APIs are functional.
@akosiaris We deployed this code in staging. Only issue we observe is ECS logging is not parsed by logstash.
Current status:
Observations from staging deployment:
The Google translation issue seems a valid, but it is totally unrelated to the original one. Google translation and markup fixes are handles by CX-cxserver project and MinT has no code related to Google.
In the design, 'Search for a topic' is about searching a topic like "Arts", "Maths", "Geography" (as in https://www.mediawiki.org/wiki/ORES/Articletopic)?
Should the user type an article title like London_Fashion_Week. From the example "Cubism" I assume it is about searching an article and then use that as a seed for suggestions.
Following js error is noted from console:
jquery.js:3783 jQuery.Deferred exception: Cannot read properties of undefined (reading '$el') TypeError: Cannot read properties of undefined (reading '$el') at Object.displayDrawer (https://en.m.wikipedia.org/w/load.php?lang=en&modules=skins.minerva.scripts&skin=minerva&version=emw6f:13:428) at eval (https://en.m.wikipedia.org/w/load.php?lang=en&modules=skins.minerva.scripts&skin=minerva&version=emw6f:46:886) at mightThrow (https://en.m.wikipedia.org/w/load.php?lang=en&modules=%40wikimedia%2Fcodex%…-styles%2Cjquery%2Cvue%7Cmobile.startup&skin=minerva&version=1evph:193:648) at process (https://en.m.wikipedia.org/w/load.php?lang=en&modules=%40wikimedia%2Fcodex%…-styles%2Cjquery%2Cvue%7Cmobile.startup&skin=minerva&version=1evph:194:309) undefined
Trying to debug the issue. It seems the issue is happening with nllb-wikipedia(default) model and not with nllb-600m model.
The UserGetLanguageObject hook hander returns a user language. It uses browsers AcceptLanguage settings for this purpose unless there is a language cookie or there is a uselang override in URL.
@isarantopoulos Agreed, let us recheck after two weeks. From our team perspective, we expect frequent deployments in this quarter as recommendation API is basis for two of our KRs.
@isarantopoulos As discussed, @KartikMistry will be deploying recommendation API for LPL team. If he can get access to deployment, we can avoid dependency on ML team for frequent deployments.
@akosiaris Horizontal scaling is required, we will reach out for that separatly. Very soon.
@GMikesell-WMF It is a backend feature and verifiable by developers. You may skip it. We had verified it.
Hi @jijiki , any updates on this request? thanks.
- Machine-Translate Content Endpoint
- HTTP Verb: POST
- Production Endpoint:
- <domain>/api/rest_v1/transform/html/from/{from}
- cxserver Endpoint:
- <domain>/v1/transform/html/from/{from}
In T357950#10091560, @MSantos wrote:This is great work! One question that I have is whether there's a plan to incorporate the changes into the service template. Would that be in scope?
A) Languages with a Wikipedia and MT support already. We can enable the new support from Google as a non-default to provide them another option, with no need for specific coordination:
This issue has been resolved now. API is working as expected now
A minor issue we need to fix before closing the ticket is to fix broken documentation at https://api.wikimedia.org/service/lw/recommendation/docs
Currently https://api.wikimedia.org/service/lw/recommendation/v1api/v1/translation?source=en&target=fr&count=12&seed=Apple works.
An ideal api URL should be
The https://translation.googleapis.com/language/translate/v2/languages api to list supported languages shows all new languages. However, the actual translation fails for new languages:
{ "error": { "code": 400, "message": "Bad language pair: en|to", "errors": [ { "message": "Bad language pair: en|to", "domain": "global", "reason": "badRequest" } ], "details": [ { "@type": "type.googleapis.com/google.rpc.BadRequest", "fieldViolations": [ { "field": "target", "description": "Target language: to" } ] } ] }
Thanks @kevinbazira. I also tested, LGTM.
@jijiki That should be ok. Our team capacity is also thin in this month.
Meanwhile, we implemented a diskcache based caching which we plan to use as fallback cache options(for use in dev boxes, testing) etc.
@kevinbazira I added CXSERVER_HEADER config value to match the env values in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1058574/2/helmfile.d/ml-services/recommendation-api-ng/values-ml-staging-codfw.yaml#23
@Isaac, both are good ideas. We had discussed the second one in our team. First one adds the flexibitlity with minimal technical cost from our side.
@Isaac As a real example to work on for first iteration, I created a simplest version of campaign at https://meta.wikimedia.org/wiki/User:Santhosh.thottingal/Essential_Biography. You can see our campaign marker in the page and the list.
@kevinbazira This looks great. There is a minor blocker for production deployment though. Our CX client side code sends query params as s,t, n etc. And new API does not accept them. I submitted a patch for this and if it goes with this week's train, we should be able to deploy new API by early next week.
As per the discussions regarding early technical iterations towards this goal, we decided the following:
Do we really need an external library here? What are the limitations we see with vanilla js?
Node 20 supprtes native testrunner. So we can use that opportunity to untangle our tests from servicerunner. cxserver tests need not depend on service runner, may be it can use express for a test server.
Initial exploration https://gerrit.wikimedia.org/r/c/mediawiki/services/cxserver/+/1055769/
CampaignEvents is installed though - if we want to get an API based output like api.php?action=query&prop=translationcampaign&titles=WikiForHumanRights returning the List page we can enhance that extension. For an MVP, some marker in the page is enough so that these pages can be retrieved using search API, probably using incontent or hastemplate as outlined in https://www.mediawiki.org/wiki/Help:CirrusSearch.
I had thought about this topic+article mixing, and I have an idea on its implementation, but just deferred it for another patch once these are merged and tested.
A suggestion for supporting both server side rendering, and future interactivity at client side:
The source code at https://github.com/wikimedia/research-recommendation-api has lot of legacy code, broken or unmaintained dependencies. The web frontend is with bower, jquery and such very old tooling. Recent updates by machine learning team got it somewhat functional to the extend it is integrated to liftwing. But adding new features require more fixups to get a smooth local development experience. We can ignore the web frontend part (AKA - gapfinder) for now as we are interested only in the API.
My preference is to enhance the "new" recommendation API at https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_content_translation_recommendation so that it can accept a topic(example: Chemisty, History, Africa, Music etc) and give recommendations. It should accept more than one topic. We can also see an intersection of topic and article in later stage.
This ticket proposes to adjust the tab that the user navigates to by default to by considering the previous selections, and the existence of previous contributions by the user:
Internally - in CX production and in our developer workflows we directly use cxserver APIs and not the RESTBase apis like https://en.wikipedia.org/api/rest_v1/#/Transforms/doMT.
dda + nukta forming the same ligature rendering of rra is a common issue in Gurmukhi fonts. For example Ektype's Mukta has this issue. And this practice of having same shape for nukta form and RRA is not adviced, yet many fonts has them. This is the reason why you see two different shapes as reported above. Common users not aware of this encoding difference, but focusing only in rendering, uses them interchangeably. This wrong usage appears in corpus. For example, in many dravidian scripts I have seen people using 0(zero) in the place of ഠ, :(colon) instead of ഃ(visarga) and so on. A neuaral MT system learns them and the same issues appear in MT output. I have seen this issue in many other languages too.
@elukey Thanks for these details. Currently in our code, models are downloaded using a boostrap shell script(called via docker entrypoint mechanism) using simple wget. These models are then mounted to the docker volume. So our server code just assumes the models are present in a configurable file system location. Do you see any issue if we follow this approach? Does the caveats you mentioned complicate this approach?
Round trip technique like wikitext->html->wikitext is one way to achieve this. However it has limitations. For example, if wikitext has a template and one of the template parameter is nested wikitext, we will miss it in html rendering(For example i18n sentences with plural syntax etc). So translation will be incomplete.
I was able to reproduce and find out the pattern that cause this issue. Repeated references. Only the first one gets fixed in MT. Second one onwards, it appears plain text. A few months back I had addressed this by keeping a search start in look up logic, but it is not catching repeatations outside the sentence. I am exploring potential solutions.
The CX entrypoint is also duplicated if you click multiple times while language selector is loading: