Page MenuHomePhabricator

Internal Server Errors from Zotero with nytimes.com
Open, MediumPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  • Edit a page
  • Click the Cite tool
  • Paste a nytimes.com article link

What happens?:

The automatic cite creation tool now consistently fails on nytimes links. I swear it worked before?

Screenshot 2022-11-15 at 2.12.57 PM.png (536×850 px, 90 KB)

What should have happened instead?:

It should automatically create a citation using the "cite news" template.

Software version (skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):

Event Timeline

I've checked, and the urls work locally for me.

Usually in this situation I'd assume we'd been blocked upstream for traffic, but it's odd it'd be inconsistent if that's the case.

Can you provide URLs that DO work? I've tried a few more and none of them work.

Edit: is it just nytimes.com that works? Probably the root URL might have a different access policy than articles.

@akosiaris I can't find anything in the logs... did we stop logging 404s or is it I am bad at finding things?

FWIW: I am able to reproduce this issue as well using the URL to an article [i] chosen randomly from the nytimes.com homepage:

Screen Shot 2022-11-17 at 8.32.00 AM.png (626×988 px, 126 KB)


i. https://www.nytimes.com/live/2022/11/17/us/election-news-results

ppelberg triaged this task as Medium priority.Nov 17 2022, 4:33 PM
ppelberg moved this task from To Triage to Triaged on the VisualEditor board.

I've checked, and the urls work locally for me.

Usually in this situation I'd assume we'd been blocked upstream for traffic, but it's odd it'd be inconsistent if that's the case.

Can you provide URLs that DO work? I've tried a few more and none of them work.

This worked for me just now: https://www.nytimes.com/interactive/2022/11/16/us/elections/republicans-house-congress.html

Stupid question: could this have something to do with their paywall?

@akosiaris I can't find anything in the logs... did we stop logging 404s or is it I am bad at finding things?

There hasn't been any change that I am aware of. Also trying "https://www.nytimes.com/live/2022/11/17/us/election-news-results" just right now worked fine. This appears to be related to some behavior on nytimes.com site. Rate limiting perhaps?

Setting a query filter of

"request.query.search": "https://www.nytimes.com/live/2022/11/17/us/election-news-results"

in logstash returns nothing for the last 3 days.

I think this bug could be closed because it doesn't seem possible to reproduce anymore. Maybe it was some temporary rate limiting.

I think this bug could be closed because it doesn't seem possible to reproduce anymore. Maybe it was some temporary rate limiting.

Hopefully it will stay closed but if it happens again we can re-open

WhatamIdoing subscribed.

Same song, second verse. This hasn't worked for weeks.

I was hoping today's redeploy of zotero would fix this but it seems to have fixed it only for eqiad but not codfw which is... weird.

Nytimes links are causing internal server errors. Before deploy it was both data centers, now it's only one.

It's hard to de-bug because logging is turned off for Zotero in prod so I don't actually have any way of finding out what the issue is (afaik); and it works locally so my local logs aren't helpful :/

@jijiki - I fixed the dashboard at least so you can switch between eqiad and codfw now but I'm not sure it helps with this.

We are logging the errors from zotero in citoid https://logstash.wikimedia.org/app/dashboards#/view/5eaf4e40-f6b6-11eb-85b7-9d1831ce7631?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-15m,to:now))&_a=(description:'Dashboard%20for%20citoid%20service%20(service-runner%20service).%20https:%2F%2Fwww.mediawiki.org%2Fwiki%2FCitoid',filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'logstash-*',key:kubernetes.namespace_name,negate:!f,params:(query:citoid),type:phrase),query:(match_phrase:(kubernetes.namespace_name:citoid)))),fullScreenMode:!f,options:(darkTheme:!f,useMargins:!t),query:(language:lucene,query:'*nytimes*'),timeRestore:!f,title:citoid,viewMode:view)

But it's not helpful. We just know they're happening, not what's actually going wrong with zotero.

curl -k -d 'http://www.nytimes.com/2018/06/11/technology/net-neutrality-repeal.html' -H 'Content-Type: text/plain' https://zotero.svc.codfw.wmnet:4969/web <- internal server error

curl -k -d 'http://www.nytimes.com/2018/06/11/technology/net-neutrality-repeal.html' -H 'Content-Type: text/plain' https://zotero.svc.eqiad.wmnet:4969/web <- works

curl -d 'http://www.nytimes.com/2018/06/11/technology/net-neutrality-repeal.html' -H 'Content-Type: text/plain' https://zotero.discovery.wmnet:4969/web <- internal server error

Mvolz renamed this task from Automatic cite tool no longer works with nytimes.com to Internal Server Errors from Zotero with nytimes.com.Mar 7 2024, 1:19 PM

Zotero is using url downloader to access the internet. It's logs end up in logstash e.g.

https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-default-1-1.11.0-6-2024.10?id=kO1nHo4B1Aouzw__XpT6

Note the HTTP 403, which translates to Forbidden, which, given this URL works from pretty much everywhere else, implies that we got banned. If this is automatic and not manual, I expect it to have some triggering and expiration. That means that, in a few weeks time, when we 'll be switching over to eqiad from codfw and after a while, you 'll see eqiad getting banned and at some point, hopefully, codfw getting unbanned.

I assume we are issuing so many requests to nytimes.com that we are either triggering fastly's automated (they appear to be fronted by at least fastly) measures or someone's dashboards in nytimes.com.

2 suggestions

  • Zotero should NOT be erroring out on a 403 but rather reporting back the 403 to Citoid and Citoid logging it in a nice, discoverable way.
  • Reaching out to nytimes.com, pointing out why the tool exists and asking for an exemption. If they ask for an IP space the tool might reach out from, that would be 208.80.152.0/22 for IPv4 and 2620:0:860::/46 for IPv6, if they ever decide to invest into it.

Is the HTTP response body for those 403s saved anywhere?

Is the HTTP response body for those 403s saved anywhere?

No it is not (it's up to Zotero to save them), but if I am right and we got banned, it should be reproducible with a curl call with the proper User-Agent header from urldownloader2004. If the ban is automatic, maybe the body is a Fastly captcha (indicating Zotero got identified as a bot, which is arguably correct). If it is manual, I 'd expect either a message to contact them, or nothing.

Update: Citoid is working for nytimes.com today.

Update: Citoid is working for nytimes.com today.

Thank you for the update, Sherry!

Note: the Editing Team is thinking through a couple of ways we could incrementally improve the current experience.

You can expect to see another comment from @Esanders, @Mvolz , or me before this week is over outlining those approaches.

It's broken again. Edits like this can only be done manually.

It's broken again. Edits like this can only be done manually.

there was a spike in errors that dissipated ~160 minutes before your edit.

https://grafana.wikimedia.org/d/NJkCVermz/citoid?orgId=1&from=1726077600000&to=1726102800000&viewPanel=46

It's broken again. Edits like this can only be done manually.

Great spot...thank you for stopping by to make us aware, @WhatamIdoing!

For some broader context of where this work stands...

On 3 August 2024, there was a significant decrease in Citoid request volume (T372438) which seems to have [i] translated into an overall reduction in the rate at which Citoid requests fail.

As evidenced by what you experienced and noted in T323169#10139545, this doesn't seem to have fully resolved the issue. In response, we're doing an analysis in T374624 (led by @MNeisler) to distinguish Citoid requests that fail because the content cannot be found and requests that fail because the request was blocked.

Knowing the above, will inform what we do next. E.g reaching out to publishers directly.

In the meantime, if/when you notice Citoid failing, we'd value y'all continuing to make us aware.


i. Thank you for linking to this @jeremyb-phone