Deutsch English Français Italiano |
<7CehOkmKaRKK7ejb@violet.siamics.net> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Ivan Shmakov <ivan@siamics.netREMOVE.invalid> Newsgroups: comp.misc,comp.infosystems.www.misc Subject: URIs within URIs: google.com/url?q= et al. Followup-To: comp.misc Date: Fri, 20 Dec 2024 18:42:28 +0000 Organization: Dbus-free station. Lines: 103 Message-ID: <7CehOkmKaRKK7ejb@violet.siamics.net> References: <67447ce1$0$22$882e4bbb@reader.netnews.com> <vi3ecs$35u53$1@dont-email.me> <6c4ae24b-7bb8-7d84-8f74-1f5fc14c0ec0@example.net> <87ed2yjkl8.fsf@tilde.institute> <55db8483-58f0-c3dc-de0b-7f44881fa180@example.net> <87jzcp4pzy.fsf@enoch.nodomain.nowhere> <4875e490-ad30-d644-345f-4a09c1935c6b@example.net> <87frnb52zf.fsf@enoch.nodomain.nowhere> Injection-Date: Fri, 20 Dec 2024 19:50:10 +0100 (CET) Injection-Info: dont-email.me; posting-host="feb8f1092de6af2a8a8acda4f44c1277"; logging-data="3770179"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/dzKgb24f0id3J8DRktGqX" Cancel-Lock: sha1:ve15xdl/4SXWlCAkDUDYgLKceSc= License: CC0-1.0 (original contributions only) Bytes: 5382 >>>>> On 2024-11-28, Mike Spencer wrote: [Cross-posting to news:comp.infosystems.www.misc just in case, but setting Followup-To: comp.misc still. Feel free to disregard, though; if anything, I'll be monitoring both groups for some time for responses.] > Here's a curiosity: > Google also sends all of your clicks on search results back through > Google. I assume y'all knew that. > If you search for (say): > leon "the professional" > you get: > https://www.google.com/url > ?q=https://en.wikipedia.org/wiki/L%25C3%25A9on:_The_Professional > &sa=U&ved=2ahUKEwi [snip tracking hentracks/data] > Note that the "real" URL which Google proposes to proxy for you > contains non-ASCII characters: > en.wikipedia.org/wiki/L%25C3%25A9on:_The_Professional > Wikipedia does *not* *have* a page connected to that URL! But if you > click the link and send it back through Google, you reach the right > Wikipedia page that *does* exist: > en.wikipedia.org/wiki/Leon:_The_Professional And this page clearly states (search for "Redirected from" there) that it was reached via an alias. If you follow the "Article" link from there, it'll lead you to .../L%C3%A9on:_The_Professional instead, which is the proper URI for that Wikipedia article. Think of it. Suppose that Google has to return something like http://example.com/?o=p&q=http://example.net/ as one of the results. Can you just put it after google.com/url?q= directly without ambiguity? You'd get: http://google.com/url?q=http://example.com/?o=p&q=http://example.net/&... ^1 ^2 Normally, the URI would start after ?q= and go until the first ^1 occurence of &, but in this case, it'd be actually the second ^2 that terminates the intended URI. Naturally, Google avoids it by %-encoding the ?s and &s, like: http://google.com/url?q=http://example.com/%3fo=p%26q=http://example.net/&... By the same merit, they need to escape %s themselves, should the original URI contain any, so e. g. http://example.com/%d1%8a becomes .../url?q=http://example.com/%25d1%258a&... . Of course, Google didn't invent any of this: unless I be mistaken, that's how HTML <form method="get" />s have worked from the get-go. And you /do/ need something like Hello%3f%20%20Anybody%20home%3f to put it after /guestbook?comment=. FWIW, I tend to use the following Perl bits for %-encoding and decoding, respectively: s {[^0-9A-Za-z/_.-]}{${ \sprintf ("%%%02x", ord ($&)); }}g; s {%([0-9a-fA-F]{2})}{${ \chr (hex ($1)); }}g; > AFAICT, when spidering the net, Google finds the page that *does* > exist, modifies it according to (opaque, unknown) rules of orthography > and delivers that to you. When you send that link back through > Google, Google silently reverts the imposed orthographic "correction" > so that the link goes to an existing page. > Isn't the weird? There's this bit near the end of the .../Leon:_The_Professional (line split for readability): <script type="application/ld+json">{ "@context":"https:\/\/schema.org", "@type":"Article", "name":"L\u00e9on: The Professional", "url":"https:\/\/en.wikipedia.org\/wiki\/L%C3%A9on:_The_Professional", [...] I'm pretty certain that Google /does/ parse JSON-LD like in the above, so I can only presume that when it finds a Web document that points to a different "url": in this way, it (sometimes?) uses the latter in preference to the original URI. I've been thinking of adopting JSON-LD for my own Web pages (http://am-1.org/~ivan/ , http://users.am-1.org/~ivan/ , etc.), but so far have only used (arguably better readable) http://microformats.org/wiki/microformats2 (that I hope search engines will at some point add support for.) Consider, e. g.: http://pin13.net/mf2/?url=http://am-1.org/~ivan/qinp-2024/112.l-system.en.xhtml Note that ?url= above needs the exact same %-treatment as does Google's /url?q=. Naturally, the HTML form at http://pin13.net/mf2/ will do it for you. (Or, rather: instruct your Web user agent to do so.)