AiK, никто же не мешает написать "User-agent: Yandex" в группе с директивой Host :).
Кроме того, в Ваше предложение точно неприемлемо:
1) Оно нарушает стандарт robots.txt. Директива disallow трактуется однозначно.
Заметьте, что роботы, следующие стандарту, строчку, начинающуюся с 'Host', должны попросту игнорировать.
2) Оно не позволит запретить зеркало, о котором Вы не знаете (кто-то "запарковал" свой ненужный хост на Ваш IP).
On Fri, 2003-01-10 at 18:46, Alexander Melkov wrote:
> The best solution for the webmaster is to disallow indexing of all the secondary mirrors with
> robots.txt:
> User-agent: *
> Disallow: /
I don't agree. There is value in robots indexing the mirrors,
so that the mirrors can be found, and so that the resulting load
can be shared across servers.
I think a better solution is to use a LINK tag that allows
a document to define some other document as its "original".
Then your robot could verify that, and decide how to display
it, say by listing the original first, and copies below it.
Or if you didn't want to be that comprehensive, you could
just follow that link tag, and index only the original instead.
No such LINK tag "rel" attribute exists at the moment, but:
- any HTML author can add it, without having to be webmasters
(or programmers)
- adding it doesn't break any existing standard.
- the w3c is presumably quite in favour of LINK
- there is a process of having this adopted as a W3C
recommendation, that people might actually pay attention to.
> - but there are problems with virtual hosts and 'domain name parking' to somebody else's IP's. Most of
> the people have difficulties even with this SSI example for Apache servers:
> <!--#if expr=" \"${HTTP_HOST}\" != \"www.main_name.ru\" " -->
> User-Agent: *
> <!--#endif -->
This sounds like it would achieve what you propose, and is easily
ported to cgi/php/servlets whatever.
If you mean that some people have difficulty applying this to
their server (because they don't have administrative control,
technical know-how, or a server platform that supports it),
then that's more a problem with their servers -- if they
want this, they should make it happen.
> For that reason we are about to introduce an extension to robots.txt, the 'Host:' field. I'll try to
> explain what we've decided. Is there anything that is very wrong?
>
> The idea is simple: webmaster can disallow all the addresses except his main mirror with a single host
> directive.
> Disallow: /forum
> Disallow: /cgi-bin
> Host: www.myhost.ru
> The value for HOST field is not host indeed, but a network location, i.e. host:port, where port=80 is
> assumed by default. If our robot somehow does not recognize this value as a correct location (i.e.
> host name violates RFC 952, 921, etc. or port=0, or the top level domain of the host name doesn't
> exist), it ignores the line. Also, multiple Host lines are allowed.
> If there is at least one correct Host line in the record, our robot matches current host name and port
> with each (correct) Host line. If none of them match, it implies "Disallow: /" line at the end of the
> record (otherwise, it does nothing).
Some comments:
- This doesn't help you with mirrors you don't control --
if a 3rd party copies your documents, without the robots.txt file,
then these copies still get indexed. (With the LINK idea
this isn't a problem).
- people will be confused by this; they're already confused
enough (by where to put the file, how records are separated,
how the pattern matching works etc).
- you're not going to get all robots to adopt this just because
you do, and having lots of inconsistent extensions to the
robots.txt doesn't help anyone.
- As you say, a server can already change its rules based on
the hostname used. Adding additional mechanisms then seems
somewhat superfluous.
- New fields in robots.txt may well cause some robots to break.
- I am no longer involved with robots, and have no interest
in extending the robots.txt standard. The HTML 4 spec
mentions robots.txt, so the W3C is probably the best organisation
to take it forward.
Personally I think this feature would only adress the problems
of very few people, in an in-adequate way, while complicating
the life all other webmasters (and robot writers) in the world,
and cause interoperability problems. So, no, I don't support it.
Regards,
-- Martijn
Hello Martijn!
I'm one of the developers of Russian search engine, Yandex (www.yandex.ru). For the reasons of
maintaining our robot's indexing speed fast, search database clean, and correct link popularity
calculation we have a database of mirror hosts.
When you have a number of mirrors, you'll have to choose one of them as a main. The fact is that no
algorithm of automatic choice can guess what any particular webmaster really thinks to be the main
mirror of his site.
The best solution for the webmaster is to disallow indexing of all the secondary mirrors with
robots.txt:
User-agent: *
Disallow: /
- but there are problems with virtual hosts and 'domain name parking' to somebody else's IP's. Most
of
the people have difficulties even with this SSI example for Apache servers:
<!--#if expr=" \"${HTTP_HOST}\" != \"www.main_name.ru\" " -->
User-Agent: *
<!--#endif -->
For that reason we are about to introduce an extension to robots.txt, the 'Host:' field. I'll try to
explain what we've decided. Is there anything that is very wrong?
The idea is simple: webmaster can disallow all the addresses except his main mirror with a single host directive.
Disallow: /forum
Disallow: /cgi-bin
Host: www.myhost.ru
The value for HOST field is not host indeed, but a network location, i.e. host:port, where port=80 is
assumed by default. If our robot somehow does not recognize this value as a correct location (i.e.
host name violates RFC 952, 921, etc. or port=0, or the top level domain of the host name doesn't
exist), it ignores the line. Also, multiple Host lines are allowed.
If there is at least one correct Host line in the record, our robot matches current host name and port
with each (correct) Host line. If none of them match, it implies "Disallow: /" line at the end of the
record (otherwise, it does nothing).
sorry за оффтоп.
Gray, речь идет о картинках в директории
Age,
> Ясно, попробуем картинки через скрипт выдавать.
Э-э-э! А mod_expires Вам на что?
Заголовок Expires говорит о том, что user-agent'у (браузеру или роботу) имеет смысл вплоть до указанной даты полагать, что документ не изменился.
Соответственно, проявляя заботу о пользователях, полезно выдавать этот заголовок для статических документов, особенно картинок.
В случае MSIE (с настройками по умолчанию) это позволило бы избежать, к примеру, кучи запросов к картинкам при первом заходе на сайт после старта программы (если, конечно, в кеше хватило места для этих картинок :)). gray, это и Вас касается :).
Что касается робота yandex.ru, то он, насколько мне известно, Expires не использует, т.е. поменять очередь обхода своего сайта роботом при помощи meta-тегов и http-заголовков вебмастер не может.
euhenio,
> что непрямые тоже учитываются
Некоторые учитываются при ссылочном ранжировании. Какие - рассказывать не буду, можете предположить :).
При расчете ВИЦ не учитываются.
Во-первых, из всех этих директорий проиндексирована только "www.km.ru/estate/".
Во-вторых, "несклеенные" зеркала могут заметно увеличить ИЦ (а не ВИЦ).
Причем, если проект *.km.ru не описан в каталоге Яндекса, то и на ИЦ его содержимое не скажется.
> "неправильно" ссылающийся на президента, попал в непот-фильтр?
euhenio, что-то вроде того. Точно сказать не могу, т.к. это вне зоны моих интересов.
> Если Рунет считает...
Никакой цензуры. Вот, скажем, если в гезете АиФ статью с четвертой полосы перенесут на 18-ю, это цензура? :)
mager, скажу по секрету, ручная правка выдачи яндекса по конкретному запросу технически невозможна, а реализовать такой сервис трудно.
Пока что вполне хватало уже подробно обсуждавшихся манипуляций с ВИЦ ссылающихся сайтов, с чем Вы даже успели недавно столкнуться.