melkov

Рейтинг
57
Регистрация
25.01.2001
Должность
postgraduate student (DMMC), yandex.ru programmer
Интересы
search engine(s), 3d engines

AiK, никто же не мешает написать "User-agent: Yandex" в группе с директивой Host :).

Кроме того, в Ваше предложение точно неприемлемо:

1) Оно нарушает стандарт robots.txt. Директива disallow трактуется однозначно.

Заметьте, что роботы, следующие стандарту, строчку, начинающуюся с 'Host', должны попросту игнорировать.

2) Оно не позволит запретить зеркало, о котором Вы не знаете (кто-то "запарковал" свой ненужный хост на Ваш IP).

On Fri, 2003-01-10 at 18:46, Alexander Melkov wrote:

> The best solution for the webmaster is to disallow indexing of all the secondary mirrors with

> robots.txt:

> User-agent: *

> Disallow: /

I don't agree. There is value in robots indexing the mirrors,

so that the mirrors can be found, and so that the resulting load

can be shared across servers.

I think a better solution is to use a LINK tag that allows

a document to define some other document as its "original".

Then your robot could verify that, and decide how to display

it, say by listing the original first, and copies below it.

Or if you didn't want to be that comprehensive, you could

just follow that link tag, and index only the original instead.

No such LINK tag "rel" attribute exists at the moment, but:

- any HTML author can add it, without having to be webmasters

(or programmers)

- adding it doesn't break any existing standard.

- the w3c is presumably quite in favour of LINK

- there is a process of having this adopted as a W3C

recommendation, that people might actually pay attention to.

> - but there are problems with virtual hosts and 'domain name parking' to somebody else's IP's. Most of

> the people have difficulties even with this SSI example for Apache servers:

> <!--#if expr=" \"${HTTP_HOST}\" != \"www.main_name.ru\" " -->

> User-Agent: *

> Disallow: /

> <!--#endif -->

This sounds like it would achieve what you propose, and is easily

ported to cgi/php/servlets whatever.

If you mean that some people have difficulty applying this to

their server (because they don't have administrative control,

technical know-how, or a server platform that supports it),

then that's more a problem with their servers -- if they

want this, they should make it happen.

> For that reason we are about to introduce an extension to robots.txt, the 'Host:' field. I'll try to

> explain what we've decided. Is there anything that is very wrong?

>

> The idea is simple: webmaster can disallow all the addresses except his main mirror with a single host

> directive.

> User-Agent: *

> Disallow: /forum

> Disallow: /cgi-bin

> Host: www.myhost.ru

>

> The value for HOST field is not host indeed, but a network location, i.e. host:port, where port=80 is

> assumed by default. If our robot somehow does not recognize this value as a correct location (i.e.

> host name violates RFC 952, 921, etc. or port=0, or the top level domain of the host name doesn't

> exist), it ignores the line. Also, multiple Host lines are allowed.

>

> If there is at least one correct Host line in the record, our robot matches current host name and port

> with each (correct) Host line. If none of them match, it implies "Disallow: /" line at the end of the

> record (otherwise, it does nothing).

Some comments:

- This doesn't help you with mirrors you don't control --

if a 3rd party copies your documents, without the robots.txt file,

then these copies still get indexed. (With the LINK idea

this isn't a problem).

- people will be confused by this; they're already confused

enough (by where to put the file, how records are separated,

how the pattern matching works etc).

- you're not going to get all robots to adopt this just because

you do, and having lots of inconsistent extensions to the

robots.txt doesn't help anyone.

- As you say, a server can already change its rules based on

the hostname used. Adding additional mechanisms then seems

somewhat superfluous.

- New fields in robots.txt may well cause some robots to break.

- I am no longer involved with robots, and have no interest

in extending the robots.txt standard. The HTML 4 spec

mentions robots.txt, so the W3C is probably the best organisation

to take it forward.

Personally I think this feature would only adress the problems

of very few people, in an in-adequate way, while complicating

the life all other webmasters (and robot writers) in the world,

and cause interoperability problems. So, no, I don't support it.

Regards,

-- Martijn

Hello Martijn!

I'm one of the developers of Russian search engine, Yandex (www.yandex.ru). For the reasons of

maintaining our robot's indexing speed fast, search database clean, and correct link popularity

calculation we have a database of mirror hosts.

When you have a number of mirrors, you'll have to choose one of them as a main. The fact is that no

algorithm of automatic choice can guess what any particular webmaster really thinks to be the main

mirror of his site.

The best solution for the webmaster is to disallow indexing of all the secondary mirrors with

robots.txt:

User-agent: *

Disallow: /

- but there are problems with virtual hosts and 'domain name parking' to somebody else's IP's. Most

of

the people have difficulties even with this SSI example for Apache servers:

<!--#if expr=" \"${HTTP_HOST}\" != \"www.main_name.ru\" " -->

User-Agent: *

Disallow: /

<!--#endif -->

For that reason we are about to introduce an extension to robots.txt, the 'Host:' field. I'll try to

explain what we've decided. Is there anything that is very wrong?

The idea is simple: webmaster can disallow all the addresses except his main mirror with a single host directive.

User-Agent: *

Disallow: /forum

Disallow: /cgi-bin

Host: www.myhost.ru

The value for HOST field is not host indeed, but a network location, i.e. host:port, where port=80 is

assumed by default. If our robot somehow does not recognize this value as a correct location (i.e.

host name violates RFC 952, 921, etc. or port=0, or the top level domain of the host name doesn't

exist), it ignores the line. Also, multiple Host lines are allowed.

If there is at least one correct Host line in the record, our robot matches current host name and port

with each (correct) Host line. If none of them match, it implies "Disallow: /" line at the end of the

record (otherwise, it does nothing).

sorry за оффтоп.

Gray, речь идет о картинках в директории

Age,

> Ясно, попробуем картинки через скрипт выдавать.

Э-э-э! А mod_expires Вам на что?

Заголовок Expires говорит о том, что user-agent'у (браузеру или роботу) имеет смысл вплоть до указанной даты полагать, что документ не изменился.

Соответственно, проявляя заботу о пользователях, полезно выдавать этот заголовок для статических документов, особенно картинок.

В случае MSIE (с настройками по умолчанию) это позволило бы избежать, к примеру, кучи запросов к картинкам при первом заходе на сайт после старта программы (если, конечно, в кеше хватило места для этих картинок :)). gray, это и Вас касается :).

Что касается робота yandex.ru, то он, насколько мне известно, Expires не использует, т.е. поменять очередь обхода своего сайта роботом при помощи meta-тегов и http-заголовков вебмастер не может.

euhenio,

> что непрямые тоже учитываются

Некоторые учитываются при ссылочном ранжировании. Какие - рассказывать не буду, можете предположить :).

При расчете ВИЦ не учитываются.

Во-первых, из всех этих директорий проиндексирована только "www.km.ru/estate/".

Во-вторых, "несклеенные" зеркала могут заметно увеличить ИЦ (а не ВИЦ).

Причем, если проект *.km.ru не описан в каталоге Яндекса, то и на ИЦ его содержимое не скажется.

> "неправильно" ссылающийся на президента, попал в непот-фильтр?

euhenio, что-то вроде того. Точно сказать не могу, т.к. это вне зоны моих интересов.

> Если Рунет считает...

Никакой цензуры. Вот, скажем, если в гезете АиФ статью с четвертой полосы перенесут на 18-ю, это цензура? :)

mager, скажу по секрету, ручная правка выдачи яндекса по конкретному запросу технически невозможна, а реализовать такой сервис трудно.

Пока что вполне хватало уже подробно обсуждавшихся манипуляций с ВИЦ ссылающихся сайтов, с чем Вы даже успели недавно столкнуться.

Всего: 142