![]() |
|
TLink
For further assistance, please check the forum.
I. Presentation
TLink is an engine capable of browsing the WWW. TLink parses the pages, get the links they contain and then navigate these new links. So TLink can explore the WWW.
Of Course, TLink doesn't only 'browse' the WWW. TLink can perfom different tasks, like for example, the detection of dead links.
TLink has been made to realize moderated tasks. Generally, a work on a site shouldn't exceed some 100'000 pages. the aim of TLink is to perform diverse tasks on well difined parts of the WWW (browsing one web site for example). However, TLink is very customizable, what allow it to perform tasks that would be out of its conception context. Anyway, don't hope to transform TLink in a kind of Googlebot.
TLink works with a system of plugins. you can create or download plugins that can perform almost any task. The most obvious task is the detection of dead links, however you can create plugins that track a certain kind of URL or a plugin that established a map of the browsed site... You will just have to configure TLink to use the plugin you have made or downloaded.
The creation of plugins hase been simplified to the max. If you have a base knowledge in C++ (and if you know how to create DLLs), you should be able to realize a plugin in a few hours. This allow you to customize TLink, so it can perform any task you want.
TLink hase been made so it can be called by extern programs, a script or a task scheduler for example. In this goal, the whole configuration of TLink can be made with configuration files (in text format). So a script can generate an option file, and then call TLink without any intervention of the user.
II. Functioning
1. Outline
In a very general way, TLink browses the web, from a page to another. On each browsed page, TLink retrieve the links and add them to its list of links to visit. When a page is parsed, TLink send some informations (URL of the page, code of the HTTP header...) about this page to the plugin which perform the requested task. The work made byt the plugin can be very various. When the work is finished (all the links have been browsed), the plugin is called another time in order to generate the output files (repport, results, errors...).
The problem is that the navigation of a site could quickly become a navigation of the whole WWW. To avoid that you must give TLink restrictions, indicating which sites it can browse and which it can't. While navigating, TLink will avoid sites that are not explicitly allowed. In this way, you can keep TLink in a restricted area, your web site for exemple.
TLink can however be configured to test the links it is not allowed to browse. If TLink find a link to an external site, it will just test if the page is valid or not and call the plugin. TLink will just test the page. It will not get the links in the page. It's a way to test dead links (even if they are out of the wrowsed site) without browsing the entire web.
In order to improve speed, TLink can browse more that one page at the same time. The problem is that a too important number of requests can bring errors on some sites (for example sites that use databases). To avoid that, TLink allow you to specify the maximum number of simultaneous connexions allowed, depending of the type of the page. For example, you can browse slowly PHP, ASP or HTML pages and quickly images. In this way, you will not get errors because only pages can use databases. This system permit a very fast browse, without any errors. The navigation speed can so be increased from 30 to 60%.
2. Navigation
During navigation, TLink uses a link manager. This manager memorize the pages already browsed and the pages to browse. This manager allow TLink to browse the WWW without going 2 times on the same page. You can specify the maximum number of links that will be memorized by the link manager. If this maximum is reached and new links are found, they will be ignored.
In order to allow TLink to begin navigation, you must pass at least one link to the link manager. It's the only URL you have to pass explicitly to TLink. TLink will start navigating with this URL. It is possible to pass explicitly several links to the link manager. For example, you can create a list of pages to test and then pass all them explicitly to the link manager. It is very useful if the pages are not linked together.
However, you must keep in mind that even if these URLs are passed explicitly, they will be subjected to the restrictions. An URL can be passed explicitly to the link manager and all the same be restricted. In this case, the page will not be wrowsed or just tested (without getting new links) depending of the configuration you have chosen.
3. Restrictions
In order to keep TLink in a restricted work area, you must use restrictions. We will use 2 types of links : included links, that are in the working area, and restricted links that are not allowed by the restrictions. If an URL is included, the page will be parsed, the links the page contains will be retrieved and added to the link manager. If a page is restricted you have 2 choices (configurable with the option file) : the page can simply be ignored, or it can just be tested (without downloading the page content). By default, a link will always be considered as restricted, except if rule authorizes it explicitly.
The rules are only defined on the URLs. To include a link in the work area, you must create a rule that authorize it. you can also define rules that restrict (or include) a set links. You can use 4 types of rules (defined below). When a new link is found, TLink will test the rules in order (the order in which the rules are defined in the option file). The last rule that match the link will be used to dertermine if the link must be included or restricted. So the order in which you enter the links in the configuration file is important.
Here is the different rules that can be used :
Restriction_Domain
Include all the links which domain is exactly the specified
Example : You established a rule Domain "www.domaine.com"
The following links will be included :
The following links will be restricted :
Restriction_DomainEnd
Include all the links which domain ends with the specified string.
Example : You established a rule DomainEnd ".domaine.com"
The following links will be included :
The following links will be restricted :
Restriction_UrlBegin
Include all the links which URL begins with the specified string.
Example : You established a rule UrlBegin "www.domaine.com/dossier"
The following links will be included :
The following links will be restricted :
Restriction_NUrlBegin
Restrict all the links which URL begins with the specified string.
Example : You established a rule NUrlBegin "http://www.domaine.com/forum"
The following links will be included :
The following links will be restricted :
So you must systematically create the appropriate rules so all the links of the web site you want to browse are included by the restrictions. Keep in mind that if no rules match, the link will be considered as restricted.
If you simply want to pass to TLink a list of links to test, you just need to define no rules and pass all the links to TLink explicitly in the option file. In this way, all the links will be tested but TLink will not retrieve any other link.
If you want to browse a wholt site except one part, you can use NUrlBegin restrictions. For example, if you want to browse the whole domain www.domaine.com avoiding the forum, you just have to use the following rules (in this order) :
4. Links classes
Some web sites will return error messages if you browse too many pages at the same time. It is the case if the site uses a database (a forum for example). The number of simultaneous connexions allowed on databases is often limited. If you make too many simultaneous connexions, you may have error pages instead of the real site pages. In the same way, many public FTP servers limit the number of simultaneous connexions allowed from the same IP.
On the contrary, static pages (generally .HTML), images, PDF files... can generally be browsed quickly, with an important number of simultaneous connexions.
TLink introduces a system with 3 classes of links. Each class of link will be browsed at a different speed. You can configure separately the speed of each class of link. The class of a page is determined with the URLs (often with the extension).
Here are the 3 classes of links :
TLink uses the protocol (HTTP or FTP) specified in the URL to determine if the link class is 3. If the class of the link is not 3, TLink uses the extensions to try to guess if the link class is 1 or 2. A link will always be considered as class 1 except if its extensions matches a class 2 extension. If you want, you can define your own class 2 extensions (also TLink is built in with a list of predefined class 2 extension).
By default, all restricted pages are considered as class 2 (because their content is not downloaded). However you can let TLink uses standard rules to determine the class of resticted pages.
You can then define separately the number of simultaneous connexions that will be used for class 1, 2 or 3 links.
Also it is not recommended, you can disable the classification system. In this case, all links will be considered as class 1. You will have to choose a number of simultaneous connexions that never bring errors whatever the content is.
III. Configuration
1. Option file
TLink is configured with an option file. You must call tlink with the path of the option file as parameter. The path must be fully qualified and must not be placed in quotes. It can contain spaces.
Example :
The option file must be in text format (UNIX or DOS). The extension is ignored, so you can create .cfg, .config... configuration files if you prefer.
The general format of an entry in the configuration file is the following :
Parameter is a parameter name
The nature of Value depend of Parameter (it can be a number, a string, ...)
If Value contains spaces, you must quote it (with simple or double quotes).
To place comments, use a #. This symbol can be used at the begining of the line or in the middle. When TLink encounters this symbol, it simply ignore the end of the line.
Example :
2. Base configuration
Here is a list of usable parameters :
3. Restrictions
The restrictions are also defined in the configuration file. To add new restrictions, use the following syntax :
Where Type_Restriction can be : Restriction_Domain, Restriction_DomainEnd, Restriction_UrlBegin or Restriction_NUrlBegin.
Example : to add a rule that would authorise the domain www.domaine.com, use :
The restriction order is important. They are executed in the same order as in the configuration file.
4. Explicit links
You must give TLink at least one link so it can start navigating. use the following syntax :
Example :
You can pass tlink as many links as necessary. These links will be subected to the restrictions you have defined.
IV. Examples of use
Here are some examples of configuration files
You want to browse a whole site :
You want to browse the whole site except the forum :
You want to browse just a part of the site :
You want to test a list of links (without browsing) :
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||