logo coldsource.net
Coldsource.net
Hébergement & Développement

TLink

Home  |  Download  |  Documentation  |  Sourceforge  |  ScreenShots


For further assistance, please check the forum.
I. Presentation
TLink is an engine capable of browsing the WWW. TLink parses the pages, get the links they contain and then navigate these new links. So TLink can explore the WWW.
Of Course, TLink doesn't only 'browse' the WWW. TLink can perfom different tasks, like for example, the detection of dead links.
TLink has been made to realize moderated tasks. Generally, a work on a site shouldn't exceed some 100'000 pages. the aim of TLink is to perform diverse tasks on well difined parts of the WWW (browsing one web site for example). However, TLink is very customizable, what allow it to perform tasks that would be out of its conception context. Anyway, don't hope to transform TLink in a kind of Googlebot.
TLink works with a system of plugins. you can create or download plugins that can perform almost any task. The most obvious task is the detection of dead links, however you can create plugins that track a certain kind of URL or a plugin that established a map of the browsed site... You will just have to configure TLink to use the plugin you have made or downloaded.
The creation of plugins hase been simplified to the max. If you have a base knowledge in C++ (and if you know how to create DLLs), you should be able to realize a plugin in a few hours. This allow you to customize TLink, so it can perform any task you want.
TLink hase been made so it can be called by extern programs, a script or a task scheduler for example. In this goal, the whole configuration of TLink can be made with configuration files (in text format). So a script can generate an option file, and then call TLink without any intervention of the user.
II. Functioning
1. Outline
In a very general way, TLink browses the web, from a page to another. On each browsed page, TLink retrieve the links and add them to its list of links to visit. When a page is parsed, TLink send some informations (URL of the page, code of the HTTP header...) about this page to the plugin which perform the requested task. The work made byt the plugin can be very various. When the work is finished (all the links have been browsed), the plugin is called another time in order to generate the output files (repport, results, errors...).
The problem is that the navigation of a site could quickly become a navigation of the whole WWW. To avoid that you must give TLink restrictions, indicating which sites it can browse and which it can't. While navigating, TLink will avoid sites that are not explicitly allowed. In this way, you can keep TLink in a restricted area, your web site for exemple.
TLink can however be configured to test the links it is not allowed to browse. If TLink find a link to an external site, it will just test if the page is valid or not and call the plugin. TLink will just test the page. It will not get the links in the page. It's a way to test dead links (even if they are out of the wrowsed site) without browsing the entire web.
In order to improve speed, TLink can browse more that one page at the same time. The problem is that a too important number of requests can bring errors on some sites (for example sites that use databases). To avoid that, TLink allow you to specify the maximum number of simultaneous connexions allowed, depending of the type of the page. For example, you can browse slowly PHP, ASP or HTML pages and quickly images. In this way, you will not get errors because only pages can use databases. This system permit a very fast browse, without any errors. The navigation speed can so be increased from 30 to 60%.
2. Navigation
During navigation, TLink uses a link manager. This manager memorize the pages already browsed and the pages to browse. This manager allow TLink to browse the WWW without going 2 times on the same page. You can specify the maximum number of links that will be memorized by the link manager. If this maximum is reached and new links are found, they will be ignored.
In order to allow TLink to begin navigation, you must pass at least one link to the link manager. It's the only URL you have to pass explicitly to TLink. TLink will start navigating with this URL. It is possible to pass explicitly several links to the link manager. For example, you can create a list of pages to test and then pass all them explicitly to the link manager. It is very useful if the pages are not linked together.
However, you must keep in mind that even if these URLs are passed explicitly, they will be subjected to the restrictions. An URL can be passed explicitly to the link manager and all the same be restricted. In this case, the page will not be wrowsed or just tested (without getting new links) depending of the configuration you have chosen.
3. Restrictions
In order to keep TLink in a restricted work area, you must use restrictions. We will use 2 types of links : included links, that are in the working area, and restricted links that are not allowed by the restrictions. If an URL is included, the page will be parsed, the links the page contains will be retrieved and added to the link manager. If a page is restricted you have 2 choices (configurable with the option file) : the page can simply be ignored, or it can just be tested (without downloading the page content). By default, a link will always be considered as restricted, except if rule authorizes it explicitly.
The rules are only defined on the URLs. To include a link in the work area, you must create a rule that authorize it. you can also define rules that restrict (or include) a set links. You can use 4 types of rules (defined below). When a new link is found, TLink will test the rules in order (the order in which the rules are defined in the option file). The last rule that match the link will be used to dertermine if the link must be included or restricted. So the order in which you enter the links in the configuration file is important.
Here is the different rules that can be used :
Restriction_Domain
Include all the links which domain is exactly the specified
Example : You established a rule Domain "www.domaine.com"
The following links will be included :
http://www.domaine.com
http://www.domaine.com/
http://www.domaine.com/index.php
http://www.domaine.com/dossier/
Code 1
The following links will be restricted :
http://domaine.com
http://prefixe.domaine.com
http://www.autredomaine.com
Code 2
Restriction_DomainEnd
Include all the links which domain ends with the specified string.
Example : You established a rule DomainEnd ".domaine.com"
The following links will be included :
http://www.domaine.com/
http://prefixe.domaine.com/dossier
http://prefixe1.prefixe2.domaine.com/index.php
Code 3
The following links will be restricted :
http://domaine.com
http://www.autredomaine.com
http://www.domaine-autre.com
Code 4
Restriction_UrlBegin
Include all the links which URL begins with the specified string.
Example : You established a rule UrlBegin "www.domaine.com/dossier"
The following links will be included :
http://www.domaine.com/dossier
http://www.domaine.com/dossier/index.php
http://www.domaine.com/dossier/sous_dossier/
Code 5
The following links will be restricted :
http://www.autredomaine.com
http://prefixe.domaine.com/dossier
http://www.domaine.com/index.php
Code 6
Restriction_NUrlBegin
Restrict all the links which URL begins with the specified string.
Example : You established a rule NUrlBegin "http://www.domaine.com/forum"
The following links will be included :
http://www.domaine.com
http://www.autredomaine.com
http://www.domaine.com/dossier/
http://www.autredomaine.com/forum
Code 7
The following links will be restricted :
http://www.domaine.com/forum
http://www.domaine.com/forum/dossier/
Code 8
So you must systematically create the appropriate rules so all the links of the web site you want to browse are included by the restrictions. Keep in mind that if no rules match, the link will be considered as restricted.
If you simply want to pass to TLink a list of links to test, you just need to define no rules and pass all the links to TLink explicitly in the option file. In this way, all the links will be tested but TLink will not retrieve any other link.
If you want to browse a wholt site except one part, you can use NUrlBegin restrictions. For example, if you want to browse the whole domain www.domaine.com avoiding the forum, you just have to use the following rules (in this order) :
Domain "www.domaine.com"
NUrlBegin "www.domaine.com/forum"
Code 9
4. Links classes
Some web sites will return error messages if you browse too many pages at the same time. It is the case if the site uses a database (a forum for example). The number of simultaneous connexions allowed on databases is often limited. If you make too many simultaneous connexions, you may have error pages instead of the real site pages. In the same way, many public FTP servers limit the number of simultaneous connexions allowed from the same IP.
On the contrary, static pages (generally .HTML), images, PDF files... can generally be browsed quickly, with an important number of simultaneous connexions.
TLink introduces a system with 3 classes of links. Each class of link will be browsed at a different speed. You can configure separately the speed of each class of link. The class of a page is determined with the URLs (often with the extension).
Here are the 3 classes of links :
Class 1
All HTTP dynamic pages (PHP, ASP...)
Class 2
All HTTP static objects (images, files...)
Class 3
Any object located on a FTP server.
TLink uses the protocol (HTTP or FTP) specified in the URL to determine if the link class is 3. If the class of the link is not 3, TLink uses the extensions to try to guess if the link class is 1 or 2. A link will always be considered as class 1 except if its extensions matches a class 2 extension. If you want, you can define your own class 2 extensions (also TLink is built in with a list of predefined class 2 extension).
By default, all restricted pages are considered as class 2 (because their content is not downloaded). However you can let TLink uses standard rules to determine the class of resticted pages.
You can then define separately the number of simultaneous connexions that will be used for class 1, 2 or 3 links.
Also it is not recommended, you can disable the classification system. In this case, all links will be considered as class 1. You will have to choose a number of simultaneous connexions that never bring errors whatever the content is.
III. Configuration
1. Option file
TLink is configured with an option file. You must call tlink with the path of the option file as parameter. The path must be fully qualified and must not be placed in quotes. It can contain spaces.
Example :
tlink.exe c:
ouveau dossier link\myconfig.txt
Code 10
The option file must be in text format (UNIX or DOS). The extension is ignored, so you can create .cfg, .config... configuration files if you prefer.
The general format of an entry in the configuration file is the following :
Parameter=Value
Code 11
Parameter is a parameter name
The nature of Value depend of Parameter (it can be a number, a string, ...)
If Value contains spaces, you must quote it (with simple or double quotes).
To place comments, use a #. This symbol can be used at the begining of the line or in the middle. When TLink encounters this symbol, it simply ignore the end of the line.
Example :
# This line will be ignored
# For the 3 following lines, quotes are optional
MaxLinks=1000 # Fix the max number of links that will be browsed
MaxLinks='1000'
MaxLinks="1000"
# Here quotes are required
AgentName="TLink (http://www.coldsource.net/projets/tlink/)"
Code 12
2. Base configuration
Here is a list of usable parameters :
DNSExpirationTime
Type : integer
Default value : 60000
TLink use a DNS cache to accelerate navigation. This option allow you to specify how long an entry will be kept (in milliseconds) in the DNS cache. After this period, TLink will ask the DNS server again.
DNSMaxEntries
Type : integer
Default value : 10000
The maximum number of DNS entries that will be cached.
RcvTimeout
Type : integer
Default value : 30000
The receive timeout (en milliseconds) for HTTP connexions.
SndTimeout
Type : integer
Default value : 30000
The send timeout (en milliseconds) for HTTP connexions.
ConnectTimeout
Type : integer
Default value : 20000
The timeout while connecting to a foreign host (en milliseconds) for HTTP connexions.
RcvBuf
Type : integer
Default value : 16384
The buffer size for reception (in bytes) for HTTP connexions.
SndBuf
Type : integer
Default value : 16384
The buffer size for emission (in bytes) for HTTP connexions.
PacketSize
Type : integer
Default value : 1024
The packet size (in bytes).
FtpRcvTimeout
Type : integer
Default value : 30000
The receive timeout (en milliseconds) for FTP connexions.
FtpSndTimeout
Type : integer
Default value : 30000
The send timeout (en milliseconds) for FTP connexions.
FtpConnectTimeout
Type : integer
Default : 20000
The timeout while connecting to a foreign host (en milliseconds) for FTP connexions.
FtpRcvBuf
Type : integer
Default value : 16384
The buffer size for reception (in bytes) for FTP connexions.
FtpSndBuf
Type : integer
Default value : 16384
The buffer size for emission (in bytes) for FTP connexions.
MaxRedirections
Type : integer
Default value : 30
The maximum number of HTTP redirections allowed before a page is reached. By default, if TLink from a pages to another more than 30 times, it consider the link as dead. As long as TLink receive HTML pages with a code 3.. (307 fot ex.) it follow the redirections and try to get the final page.
AgentName
Type : string
Default value : "TLink (http://www.coldsource.net/projets/tlink)"
TLink will use this value in the 'Agent-Name' field of HTTP headers.
MaxPageLinks
Type : integer
Default value : 1024
The maximum number of links that will be retrieved on a single page. The other will be ignored.
MaxCookies
Type : integer
Default value : 300
The maximum number of cookies that will be memorized.
MaxLinks
Type : integer
Default value : 10000
The maximum number of links that will be added to the link manager. This is the maximum total number of links that will be browsed or tested. Other links will be ignored.
MaxThreads_Class1
Type : integer
Default value : 1
The number of simultaneous connexions that will be used for class 1 links.
MaxThreads_Class2
Type : integer
Default value : 1
The number of simultaneous connexions that will be used for class 2 links.
MaxThreads_Class3
Type : integer
Default value : 3
The number of simultaneous connexions that will be used for class 3 links.
C2ExtensionsFile
Type : string
Default value : ""
The path of a file containing class 2 extension. you can use this option to customize the detection of class 2 links.
ForceClass1Links
Type : on | off
Default value : off
Disable (or enable) link classification. All the links will be considered as class 1.
RestrictedLinksClass2
Type : on | off
Default value : on
Allow the classification of restricted links as class 2 (independently of the extensions).
DiscardRestrictedLinks
Type : on | off
Default value : off
Tell TLink to ignore restricted links. These links will even not be tested.
AutoExit
Type : on | off
Default value : off
Activate auto exit. If this option is enabled and an error occurs, TLink will display an error message and exit after 5 secondes (the user doesn't have to press a key). This option can be used if you launch TLink from un script or a task scheduler.
Plugin
Type : string
Default value : ""
The name of the plugin that will be used. It's the DLL name (with extension and without path). The DLL must be in the 'Plugins' folder. This parameter is required.
PluginOutputDirectory
Type : string
Default value : ""
This path is passed to the plugin. It is in this folder that new files will be created. If you are using the default.dll plugin, this folder must exist and the last character must not be a \ (ex: c: or c:\folder). This parameter is required.
PluginParameters
Type : string
Default value : ""
This string must not exceed 4095 characteres. It will be passed to the plugin. Its meaning depend of the plugin you are using.
ProxyAddr
Type : IP address
Default value : ""
The IP address of the proxy to use with HTTP connexions
ProxyPort
Type : integer
Default value : 8080
The port used to connect to the proxy
3. Restrictions
The restrictions are also defined in the configuration file. To add new restrictions, use the following syntax :
Type_Restrcition="Parameter"
Code 13
Where Type_Restriction can be : Restriction_Domain, Restriction_DomainEnd, Restriction_UrlBegin or Restriction_NUrlBegin.
Example : to add a rule that would authorise the domain www.domaine.com, use :
Restriction_Domain="www.domaine.com"
Code 14
The restriction order is important. They are executed in the same order as in the configuration file.
4. Explicit links
You must give TLink at least one link so it can start navigating. use the following syntax :
Link=URL
Code 15
Example :
Link="http://www.domaine.com/index.php"
Code 16
You can pass tlink as many links as necessary. These links will be subected to the restrictions you have defined.
IV. Examples of use
Here are some examples of configuration files
You want to browse a whole site :
MaxLinks=30000

MaxThreads_Class1=3
MaxThreads_Class2=5
MaxThreads_Class3=1

Plugin="default.dll"
PluginOutputDirectory="c:\mydocs"

Restriction_Domain="www.domaine.com"
Link="http://www.domaine.com"
Code 17
You want to browse the whole site except the forum :
MaxLinks=30000

MaxThreads_Class1=3
MaxThreads_Class2=5
MaxThreads_Class3=1

Plugin="default.dll"
PluginOutputDirectory="c:\mydocs"

Restriction_Domain="www.domaine.com"
Restriction_NUrlBegin="www.domaine.com/forum"
Link="http://www.domaine.com"
Code 18
You want to browse just a part of the site :
MaxLinks=30000

MaxThreads_Class1=3
MaxThreads_Class2=5
MaxThreads_Class3=1

Plugin="default.dll"
PluginOutputDirectory="c:\mydocs"

Restriction_UrlBegin="www.domaine.com/part"
Link="http://www.domaine.com/part/"
Code 19
You want to test a list of links (without browsing) :
MaxLinks=100

MaxThreads_Class1=3
MaxThreads_Class2=5
MaxThreads_Class3=1

Plugin="default.dll"
PluginOutputDirectory="c:\mydocs"

Link="http://www.domaine.com/page1.php"
Link="http://www.domaine.com/page2.php"
Link="http://www.domaine.com/page3.php"
Link="http://www.autredomaine.com/folder/"
Link="http://www.autredomaine.com/index.php"
Link="ftp://ftp.domaine.com/pub/file.ext"
Code 20
Copyright © 2005 Coldsource.net
Contact :