Analysis and classification of web pages using sublinks and bookmarks
The implemented functionality of going through sublinks when there is not enough text on the page does not always bring the expected results.
Functionality – website analysis
Implemented functionality of going through sublinks when there is not enough text on the website does not always bring the expected results. Websites are built in countless ways, which often causes difficulties. Although the program is prepared to handle different page variants and define sub-links, it cannot anticipate all the possibilities of how users place resources on a website. In addition, websites are built in different ways. Therefore, a number of tests have been performed to illustrate that for one website, going through links or tabs to collect more words can be helpful.
Site: https://silyzbrojne.pl
Expected Category: Politics, Law and Government Institutions
Topics: armed forces
A table showing the results for the home page, and after going through the tabs (improvement)
A table showing the results for the home page, and after going through the tabs (improvement)
Algorytm Ridge:
- Before the transition: Politics, Law and Government Institutions: 100%
- Ahead of the transition: Media, News and Weather: 96,39%
- After the passage: Politics, Law and Government Institutions: 100%
- Following the passage of Media, News and Weather: 77,17%
Chart generated without going through the sublinks
Graph generated after going through sublinks (improving results on your website)
Unfortunately, going through the tabs also often results in worse results. An example of this would be a link leading to the terms of service or clauses. In all likelihood, the wrong category will then be returned, and this is not the fault of the algorithms, as these are based only on textual input.
Site Example: FColumbus
Company Website: https://fcolumbus.pl/
Expected Category: People and Social Media
Topics: personal development
A table showing the results for the home page, and after going through the tabs (deterioration)
Although the collection of more words occurred, the algorithms noted a deterioration in their performance. This is dictated by the situation related to the fact that the analyzed website contains many bookmarks. These can only unsettle the algorithms by loading words from different categories. These tabs include: Sponsors, Foundation Statute, Contact.
Chart generated without going through the sublinks
Ridge Algorithm:
- Before the transition: Finance, Banking and Insurance: 100% (incorrect)
- Before the transition: People and Social Media: 89.44% (expected)
- After passage: Career, Education and Religion: 100% (incorrect)
- After the transition: People and Social Media: fell out of the top three categories returned
The red region indicates the expected category, which should reach 100%
Graph generated after going through sub-links (performance degradation)
You can protect yourself from such a situation by defining the names of potential bookmarks that the program should not visit. For example: Terms and Conditions, Clause, Contact. It is not possible to avoid this problem completely, because websites are built in an unconventional way, and the various elements on them can escape from certain well-defined frameworks.
Below is a visualization for https://fcolumbus.pl/, which shows how difficult it is for classification algorithms to work on a heterogeneous set of input data without going through sublinks.