Automatic website classification
Machine learning in the eAuditor system analyzes web content and assigns the appropriate category. Check out what possibilities it provides!
Table of contents
1. Use of AI in classifying web pages
2. Application of Machine learning in eAuditor
3. Operation of the Bayesian classifier
4. Correctness and timing of website classification
5. Why we introduced machine learning into the eAuditor system?
6. Benefits of machine learning for eAuditor users
Using AI to classify web pages
Many companies struggle to determine exactly what they need, and what they can achieve by investing in AI (Artificial Intelligence) and machine learning technology. The most common barrier is a lack of knowledge and conviction about the data resources the company is collecting.
Application of Machine learning in our system allowing detailed security analysis
Machine learning (machine learning) in the eAuditor system performs analyzes the content of web pages and assigns the appropriate category.
Website classification can be useful in any entity where surveillance and control of user activity can have a real impact on security.
The implementation of a machine learning algorithm allows for efficient and fast classification of any website in terms of its content, so that it can be assigned to the appropriate category. The website classification module in the eAuditor system is prepared for the occurrence of various random events in such a way that, despite a server-side error or website expiration, it does not interrupt its operation and correctly performs its task, assigning websites to the appropriate categories.
Operation of the Bayesian classifier
The Bayesian classifier, which is based on Bayes’ theorem, is particularly suitable for solving problems with multiple dimensions. Despite the simplicity of the method, it often performs better than other, more complicated classifier methods. The mentioned classifier can be taught in supervised learning mode. This means that for the algorithm to work correctly and even better, it is necessary to have human supervision, which continuously analyzes and corrects any errors in the algorithm. The classifier is correct as long as the correct category is more likely than others.
Worth remembering!
In practice, it happens that the algorithm may indicate a different category than expected. This happens especially on news sites that consist of many articles on many topics and industries. Then the algorithm may indicate the incorrect category.
Correctness and timing of website classification
As part of eAuditor’s machine learning test, 1,000 random and unpopular websites were categorized. The number is now close to 5 million! The correctness of category assignment for these sites is > 95%. The problem with achieving better results is not on the side of the algorithm, as the algorithm finds the highest probability of a category. What turns out to be problematic is the fact that a single website can fall into several categories at once, and each category can be correct.
Example:
The website www.onet.pl can be categorized as both news and media, as well as entertainment or law and politics.
Why did we introduce machine learning into the eAuditor system?
- the database of websites with assigned categories is huge and takes up a lot of space (over 1 TB). The number of web pages is not a few thousand or even millions. Currently, it is a quantity that is difficult to estimate,
- the use of a ready-made database does not even cover 75% of the pages viewed by our customers – it is physically impossible,
- web sites can change their category faster than off-the-shelf site category databases,
- databases require constant updating, which is costly and time-consuming,
- machine learning categorizes websites individually for each user’s needs.
Benefits of machine learning for eAuditor users
- automatically assigning a category to each web page visited,
- high classification efficiency,
- auto-adaptation to each user of the eAuditor system,
- the lack of a database of website categories and the need to update it,
- automatic reclassification in case of algorithm modification or website modification,
- independence from external providers of such a base,
- reduction of system operating costs,
- the ability to integrate with Hyprovision DLP for blocking selected types of sites.