Optimization of Chatbot Intent Classification: a Comparative Analysis of NLP and LLM Embedding Techniques

Authors

  • Anvar Zokhidov Former Senior AI Manager, Tenge Bank (Halyk Group) l ML Engineer, Orange (France Telecom)

Abstract

In the field of natural language processing (NLP) and large language models (LLMS), a comprehensive study has been conducted on a number of machine learning models and methods for extracting, understanding, and generating textual information. The main focus was on using the capabilities of models of different generations to solve a classification problem using machine learning, which was trained, validated and tested on synthetically generated text data from ChatGPT the GPT-4o version.

The experiments began by evaluating basic models such as TF-IDF, Word2Vec, Doc2Vec, FastText, and ended up with Transformer-based models. These models were used to extract embeddings from text data, which laid the foundation for subsequent intent classification. This initial stage of the experiments allowed to obtain valuable information about the effectiveness of various embedding methods and their impact on the subsequent Chatbot classification task. This article describes an optimization through these various stages of experimentation and learning, culminating in a holistic understanding of the various NLP and LLM models and their applications.

References

1. Felipe Almeida and Geraldo Xexéo. Word Embeddings: A Survey. 2023. arXiv: 1901.09069 [cs.CL].

2. Piotr Bojanowski et al. Enriching Word Vectors with Subword Information. 2017. arXiv:

1607.04606 [cs.CL].

3. Leo Breiman. “Random Forests”. In: Machine Learning 45.1 (2001), pp. 5–32. ISSN: 1573-0565. DOI: 10.1023/A: 1010933404324. URL: https://doi.org/10.1023/A:1010933404324.

4. R. J. G. B. Campello et al. “Density-based clustering based on hierarchical density estimates”. In: Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD’11) 8158 (2013), pp. 160–172. DOI: 10.1007/978-3-642-40994-3_14.

5. Corinna Cortes and Vladimir Vapnik. “Support-vector networks”. In: Machine Learning 20.3 (1995), pp. 273–297. DOI: 10.1007/BF00994018. URL: https://doi.org/10.1007/BF00994018.

6. Jacob Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. In: CoRR abs/1810.04805 (2018). arXiv: 1810.04805. URL:

http://arxiv.org/abs/1810.04805.

7. Jacob Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 2019. arXiv: 1810.04805 [cs.CL].

8. Divyansh Khurana et al. “Natural Language Processing: State of the Art, Current Trends and Challenges”. In: Multimedia Tools and Applications 82 (2023), pp. 3713–3744. DOI: 10.1007/s11042-022-13428-4. URL: https://doi.org/10.1007/s11042-022-13428-4.

9. Quoc V. Le and Tomas Mikolov. Distributed Representations of Sentences and Documents. 2014. arXiv: 1405.4053 [cs.CL].

10. LLAMA: Large Language Models for Articulated Motion and Animation.

https://ai.meta.com/llama/. Accessed: August 1, 2023.

11. Stuart Lloyd. “Least Squares Quantization in PCM”. In: IEEE Transactions on Information Theory 28.2 (1982), pp. 129–137.

12. Edward Loper and Steven Bird. NLTK: The Natural Language Toolkit. 2002. arXiv: cs/0205028 [cs.CL].

13. lovelytony22. Why Are There So Many Different Languages?

https://africanparadiseworld.com/2018/01/29/why-many-different-languages/. Accessed on August 10, 2023.

14. Laurens van der Maaten and Geoffrey Hinton. “Visualizing Data Using t-SNE”. In: Journal of Machine Learning Research 9.Nov (2008), pp. 2579–2605.

15. Andrzej Maćkiewicz and Waldemar Ratajczak. “Principal Components Analysis (PCA)”. In: Computers & Geosciences 19.3 (1993), pp. 303–342. ISSN: 0098-3004. DOI:

https://doi.org/10.1016/0098-3004 (93)90090-R. URL:

https://www.sciencedirect.com/science/article/pii/009830049390090R.

16. James B. MacQueen. “Some Methods for Classification and Analysis of Multivariate Observations”. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability 1.14 (1967), pp. 281–297.

17. Louis Martin et al. “CamemBERT: a Tasty French Language Model”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020. DOI: 10.18653/v1/2020.acl-main.645. URL:

https://doi.org/10.18653/v1/2020.acl-main.645.

18. [18] Leland McInnes, John Healy, and James Melville. “UMAP: Uniform Manifold Approximation and Projection”. In: Journal of Open Source Software 3.29 (2018), p. 861. DOI:

10.21105/joss.00861.

19. Tomas Mikolov et al. Efficient Estimation of Word Representations in Vector Space. 2013. arXiv: 1301.3781 [cs.CL].

20. OpenAI. ChatGPT. https://openai.com. 2021.

21. OpenAI. OpenAI. 2023. URL: https://www.openai.com/.

22. Orange Group: Overview. https://www.orange.com/en/group/overview/orange-group. Accessed: July 18, 2023.

23. Fabian Pedregosa et al. Scikit-learn: Machine Learning in Python. 2018. arXiv: 1201.0490 [cs.LG].

24. Radim Řehůřek. Gensim: Topic Modelling for Humans.

https://radimrehurek.com/gensim/index.html. Accessed on: August 3, 2023.

25. Nils Reimers and Iryna Gurevych. “Sentence-BERT: Sentence Embeddings using Siamese BERT Networks”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020).

26. Ippon Technologies. “Mettez en place une classification de texte performante grâce à un CamemBERT”. In: Ippon Blog (Oct. 2022). URL: https://blog.ippon.fr/2022/10/10/mettez-en-place-une-classification-de-texte-performante-grace-a-un-camembert-2/.

27. TF-IDF (Term Frequency-Inverse Document Frequency). https://www.learndatasci.com/glossary/tf-idf-term-frequency-inverse-document-frequency/. Accessed: July 2023.

28. Ashish Vaswani et al. Attention Is All You Need. 2023. arXiv: 1706.03762 [cs.CL].

29. IBM. Natural Language Processing.

https://www.ibm.com/think/topics/natural-language-processing. Accessed: August 1, 2023.

Downloads

Published

2025-06-23

How to Cite

Zokhidov , A. (2025). Optimization of Chatbot Intent Classification: a Comparative Analysis of NLP and LLM Embedding Techniques. International Journal of Informatics and Data Science Research, 2(6), 11–19. Retrieved from https://scientificbulletin.com/index.php/IJIDSR/article/view/1051