TY  - THES
T1  - Grocery2Net a web-based hybrid CNN and ViT model for image classification and semantic segmentation
A1  - Madriaga, Charissa Mae P.
A2  - Yusiong, John Paul T.
LA  - English
UL  - https://tuklas.up.edu.ph/Record/UP-8027390931312010012
AB  - Fruits, vegetables, and packaged products, essential for daily consumption, are among the most common items in grocery stores, making it important to develop models that can classify and segment products effectively for sorting, inventory management, and enhancing customer experience. Current approaches rely on convolutional neural networks (CNNs) and Vision Transformers (ViTs) separately, each with distinct strengths and weaknesses. Traditional CNNs excel at extracting localized spatial features but struggle with long-range dependencies, while ViTs capture global context but require large datasets, extensive computation, and may struggle with fine local details. Moreover, there is limited research that explores multitask learning for both classification and segmentation. To address these limitations, this study presents Grocery2Net, a hybrid model that combines the strengths of both architectures to perform simultaneous classification and segmentation. The model was trained on a publicly available grocery dataset and evaluated using metrics such as Top-1 and Top-5 accuracy, segmentation accuracy, młoU, precision, recall, and F1 score, and was benchmarked against the state-of-the-art. The final model, which uses a ResNeXt50.32x4d CNN encoder and a MiT-B2 ViT encoder with two parallel UNets, achieved an average Top-1 classification accuracy of 98.78%, Top-5 accuracy of 99.94%, a mean Intersection over Union (mIoU) of 94.43%, and segmentation F1-score of 97.14%. These results demonstrate that the hybrid model outperforms standalone architectures on the same dataset, achieving higher classification accuracy and more precise segmentation maps. 
CN  - LG 993.5 2025 C66 M33
KW  - Computer vision.
ER  -