dc.description.abstract | In recent decades, significant growth and diversification in sources of User-Generated Content(UGC) have been observed. Social media emerges as one of the primary sources of UGC, offeringnumerous advantages over traditional data sources, such as affordability, vastness, and diversityacross various domains of application (for example, tourism, health, public policies). However, thehighly unstructured nature of social media posts introduces several challenges. The languagediversity and specificity of social media posts, characterized by features such as brevity, frequentgrammatical errors, and the use of special characters, combined with the substantial volume andnoisy nature of the data, make analyzing social media data a complex endeavour.This thesis introduces a novel multilingual framework, the APs Framework, designed tostreamline the processing and analysis of social media data. This framework is generic in two aspects:it can be applied across various social media platforms and is adaptable to different applicationdomains. The genericity of the application domain is supported by semantic representations ofdomain knowledge (for example, through thesaurus or ontologies). The APs Framework aimsto provide domain-independent insights from social media to non-computer scientists, such asstakeholders in various domains (for example, tourism offices in the tourism domain), therebyenhancing their analytical capabilities. The APs Framework is structured into four phases: Collect,Transform, Analyze, and Valorize.In the Collect phase, a generic and iterative methodology for constructing thematic datasetsfrom social media is proposed. This approach seeks to mitigate the challenges of creating accurateand representative datasets amidst the voluminous and noisy nature of social media. The objectiveis to shift from ad hoc extraction techniques, prevalent in existing studies, to a more systematic,semi-automatic process. This methodology incorporates human feedback at various stages andutilizes both content-based and metadata-based filtering techniques, alongside semantic domaindescriptions, to offer a standardized and reusable method for thematic dataset building fromsocial media. The methodology was evaluated both qualitatively and quantitatively through thedevelopment of an X/Twitter dataset focused on tourism in the Basque Country region.The Transform phase tackles the challenge of converting multilingual, unstructured text datainto structured knowledge within a given application domain. It concentrates on three pivotalknowledge extraction tasks: (1) Sentiment Analysis, (2) Named Entity Recognition (NER) forLocations, and (3) Fine-grained Thematic Concept Extraction. Given the scarcity of multilingualtraining resources in the tourism domain, the process of manually generating a novel annotatedtraining dataset for this domain is detailed. Subsequently, the thesis explores optimal strategiesfor the multilingual analysis of social media content in tourism, comparing rule-based and deepiiilearning-based approaches (including fine-tuning and prompting-based few-shot learning withvarious language models). This exploration aims to identify the minimal number of annotatedexamples necessary for achieving competitive results across these tasks, leveraging various trainingtechniques and language models. This phase addresses the challenge of minimizing manualannotation efforts without compromising the results¿ quality, considering the time-consuming andexpensive nature of manual data annotation.In the Analyze phase, we hypothesize that adapting the theory of proxemics, traditionallyapplied in physical contexts, to social media could offer a novel approach to crafting meaningful,domain-adaptable indicators for various end-users. The theory is formally redefined, leadingto the development of a modular and extensible proxemic data model. This model is capableof representing social media entities and their interactions in a domain-independent manner.Leveraging this model, ProxMetrics, a toolkit and formula for generating adaptable indicators fromsocial media is introduced. These indicators, conceptualized as proxemic similarity measures, spanmultidimensional social media entities, including users, groups, places, themes, and temporalperiods. They are highly customizable, allowing for the adjustment of the five proxemic dimensions(Distance, Identity, Location, Movement and Orientation) to address various domain requirements.The toolkit and models underwent qualitative evaluations in collaboration with a local tourismoffice to model and address various local touristic requirements.Finally, the Valorize phase addresses the challenge of presenting social media indicators andanalyses to non-computer scientist users, such as domain stakeholders, in an accessible anddomain-independent manner. To this end, TextBI, a multimodal generic dashboard, is proposed.This tool is designed to display multidimensional annotations and indicators over volumes ofmultilingual social media data, focusing on four core dimensions: spatial, temporal, thematic,and personal, while also accommodating additional enrichment data, such as sentiment andengagement. The dashboard offers various visualization modes, including frequency, movement,association and, proxemics, combining features from Business Intelligence (interactivity, combinedfiltering, synchronization of visuals), Geographical Information Systems (spatial view at multiplegranularities), and Linguistic Information Visualization tools (text-based analyses). Unlike mostexisting dashboards, it is generic to operate across different domains, provided the data adheres tothe specified data model. The effectiveness of this dashboard was validated in the tourism domainthrough evaluations conducted by tourism offices, assessing its applicability and relevance.The framework¿s twofold genericity (application domain and data source) is demonstratedthrough the application of each phase in another domain of application: local public policies,leveraging data from municipality review platforms. | es_ES |