Final Examination
MA course Linguistic data: quantitative analysis and visualization, HSE, Moscow
In the project, the students are supposed to explore the use of the rival forms in the written or oral speech. This can be a choice between:
Possible research tracks (1)
Just to give an idea, the choice of the rival units can be driven by certain contextual factors (e.g. words, syntactics patterns), features of the rival units themselves (e.g. the gender of the noun, the tense of the verb), genre and register of the text, sociolinguistic parameters (e.g. age, sex, profession, education, place of birth). We expect students to explore at least three factors of any kind in their case study. For inspiration, you can look at some examples of the data annotation at the open repository Trolling (https://opendata.uit.no/dataverse/trolling).
During the project, the students formulate their initial hypothesis, collect data (either corpus-based or experimental), annotate data and do the preliminary desctiptive, exploratory and inferential statistical analysis. After that, they can update their hypothesis, include more or exclude some factors, and collect more data/annotate more parameters in order to improve the empirical basis for their analysis. The amount of the data collected should be enough to support the statistic analysis. It is by no means evident that it is strongly prohibited to exclude data that contradict the hypothesis or make any other sort of the hypothesis-biased fraud.
The students prepare the final project in a written form as an electronic document (R markdown) that include the following parts:
Language under analysis: any natural language
Type of analysis: multi-factor analysis. At least two multi-factor analysis techniques should be demonstrated in the paper.
Type of the project: individual or group (max. 2 people) project
Language of the project paper: English
The students can either compile the dataset specifically for this research paper or make use of data collected for their term papers, dissertations, other research actitivities. If you recycle the results of your previous research please be prepared to add more data in your dataset based on your experience with their analysis. If you use data collected and annotated by other people please indicate it in your research paper. In this case, you will have to (a) write an additional section on “previous research on the topic, comparison of current results with previous studies” or (b) do additional model testing to explore the effects of data size, missing data treatment, data sampling, custom methods within models, etc.
Corpus data samples and experimental data sets are usually annotated manually. However, you can do any kind of data preprocessing and exploit ways to automate data annotation, if you wish. The quality of annotation and its interpretability will be assessed.
Examination grades: 10 – 500 points or higher, 4 (‘passed’) – 200 points
Research paper template
Examples of project papers (of different quality, for orientation only)