Another question—is there any studied/published literature on the correlation to general productivity (say, in the workplace) and the number of words in a language? Does having more words in a mother tongue make you more or less productive in a group?
Thank you Daniel !
1. Can all of them be both nouns and adjectives ?
My intuition is that the adjectival sense is more basic, and nouns are more natural if they've been established in conventional usage. But just thinking about this at the moment I can't come up with any examples that don't seem to function as nouns, so I'm not sure. It's also possible some nouns exist but other synonymous adjectives block the formation/usage of the correlated adjective, but again I can't think of examples.*

[*To me, 'communist' [adj.] feels like an adjectival counterpart to the identical noun, rather than the other way around, for example. That's the inverse of the situation I described for -ist in general. Not sure why. Maybe just frequency, and because something like "communist country" is ambiguous between a noun-noun compound and adj-noun phrase.]

2. Is it considered as a conversion A>N or N>A by morphologists ?
I can't answer in general, but this would depend on your analysis. Is the derivation from X>A (social>socialist) and also in parallel X>N (social>socialist)? Or is the derivation first from the root, then to the adjective, and then as a third step to the noun? If it's a secondary derivation, then yes this would presumably be considered conversion.

Note that "X" above could be an adjective (social-ist) or a noun (sex-ist) or just a bare root (commun-ist?). Many derivational affixes are restricted to combining with a certain word class, so maybe there are actually several -ist affixes that combine slightly differently with different word classes, and therefore might give slightly different results for your questions, although keep in mind that analogy could still hold their usage together in some ways (e.g., I don't imagine much of a functional difference between "communist" and "socialist" regardless of their slightly different origins morphologically).

3. Are there adjectival doublets in -ist and -istic that are perfectly synonymous ?
Perfect synonymy rarely exists given connotations, frequency of use, associations with particular speakers, etc. But these are pretty close. Still, one seems to be the default form for most examples I can think of. No pair of exact synonymy comes to mind, but it wouldn't surprise me if there are some in free variation (though again, possibly with some minor individual connotation differences or preferences, etc.).

This reminds me of the -ic/-ical pair, which for whatever reason seems easier to discuss at the moment. This applies especially to some linguistics terminology, so it's easy to come up with examples:
morphological / *?morphologic
syntactic / ?syntactical

I don't know that I've ever seen "morphologic" in real use, but "syntactical" comes up fairly often (I suspect most often from non-native speakers, but also some native speakers). Obviously in typical syntactic research there is no intended distinction between the two forms. So they are I guess 'perfectly synonymous'. However, to my ears 'syntactical' sounds off, and even though it has a clear and identical meaning, I much prefer 'syntactic'. That's entirely arbitrary, though, because 'morphological' also sounds better. Again, this could just be a question of frequency of use. Some other pairs are more flexible, I think.

Extending this a bit, something else that is interesting is how adverbs are formed from these adjectives: it's typically the long form that is used:
*syntacticly / syntactically
[However, pronunciation of those would be identical, so maybe that's only a spelling issue.]

*communistly / communistically
*socialistly / socialistically
(But: sexistly, racistly)

Yet I see a subtle distinction in meaning: the ending -istic (along with -istically) has an "aboutness" sense that isn't found with just -ist. And that also applies to -ical ("aboutness"), vs. -ic (more general).

"Syntactical" sounds to me like it should mean something like "of or relating to meta-analysis of syntax", or something like that. "This is a syntactical paper". "Syntactic" just means "related to syntax", etc. It's almost like "-ical" has a hint of being a double derivation. Maybe that's just an iconic property of being a longer form (and that matches my intuition about why "syntactical" sounds wrong-- it's just longer, and not needed, because the shorter form sounds fine-- blocking).

There may also be a few cases where nouns and adjectives are contrastive:
impressionist: noun meaning 'impressionism artist; or just one who makes impressions'; or adjective meaning 'related to impressionism; or just based on impressions'
impressionistic: (only) adjective meaning 'based on impressions'

Also perhaps related are the differences in meaning between 'sexist' and 'socialist' -- two very different kinds of ideas there!

Edit: while browsing for unrelated reasons, I just came across a potentially relevant paper about -ic/-ical, by Gries here:
I have questions about these derivatives’ category in English.
1. Can all of them be both nouns and adjectives ?
2. Is it considered as a conversion A>N or N>A by morphologists ?
3. Are there adjectival doublets in -ist and -istic that are perfectly synonymous ?
Thank you.
The longest threads speak to individuals, not general interest.

As for Eurocentrism in linguistics in general, it's because there are more Europeans and Americans working in linguistics than in other areas, and it's something that many linguists are hoping improves, both by focusing our research on other areas and by getting people from other areas and speakers of non-Western languages involved as linguists themselves. There's also a bit of a feedback loop where, for example, a historical linguistics class could certainly be taught based on the Austronesian family-- there's plenty of data and it's also conveniently very clean data (often languages separated on different islands and not in too much contact after separation), but we've all been taught in classes mostly about Indo-European history for reasons of tradition, available textbooks, the knowledge of our instructors, etc. Things are improving, slowly, especially as some of the areas of research about European languages begin to dry up (relatively few big new things are left to be discovered, although even for English many minor details that could have profound influence on theory are still being investigated and debated). So, for example, as prior Indo-Europeanists turn their attention to other areas, because they already have a reasonable understanding of IE and aren't answering many new questions, we might see more of a focus on other areas.

The familiar will probably always dominate discussions, questions and research, but we can work on also representing other things well, and also shifting what is familiar. Intro to Linguistics classes around the world should emphasize signed languages more, for example, because that is a severely understudied area, even more than geographically diverse oral languages. It's happening, but slowly.

Especially at the amateur level, it shouldn't be surprising that people are interested in things close to them. But there are also plenty of questions about things elsewhere including on this forum.
So, what do you think, why is this forum Eurocentric? I mean, the longest thread on this forum (according to the statistics) is "The Language of Old Europe" and the third longest thread is called "Croatian toponyms". Why do those things interest people more than the Native American or the Aboriginal languages?
Your hesitation is appropriate.

However, what is interesting about Zipf's Law is that it applies regardless of the particular data set you're looking at. (Obviously more approximately on smaller data sets.) So, it's possible there could be reasons for using it even on small data sets in your approach. But yes, look into existing research, and make sure there is a principled reason for approaching it that way.
Daniel, thank you a lot for your comment!  :) You're right, will search more for already existing measures. I wish I had more mathematical background, maybe it's also time for me to read more about statistics in general. I was too tempted to use log just because the results were amazing, but the more I think about it, I'm convinced that it is not statistically relevant here. It would be, if I used the whole corpus, maybe I will look into that.
I can't answer this from the technical perspective of a computational linguist, but I'll try to comment briefly.

When we use a statistical test, we are usually comparing two things. We want to show (roughly) that one is significantly more likely than another. There are much more complex statistical tests, but they all boil down to basically that core. This means that you can either compare two values to see whether one is bigger than the other (whether the difference between them is significant), or whether one value is significantly different from an expected value (e.g., 0 (lack of or observation of an unexpected event), 50% (as in a coin toss), etc.). Since the things you will be comparing will be similar, the comparison is usually unit-less. So in that sense, a log-transform may not be a problem at all. Of course it might mess up the statistical test mathematically, so beware of that. But in principle this may be fine, if you pick an appropriate statistical test. (Note that your intuitions about statistical distributions are irrelevant, so you shouldn't use logs just because you think they look better, but rely instead on a statistical test to tell you whether the results are 'interesting', that is, significant, in a technical sense.)

Regarding frequency and log likelihood for words in a corpus, you probably already know about Zipf's law:
But that partially addresses what you're observing: the relationship between the frequency of words is typically linear on a logarithmic scale. So there may be some sound motivation behind your decision.

The best answer is to consult current sources (e.g., journal articles) doing something like what you're trying to do, and then using the same (or slightly adjusted) methodology for your project. That's how you're likely to get published, at least. If you don't have a good reason for doing something else, that's where to start.

There are various approaches to dealing with frequencies in corpora (that's really the whole field of corpus linguistics!), and there are a number of ways to try to standardize values, find a balance between frequent words and frequent pairings to look at interesting patterns of attraction (e.g., etc. Don't reinvent the wheel if there's already a method out there, and that way your results will be comparable anyway.
Hi to all! My name is Nemi and I just started to dive into Corpus Linguistics. Currently I'm doing some research on Twitter. As I am pretty new to all this (my former field was philosophy), I have a lot of questions and was very happy, when I discovered this forum. I hope my questions are not too benign and apologize in advance, if they are :) Thank you for reading and nice to meet all of you!
