Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upGitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
Find a good English dictionary for us! #438
Comments
|
Issue-Label Bot is automatically applying the label Links: app homepage, dashboard and code for this bot. |
|
hello @bee-san can be please let me contribute in ur code? |
|
@chetnaaggarwal-ca Sure! Go right ahead :D |
|
i found this dictionary of words using this https://pypi.org/project/english-words/ but it has only 25487 words which is almost negligible compared to the 479826 in the UNIX default word-list. However all the words in this list are meaningful |
|
I had tried to create a fix with the logic that @bee-san suggested in the issue description by using this tool https://www.english-corpora.org/coca/ but this had a problem. the string "hm" apparently has a frequency almost 15 times higher than more meaningful words like "curate" and many more. |
|
https://raw.githubusercontent.com/jeremy-rifkin/Wordlist/master/master.txt or you could scrape a whole dictionary |
|
@Itamai the 300k dictionary looks good to me as well I also found the dictionary that nltk library uses and it has 236736 words and as far as i am able to see they seem to be all meaningful which one shall i go with ? |
the github user @chetnaaggarwal-ca got assigned to this issue so we should let her do it. I was just giving some ideas :D |
|
sure thing ! I'll wait for that to finish so i can begin #412 :) |
|
I tried to merge dictionaries previously and it didn't work out so well :( I also tried scraping the speeches from the Houses of Parliament :( The dictionary is loaded as JSON so {word: 1}. What do you think about scraping words from multiple sources and building a frequency distribution, like so:
So the word "and" appears a lot. And then we add together the frequencies and if they go over a threshold, we use that? We do a similar thing for stop words / top 1000 words, except this is for every word in the dictionary :) |
|
Oh right |
If we used English Wikipedia I think it'd cover a significant amount of words, and the words that do not appear we could always do something like this:
It's unlikely for any given text to be compromised entirely of only rare words that have never appeared in English Wikipedia before (or rarely appear) :) |
|
awesome ! |
|
I found this https://github.com/IlyaSemenov/wikipedia-word-frequency where they conveniently calculated the frequency of english words on wikipedia for us (https://github.com/IlyaSemenov/wikipedia-word-frequency/blob/master/results/enwiki-20190320-words-frequency.txt) so perhaps i could use this along with the dictionaries we currently use to figure out "common words" and split them into 2 files of common words and uncommon words which we can then manually filter to see if we need any words from the uncommon file to be in the common file |
|
I have also found this highly reliable source called datamuse who provide an awesome API which uses google's n-gram to provide frequency of words used in several texts, books and several other material that google has access to. https://www.datamuse.com/api/. for example : https://api.datamuse.com/words?sp=awesome&md=f&max=1 However, they have a cap of 100k requests per day. We can get over this cap if 4-5 of us run a script to get the frequency of every single word in the dictionary that we are currently using and then merge the data collected into one file. We can then manipulate this data using whatever threshold we require. This solution (in my opinion) sounds like the most viable one so far. If y'all agree i can get the script done (including the splitting of the file so 4 of us can run it) |
Are you still working on this? |
|
Reassigned issue due to no communication :) |


Hello spoooopyyy hackers🎃
This is a Hacktoberfest only issue!👻
This is also data-sciency!
The Problem
Our English dictionary contains words that aren't English, and does not contain common English words.
Examples of non-common words in the dictionary:
This is our current dictionary:
https://github.com/dwyl/english-words
What we want
An English dictionary without English words that are horrible, and with common English words in JSON format.
Ideas on how to achieve this
You'll likely need to use data science, parse English text (such as books / stories) and find uncommon words to remove them. Also potentially adding more words.
I'm not the best data scientist in the world, so what you decide will be good.
You can also publish this work outside of Ciphey, such as in a separate GitHub repository -- so long as Ciphey can use it <3
While I'm not an expert data scientist, I have studied it -- so if you need help leave a comment :)