I think we need to be clear on the terminology being used here. For example, in your first question, you say you want to match the first 300 characters but then mention other "characters" that you imply should not be counted. Similarly, when you talk about 300 "words" in the title and 2nd question, you don't define what a "word" is in your context (there are many posts in this forum where this has been discussed and probably as many definitions determined as there were situations put forward).
I'm going to assume that you want to count the first alphanumeric characters plus the underscore (this happens to match the definition of the '\w' character set shortcut) and match all characters from the start of the text until either the end of the text or 300 such characters have been matched.
To start with, we can use
(with the "singleline" option set) to match the first 300 or any character. Next we use the fact that the '\W' character set shortcut is the complement of the '\w' one that we are targeting, and that if we try to match any character with either the set or its complement, then we will always have a match. Therefore we can express the same pattern as
Now, we want to not count any character that is matched by the '\W' part. Therefore we need to skip over any such character and so we change this pattern to
As you can see, we skip over (and therefore don't count) any character that matches the '\W' character set, and then count each single character that matches the 'w' set.
You can now change this pattern to count whatever "character set" you want. For example, to match the first 10 vowels, you could use
(with the "ignore case" option set as necessary)
Note that in this whole discussion, we will include the skipped (and uncounted) character in the overall match, but that the counting itself will be driven by the character set we have specified.
To count words, we can use a similar structure but we first need to define what we mean by a "word". In this case I'm going to assume a word made up of alphabetic characters plus the apostrophe and hyphen (i.e the pattern element '[a-z'-]+' with the "ignore case" option set to include "words" with capitalised letters).
Therefore we can set up the pattern:
What this does is to skip over any non-"word" characters and then capture the characters in a "word". I have included a capture group around the "word" part of that it is easy for us to get to the individual "words". (I note that you are using the .NET regex engine: this will work in this case because of an extension in the .NET regex capability to capture the text in each repeated capture group - look up "captures"; in other regex engines, each repeated capture will overwrite the previously captured text and other techniques are needed.)
To make it a bit easier to identify the "words", we can tell the regex engine not to "capture" the grop that includes the non-word characters, as in:
By the way, I tested this on the text of your question (as it is not 300 "words" long, I matched the first 100 only - that is to the "And" and "what" of the 2nd paragraph). This matched the words "wasn't" and "you're" correctly. If you expand the count a bit, you will find that it captured "th" and not "300th" as it would have skipped the leading digits in the same way it did the leading whitespace for the "word". This can be fixed but you will need to define exactly what you want to do first.
The last part of the 2nd question talks about finding the "closest" period to the 300th character. Unfortunately regex engines can't do maths and so they have no concept of "closest". Also, a characteristic of a regex engine is that it can't go back and rescan text without first forgetting any previously captured text that was scanned.
In this situation, to find the "nearest, you need to:
- locate the 300th character
- scan back to the previous period character and record the number of characters
- scan forward to the next period character and record the number of characters
- compare the 2 counts and keep the shortest
If I was trying to do this, I would create a pattern that captured "sentences" defined (in some way) based on the occurrence of a period character. For example
(with the "singleline" options set) will match characters up to and including a period. (Actually it will have problems with, for example, the text of your question where it will see the ".NET" as the end of a sentence followed by "NET" at the start of the next one - another example of how careful you need to be to define what you really want to match.) You can then look at the array of the matches and count the length of each and keep a running sum. When the sum ticks over the 300 mark, you can see if you really should include the last 'sentence" of not.